Saturday, May 24, 2014

It's kinda hard to read 'cause all the lines are red

I'd say that in the final takeaway charts the blue lines are broadly consistent with the red lines. It's hard to say which red lines 'cause all the lines are red.
Harris: This green line represents our product. And the other green lines represents the competitors' product. So what we've got here is basically a case of ... 
Don: Uh-oh. See, it's kinda hard to read 'cause all the lines are green.

Thursday, May 23, 2013

Reaching for the relatable application

These kids are totally rad; all I managed to do at their age was break a Raman spectrometer. However, I think the desire to say Khare's discovery could be used for cell phones is a bit of stretch. Here is the relevant diagram:
You'd need a supercapacitor ~10 times the size of your current cell phone battery to carry as much energy and you'd need to dissipate a lot of heat to have it discharge more slowly.

Friday, March 8, 2013

Is X better than Y if X is perfectly correlated with Y?

The Dow Jones Industrial Average hit its first record in a few years the other day and I was bombarded with mathematical innumeracy stampeding out of the speakers of my car like a herd of ... well, innumerate journalists.

DAVIDSON: Here's the thing. For reasons I cannot understand, nobody adjusts for inflation when they're talking about the Dow. .... And anyway, even if it did reach a record, this is not the measure we should be paying attention to. 
BLOCK: OK. Well, if it's not the measure we should be paying attention to, what is? 
DAVIDSON: There are, as I mentioned, a handful of indexes that do a much better job, like the S&P 500 ...

Don't worry too much right now about the fact that the S&P 500 isn't adjusted for inflation either. More came from Marketplace minutes later:

“[The Dow] a rough indicator of the health of the market,” says Kelly School of Business professor Scott Smart, “but there are some problems with the Dow as such an indicator." 
For one, the Dow is a very, very small sample ... “since it only looks at 30 stocks, there are obviously big portions of the market that the Dow doesn't monitor or doesn't capture.” ... 
... “it is weighted in a very unusual way.” ... 
That’s the Dow. No complicated formula. No algorithms. No wonder more serious investors prefer the S&P 500 ...

So while the Dow is a giant turd on the world of financial journalism, the S&P 500 is, like, totally the awesomest.

Except they have a correlation of 0.96 ...

The Dow is in blue and the S&P 500 is in gray. Or, wait ...

Here is another view. Basically Dow = a*S&P 500 (where a is ~ 9.2) ... and you will only be off by a couple percent.

The simplicity of the formula, its "unusual" weighting based on price not market cap, its small sample size: none of these things matter. Why? Because most companies large enough to be listed in an index are highly correlated with each other. Here, for example, are Boeing and GE:

It doesn't matter how you weight the companies since these weights have no effect on correlation  ... if the correlation of X and Y is c, then the correlation of a*X and b*Y is also c.

And even if individual companies weren't very correlated with each other, creating indices that lump individual companies together tends to destroy the information about their individual performance so you end up with an index that shows an average trend (this is the idea behind these indices in the first place). All that matters is how highly correlated the stocks are in the first place whether it takes 30 companies, 500 or 5000 to get there. (The answer is 30.)

Wednesday, September 5, 2012

Study changes (NYT editors') understanding of how DNA causes disease

Study Changes Understanding of How DNA Causes Disease
At least four million gene switches that reside in bits of DNA once thought to be inactive turn out to play critical roles in health, researchers reported.

So, what is your first take on what this story is about? Just, say, reading the title and the lede. It sounds like this is some sort of new result about how "junk DNA" actually does something. Wow! And there might be new understanding of (potentially all?) disease! 

Of course, these pieces of junk DNA had been known to be associated with certain diseases for a decade. In fact, the author likely knew this as it is written in the article.
In large studies over the past decade, scientists found that minor changes in human DNA sequences increase the risk that a person will get those diseases.
The earliest papers are from the late 1990s and 2000s. As the Human Genome project was coming up with much less than it expected, scientists pushed into this area. 

And of course, gene switches aren't new -- another fact the author likely knew as it too is written in the article.

In recent years, some [scientists] began to find switches in the 99 percent of human DNA that is not genes
I think the author left off what recent years meant because 10 years doesn't sound so new. In recent years (2007) it was sufficiently established for NOVA to cover it.

In fact, the entire concept has been around for awhile. The reason I wrote this particular post is that I personally have known about this simply through the aforementioned NOVA episode. I knew enough about gene switches in 2008 to comment (with proto-spittle flecked ire) on an idiotic statement by Ray Kurzweil saying the brain is simple because a human DNA sequence consists of only "50 million bytes" of information. I said:
In the worst case, a sizable fraction of all 2^20000 [gene on/off] states could be involved to get from a stem cell to every neuron in its right place of the brain with the proper function.

I don't want to detract from the actual work presented in the article. It is a pretty awesome piece of human genome mapping, and it really sheds light on how complex the whole thing is.  (And it puts some more hurt on Kurzweil since the entire 3D structure along with the switches appears to be important in DNA.) 

Gina Kolata seems to be a stand-up molecular biologist cum journalist. I imagine the editors of the NYT were completely blown away by progress in stuff they hadn't been paying attention to since the 1990s (or maybe ever) and said she should change the lede. 

And I guess it got me to click the link.

Monday, September 3, 2012

Other than the grammar ...

How often do people ask the question Other than the grammar, how was the speech? Well, apparently over 298 million times

Human speech derives its information carrying capacity from several places not the least of which is its temporal structure. Sorting on word frequency literally destroys significant quantities of information. The entire information content of the word green next to the word frog (i.e. green frog for those following along at home) is that green is modifying frog so as both to convey the information that the frog is green and distinguish said frog from e.g. a poison dart frog (which is not green, but instead blue or yellow). If I take that word green and move it to different position unrelated to the position of frog, that word green no longer carries any information at all ... and any information you do decide to imbue it with has no foundation whatsoever.

The above word cloud (apparently also known as a wordle, though that just may be specific generation software) of Romney's speech to the RNC has removed all of the information except that he might be running for President of America. However that is information I am adding to this infographic. The speaker simply mentions President and America. It could be in a negative light. The most commonly appearing words in this blog post are green and frog, but I'm not talking about green frogs. In a sense, the creators have only done a half-assed job. Below I lay waste to the information content, reducing the speech to an empirical estimate of the letter frequency in English.

(Sorry. I couldn't help myself.)

Thursday, July 5, 2012

Fun with normalization, economics edition

So the new thing in economic circles is for rich countries to look at poorer ones as economies to emulate and this has sparked some kind of debate about Iceland, Estonia, Latvia, Ireland and Lithuania: which one fared best during the recession? And debate begets ... graphs! I love graphs.

This blog summarizes the graphs, but takes a demonstrably wrong view of the data. I link to it because it and the side it supports are to be the recipients of my spittle-flecked ire.

The subject of the graphs are the RGDP data for the aforementioned countries. Now RGDPs of all countries at any particular time form a power law distribution, making it difficult to graph on standard axes in a way that conveys information. That's why humans in all their wisdom have invented several ways to "enhance" graphical information to try and make their point. Percent changes, derivatives, normalization, logarithmic scales: pick your poison. But make sure you pick the right poison because certain poisons work on certain subjects.

Let's start with the "raw" RGDP data. Iceland is in blue, the rest red (because the question is: Is Iceland faring better?). I forgot to put units on the graph (bad Bourbaki) but the y-axis is Millions of 2005 Euros.
Iceland has only a few hundred thousand people in it so its RGDP is pretty small. However, RGDP per capita is huge; Iceland and Ireland are wealthy countries relative to the others. So according to one metric, How much money do you have?, Iceland wins with a much larger RGDP per capita.

But we want to look at the recession, so one side of this "debate" made the choice to normalize to the pre-recession peak. This is standard practice in economics. When looking at a recession, it only lasts a few years so inside of that window your RGDP data are approximately linear. Normal growth rates (r) are on the order of a few to several percent per year (t) so r*t << 1 for several years. For linear data, normalization is fine, but you need a way to select your normalization point that isn't arbitrary ... hence choosing the peak (or other feature). This is what we get.
You can see Iceland near the top from 2008 to 2012 since its recession wasn't as big relative to the peak. Even the pre-recession data has some value because it shows the run-up to the peak (slope) was shallower in Iceland. Lots of information. The pre-peak levels are not valid for points far from the peak for reasons we'll describe later. Overall, lots of information. Excellent.

Except that the libertarians of the world love the Baltic countries because one of them mentioned Milton Friedman at one point. So they set out to show this wasn't correct. They chose to normalize to the year 2000. And thus, Iceland sucks.
Why 2000? No idea. The peak of Iceland's boom was about 2002; the year 2000 also represented unremarkable years in the other countries. You can choose other years. In fact, if you choose other years, you can show Iceland being anywhere from the bottom to the middle of the pack (2003) ...
To the top of the heap (2006) ...
Actually, by choice of normalization year, you can show any country listed to be at the top of the heap during the recent recession (2008 to today). Iceland in 2007, Estonia in 1997, Latvia in 2011, Ireland in 2011, and Lithuania in 1997. In fact, the year 2000 is the year you'd choose if you wanted to show Iceland at the bottom (which makes me think this was deliberately manipulated by one side of the argument).

In general, a normalizing time series data that is linear in log space creates a time dependent scale. For short times, log(a+b*x) ~ log(a) + x*(b/a) + o(x^2). You can normalize lines. But over 10 years or so with growth rates on the order of a few to several percent per year, you need those o(x^2) terms.

If you look back to the first graph, you can see a nice long linear trend in the data, which suggests the correct way if you want to look at recovery from a deviation from the previous trend: fit the pre-crisis trend and look at the percent difference.

Here are some fits to the pre-crisis trend (Iceland: blue, Estonia: red, Latvia: orange) ...
Note these linear fits have different slopes and intercepts. That's why normalization to a specific year allows you to put any of the countries on top. Also note that the slopes are higher for the Baltics. I think this is what the libertarians are trying to give credit for, but the overall higher trend is not germane to the question of how bad the recession is. Additionally, all poor countries like the Baltics all have higher growth rates than rich countries like Iceland and Ireland. Rapid growth from a low base is what is behind massive growth numbers from China, for example. You can think of it as picking the low-hanging fruit. (China is in the process of transforming low productivity agricultural workers to higher productivity industrial workers.)

The result after taking the percent difference from these trends (Iceland: blue, everyone else: red) ...
Iceland is on top again.

What have we learned?

  • All data can be manipulated. If someone shows you a graph in a certain format (removing the origin, normalizing to some arbitrary year), question their formatting choices.
  • Corollary: Especially question when someone decides to change the format of data previously graphed data to opine or make a political/partisan/school of thought's point. It could even be just to get more page views.
  • Normalize and scale to features of your data (peaks, troughs, trends), not arbitrary points.
  • Specifically, Iceland seems to have fared better than the Baltic countries (and Ireland) in the recession when the data is normalized to the peak or fit to the trend. Iceland is also doing better when measured by RGDP per capita. As these (peak, level, trend) are the only features of a linear data set besides, say, the level of noise/seasonal variations we can with confidence say that Iceland is indeed doing much better when measured with RGDP.

Marginal Revolution has been a serious offender on this kind of manipulation in the first bullet. Or at least on spreading the offending graphs around. This graph shows the same shenanigans mentioned here. This graph basically shows the first graph at the top of the page and asks what's the big deal? The big deal is of course that graphing on a linear scale in this case exaggerates the level when the question is about the trend. They are entering Freakonomics territory. My opinion of this last graph is well known.

Sunday, May 13, 2012

It slices *and* dices?

Problems solved by the slime mold include ... other complex mathematical challenges (like creating a Voronoi diagram and a Delaunay triangulation).
A Delaunay triangulation is a dual graph to a Voronoi diagram. They are the same mathematical problem (at least with the ordinary distance metric ... and guess what ... they didn't bother with any other metrics). It sounds neater when you put them both in as examples, though.

Nice to see that the NYT is coming around to stuff that was reported two years ago.