Wednesday, September 5, 2012

Study changes (NYT editors') understanding of how DNA causes disease

Study Changes Understanding of How DNA Causes Disease
At least four million gene switches that reside in bits of DNA once thought to be inactive turn out to play critical roles in health, researchers reported.

So, what is your first take on what this story is about? Just, say, reading the title and the lede. It sounds like this is some sort of new result about how "junk DNA" actually does something. Wow! And there might be new understanding of (potentially all?) disease! 

Of course, these pieces of junk DNA had been known to be associated with certain diseases for a decade. In fact, the author likely knew this as it is written in the article.
In large studies over the past decade, scientists found that minor changes in human DNA sequences increase the risk that a person will get those diseases.
The earliest papers are from the late 1990s and 2000s. As the Human Genome project was coming up with much less than it expected, scientists pushed into this area. 

And of course, gene switches aren't new -- another fact the author likely knew as it too is written in the article.

In recent years, some [scientists] began to find switches in the 99 percent of human DNA that is not genes
I think the author left off what recent years meant because 10 years doesn't sound so new. In recent years (2007) it was sufficiently established for NOVA to cover it.

In fact, the entire concept has been around for awhile. The reason I wrote this particular post is that I personally have known about this simply through the aforementioned NOVA episode. I knew enough about gene switches in 2008 to comment (with proto-spittle flecked ire) on an idiotic statement by Ray Kurzweil saying the brain is simple because a human DNA sequence consists of only "50 million bytes" of information. I said:
In the worst case, a sizable fraction of all 2^20000 [gene on/off] states could be involved to get from a stem cell to every neuron in its right place of the brain with the proper function.

I don't want to detract from the actual work presented in the article. It is a pretty awesome piece of human genome mapping, and it really sheds light on how complex the whole thing is.  (And it puts some more hurt on Kurzweil since the entire 3D structure along with the switches appears to be important in DNA.) 

Gina Kolata seems to be a stand-up molecular biologist cum journalist. I imagine the editors of the NYT were completely blown away by progress in stuff they hadn't been paying attention to since the 1990s (or maybe ever) and said she should change the lede. 

And I guess it got me to click the link.

Monday, September 3, 2012

Other than the grammar ...

How often do people ask the question Other than the grammar, how was the speech? Well, apparently over 298 million times

Human speech derives its information carrying capacity from several places not the least of which is its temporal structure. Sorting on word frequency literally destroys significant quantities of information. The entire information content of the word green next to the word frog (i.e. green frog for those following along at home) is that green is modifying frog so as both to convey the information that the frog is green and distinguish said frog from e.g. a poison dart frog (which is not green, but instead blue or yellow). If I take that word green and move it to different position unrelated to the position of frog, that word green no longer carries any information at all ... and any information you do decide to imbue it with has no foundation whatsoever.

The above word cloud (apparently also known as a wordle, though that just may be specific generation software) of Romney's speech to the RNC has removed all of the information except that he might be running for President of America. However that is information I am adding to this infographic. The speaker simply mentions President and America. It could be in a negative light. The most commonly appearing words in this blog post are green and frog, but I'm not talking about green frogs. In a sense, the creators have only done a half-assed job. Below I lay waste to the information content, reducing the speech to an empirical estimate of the letter frequency in English.

(Sorry. I couldn't help myself.)

Thursday, July 5, 2012

Fun with normalization, economics edition

So the new thing in economic circles is for rich countries to look at poorer ones as economies to emulate and this has sparked some kind of debate about Iceland, Estonia, Latvia, Ireland and Lithuania: which one fared best during the recession? And debate begets ... graphs! I love graphs.

This blog summarizes the graphs, but takes a demonstrably wrong view of the data. I link to it because it and the side it supports are to be the recipients of my spittle-flecked ire.

The subject of the graphs are the RGDP data for the aforementioned countries. Now RGDPs of all countries at any particular time form a power law distribution, making it difficult to graph on standard axes in a way that conveys information. That's why humans in all their wisdom have invented several ways to "enhance" graphical information to try and make their point. Percent changes, derivatives, normalization, logarithmic scales: pick your poison. But make sure you pick the right poison because certain poisons work on certain subjects.

Let's start with the "raw" RGDP data. Iceland is in blue, the rest red (because the question is: Is Iceland faring better?). I forgot to put units on the graph (bad Bourbaki) but the y-axis is Millions of 2005 Euros.
Iceland has only a few hundred thousand people in it so its RGDP is pretty small. However, RGDP per capita is huge; Iceland and Ireland are wealthy countries relative to the others. So according to one metric, How much money do you have?, Iceland wins with a much larger RGDP per capita.

But we want to look at the recession, so one side of this "debate" made the choice to normalize to the pre-recession peak. This is standard practice in economics. When looking at a recession, it only lasts a few years so inside of that window your RGDP data are approximately linear. Normal growth rates (r) are on the order of a few to several percent per year (t) so r*t << 1 for several years. For linear data, normalization is fine, but you need a way to select your normalization point that isn't arbitrary ... hence choosing the peak (or other feature). This is what we get.
You can see Iceland near the top from 2008 to 2012 since its recession wasn't as big relative to the peak. Even the pre-recession data has some value because it shows the run-up to the peak (slope) was shallower in Iceland. Lots of information. The pre-peak levels are not valid for points far from the peak for reasons we'll describe later. Overall, lots of information. Excellent.

Except that the libertarians of the world love the Baltic countries because one of them mentioned Milton Friedman at one point. So they set out to show this wasn't correct. They chose to normalize to the year 2000. And thus, Iceland sucks.
Why 2000? No idea. The peak of Iceland's boom was about 2002; the year 2000 also represented unremarkable years in the other countries. You can choose other years. In fact, if you choose other years, you can show Iceland being anywhere from the bottom to the middle of the pack (2003) ...
To the top of the heap (2006) ...
Actually, by choice of normalization year, you can show any country listed to be at the top of the heap during the recent recession (2008 to today). Iceland in 2007, Estonia in 1997, Latvia in 2011, Ireland in 2011, and Lithuania in 1997. In fact, the year 2000 is the year you'd choose if you wanted to show Iceland at the bottom (which makes me think this was deliberately manipulated by one side of the argument).

In general, a normalizing time series data that is linear in log space creates a time dependent scale. For short times, log(a+b*x) ~ log(a) + x*(b/a) + o(x^2). You can normalize lines. But over 10 years or so with growth rates on the order of a few to several percent per year, you need those o(x^2) terms.

If you look back to the first graph, you can see a nice long linear trend in the data, which suggests the correct way if you want to look at recovery from a deviation from the previous trend: fit the pre-crisis trend and look at the percent difference.

Here are some fits to the pre-crisis trend (Iceland: blue, Estonia: red, Latvia: orange) ...
Note these linear fits have different slopes and intercepts. That's why normalization to a specific year allows you to put any of the countries on top. Also note that the slopes are higher for the Baltics. I think this is what the libertarians are trying to give credit for, but the overall higher trend is not germane to the question of how bad the recession is. Additionally, all poor countries like the Baltics all have higher growth rates than rich countries like Iceland and Ireland. Rapid growth from a low base is what is behind massive growth numbers from China, for example. You can think of it as picking the low-hanging fruit. (China is in the process of transforming low productivity agricultural workers to higher productivity industrial workers.)

The result after taking the percent difference from these trends (Iceland: blue, everyone else: red) ...
Iceland is on top again.

What have we learned?

  • All data can be manipulated. If someone shows you a graph in a certain format (removing the origin, normalizing to some arbitrary year), question their formatting choices.
  • Corollary: Especially question when someone decides to change the format of data previously graphed data to opine or make a political/partisan/school of thought's point. It could even be just to get more page views.
  • Normalize and scale to features of your data (peaks, troughs, trends), not arbitrary points.
  • Specifically, Iceland seems to have fared better than the Baltic countries (and Ireland) in the recession when the data is normalized to the peak or fit to the trend. Iceland is also doing better when measured by RGDP per capita. As these (peak, level, trend) are the only features of a linear data set besides, say, the level of noise/seasonal variations we can with confidence say that Iceland is indeed doing much better when measured with RGDP.

Marginal Revolution has been a serious offender on this kind of manipulation in the first bullet. Or at least on spreading the offending graphs around. This graph shows the same shenanigans mentioned here. This graph basically shows the first graph at the top of the page and asks what's the big deal? The big deal is of course that graphing on a linear scale in this case exaggerates the level when the question is about the trend. They are entering Freakonomics territory. My opinion of this last graph is well known.

Sunday, May 13, 2012

It slices *and* dices?

Problems solved by the slime mold include ... other complex mathematical challenges (like creating a Voronoi diagram and a Delaunay triangulation).
A Delaunay triangulation is a dual graph to a Voronoi diagram. They are the same mathematical problem (at least with the ordinary distance metric ... and guess what ... they didn't bother with any other metrics). It sounds neater when you put them both in as examples, though.

Nice to see that the NYT is coming around to stuff that was reported two years ago.

Sunday, April 22, 2012

Blood from The Stone

The Stone is one of the greatest threats to intellectual discourse since the invention of the blog.
“I can’t answer ['What is philosophy?'] directly. I will tell you why I became a philosopher. I became a philosopher because I wanted to be able to talk about many, many things, ideally with knowledge, but sometimes not quite the amount of knowledge that I would need if I were to be a specialist in them. It allows you to be many different things. And plurality and complexity are very, very important to me.” (Alexander Nehamas)
Nehemas became a philosopher in order to further his desire to talk out of his ass. I think that about sums it up.

Luckily for us, there was an example of this from earlier in the month.
Take for example mathematics**, theoretical physics, psychology and economics***. These are predominately rational conceptual disciplines. That is, they are not chiefly reliant on empirical observation. For unlike science, they may be conducted while sitting in an armchair with eyes closed.
Ok, I'll bite. theoretical physics is not based on data.
As such, whereas science tends to alter and update its findings day to day through trial and error, logical deductions are timeless(**). This is why Einstein pompously called attempts to empirically confirm his special theory of relativity “the detail work.” 
Ha ha. Einstein is pompous. Wait; I thought you said theoretical physics is not based on empirical observation?
Indeed last September, The New York Times reported that scientists at the European Center for Nuclear Research (CERN) thought they had empirically disproved Einstein’s theory that nothing could travel faster than the speed of light, only to find their results could not be reproduced in follow-up experiments last month. Such experimental anomalies are confounding. But as CERN’s research director Sergio Bertolucci plainly put it, “This is how science works.”
So now theoretical physics is based on empirical data? I pause to note that these are two consecutive paragraphs.

I have two subsequent points.
  • This would not have "disproved" Einstein's theory. It would have meant that Einstein's theory was some approximation to some underlying theory. GPS, which uses Einstein's General Relativity to work, would keep on working. Muons generated from cosmic rays would still make it through the atmosphere.
  • It wasn't that their results couldn't be reproduced. There was an error that didn't take into account the difference in clock speeds at different points in the Earth's gravitational field of some kind. (Per Student, a loose cable.)

Talking out of one's ass indeed.

** "However, 5 plus 7 will always equal 12. No amount of further observation will change that." Except in modular arithmetic. Or any other redefinition of the binary operator "+". Or in different bases. Or adding 5 mL of isopropyl alcohol to 7 mL of water. The reason you can be sure of, snarking aside, the constancy of the underlying claim is that it is arbitrary. It is a small step from the Peano axioms to 5 + 7 = 12 and therefore it is as arbitrary as those axioms. Timeless, indeed. We see the fallacy of the preeminence of human thought again.

*** I don't think it was an intentional hit on economics, but I like it. They do make charts with empirical data in them. So does psychology.