Editor's Note: This is a repost of a previous post on our blog from 2012. The repost is inspired by similar issues with statistical illiteracy that are coming up in allergy screening and pregnancy screening.
I just was doing my morning reading of a few news sources and stumbled across this Huffington Post article talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn't find a link to the research paper in the article so I did some searching and found out this result is also being covered by Time, Science Daily, Medical Daily, and a bunch of other news outlets.
Now thoroughly freaked, I looked online and found the pdf of the original research article. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies' cries can predict autism. Here are the flaws with the study:
Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.
The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?
Editor's Note: Last year I made a list off the top of my head of awesome things other people did. I loved doing it so much that I'm doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people's awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!
I'll let @ResearchMark take us out:
This year, Kobe leads the league in missed shots (by a lot), has an abysmal FG% of 39 and his team plays better when he is on the bench. Yet he blames his teammates for the Lakers' 6-16 record. Below is a plot showing that 2014 is not the first time the Lakers are mediocre during Kobe's tenure. It shows the percentage points above .500 per season with the Shaq and twin towers eras highlighted. I include the same plot for Lebron as a control.
So stop blaming your teammates!
And here is my hastily written code (don't judge me!).
Editor's note: Last week the Latin American media picked up a blog post with the eye-catching title "The perfect human is Puerto Rican". More attention appears to have been given to the title than the post itself. The coverage and comments on social media have demonstrated the need for scientific education on the topic of genetics and race. Here I will try to explain, in layman's terms, why the interpretations I read in the main Puerto Rican paper is scientifically incorrect and somewhat concerning. The post is in Spanish.
En un artículo reciente titulado “Ser humano perfecto sería puertorriqueño", El Nuevo Día resumió una entrada en el blog (erróneamente llamado un estudio) del matemático Lior Pachter. El autor del blog, intentando ridiculizar comentarios racistas que escuchó decir a James Watson, describe un experimento mental en el cual encuentra que el humano “perfecto” (las comilla son importantes), de existir, pertenecería a un grupo genéticamente mezclado. De las personas estudiadas, la más genéticamente cercana a su humano “perfecto” resultó ser una mujer puertorriqueña. La motivación de este ejercicio era ridiculizar la idea de que una raza puede ser superior a otra. El Nuevo Día parece no captar este punto y nos dice que “el experto concluyó que en todo caso no es de sorprenderse que la persona más cercana a tal perfección sería una puertorriqueña, debido a la combinación de buenos genes que tiene la raza puertorriqueña.” Aquí describo por qué esta interpretación es científicamente errada.
¿Qué es el genoma?
El genoma humano codifica (en moléculas de ADN) la información genética necesaria para nuestro desarrollo biológico. Podemos pensar en el genoma como dos series de 3,000,000,000 letras (A, T, C o G) concatenadas. Una la recibimos de nuestro padre y la otra de nuestra madre. Distintos pedazos (los genes) codifican proteínas necesarias para las miles de funciones que cumplen nuestras células y que conllevan a algunas de nuestras características físicas. Con unas pocas excepciones, todas las células en nuestro cuerpo contienen una copia exacta de estas dos series de letras. El esperma y el huevo tienen sólo una serie de letras, una mezcla de las otras dos. Cuando se unen el esperma y el huevo, la nueva célula, el cigoto, une las dos series y es así que heredamos características de cada progenitor.
¿Qué es la variación genética?
Si todos venimos del primer humano,¿cómo entonces es que somos diferentes? Aunque es muy raro, estas letras a veces mutan aleatoriamente. Por ejemplo, una C puede cambiar a una T. A través de cientos de miles de años suficientes mutaciones han ocurrido para crear variación entre los humanos. La teoría de selección natural nos dice que si esta mutación confiere una ventaja para la supervivencia, el que la posee tiene más probabilidad de pasarla a sus descendientes. Por ejemplo, en Europa la piel clara es más ventajosa, por su habilidad de absorber vitamina D cuando hay poco sol, que en África Occidental donde la melanina en la piel oscura protege del sol intenso. Se estima que las diferencias entre los humanos se pueden encontrar en por lo menos 10 millones de las 3 mil millones de letras (noten que es menos de 1%).
Genéticamente, ¿qué es una “raza” ?
Esta es un pregunta controversial. Lo que no es controversial es que si comparamos la serie de letras de los europeos del norte con los africanos occidentales o con los indígenas de las Américas, encontramos pedazos del código que son únicos a cada región. Si estudiamos las partes del código que cambian entre humanos, fácilmente podemos distinguir los tres grupos. Esto no nos debe sorprender dado que, por ejemplo, la diferencia en el color de ojos y la pigmentación de la piel se codifica con distintas letras en los genes asociados con estas características. En este sentido podríamos crear una definición genética de “raza” basada en las letras que distinguen a estos grupos. Ahora bien, ¿podemos hacer lo mismo para distinguir un puertorriqueño de un dominicano? ¿Podemos crear una definición genética que incluye a Carlos Delgado y a Mónica Puig, pero no a Robinson Canó y Juan Luis Guerra? La literatura científica nos dice que no.
En una serie de artículos , el genético Carlos Bustamante y sus colegas han estudiado los genomas de personas de varios grupos étnicos. Ellos definen una distancia genética que resumen con dos dimensiones en la gráfica arriba. Cada punto es una persona y el color presenta a su grupo. Noten los tres extremos de la gráfica con muchos puntos del mismo color amontonados. Estos son los europeos blancos (puntos rojo), africanos occidentales (verde) e indígenas americanos (azul). Los puntos más regados en el medio son las poblaciones mezcladas. Entre los europeos y los indígenas vemos a los mexicanos y entre los europeos y africanos a los afroamericanos. Los puertorriqueños son los puntos anaranjados. He resaltado con números a tres de ellos. El 1 está cerca del supuesto humano “perfecto”. El 2 es indistinguible de un europeo y el 3 es indistinguible de un afroamericano. Los demás cubrimos un espectro amplio. También resalto con el número 4 a un dominicano que está tan cerca a la “perfección” como la puertorriqueña. La observación principal es que hay mucha variación genética entre los puertorriqueños. En los que Bustamante estudió, la ascendencia africana varía de 5-60%, la europea de 35-95% y la taína de 0-20%. ¿Cómo entonces podemos hablar de una "raza" puertorriqueña cuando nuestros genomas abarcan un espacio tan grande que puede incluir, entre otros, europeos, afroamericanos y dominicanos ?
¿Qué son los genes “buenos”?
Algunas mutaciones son letales. Otras resultan en cambios a proteínas que causan enfermedades como la fibrosis quística y requieren que ambos padres tengan la mutación. Por lo tanto la mezcla de genomas diferentes disminuye las probabilidades de estas enfermedades. Recientemente una serie de estudios ha encontrado ventajas de algunas combinaciones de letras relacionadas a enfermedades comunes como la hipertensión. Una mezcla genética que evita tener dos copias de estos genes con más riesgo puede ser ventajosa. Pero las supuestas ventajas son pequeñísimas y específicas a enfermedades, no a otras características que asociamos con la “perfección”. El concepto de “genes buenos” es un vestigio de la eugenesia.
A pesar de nuestros problemas sociales y económicos actuales, Puerto Rico tiene mucho de lo cual estar orgulloso. En particular, producimos buenísimos ingenieros, atletas y músicos. Atribuir su éxito a “genes buenos” de nuestra “raza” no sólo es un disparate científico, sino una falta de respeto a estos individuos que a través del trabajo duro, la disciplina y el esmero han logrado lo que han logrado. Si quieren saber si Puerto Rico tuvo algo que ver con el éxito de estos individuos, pregúntenle a un historiador, un antropólogo o un sociólogo y no a un genetista. Ahora, si quieren aprender del potencial de estudiar genomas para mejorar tratamientos médicos y la importancia de estudiar una diversidad de individuos, un genetista tendrá mucho que compartir.
CT: The questions that get me up and out of bed in the morning the fastest are biology questions. I work on cell differentiation - I want to know how to define the state of a cell and how to predict transitions between states. That said, my approach to these questions so far has been to use new technologies to look at previously hard to access aspects of gene regulation. For example, I’ve used RNA-Seq to look beyond gene expression into finer layers of regulation like splicing. Analyzing sequencing experiments often involves some pretty non-trivial math, computer science, and statistics. These data sets are huge, so you need fast algorithms to even look at them. They all involve transforming reads into a useful readout of biology, and the technical and biological variability in that transformation needs to be understood and controlled for, so you see cool mathematical and statistical problems all the time. So I guess you could say that I’m a biologist, both experimental and computational. I have to do some computer science and statistics in order to do biology.
CT: Three reasons, mainly:
1) I thought learning to do bench work would make me a better overall scientist. It has, in many ways, I think. It’s fundamentally changed the way I approach the questions I work on, but it’s also made me more effective in lots of tiny ways. I remember when I first got to John Rinn’s lab, we needed some way to track lots of libraries and other material. I came up with some scheme where each library would get an 8-digit alphanumeric code generated by a hash function or something like that (we’d never have to worry about collisions!). My lab mate handed me a marker and said, “OK, write that on the side of these 12 micro centrifuge tubes”. I threw out my scheme and came up with something like “JR_1”, “JR_2”, etc. That’s a silly example, but I mention it because it reminds me of how completely clueless I was about where biological data really comes from.
2) I wanted to establish an independent, long-term research program investigating differentiation, and I didn’t want to have to rely on collaborators to generate data. I knew at the end of grad school that I wanted to have my own wet lab, and I doubted that anyone would trust me with that kind of investment without doing some formal training. Despite the now-common recognition by experimental biologists that analysis is incredibly important, there’s still a perception out there that computational biologists aren’t “real biologists”, and that computational folks are useful tools, but not the drivers of the intellectual agenda. That's of course not true, but I didn’t want to fight the stigma.
3) It sounded fun. I had one or two friends who had followed the "dry to wet” training trajectory, and they were having a blast. Seeing a result live under the microscope is satisfying in a way that I’ve rarely experienced looking at a computer screen.
CT: Yes. I’m going to be starting my lab at the University of Washington in the department of Genome Sciences this summer, and it’s going to be a roughly 50/50 operation, I hope. Many of the labs there are set up that way, and there’s a real culture of valuing both sides. As a postdoc, I’ve been extremely fortunate to collaborate with grad students and postdocs who were trained as cell or molecular biologists but wanted to learn sequencing analysis. We’d train each other, often at great cost in terms of time spent solving “somebody else’s problem”. I’m going to do my best to create an environment like that, the way John did for me and my lab mates.
CT: That’s a good question, and I don’t really have a good answer. You’ve talked a lot on this blog about the importance of making science more reproducible and how journals could change to make it so. I agree wholeheartedly with a lot of what you’ve said. I like the idea of "papers as packages”, but I don’t see it happening soon, because it’s a huge amount of extra work and there’s not a big incentive to do so. Doing so might make it easier to be attacked, so there could even a disincentive! Scientists do well when the publish papers and those papers are cited widely. We have lots of ways to quantify “impact” - h-index, total citation count, how many times your paper is shared via twitter on a given day, etc. (Say what you want about whether these are meaningful measures).
We don’t have a good way to track who’s right and who’s wrong, or whose results are reproducible and whose aren’t, short of full blown paper retraction. Most papers aren’t even checked in a serious way. Worse, the papers that are checked are the ones that a lot of people see - few people spend precious time following up on tangential observations in low circulation journals. So there’s actually an incentive to publish “controversial" results in highly visible journals because at least you’re getting attention.
Maybe we need a Yelp for papers and data sets? One where in order to dispute the reproducibility of the analysis, you’d have to provide the code *you* ran to generate a contradictory result? There needs to be a genuine and tangible *reward* (read: funding and career advancement) for putting up an analysis that others can dive into, verify, extend, and learn from.
In any case, I think it’s worth noting that reproducibility is not a problem unique to computation - experimentalists have a hard time reproducing results they got last week, much less results that came from some other lab! There’s all kinds of harmless reasons for that. Experiments are hard. Reagents come in bad lots. You had too much coffee that morning and can’t steady your pipet hand to save your life. But I worry a bit that we could spend a lot of effort making our analysis totally automated and perfectly reproducible and still be faced with the same problem.
Oh man, there are many. Here’s a few:
1) There some very interesting questions about variability in expression across cells, or within one cell across time. There’s clearly a lot of variability in the expression level of a given gene across cells. But there’s really no way right now to take “replicate” measurements of a single cell. What would that mean? With current technology, to make an RNA-Seq library form a cell, you have to lyse it. So that’s it for that cell. Even if you had a non-invasive way to measure the whole transcriptome, the cell is a living machine that’s always changing in ways large and small, even in culture. Would you consider repeated measurements “replicates”. Furthermore, how can you say that two different cells are “replicate” measurements of a single, defined cell state? Do such states even really exist?
For that matter, we don’t have a good way of assessing how much variability stems from technical sources as opposed to biological sources. One common way of assessing technical variability is to spike some alien transcripts at known concentrations in to purified RNA before making the library, so you can see how variable your endpoint measurements are for those alien transcripts. But to do that for single-cell RNA-Seq, we’d have to actually spike transcripts *into* the nucleus of a cell before we lyse it and put it through the library prep process. Just doping it into the lysate’s not good enough, because the lysis itself might (and likely does) destroy a substantial fraction of the endogenous RNA in the cell. So there are some real barriers to overcome in order to get a handle on how much variability is really biological.
2) A second challenge is writing down what a biological process looks like at single cell resolution. I mean we want to write down a model that predicts the expression levels of each gene in a cell as it goes through some biological process. We want to be able to say this gene comes on first, then this one, then these genes, and so on. In genomics up until now, we’ve been in the situation where we are measuring many variables (P) from few measurements (N). That is, N << P, typically, which has made this problem extremely difficult. With single cell RNA-Seq, that may no longer be the case. We can already easily capture hundreds of cells, and thousands of cells per capture is just around the corner, so soon, N will be close to P, and maybe someday greater.
Assume for the moment that we are capturing cells that are either resting at or transiting between well defined states. You can think of each cell as a point in a high-dimensional geometric space, where each gene is a different dimension. We’d like to find those equilibrium states and figure out which genes are correlated with which other genes. Even better, we’d like to study the transitions between states and identify the genes that drive them. The curse of dimensionality is always going to be a problem (we’re not likely to capture millions or billions of cells anytime soon), but maybe we have enough data to make some progress. There’s interesting literature out there for tackling problems at this scale, but to my knowledge these methods haven’t yet been widely applied in biology. I guess you can think of cell differentiation viewed at whole-transcriptome, single-cell resolution as one giant manifold learning problem. Same goes for oncogenesis, tissue homeostasis, reprogramming, and on and on. It’s going to be very exciting to see the convergence of large scale statistical machine learning and cell biology.
SS: If you could do it again would you do computational training then wet lab training or the other way around?
CT: I’m happy with how I did things, but I’ve seen folks go the other direction very successfully. My labmates Loyal Goff and Dave Hendrickson started out as molecular biologists, but they’re wizards at the command line now.
CT: Oh, I’d say I hate them all equally
Just kidding. I’ll always love C++. I work in R a lot these days, as my work has veered away from developing tools for other people towards analyzing data I’ve generated. I still find lots of things about R to be very painful, but ggplot2, plyr, and a handful of other godsend packages make the juice worth the squeeze.
Editor's note: This is a repost of our previous post about deterministic statistical machines. It is inspired by the recent announcement that the Automatic Statistician received funding from Google. In 2012 we also applied to Google for a small research award to study this same problem, but didn't get it. In the interest of extreme openness like Titus Brown or Ethan White, here is our application we submitted to Google. I showed this to a friend who told me the reason we didn't get it is because our proposal was missing two words: "artificial", "intelligence".
As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.
StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$.
So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare.
The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.
The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better?
A few months ago the Pew Research Internet Project published a paper on social media and the ‘spiral of silence’. Their main finding is that people are less likely to discuss a controversial topic on social media than in person. Unlike others, I did not find this result surprising, perhaps because I think like a statistician.
Shares or retweets of published opinions on controversial political topics - religion, abortion rights, gender inequality, immigration, income inequality, race relations, the role of government, foreign policy, education, climate change - are ubiquitous in social media. These are usually accompanied by passionate statements of strong support or outraged disagreement. Because these are posted by people we elect to follow, we generally agree with what we see on our feeds. Here is a statistical explanation for why many keep silent when they disagree.
We will summarize the political view of an individual as their opinions on the 10 topics listed above. For simplicity I will assume these opinions can be quantified with a left (liberal) to right (conservative) scale. Every individual can therefore be defined by a point in a 10 dimensional space. Once quantified in this way, we can define a political distance between any pair of individuals. In the American landscape there are two clear clusters which I will call the Fox News and MSNBC clusters. As seen in the illustration below, the cluster centers are very far from each other and individuals within the clusters are very close. Each cluster has a very low opinion of the other. A glance through a social media feed will quickly reveal individuals squarely inside one of these clusters. Members of the clusters fearlessly post their opinions on controversial topics as this behavior is rewarded by likes, retweets or supportive comments from others in their cluster. Based on the uniformity of opinion inferred from the comments, one would think that everybody is in one of these two groups. But this is obviously not the case.
In the illustration above I include an example of an individual (the green dot) that is outside the two clusters. Although not shown, there are many of these independent thinkers. In our example, this individual is very close to the MSNBC cluster, but not in it. The controversial topic posts in this person's feed are mostly posted by those in the cluster of closest proximity, and the spiral of silence is due in part to the fact that independent thinkers are uniformly adverse to disagreeing publicly. For the mathematical explanation of why, we introduce the concept of a projection.
In mathematics, a projection can map a multidimensional point to a smaller, simpler, subset. In our illustration, the independent thinker is very close to the MSNBC cluster on all dimensions except one. To use education as an example, let's say this person supports school choice. As seen in the illustration, in the projection to the education dimension, that mostly liberal person is squarely in the Fox News cluster. Now imagine that a friend shares an article on The Corporate Takeover of Public Education along with a passionate statement of approval. Independent thinkers have a feeling that by voicing their dissent, dozens, perhaps hundreds, of strangers on social media (friends of friends for example) will judge them solely on this projection. To make matters worse, public shaming of the independent thinker, for supposedly being a member of the Fox News cluster, will then be rewarded by increased social standing among the MSNBC cluster as evidenced by retweets, likes and supportive comments. In a worse case scenario for this person, and best case scenario for the critics, this public shaming goes viral. While the short term rewards for preaching to the echo chamber are clear, there are no apparent incentives for dissent.
The superficial and fast paced nature of social media is not amenable to nuances and subtleties. Disagreement with the groupthink on one specific topic can therefore get a person labeled as a "neoliberal corporate shill" by the MSNBC cluster or a "godless liberal" by the Fox News one. The irony is that in social media, those politically closest to you, will be the ones attaching the unwanted label.