Simply Statistics


Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.




Thinking like a statistician: don't judge a society by its internet comments

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends' profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual's "righteousness" with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.


Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the "bad" tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people's opinions and thoughts. We assign a "righteousness" index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity's distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.




Bayes Rule in an animated gif

Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B|A)=Pr(no B| no A) = 90% .  The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B|A). We want to know the probability of having the disease if you tested positive: Pr(A|B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.

The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B|A), thus Pr(B|A) x Pr(A). Pr(A|B), or the proportion of reds in the bottom left, is therefore Pr(B|A) x Pr(A) divided by Pr(B):  Pr(A|B)=Pr(B|A) x Pr(A) / Pr(B)

ps - Is this a frequentist or Bayesian gif?


I declare the Bayesian vs. Frequentist debate over for data scientists

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.


Applied Statisticians: people want to learn what we do. Let's teach them.

In this recent opinion piece, Hadley Wickham explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

and makes the important point that

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.

The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley's definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data.  More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used "data product" R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.

So why then does Hadley think that "Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products"?  I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl's work was published in the American Journal of Human Genetetics. A second reason may be that most of us academic applied statisticians don't teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.

About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.

This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.

So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.





Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.


The Big in Big Data relates to importance not size

In the past couple of years several non-statisticians have asked me "what is Big Data exactly?" or "How big is Big Data?". My answer has been "I think Big Data is much more about "data" than "big". I explain below.

Screen Shot 2014-05-28 at 10.14.53 AM Screen Shot 2014-05-28 at 10.15.04 AM

Since 2011 Big Data has been all over the news. The New York Times, The Economist, Science, Nature, etc.. have told us that the Big Data Revolution is upon us (see google trends figure above). But was this really a revolution? What happened to the Massive Data Revolution (see figure above)? For this to be called a revolution, there must be some a drastic change, a discontinuity, or a quantum leap of some kind.  So has there been such a discontinuity in the rate of growth of data? Although this may be true for some fields (for example in genomics, next generation sequencing did introduce a discontinuity around 2007), overall, data size seems to have been growing at a steady rate for decades. For example, in the  graph below (see this paper for source) note the trend in internet traffic data (which btw dwarfs genomics data). There does seem to be a change of rate but during the 1990s which brings me to my main point.

internet data traffic

Although several fields (including Statistics) are having to innovate to keep up with growing data size, I don't see this as something that new. But I do think that we are in the midst of a Big Data revolution.  Although the media only noticed it recently,  it started about 30 years ago. The discontinuity is not in the size of data, but in the percent of fields (across academia, industry and government) that use data. At some point in the 1980s with the advent of cheap computers, data were moved from the file cabinet to the disk drive. Then in the 1990s, with the democratization of the internet, these data started to become easy to share. All of the sudden, people could use data to answer questions that were previously answered only by experts, theory or intuition.

In this blog we like to point out examples but let me review a few. Credit card companies started using purchase data to detect fraud. Baseball teams started scraping data and evaluating players without ever seeing them. Financial companies started analyzing  stock market data to develop investment strategies. Environmental scientists started to gather and analyze data from air pollution monitors. Molecular biologists started quantifying outcomes of interest into matrices of numbers (as opposed to looking at stains on nylon membranes) to discover new tumor types and develop diagnostics tools. Cities started using crime data to guide policing strategies. Netflix started using costumer ratings to recommend movies. Retail stores started mining bonus card data to deliver targeted advertisements. Note that all the data sets mentioned were tiny in comparison to, for example, sky survey data collected by astronomers. But, I still call this phenomenon Big Data because the percent of people using data was in fact Big.


I borrowed the title of this talk from a very nice presentation by Diego Kuonen


Confession: I sometimes enjoy reading the fake journal/conference spam

I've spent a considerable amount of time setting up filters to avoid getting spam from fake journals and conferences. Unfortunately, they are exceptionally good at thwarting my defenses. This does not annoy me as much as I pretend because, secretly, I enjoy reading some of these emails. Here are three of my favorites.

1) Over-the-top robot:

It gives us immense pleasure to invite you and your research allies to submit a manuscript for the journal “REDACTED”. The expertise of you in the never ending field of Gene Technology is highly appreciable. The level of intricacy shown by you in your work makes us even more proud, and we believe that your works should be known to mankind of science.

2) Sarcastic robot?

First of all, congratulations on the publication of your highly cited original article < The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores > in the field of colon cancer, which has been cited more than 1 times and is in the world's top one percent of papers. Such high number of citations reflects the high quality and influence of your paper.

3) Intimidating robot:

This is Rocky.... Recently we have mailed you about the details of the conference. But we still have not received your response. So today we contact you again.

NB: Although I am joking in this post, I do think these fake journals and conferences are a very serious problem. The fact that they are still around means enough money (mostly taxpayer money) is being spent to keep them in business. If you want to learn more, this blog does a good job on reporting on them and includes a list of culprits.


Correlation does not imply causation (parental involvement edition)

The New York Times recently published an article on education titled "Parental Involvement Is Overrated". Most research in this area supports the opposite view, but the authors claim that "evidence from our research suggests otherwise".  Before you stop helping your children understand long division or correcting their grammar, you should learn about one of the most basic statistical concepts: correlation does not imply causation. The first two chapters of this very popular text book describes the problem and even Khan Academy has a class on it. As several of the commenters in the  NYT article point out, the authors fail to make this distinction.

To illustrate the problem, imagine you want to know how effective tutoring is for students in a math class you are teaching.  So you compare the test scores of students that received tutoring to those that don't. You find that receiving tutoring is correlated with lower test scores. So do you conclude that tutoring causes lower grades? Of course not!  In this particular case we are confusing cause and effect: students that have trouble with math are much more likely to seek out tutoring and this is what drives the observed correlation. With that example in mind,  consider this quote from the New York Times article:

When we examined whether regular help with homework had a positive impact on children’s academic performance, we were quite startled by what we found. Regardless of a family’s social class, racial or ethnic background, or a child’s grade level, consistent homework help almost never improved test scores or grades.... Even more surprising to us was that when parents regularly helped with homework, kids usually performed worse.

A first question we would ask here is: how do we know that the children's performance would not have been even worse had they not received help? I imagine the authors made use of controls: we compare the group that received the treatment (regular help with homework) to a control group that did not. But this brings up a more difficult question: how do we know that the treatment and control groups are comparable?

In a randomized controlled experiment, we would take a group of kids and randomly assign each one to the treatment group (will be helped with their homework) or control group (no help with homework). By doing this we can use probability calculations to determine the range of differences we expect to see by chance when the treatment has no effect.  Note that by chance one group may end up with a few more "better testers" than the other. However, if we see a big enough difference that can't be explained by chance, then the alternative that the treatment is responsible for the observed differences becomes more believable.

Given all the prior research (and common sense) suggesting that parent involvement, in its many manifestations, is in fact helpful to students, many would consider it unethical to run a randomized controlled trial on this issue (you would knowingly hurt the control group). Therefore, the authors are left with no choice than to use an observational study to reach their conclusions. In this case, we have no control over who receives help and who doesn't. Kids that require regular help with their homework are different in many ways to kids that don't, even after correcting for all the factors mentioned. For example, one can envision how kids that have a mediocre teacher or have trouble with tests are more likely to be in the treatment group, while kids who naturally test well or go to schools that offer in-school tutoring are more likely to be in the control group.

I am not an expert on education, but as a statistician I am skeptical of the conclusions of this data-driven article.  In fact, I would  recommend parents actually do get involved early on by, for example, teaching children that correlation does not imply causation.

Note that I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.


Writing good software can have more impact than publishing in high impact journals for genomic statisticians

Every once in a while we see computational papers published in science journals with high impact factors.  Genomics related methods appear quite often in these journals. Several of my junior colleagues express frustration that all their papers get rejected from these journals. I tell them that the same is true for most of my papers and remind them of these examples:

Method Journal Year #Citations
PLINK AJHG 2007 6481
Bioconductor Genome Biology 2004 5973
RMA Biostatistics 2003 5674
limma SAGMB 2004 5637
quantile normalization Bioinformatics 2003 4646
Bowtie Genome Biology 2009 3849
BWA Bioinformatics 2009 3327
Loess normalization NAR 2002 3313
qvalues JRSS-B 2002 2758
tophat Bioinformatics 2008 1868
vsn Bioinformatics 2002 1398
GCRMA JASA 2004 1397
MACS Genome Biology 2008 1277
deseq Genome Biology 2010 1264
CBS Biostatistics 2004 1051
R/qtl Bioinformatics 2003 1027

Let me know of other examples in the comments.
update: I added one more to the list.