Simply Statistics


HarvardX Biomedical Data Science Open Online Training Curriculum launches on January 19

We recently received funding from the NIH BD2K initiative to develop MOOCs for biomedical data science. Our first offering will be version 2 of my Data Analysis for Genomics course which will launch on January 19. In this version, the course will be turned into an 8 course series and you can get a certificate in each one of them. The motivation for doing this is to go more in-depth into the different topics and to provide different entry points for students with different levels of expertise. We provide four courses on concepts and skills and four case-study based course. We basically broke the original class into the following eight parts:

    1. Statistics and R for the Life Sciences
    2. Introduction to Linear Models and Matrix Algebra
    3. Advanced Statistics for the Life Sciences
    4. Introduction to Bioconductor
    5. Case study: RNA-seq data analysis
    6. Case study: Variant Discovery and Genotyping
    7. Case study: ChIP-seq data analysis
    8. Case study: DNA methylation data analysis

You can follow the links to enroll. While not required, some familiarity with R and Rstudio will serve you well so consider taking Roger's R course and Jeff's Toolbox course before delving into this class.

In years 2 and 3 we plan to introduce several other courses covering topics such as python for data analysis, probability, software engineering, and data visualization which will be taught by a collaboration between the departments of Biostatistics, Statistics and Computer Science at Harvard.

Announcements will be made here and on twitter: @rafalab



Data Science Students Predict the Midterm Election Results

As explained in an earlier post, one of the homework assignments of my CS109 class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.

The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up.  However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.

One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can't infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was very large in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year's polls did very poorly. The 90% credible intervals provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance.  Several of our students compared favorably to 538:

name avg bias MSE avg interval size # missed
Manuel Andere -3.9 6.9 24.1 3
Richard Lopez -5.0 7.4 26.9 3
Daniel Sokol -4.5 6.4 24.1 4
Isabella Chiu -5.3 9.6 26.9 6
Denver Mosigisi Ogaro -3.2 6.6 18.9 7
Yu Jiang -5.6 9.6 22.6 7
David Dowey -3.5 6.2 16.3 8
Nate Silver -4.2 6.6 16.4 8
Filip Piasevoli -3.5 7.4 22.1 8
Yapeng Lu -6.5 8.2 16.5 10
David Jacob Lieb -3.7 7.2 17.1 10
Vincent Nguyen -3.8 5.9 11.1 14

It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538's 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican's favor.

Here are some plots showing results from two of our top performers:

Rplot Rplot01

I hope this exercise helped students realize that data science can be both fun and useful. I can't wait to do this again in 2016.





538 election forecasts made simple

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.





Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.




Thinking like a statistician: don't judge a society by its internet comments

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends' profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual's "righteousness" with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.


Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the "bad" tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people's opinions and thoughts. We assign a "righteousness" index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity's distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.




Bayes Rule in an animated gif

Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B|A)=Pr(no B| no A) = 90% .  The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B|A). We want to know the probability of having the disease if you tested positive: Pr(A|B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.

The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B|A), thus Pr(B|A) x Pr(A). Pr(A|B), or the proportion of reds in the bottom left, is therefore Pr(B|A) x Pr(A) divided by Pr(B):  Pr(A|B)=Pr(B|A) x Pr(A) / Pr(B)

ps - Is this a frequentist or Bayesian gif?


I declare the Bayesian vs. Frequentist debate over for data scientists

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.


Applied Statisticians: people want to learn what we do. Let's teach them.

In this recent opinion piece, Hadley Wickham explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

and makes the important point that

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.

The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley's definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data.  More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used "data product" R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.

So why then does Hadley think that "Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products"?  I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl's work was published in the American Journal of Human Genetetics. A second reason may be that most of us academic applied statisticians don't teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.

About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.

This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.

So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.





Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.


The Big in Big Data relates to importance not size

In the past couple of years several non-statisticians have asked me "what is Big Data exactly?" or "How big is Big Data?". My answer has been "I think Big Data is much more about "data" than "big". I explain below.

Screen Shot 2014-05-28 at 10.14.53 AM Screen Shot 2014-05-28 at 10.15.04 AM

Since 2011 Big Data has been all over the news. The New York Times, The Economist, Science, Nature, etc.. have told us that the Big Data Revolution is upon us (see google trends figure above). But was this really a revolution? What happened to the Massive Data Revolution (see figure above)? For this to be called a revolution, there must be some a drastic change, a discontinuity, or a quantum leap of some kind.  So has there been such a discontinuity in the rate of growth of data? Although this may be true for some fields (for example in genomics, next generation sequencing did introduce a discontinuity around 2007), overall, data size seems to have been growing at a steady rate for decades. For example, in the  graph below (see this paper for source) note the trend in internet traffic data (which btw dwarfs genomics data). There does seem to be a change of rate but during the 1990s which brings me to my main point.

internet data traffic

Although several fields (including Statistics) are having to innovate to keep up with growing data size, I don't see this as something that new. But I do think that we are in the midst of a Big Data revolution.  Although the media only noticed it recently,  it started about 30 years ago. The discontinuity is not in the size of data, but in the percent of fields (across academia, industry and government) that use data. At some point in the 1980s with the advent of cheap computers, data were moved from the file cabinet to the disk drive. Then in the 1990s, with the democratization of the internet, these data started to become easy to share. All of the sudden, people could use data to answer questions that were previously answered only by experts, theory or intuition.

In this blog we like to point out examples but let me review a few. Credit card companies started using purchase data to detect fraud. Baseball teams started scraping data and evaluating players without ever seeing them. Financial companies started analyzing  stock market data to develop investment strategies. Environmental scientists started to gather and analyze data from air pollution monitors. Molecular biologists started quantifying outcomes of interest into matrices of numbers (as opposed to looking at stains on nylon membranes) to discover new tumor types and develop diagnostics tools. Cities started using crime data to guide policing strategies. Netflix started using costumer ratings to recommend movies. Retail stores started mining bonus card data to deliver targeted advertisements. Note that all the data sets mentioned were tiny in comparison to, for example, sky survey data collected by astronomers. But, I still call this phenomenon Big Data because the percent of people using data was in fact Big.


I borrowed the title of this talk from a very nice presentation by Diego Kuonen