Category: Uncategorized


Dear Laboratory Scientists: Welcome to My World

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

  1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
  2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.


I declare the Bayesian vs. Frequentist debate over for data scientists

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.


Data science can't be point and click

As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data.  I see more and more that people are creating tools to accommodate folks who aren't trained but who still want to look at data right now. While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.

The reason is that, especially with big data, it is very easy to find things like this with point and click tools:

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (

The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.

Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and there aren't enough people to analyze it expertly. But you wouldn't want me to operate on you using point and click surgery software. You'd want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.

The ways to solve this problem are:

  1. More data analysis training
  2. Encouraging people to do their analysis interactively

I have a few more tips which I have summarized in this talk on things statistics taught us about big data.


The Leek group guide to genomics papers

Leek group guide to genomics papers

When I was a student, my advisor, John Storey, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn't need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.



An economic model for peer review

I saw this tweet the other day:

It reminded me that a few years ago I had a paper that went through the peer review wringer. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the "economics of peer review". Basically, what is the incentive for scientists to contribute to the system.

To get a handle on this idea, I designed a "peer review game" where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is S_i = \sum {\rm Submitted \; Papers \; Accepted}.

Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that no one reviews any papers. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.

We then built a model system for testing out our theory. The system involved having groups of students play a "peer review game" where they submitted solutions to SAT problems like:

Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.

We found some cool things:

  1. In closed review, reviewing gave no benefit.
  2. In open review, reviewing gave a small positive benefit.
  3. Both systems gave comparable accuracy
  4. All peer review increased the overall accuracy of responses

The paper is here and all of the data and code are here.


The Drake index for academics

I think academic indices are pretty silly; maybe we should introduce so many academic indices that people can't even remember which one is which. There are pretty serious flaws with both citation indices and social media indices that I think render them pretty meaningless in a lot of ways.

Regardless of these obvious flaws I want in the game. Instead of the K-index for academics I propose the Drake index. Drake has achieved both critical and popular success. His song "Honorable Mentions" from the ESPYs (especially the first verse) reminds me of the motivation of the K-index paper.

To quantify both the critical and popular success of a scientist, I propose the Drake Index (TM). The Drake Index is defined as follows

(# Twitter Followers)/(Max Twitter Followers by a Person in your Field) + (#Citations)/(Max Citations by a Person in your Field)

Let's break the index down. There are two main components (Twitter followers and Citations) measuring popular and critical acclaim. But they are measured on different scales. So we attempt to normalize them to the maximum in their field so the indices will both be between 0 and 1. This means that your Drake index score is between 0 and 2. Let's look at a few examples to see how the index works.

  1. Drake  = (16.9M followers)/(55.5 M followers for Justin Bieber) + (0 citations)/(134 Citations for Natalie Portman) = 0.30
  2. Rafael Irizarry = (1.1K followers)/(17.6K followers for Simply Stats) + (38,194 citations)/(185,740 citations for Doug Altman) = 0.27
  3. Roger Peng - (4.5K followers)/(17.6K followers for Simply Stats) + (4,011 citations)/(185,740 citations for Doug Altman) = 0.27
  4. Jeff Leek - (2.6K followers)/(17.6K followers for Simply + (2,348 citations)/(185,740 citations for Doug Altman) = 0.16

In the interest of this not being taken any seriously than an afternoon blogpost should be I won't calculate any other people's Drake index. But you can :-).


You think P-values are bad? I say show me the data.

Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands.  Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another.  I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around.  So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.


A non-comprehensive list of awesome female data people on Twitter

I was just talking to a student who mentioned she didn't know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn't seen a good list of women on Twitter who do stats/data. So I thought I'd make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I'll update the list?

I have also been informed that these Twitter lists are probably better than my post. But I'll keep updating my list anyway cause I want to know who all the right people to follow are!



Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don't, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people's data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn't change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn't before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were't able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn't put me in charge anyway, I've never won the lottery with any infrastructure I've tried to build.


A (very) brief review of published human subjects research conducted with social media companies

As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research is still unclear. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I'm only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few.  I'd be interested to see more in the comments if people know about them.

Paper: Experimental evidence of massive-scale emotional contagion through social networks
Company: Facebook
What they did: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.
What they found: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.

Paper: Social influence bias: a randomized experiment
Company: Not stated but sounds like Reddit
What they did: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.
What they found: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.

Paper: Identifying influential and susceptible members of social networks 
Company: Facebook
What they did: Using a commercial Facebook app,  they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.
What they found: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.


Paper: Inferring causal impact using Bayesian structural time-series models
Company: Google
What they did: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser's website through paid and organic (non-paid) clicks.
What they found: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.