Simply Statistics


Thinking like a statistician: don't judge a society by its internet comments

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends' profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual's "righteousness" with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.


Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the "bad" tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people's opinions and thoughts. We assign a "righteousness" index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity's distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.




Bayes Rule in an animated gif

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Say Pr(A)=5% is the prevalence of a disease (% of red dots on top fig). Each individual is given a test with accuracy Pr(B|A)=Pr(no B| no A) = 90% .  The O in the middle turns into an X when the test fails. The rate of Xs is 1-Pr(B|A). We want to know the probability of having the disease if you tested positive: Pr(A|B). Many find it counterintuitive that this probability is much lower than 90%; this animated gif is meant to help.

The individual being tested is highlighted with a moving black circle. Pr(B) of these will test positive: we put these in the bottom left and the rest in the bottom right. The proportion of red points that end up in the bottom left is the proportion of red points Pr(A) with a positive test Pr(B|A), thus Pr(B|A) x Pr(A). Pr(A|B), or the proportion of reds in the bottom left, is therefore Pr(B|A) x Pr(A) divided by Pr(B):  Pr(A|B)=Pr(B|A) x Pr(A) / Pr(B)

ps - Is this a frequentist or Bayesian gif?


Creating the field of evidence based data analysis - do people know what a p-value looks like?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In the medical sciences, there is a discipline called "evidence based medicine". The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who don't like to be told how to practice medicine. However, on the whole it is a valuable and interesting scientific exercise.

Roger introduced the idea of evidence based data analysis in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community starting with Bill Cleveland and extending forward to work by Diane Cook, , Jeffrey Heer, and others.

Today we published a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook's group calls this idea "graphical inference" or "visual significance" and they have studied human's ability to detect significance in the context of linear regression and how it associates with demographics and visual characteristics of the plot.

We performed a randomized study to determine if data analysts with basic training could identify statistically significant relationships. Or as the first author put it in a tweet:

What we found was that people were pretty bad at detecting statistically significant results, but that over multiple trials they could improve. This is a tentative first step toward understanding how the general practice of data analysis works. If you want to play around and see how good you are at seeing p-values we also built this interactive Shiny app. If you don't see the app you can also go to the Shiny app page here.



Dear Laboratory Scientists: Welcome to My World

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Consider the following question: Is there a reproducibility/replication crisis in epidemiology?

I think there are only two possible ways to answer that question:

  1. No, there is no replication crisis in epidemiology because no one ever believes the result of an epidemiological study unless it has been replicated a minimum of 1,000 times in every possible population.
  2. Yes, there is a replication crisis in epidemiology, and it started in 1854 when John Snow inferred, from observational data, that cholera was spread via contaminated water obtained from public pumps.

If you chose (2), then I don't think you are allowed to call it a "crisis" because I think by definition, a crisis cannot last 160 years. In that case, it's more of a chronic disease.

I had an interesting conversation last week with a prominent environmental epidemiologist over the replication crisis that has been reported about extensively in the scientific and popular press. In his view, he felt this was less of an issue in epidemiology because epidemiologists never really had the luxury of people (or at least fellow scientists) believing their results because of their general inability to conduct controlled experiments.

Given the observational nature of most environmental epidemiological studies, it's generally accepted in the community that no single study can be considered causal, and that many replications of a finding are need to establish a causal connection. Even the popular press knows now to include the phrase "correlation does not equal causation" when reporting on an observational study. The work of Sir Austin Bradford Hill essentially codifies the standard of evidence needed to draw causal conclusions from observational studies.

So if "correlation does not equal causation", it begs the question, what does equal causation? Many would argue that a controlled experiment, whether it's a randomized trial or a laboratory experiment, equals causation. But people who work in this area have long known that while controlled experiments do assign the treatment or exposure, there are still many other elements of the experiment that are not controlled.

For example, if subjects drop out of a randomized trial, you now essentially have an observational study (or at least a "broken" randomized trial). If you are conducting a laboratory experiment and all of the treatment samples are measured with one technology and all of the control samples are measured with a different technology (perhaps because of a lack of blinding), then you still have confounding.

The correct statement is not "correlation does not equal causation" but rather "no single study equals causation", regardless of whether it was an observational study or a controlled experiment. Of course, a very tightly controlled and rigorously conducted controlled experiment will be more valuable than a similarly conducted observational study. But in general, all studies should simply be considered as further evidence for or against an hypothesis. We should not be lulled into thinking that any single study about an important question can truly be definitive.


I declare the Bayesian vs. Frequentist debate over for data scientists

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In a recent New York Times article the "Frequentists versus Bayesians" debate was brought up once again. I agree with Roger:

Because the real story (or non-story) is way too boring to sell newspapers, the author resorted to a sensationalist narrative that went something like this:  "Evil and/or stupid frequentists were ready to let a fisherman die; the persecuted Bayesian heroes saved him." This piece adds to the growing number of writings blaming frequentist statistics for the so-called reproducibility crisis in science. If there is something Roger, Jeff and I agree on is that this debate is not constructive. As Rob Kass suggests it's time to move on to pragmatism. Here I follow up Jeff's recent post by sharing related thoughts brought about by two decades of practicing applied statistics and hope it helps put this unhelpful debate to rest.

Applied statisticians help answer questions with data. How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy? Our involvement typically means analyzing data and designing experiments. To do this we use a variety of techniques that have been successfully applied in the past and that we have mathematically shown to have desirable properties. Some of these tools are frequentist, some of them are Bayesian, some could be argued to be both, and some don't even use probability. The Casino will do just fine with frequentist statistics, while the baseball team might want to apply a Bayesian approach to avoid overpaying for players that have simply been lucky.

It is also important to remember that good applied statisticians also *think*. They don't apply techniques blindly or religiously. If applied statisticians, regardless of their philosophical bent, are asked if the sun just exploded, they would not design an experiment as the one depicted in this popular XKCD cartoon.

Only someone that does not know how to think like a statistician would act like the frequentists in the cartoon. Unfortunately we do have such people analyzing data. But their choice of technique is not the problem, it's their lack of critical thinking. However, even the most frequentist-appearing applied statistician understands Bayes rule and will adapt the Bayesian approach when appropriate. In the above XCKD example, any respectful applied statistician would not even bother examining the data (the dice roll), because they would assign a probability of 0 to the sun exploding (the empirical prior based on the fact that they are alive). However, superficial propositions arguing for wider adoption of Bayesian methods fail to realize that using these techniques in an actual data analysis project is very different from simply thinking like a Bayesian. To do this we have to represent our intuition or prior knowledge (or whatever you want to call it) with mathematical formulae. When theoretical Bayesians pick these priors, they mainly have mathematical/computational considerations in mind. In practice we can't afford this luxury: a bad prior will render the analysis useless regardless of its convenient mathematically properties.

Despite these challenges, applied statisticians regularly use Bayesian techniques successfully. In one of the fields I work in, Genomics, empirical Bayes techniques are widely used. In this popular application of empirical Bayes we use data from all genes to improve the precision of estimates obtained for specific genes. However, the most widely used output of the software implementation is not a posterior probability. Instead, an empirical Bayes technique is used to improve the estimate of the standard error used in a good ol' fashioned t-test. This idea has changed the way thousands of Biologists search for differential expressed genes and is, in my opinion, one of the most important contributions of Statistics to Genomics. Is this approach frequentist? Bayesian? To this applied statistician it doesn't really matter.

For those arguing that simply switching to a Bayesian philosophy will improve the current state of affairs, let's consider the smoking and cancer example. Today there is wide agreement that smoking causes lung cancer. Without a clear deductive biochemical/physiological argument and without
the possibility of a randomized trial, this connection was established with a series of observational studies. Most, if not all, of the associated data analyses were based on frequentist techniques. None of the reported confidence intervals on their own established the consensus. Instead, as usually happens in science, a long series of studies supporting this conclusion were needed. How exactly would this have been different with a strictly Bayesian approach? Would a single paper been enough? Would using priors helped given the "expert knowledge" at the time (see below)?

And how would the Bayesian analysis performed by tabacco companies shape the debate? Ultimately, I think applied statisticians would have made an equally convincing case against smoking with Bayesian posteriors as opposed to frequentist confidence intervals. Going forward I hope applied statisticians continue to be free to use whatever techniques they see fit and that critical thinking about data continues to be what distinguishes us. Imposing Bayesian or frequentists philosophy on us would be a disaster.


Data science can't be point and click

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data.  I see more and more that people are creating tools to accommodate folks who aren't trained but who still want to look at data right now. While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.

The reason is that, especially with big data, it is very easy to find things like this with point and click tools:

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (

The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.

Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and there aren't enough people to analyze it expertly. But you wouldn't want me to operate on you using point and click surgery software. You'd want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.

The ways to solve this problem are:

  1. More data analysis training
  2. Encouraging people to do their analysis interactively

I have a few more tips which I have summarized in this talk on things statistics taught us about big data.


The Leek group guide to genomics papers

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Leek group guide to genomics papers

When I was a student, my advisor, John Storey, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn't need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.



An economic model for peer review

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I saw this tweet the other day:

It reminded me that a few years ago I had a paper that went through the peer review wringer. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the "economics of peer review". Basically, what is the incentive for scientists to contribute to the system.

To get a handle on this idea, I designed a "peer review game" where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is S_i = \sum {\rm Submitted \; Papers \; Accepted}.

Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that no one reviews any papers. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.

We then built a model system for testing out our theory. The system involved having groups of students play a "peer review game" where they submitted solutions to SAT problems like:

Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.

We found some cool things:

  1. In closed review, reviewing gave no benefit.
  2. In open review, reviewing gave a small positive benefit.
  3. Both systems gave comparable accuracy
  4. All peer review increased the overall accuracy of responses

The paper is here and all of the data and code are here.


The Drake index for academics

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I think academic indices are pretty silly; maybe we should introduce so many academic indices that people can't even remember which one is which. There are pretty serious flaws with both citation indices and social media indices that I think render them pretty meaningless in a lot of ways.

Regardless of these obvious flaws I want in the game. Instead of the K-index for academics I propose the Drake index. Drake has achieved both critical and popular success. His song "Honorable Mentions" from the ESPYs (especially the first verse) reminds me of the motivation of the K-index paper.

To quantify both the critical and popular success of a scientist, I propose the Drake Index (TM). The Drake Index is defined as follows

(# Twitter Followers)/(Max Twitter Followers by a Person in your Field) + (#Citations)/(Max Citations by a Person in your Field)

Let's break the index down. There are two main components (Twitter followers and Citations) measuring popular and critical acclaim. But they are measured on different scales. So we attempt to normalize them to the maximum in their field so the indices will both be between 0 and 1. This means that your Drake index score is between 0 and 2. Let's look at a few examples to see how the index works.

  1. Drake  = (16.9M followers)/(55.5 M followers for Justin Bieber) + (0 citations)/(134 Citations for Natalie Portman) = 0.30
  2. Rafael Irizarry = (1.1K followers)/(17.6K followers for Simply Stats) + (38,194 citations)/(185,740 citations for Doug Altman) = 0.27
  3. Roger Peng - (4.5K followers)/(17.6K followers for Simply Stats) + (4,011 citations)/(185,740 citations for Doug Altman) = 0.27
  4. Jeff Leek - (2.6K followers)/(17.6K followers for Simply + (2,348 citations)/(185,740 citations for Doug Altman) = 0.16

In the interest of this not being taken any seriously than an afternoon blogpost should be I won't calculate any other people's Drake index. But you can :-).


You think P-values are bad? I say show me the data.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands.  Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another.  I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around.  So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.