Simply Statistics


An interactive visualization to teach about the curse of dimensionality

I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student Prasad, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the full app here or check it out on the blog here:



Vote on simply statistics new logo design

As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can't decide. Since we are statisticians, we thought we'd collect some data. Here is the link to the poll. Let us know


Creating the field of evidence based data analysis - do people know what a p-value looks like?

In the medical sciences, there is a discipline called "evidence based medicine". The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who don't like to be told how to practice medicine. However, on the whole it is a valuable and interesting scientific exercise.

Roger introduced the idea of evidence based data analysis in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community starting with Bill Cleveland and extending forward to work by Diane Cook, , Jeffrey Heer, and others.

Today we published a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook's group calls this idea "graphical inference" or "visual significance" and they have studied human's ability to detect significance in the context of linear regression and how it associates with demographics and visual characteristics of the plot.

We performed a randomized study to determine if data analysts with basic training could identify statistically significant relationships. Or as the first author put it in a tweet:

What we found was that people were pretty bad at detecting statistically significant results, but that over multiple trials they could improve. This is a tentative first step toward understanding how the general practice of data analysis works. If you want to play around and see how good you are at seeing p-values we also built this interactive Shiny app. If you don't see the app you can also go to the Shiny app page here.



Data science can't be point and click

As data becomes cheaper and cheaper there are more people that want to be able to analyze and interpret that data.  I see more and more that people are creating tools to accommodate folks who aren't trained but who still want to look at data right now. While I admire the principle of this approach - we need to democratize access to data - I think it is the most dangerous way to solve the problem.

The reason is that, especially with big data, it is very easy to find things like this with point and click tools:

US spending on science, space, and technology correlates with Suicides by hanging, strangulation and suffocation (

The danger with using point and click tools is that it is very hard to automate the identification of warning signs that seasoned analysts get when they have their hands in the data. These may be spurious correlation like the plot above or issues with data quality, or missing confounders, or implausible results. These things are much easier to spot when analysis is being done interactively. Point and click software is also getting better about reproducibility, but it still a major problem for many interfaces.

Despite these issues, point and click software are still all the rage. I understand the sentiment, there is a bunch of data just laying there and there aren't enough people to analyze it expertly. But you wouldn't want me to operate on you using point and click surgery software. You'd want a surgeon who has practiced on real people and knows what to do when she has an artery in her hand. In the same way, I think point and click software allows untrained people to do awful things to big data.

The ways to solve this problem are:

  1. More data analysis training
  2. Encouraging people to do their analysis interactively

I have a few more tips which I have summarized in this talk on things statistics taught us about big data.


The Leek group guide to genomics papers

Leek group guide to genomics papers

When I was a student, my advisor, John Storey, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn't need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.



An economic model for peer review

I saw this tweet the other day:

It reminded me that a few years ago I had a paper that went through the peer review wringer. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the "economics of peer review". Basically, what is the incentive for scientists to contribute to the system.

To get a handle on this idea, I designed a "peer review game" where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is S_i = \sum {\rm Submitted \; Papers \; Accepted}.

Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that no one reviews any papers. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.

We then built a model system for testing out our theory. The system involved having groups of students play a "peer review game" where they submitted solutions to SAT problems like:

Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.

We found some cool things:

  1. In closed review, reviewing gave no benefit.
  2. In open review, reviewing gave a small positive benefit.
  3. Both systems gave comparable accuracy
  4. All peer review increased the overall accuracy of responses

The paper is here and all of the data and code are here.


The Drake index for academics

I think academic indices are pretty silly; maybe we should introduce so many academic indices that people can't even remember which one is which. There are pretty serious flaws with both citation indices and social media indices that I think render them pretty meaningless in a lot of ways.

Regardless of these obvious flaws I want in the game. Instead of the K-index for academics I propose the Drake index. Drake has achieved both critical and popular success. His song "Honorable Mentions" from the ESPYs (especially the first verse) reminds me of the motivation of the K-index paper.

To quantify both the critical and popular success of a scientist, I propose the Drake Index (TM). The Drake Index is defined as follows

(# Twitter Followers)/(Max Twitter Followers by a Person in your Field) + (#Citations)/(Max Citations by a Person in your Field)

Let's break the index down. There are two main components (Twitter followers and Citations) measuring popular and critical acclaim. But they are measured on different scales. So we attempt to normalize them to the maximum in their field so the indices will both be between 0 and 1. This means that your Drake index score is between 0 and 2. Let's look at a few examples to see how the index works.

  1. Drake  = (16.9M followers)/(55.5 M followers for Justin Bieber) + (0 citations)/(134 Citations for Natalie Portman) = 0.30
  2. Rafael Irizarry = (1.1K followers)/(17.6K followers for Simply Stats) + (38,194 citations)/(185,740 citations for Doug Altman) = 0.27
  3. Roger Peng - (4.5K followers)/(17.6K followers for Simply Stats) + (4,011 citations)/(185,740 citations for Doug Altman) = 0.27
  4. Jeff Leek - (2.6K followers)/(17.6K followers for Simply + (2,348 citations)/(185,740 citations for Doug Altman) = 0.16

In the interest of this not being taken any seriously than an afternoon blogpost should be I won't calculate any other people's Drake index. But you can :-).


You think P-values are bad? I say show me the data.

Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands.  Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another.  I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around.  So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.


A non-comprehensive list of awesome female data people on Twitter

I was just talking to a student who mentioned she didn't know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn't seen a good list of women on Twitter who do stats/data. So I thought I'd make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I'll update the list?

I have also been informed that these Twitter lists are probably better than my post. But I'll keep updating my list anyway cause I want to know who all the right people to follow are!



Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don't, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people's data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn't change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn't before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were't able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn't put me in charge anyway, I've never won the lottery with any infrastructure I've tried to build.