Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

Jeff Leek
2014-09-04

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don’t, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people’s data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn’t change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn’t before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

Of course I wouldn’t put me in charge anyway, I’ve never won the lottery with any infrastructure I’ve tried to build.