Tag: reproducible research


The value of re-analysis

I just saw this really nice post over on John Cook's blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).

In a more statistical version of this idea, Gary King has proposed reproducing the analysis in a published paper as a way to get a paper of your own.  You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:

  1. Can you actually reproduce the code used to perform the analysis?
  2. Can you solve the "paper as puzzle" exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?

Both of these things require a bit more "higher level thinking" than just re-running the analysis if you have the code. But I think even the seemingly "low-level" task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out Rpubs or RunMyCode or even the right parts of Figshare you can find data analyses you can run through and reproduce.

The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) said, "Data analysis practice doesn't make perfect, perfect data analysis practice makes perfect."


Pro Tips for Grad Students in Statistics/Biostatistics (Part 1)

I just finished teaching a Ph.D. level applied statistical methods course here at Hopkins. As part of the course, I gave one “pro-tip” a day; something I wish I had learned in graduate school that has helped me in becoming a practicing applied statistician. Here are the first three, more to come soon. 
  1. A major component of being a researcher is knowing what’s going on in the research community. Set up an RSS feed with journal articles. Google Reader is a good one, but there are others. Here are some good applied stat journals: Biostatistics, Biometrics, Annals of Applied Statistics…
  2. Reproducible research is a hot topic, in part because a couple of high-profile papers that were disastrously non-reproducible (see “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology”). When you write code for statistical analysis try to make sure that: (a) It is neat and well-commented - liberal and specific comments are your friend. (b)That it can be run by someone other than you, to produce the same results that you report.
  3. In data analysis - particularly for complex high-dimensional
    data - it is frequently better to choose simple models for clearly defined parameters. With a lot of data, there is a strong temptation to go overboard with statistically complicated models; the danger of overfitting/ over-interpreting is extreme. The most reproducible results are often produced by sensible and statistically “simple” analyses (Note: being sensible and simple does not always lead to higher prole results).

figshare and don't trust celebrities stating facts

A couple of links:

  1. figshare is a site where scientists can share data sets/figures/code. One of the goals is to encourage researchers to share negative results as well. I think this is a great idea - I often find negative results and this could be a place to put them. It also uses a tagging system, like Flickr. I think this is a great idea for scientific research discovery. They give you unlimited public space and 1GB of private space. This could be big, a place to help make reproducible research efforts user-friendly. Via TechCrunch
  2. Don’t trust celebrities stating facts because they usually don’t know what they are talking about. I completely agree with this. Particularly because I have serious doubts about the statisteracy of most celebrities. Nod to Alex for the link (our most active link finder!).  

Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.

Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s opening line is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:

  • The data are just numbers. My method works on numbers, and these data are numbers, so my method should work here. If it doesn’t work, then I’ll find some other numbers where it does work.
  • The data are all that are important. I’m not that interested in working with an actual scientist on an important problem that people care about, because that would be an awful lot of work and time (see here). I just care about getting the data from whomever will give it to me. I don’t care about the substantive context.
  • Once I have the data, I’m good, thank you. In other words, the scientific process is modular. Scientists generate the data and once I have it I’ll apply my method until I get something that I think makes sense. There’s no need for us to communicate. That is unless I need you to help make the data pretty and nice for me.

The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.

Niels Keiding wrote a provocative commentary about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.

I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general. 

But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions. 


Reproducible Research in Computational Science

First of all, thanks to Rafa for scooping me with my own article. Not sure if that’s reverse scooping or recursive scooping or….

The latest issue of Science has a special section on Data Replication and Reproducibility. As part of the section I wrote a brief commentary on the need for reproducible research in computational science. Science has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.

The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because Science has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), reproducibility is not needed soley to prevent fraud, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not Science).

One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. We can start modestly and make useful improvements

A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.


Reproducible Research and Turkey

Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. 

I sometimes get the sense that many people see reproducibility as essentially binary. A published paper is either reproducible, as in you can compute every single last numerical result to within epsilon precision, or it’s not. My feeling is that there is a spectrum of reproducibility when it comes to published scientific findings. Some papers are more reproducible than others. And that’s where cooking comes in.

I do a bit of cooking and I am a shameless consumer of food blogs/web sites. There seems pretty solid agreement (and my own experience essentially confirms) that the more you can make yourself and not have to rely on other people doing the cooking, the better. For example, for Thanksgiving, you could theoretically buy yourself a pre-roasted turkey that’s ready to eat. My brother tells me this is what homesick Americans do in China because so few people have an oven (I suppose you could steam a turkey?). Or you could buy an un-cooked turkey that is “flavor injected”. Or you could buy a normal turkey and brine/salt it yourself. Or you could get yourself one of those heritage turkeys. Or you could raise your own turkeys…. I think in all of these cases, the turkey would definitely be edible and maybe even tasty. But some would probably be more tasty than others. 

And that’s the point. There’s a spectrum when it comes to cooking and some methods result in better food than others. Similarly, when it comes to published research there is a spectrum of what authors can make available to reproduce their work. On the one hand, you have just the paper itself, which reveals quite a bit of information (i.e. the scientific question, the general approach) but usually too few details to actually reproduce (or even replicate) anything. Some authors might release the code, which allows you to study the algorithms and maybe apply them to your own work. Some might release the code and the data so that you can actually reproduce the published findings. Some might make a nice R package/vignette so that you barely have to lift a finger. Each case is better than the previous, but that’s not to say that I would only accept the last/best case. Some reproducibility is better than none.

That said, I don’t think we should shoot low. Ideally, we would have the best case, which would allow for full reproducibility and rapid dissemination of ideas. But while we wait for that best case scenario, it couldn’t hurt to have a few steps in between.