Tag: reproducibility


Does NIH fund innovative work? Does Nature care about publishing accurate articles?

Editor's Note: In a recent post we disagreed with a Nature article claiming that NIH doesn't support innovation. Our colleague Steven Salzberg actually looked at the data and wrote the guest post below. 

Nature published an article last month with the provocative title "Research grants: Conform and be funded."  The authors looked at papers with over 1000 citations to find out whether scientists "who do the most influential scientific work get funded by the NIH."  Their dramatic conclusion, widely reported, was that only 40% of such influential scientists get funding.

Dramatic, but wrong.  I re-analyzed the authors' data and wrote a letter to Nature, which was published today along with the authors response, which more or less ignored my points.  Unfortunately, Nature cut my already-short letter in half, so what readers see in the journal omits half my argument.  My entire letter is published here, thanks to my colleagues at Simply Statistics.  I titled it "NIH funds the overwhelming majority of highly influential original science results," because that's what the original study should have concluded from their very own data.  Here goes:

To the Editors:

In their recent commentary, "Conform and be funded," Joshua Nicholson and John Ioannidis claim that "too many US authors of the most innovative and influential papers in the life sciences do not receive NIH funding."  They support their thesis with an analysis of 200 papers sampled from 700 life science papers with over 1,000 citations.  Their main finding was that only 40% of "primary authors" on these papers are PIs on NIH grants, from which they argue that the peer review system "encourage[s] conformity if not mediocrity."

While this makes for an appealing headline, the authors' own data does not support their conclusion.  I downloaded the full text for a random sample of 125 of the 700 highly cited papers [data available upon request].  A majority of these papers were either reviews (63), which do not report original findings, or not in the life sciences (17) despite being included in the authors' database.  For the remaining 45 papers, I looked at each paper to see if the work was supported by NIH.  In a few cases where the paper did not include this information, I used the NIH grants database to determine if the corresponding author has current NIH support.  34 out of 45 (75%) of these highly-cited papers were supported by NIH.  The 11 papers not supported included papers published by other branches of the U.S. government, including the CDC and the U.S. Army, for which NIH support would not be appropriate.  Thus, using the authors' own data, one would have to conclude that NIH has supported a large majority of highly influential life sciences discoveries in the past twelve years.

The authors – and the editors at Nature, who contributed to the article – suffer from the same biases that Ioannidis himself has often criticized.  Their inclusion of inappropriate articles and especially the choice to require that both the first and last author be PIs on an NIH grant, even when the first author was a student, produced an artificially low number that misrepresents the degree to which NIH supports innovative original research.

It seems pretty clear that Nature wanted a headline about how NIH doesn't support innovation, and Ioannidis was happy to give it to them.  Now, I'd love it if NIH had the funds to support more scientists, and I'd also be in favor of funding at least some work retrospectively - based on recent major achievements, for example, rather than proposed future work.  But the evidence doesn't support the "Conform and be funded" headline, however much Nature might want it to be true.


Replication and validation in -omics studies - just as important as reproducibility

The psychology/social psychology community has made replication a huge focus over the last year. One reason is the recent, public blow-up over a famous study that did not replicate. There are also concerns about the experimental and conceptual design of these studies that go beyond simple lack of replication. In genomics, a similar scandal occurred due to what amounted to “data fudging”. Although, in the genomics case, much of the blame and focus has been on lack of reproducibility or data availability

I think one of the reasons that the field of genomics has focused more on reproducibility is that replication is already more consistently performed in genomics. There are two forms for this replication: validation and independent replication. Validation generally refers to a replication experiment performed by the same research lab or group - with a different technology or a different data set. On the other hand, independent replication of results is usually performed by an outside laboratory. 

Validation is by far the more common form of replication in genomics. In this article in Science, Ioannidis and Khoury point out that validation has different meaning depending on the subfield of genomics. In GWAS studies, it is now expected that every significant result will be validated in a second large cohort with genome-wide significance for the identified variants.

In gene expression/protein expression/systems biology analyses, there has been no similar definition of the “criteria for validation”. Generally the experiments are performed and if a few/a majority/most of the results are confirmed, the approach is considered validated. My colleagues and I just published a paper where we define a new statistical sampling approach for validating lists of features in genomics studies that is somewhat less ambiguous. But I think this is only a starting point. Just like in psychology, we need to focus not just on reproducibility, but also replicability of our results, and we need new statistical approaches for evaluating whether validation/replication have actually occurred. 


People in positions of power that don't understand statistics are a big problem for genomics

I finally got around to reading the IOM report on translational omics and it is very good. The report lays out problems with current practices and how these led to undesired results such as the now infamous Duke trials and the growth in retractions in the scientific literature. Specific recommendations are provided related to reproducibility and validation. I expect the report will improve things. Although I think bigger improvements will come as a result of retirements.

In general, I think the field of genomics  (a label that is used quite broadly) is producing great discoveries and I strongly believe we are just getting started. But we can’t help but notice that retraction and questionable findings are particularly high in this field. In my view most of the problems we are currently suffering stem from the fact that a substantial number of the people with positions of power do not understand statistics and have no experience with computing. Nevin’s biggest mistake was not admitting to himself that he did not understand what Baggerly and Coombes were saying. The lack of reproducibility just exacerbated the problem. The same is true for the editors that rejected the letters written by this pair in their effort to expose a serious problem - a problem that was obvious to all the statistics savvy biologists I talked to.

Unfortunately Nevins is not the only head of a large genomics lab that does not understand basic statistical principles and has no programming/data-management experience. So how do people without the necessary statistical and computing skills to be considered experts in genomics become leaders of the field?  I think this is due to the speed at which Biology changed from a data poor discipline to a data intensive ones. For example, before microarrays, the analysis of gene expression data amounted to spotting black dots on a piece of paper (see Figure A below). In the mid 90s this suddenly changed to sifting through tens of thousands of numbers (see Figure B).

Note that typically, statistics is not a requirement of the Biology graduate programs associated with genomics. At Hopkins neither of the two major programs (CMM and BCMB) require it. And this is expected, since outside of genomics one can do great  Biology without quantitative skills and for most of the 20th century most Biology was like this. So when the genomics revolution first arrived, the great majority of powerful Biology lab heads had no statistical training whatsoever. Nonetheless, a few of these decided to delve into this “sexy” new field and using their copious resources were able to perform some of the first big experiments. Similarly, Biology journals that were not equipped to judge the data analytic component of genomics papers were eager to publish papers in this field, a fact that further compounded the problem.

But I as I mentioned above, in general, the field of genomics is producing wonderful results. Several lab heads did have statistics and computational expertise, while others formed strong partnerships with quantitative types. Here I should mentioned that for these partnerships to be successful the statisticians also needed to expand their knowledge base. The quantitative half of the partnership needs to be biology and technology savvy or they too can make mistakes that lead to retractions

Nevertheless, the field is riddled with problems; enough to prompt an IOM report. But although the present is somewhat grim, I am optimistic about the future. The new generation of biologists leading the genomics field are clearly more knowledgeable and appreciative about statistics and computing than the previous ones. Natural selection helps, as these new investigators can’t rely on pre-genomics-revolution accomplishments and those that do not posses these skills are simply outperformed by those that do. I am also optimistic because biology graduate programs are starting to incorporate statistics and computation into their curricula. For example, as of last year, our Human Genetics program requires our Biostats 615-616 course


Replication, psychology, and big science

Reproducibility has been a hot topic for the last several years among computational scientists. A study is reproducible if there is a specific set of computational functions/analyses (usually specified in terms of code) that exactly reproduce all of the numbers in a published paper from raw data. It is now recognized that a critical component of the scientific process is that data analyses can be reproduced. This point has been driven home particularly for personalized medicine applications, where irreproducible results can lead to delays in evaluating new procedures that affect patients’ health. 

But just because a study is reproducible does not mean that it is replicable. Replicability is stronger than reproducibility. A study is only replicable if you perform the exact same experiment (at least) twice, collect data in the same way both times, perform the same data analysis, and arrive at the same conclusions. The difference with reproducibility is that to achieve replicability, you have to perform the experiment and collect the data again. This of course introduces all sorts of new potential sources of error in your experiment (new scientists, new materials, new lab, new thinking, different settings on the machines, etc.)

Replicability is getting a lot of attention recently in psychology due to some high-profile studies that did not replicate. First, there was the highly-cited experiment that failed to replicate, leading to a show down between the author of the original experiment and the replicators. Now there is a psychology project that allows researchers to post the results of replications of experiments - whether they succeeded or failed. Finally, the Reproducibility Project, probably better termed the Replicability Project, seeks to replicate the results of every experiment in the journals Psychological Science, the Journal of Personality and Social Psychology,or the Journal of Experimental Psychology: Learning, Memory, and Cognition in the year 2008.

Replicability raises important issues for “big science” projects, ranging from genomics (The Thousand Genomes Project) to physics (The Large Hadron Collider). These experiments are too big and costly to actually replicate. So how do we know the results of these experiments aren’t just errors, that upon replication (if we could do it) would not show up again? Maybe smaller scale replications of sub-projects could be used to help convince us of discoveries in these big projects?

In the meantime, I love the idea that replication is getting the credit it deserves (at least in psychology). The incentives in science often only credit the first person to an idea, not the long tail of folks who replicate the results. For example, replications of experiments are often not considered interesting enough to publish. Maybe these new projects will start to change some of the perverse academic incentives.


Some thoughts from Keith Baggerly on the recently released IOM report on translational omics

Shortly after the Duke trial scandal broke, the Institute of Medicine convened a committee to write a report on translational omics. Several statisticians (including one of our interviewees) either served on the committee or provided key testimony. The report came out yesterday.  Nature, Nature Medicine, and Science had posts about the release. Keith Baggerly sent an email with his thoughts and he gave me permission to post it here. He starts by pointing out that the Science piece has a key new observation:

The NCI’s Lisa McShane, who spent months herself trying to validate Duke results, says the IOM committee “did a really fine job” in laying out the issues. NCI now plans to require that its cooperative groups who want to use omics tests follow a checklist similar to that in the IOM report. NCI has not yet decided whether it should add new requirements for omics tests to its peer review process for investigator-initiated grants. But “our hope is that this report will heighten everyone’s awareness,” McShane says. 

Some further thoughts from Keith:

First, the report helps clarify the regulatory landscape: if omics-based tests (which the FDA views as medical devices) will direct patient therapy, FDA approval in the form of an Investigational Device Exemption (IDE) is required. This is in keeping with increased guidance FDA has been providing over the past year and a half dealing with companion diagnostics. It seems likely that several of the problems identified with the Duke trials would have been caught by an FDA review, particularly if the agency already had cause for concern, such as a letter to the editor identifying analytical shortcomings. 

 Second, the report recommends the publication of the full data, code, and metadata used to construct the omics assays prior to their use to guide patient therapy. Had such data and code been available earlier, this would have greatly reduced the amount of effort required for others (including us) to check and potentially extend on the underlying results.

Third, the report emphasizes, repeatedly, that the test must be fully specified (“locked down”) before it is validated, let alone used to guide patient therapy. Quite a bit of effort is given to providing an explicit definition of locked down, in part (we suspect) because both Lisa McShane (NCI) and Robert Becker (FDA) provided testimony that incomplete specification was a problem their agencies encountered frequently. Such specification would have prevented problems such as that identified by the NCI for the Lung Metagene Score (LMS) in 2010, which led the NCI to remove the LMS evaluation as a goal of the Phase III cooperative group trial CALGB-30506.

 Finally, the very existence of the report is recognition that reproducibility is an important problem for the omics-test community. This is a necessary step towards fixing the problem.


Where do you get your data?

Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.

Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s opening line is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:

  • The data are just numbers. My method works on numbers, and these data are numbers, so my method should work here. If it doesn’t work, then I’ll find some other numbers where it does work.
  • The data are all that are important. I’m not that interested in working with an actual scientist on an important problem that people care about, because that would be an awful lot of work and time (see here). I just care about getting the data from whomever will give it to me. I don’t care about the substantive context.
  • Once I have the data, I’m good, thank you. In other words, the scientific process is modular. Scientists generate the data and once I have it I’ll apply my method until I get something that I think makes sense. There’s no need for us to communicate. That is unless I need you to help make the data pretty and nice for me.

The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.

Niels Keiding wrote a provocative commentary about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.

I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general. 

But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions. 


Preventing Errors through Reproducibility

Checklist mania has hit clinical medicine thanks to people like Peter Pronovost and many others. The basic idea is that simple and short checklists along with changes to clinical culture can prevent major errors from occurring in medical practice. One particular success story is Pronovost’s central line checklist which dramatically reduced bloodstream infections in hospital intensive care units.  

There are three important points about the checklist. First, it neatly summarizes information, bringing the latest evidence directly to clinical practice. It is easy to follow because it is short. Second, it serves to slow you down from whatever you’re doing. Before you cut someone open for surgery, you stop for a second and run the checklist. Third, it is a kind of equalizer that subtly changes the culture: everyone has to follow the checklist, no exceptions. A number of studies have now shown that when clinical units follow checklists, infection rates go down and hospital stays are shorter compared to units using standard procedures. 

Here’s a question: What would it take to convince you that an article’s results were reproducible, short of going in and reproducing the results yourself? I recently raised this question in a talk I gave at the Applied Mathematics Perspectives conference. At the time I didn’t get any responses, but I’ve had some time to think about it since then.

I think most people are thinking of this issue along the lines of “The only way I can confirm that an analysis is reproducible is to reproduce it myself”. In order for that to work, everyone needs to have the data and code available to them so that they can do their own independent reproduction. Such a scenario would be sufficient (and perhaps ideal) to claim reproducibility, but is it strictly necessary? For example, if I reproduced a published analysis, would that satisfy you that the work was reproducible, or would you have to independently reproduce the results for yourself? If you had to choose someone to reproduce an analysis for you (not including yourself), who would it be?

This idea is embedded in the reproducible research policy at Biostatistics, but of course we make the data and code available too. There, a (hopefully) trusted third party (the Associate Editor for Reproducibility) reproduces the analysis and confirms that the code was runnable (at least at that moment in time). 

It’s important to point out that reproducible research is not only about correctness and prevention of errors. It’s also about making research results available to others so that they may more easily build on the work. However, preventing errors is an important part and the question is then what is the best way to do that? Can we generate a reproducibility checklist?


Reproducible Research in Computational Science

First of all, thanks to Rafa for scooping me with my own article. Not sure if that’s reverse scooping or recursive scooping or….

The latest issue of Science has a special section on Data Replication and Reproducibility. As part of the section I wrote a brief commentary on the need for reproducible research in computational science. Science has a pretty tight word limit for it’s commentaries and so it was unfortunately necessary to omit a number of relevant topics.

The editorial introducing the special section, as well as a separate editorial in the same issue, seem to emphasize the errors/fraud angle. This might be because Science has once or twice been at the center of instances of scientific fraud. But as I’ve said previously (and a point I tried to make in the commentary), reproducibility is not needed soley to prevent fraud, although that is an important objective. Another important objective is getting ideas across and disseminating knowledge. I think this second objective often gets lost because there’s a sense that knowledge dissemination already happens and that it’s the errors that are new and interesting. While the errors are perhaps new, there is a problem of ideas not getting across as quickly as they could because of a lack of code and/or data. The lack of published code/data is arguably holding up the advancement of science (if not Science).

One important idea I wanted to get across was that we can ramp up to achieve the ideal scenario, if getting there immediately is not possible. People often get hung up on making the data available but I think a substantial step could be made by simply making code available. Why doesn’t every journal just require it? We don’t have to start with a grand strategy involving funding agencies and large consortia. We can start modestly and make useful improvements

A final interesting question that came up as the issue was going to press was whether I was talking about “reproducibility” or “replication”. As I made clear in the commentary, I define “replication” as independent people going out and collecting new data and “reproducibility” as independent people analyzing the same data. Apparently, others have the reverse definitions for the two words. The confusion is unfortunate because one idea has a centuries long history whereas the importance of the other idea has only recently become relevant. I’m going to stick to my guns here but we’ll have to see how the language evolves.


Reproducible Research and Turkey

Over the Thanksgiving recent break I naturally started thinking about reproducible research in between salting the turkey and making the turkey stock. Clearly, these things are all related. 

I sometimes get the sense that many people see reproducibility as essentially binary. A published paper is either reproducible, as in you can compute every single last numerical result to within epsilon precision, or it’s not. My feeling is that there is a spectrum of reproducibility when it comes to published scientific findings. Some papers are more reproducible than others. And that’s where cooking comes in.

I do a bit of cooking and I am a shameless consumer of food blogs/web sites. There seems pretty solid agreement (and my own experience essentially confirms) that the more you can make yourself and not have to rely on other people doing the cooking, the better. For example, for Thanksgiving, you could theoretically buy yourself a pre-roasted turkey that’s ready to eat. My brother tells me this is what homesick Americans do in China because so few people have an oven (I suppose you could steam a turkey?). Or you could buy an un-cooked turkey that is “flavor injected”. Or you could buy a normal turkey and brine/salt it yourself. Or you could get yourself one of those heritage turkeys. Or you could raise your own turkeys…. I think in all of these cases, the turkey would definitely be edible and maybe even tasty. But some would probably be more tasty than others. 

And that’s the point. There’s a spectrum when it comes to cooking and some methods result in better food than others. Similarly, when it comes to published research there is a spectrum of what authors can make available to reproduce their work. On the one hand, you have just the paper itself, which reveals quite a bit of information (i.e. the scientific question, the general approach) but usually too few details to actually reproduce (or even replicate) anything. Some authors might release the code, which allows you to study the algorithms and maybe apply them to your own work. Some might release the code and the data so that you can actually reproduce the published findings. Some might make a nice R package/vignette so that you barely have to lift a finger. Each case is better than the previous, but that’s not to say that I would only accept the last/best case. Some reproducibility is better than none.

That said, I don’t think we should shoot low. Ideally, we would have the best case, which would allow for full reproducibility and rapid dissemination of ideas. But while we wait for that best case scenario, it couldn’t hurt to have a few steps in between.


Reproducible research: Notes from the field

Over the past year, I’ve been doing a lot of talking about reproducible research. Talking to people, talking on panel discussions, and talking about some of my own work. It seems to me that interest in the topic has exploded recently, in part due to some recent scandals, such as the Duke clinical trials fiasco.

If you are unfamiliar with the term “reproducible research”, the basic idea is that authors of published research should make available the necessary materials so that others may reproduce to a very high degree of similarity the published findings. If that definitions seems imprecise, well that’s because it is.

I think reproducibility becomes easier to define in the context of a specific field or application. Reproducibility often comes up in the context of computational science. In computational science fields, often much of the work is done on the computer using often very large amounts of data. In other words, the analysis of the data is of comparable difficulty as the collection of the data (maybe even more complicated). Then the notion of reproducibility typically comes down to the idea of making the analytic data and the computer code available to others. That way, knowledgeable people can run your code on your data and presumably get your results. If others do not get your results, then that may be a sign of a problem, or perhaps a misunderstanding. In either case, a resolution needs to be found. Reproducibility is key to science much the way it is key to programming. When bugs are found in software, being able to reproduce the bug is an important step to fixing it. Anyone learning to program in C knows the pain of dealing with a memory-related bug, which will often exhibit seemingly random and non-reproducible behavior.

My discussions with others about the need for reproducibility in science often range far and wide. One reason is that many people have very different ideas what (a) what is reproducibility and (b) why we need it. Here is my take on various issues.

  • Reproducibility is not replication. There’s often honest confusion between the notion of reproducibility and what I would call a “full replication”. A full replication doesn’t analyze the same dataset, but rather involves an independent investigator collecting an independent dataset conducting an independent analysis. Full replication has been a fundamental component of science for a long time now and will continue to be the primary yardstick for measuring the plausibility of scientific claims. I think most would agree that full replication is preferable, but often it is simply not possible.
  • Reproducibility is not needed solely to prevent fraud. I’ve heard many people emphasize reproducibility as a means to prevent fraud. Journal editors seem to think this is the main reason for demanding reproducibility. It is one reason, but to be honest, I’m not sure it’s all that useful for detecting fraud. If someone truly wants to commit fraud, then it’s possible to make the fraud reproducible. If I just generate a bunch of numbers and claim it as data that I collected, any analysis from that dataset can be reproducible. While demanding reproducibility may be useful for ferreting out certain types of fraud, it’s not a general solution and it’s not the primary reason we need it. 
  • Reproducibility is not as easy as it sounds. Making one’s research reproducible is hard. It’s especially hard when you try to do it after the research has been done. In that case it’s more like an audit, and I’m guessing for most people the word “audit” is NOT synonymous with “fun”. Even if you set out to make your work reproducible from the get go, it’s easy to miss things. Code can get lost (even with a version control system) and metadata can slip through the cracks. Even when you’ve done everything right, computers and software can change. Virtual machines like Amazon EC2 and others seem to have some potential. The single most useful tool that I have found is a good version control system, like git
  • At this point, anything would be better than nothing. Right now, I think the bar for reproducibility is quite low in the sense that most published work is not reproducible. Even if data are available, often the code that analyzed the data is not available. So if you’re publishing research and you want to make it at least partially reproducible, just put what you can out there. On the web, on github, in a data repository, wherever you can. If you can’t publish the data, make your code available. Even that is better than nothing. In fact, I find reading someone’s code to be very informative and often questions can arise without looking at the data. Until we have a better infrastructure for distributing reproducible research, we will have to make do with what we have. But if we all start putting stuff out there, the conversation will turn from “Why should I make stuff available?” to “Why wouldn’t I make stuff available?”