Simply Statistics: On research parasites and internet mobs - let's try to solve the real problem.

A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term “research parasite”:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

“****The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.****” This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good [A couple of days ago one of the editors of the New England Journal of Medicine posted an editorial showing some moderate level of support for data sharing but also introducing the term “research parasite”:

A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”

While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:

“****The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.****” This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good](https://github.com/jtleek/datasharing) policies and respond to queries from people using their data promptly then this should not be a problem at all.
“… but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited.” The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that is not papers. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.
“****The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own.” The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn’t the only way science should work. Researchers shouldn’t be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn’t, but in neither case should we label the researchers “symbiotic” or “parasitic”, terms that have extreme connotations.
“How would data sharing work best? We think it should happen symbiotically, not parasitically.” I think that it should happen automatically. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should get credit for generating the data set and the hypothesis that led to the data set. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.
“Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested.” The trouble with this framework is that it preferentially accrues credit to data generators and doesn’t accurately describe the role of either party. To flip this argument around, you could just as easily say that anyone who uses Steven Salzberg’s software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.

After the piece was posted there was predictable internet rage from data parasites, a dedicated hashtag, and half a dozen angry blog posts written about the piece. These inspired a follow up piece from Drazen. I recognize why these folks were upset - the “research parasites” thing was unnecessarily inflammatory. But I also sympathize with data creators who are also subject to a tough environment - particularly when they are junior scientists.

I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I’d like to see something that acknowledges different contributions appropriately and doesn’t slow down science. Something like:

We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.
When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.
We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.
We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.
We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.
When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.
When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.

On research parasites and internet mobs - let’s try to solve the real problem.