Rethinking Academic Data Sharing

Roger Peng
2018-05-15

The sharing of data is one of the key principles of reproducible research (the other one being code sharing). Using the data and code a researcher has used to generate a finding, other researchers can reproduce those findings and examine the process that lead to them. Reproducibility is critical for transparency, so that others can verify the process, and for speeding up knowledge transfer. But recent events have gotten me thinking more about the data sharing aspect of reproducibility and whether it is tenable in the long run.

Much, if not most, data sharing can occur without any sort of controversy. For example, in our air pollution studies, we make available air pollution monitoring levels that are inputs to our models (these data are already publicly available from the US EPA although we process them a little). But the majority of my research involves health data which contains sensitive information about individuals. These data range from national studies using Medicare and Medicaid insurance claims to more local Baltimore studies involving patients at Johns Hopkins Hospital. In most cases, I hesitate to publish these data unless there is an obvious way to do so that doesn’t violate privacy. For example, we sometimes publish high-level summary statistics of the data or will publish small clinical datasets with identifiers removed. My discussions with others about sharing health data for research purposes have always been vague and hand-wavy. In general, there’s a sense that “reproducibility is good” and that “data sharing is good” but as someone who has implement these vague ideas, there is frustratingly little guidance on how.

Times have changed quite a bit since I first started thinking seriously about reproducible research back in 2006. Today we have companies like Facebook, Google, and other social networks that are routinely collecting troves of information about our behaviors. Credit scoring companies like Equifax collect a tremendous amount of financial data on us and are vulnerable to hacking. People for the most part do not have a problem with the fact that the data are collected because there is an exchange there (data for services). But when personal data is shared with others without their knowledge or consent, they are rightfully outraged. An interesting aspect of the most recent Facebook/Cambridge Analytica controversy is that it was an academic researcher who shared the data with consulting firm Cambridge Analytica. If he had written an academic paper and shared the data with people who wanted to reproduce the findings, would that have been the right thing to do? For the sake of reproducibility?

I think research on humans is moving in the direction of making it harder to share rather than easier. It is getting easier and cheaper to collect highly detailed information about individuals and their behavior and it is tempting to incorporate that into research questions. From my own work, in a paper published in 2008 we linked health data aggregated to the county level to air pollution concentrations in that county. In a more recent study on coarse particulate matter exposure, we linked health data at the ZIP code level (a smaller unit of aggregation) because we were able to build a machine learning model to predict air pollution exposure at smaller scales. When you further stratify a ZIP code by race, gender, and age, the counts start to get very small. While using data at this level of detail is useful for answering important new research questions, it is impossible to share this kind of data for reproducibility purposes. That said, this particular dataset (Medicaid billing claims) is not proprietary; any researcher can obtain it from the Center for Medicare and Medicaid Studies for a fee and a research protocol.

There are a few points that need some serious consideration before we can have any real sharing of health data and at this point I don’t think there are good answers to the questions raised.

Recent controversies at Facebook and various other companies have highlighted the vast quantities of data these companies collect and how vulnerable the data are to exposure. There is a reasonable debate going on regarding whether companies should be able to share this data and for what purposes. Academics have to realize that they are also part of this debate and that any decisions made in that domain will likely affect them. For example, the Health Insurance Portability and Accountability Act (HIPAA) was not a law that particularly had academics in mind, but it nevertheless had a profound effect on the way that academics handled health insurance data for research purposes. Should governments around the world decide to restrict the amount of data that social networks like Facebook can share with third parties, those decisions will likely affect academics too. The recently implemented EU General Data Protection Regulation is a step in this direction and there will likely be more to follow.

While there are many complexities involved in sharing health data, I do think data sharing serves an important practical purpose, beyond the transparency and the “it’s the right thing to do” reasons. Lack of data sharing serves to entrench incumbents in the field, and so large and rich institutions can afford to hoard their data and collect new data whenever they feel like it. This hoarding of data feeds an upward cycle where these institutions can therefore get more funding to collect even more data that they don’t share. Newer entrants to a given field are left with fewer resources and an inability to collect the same kinds of data. A rich infrastructure of data sharing allows these new entrants to get “up to speed” quicker and to ask more interesting questions from the get go. The same is true in the commercial world, as huge companies like Facebook and Google can afford to hold on to their data and comply with various complex regulations. They may be especially motivated to do so if it prevents upstarts and competitors from appearing. The recent release of Facebook’s dating feature is a near perfect example of this. Facebook might be very motivated to not share its data if it can build its own separate business while simultaneously jeopardizing the business models of its competitors Match.com, Tinder, and OkCupid, which depend critically on Facebook’s social graph.

In the end, doing things for research does not give someone a blank check to do anything. Over time we have restricted the activities that researchers can engage in, because society has decided they are unethical or otherwise inappropriate. I think it’s inevitable that the sharing of data will require a society-wide discussion about whether sharing for research purposes provides a benefit that outweighs the costs. Right now, we have researchers essentially making unilateral decisions on an ad hoc basis and I don’t think that is sustainable.