Simply Statistics: Rethinking Academic Data Sharing

The sharing of data is one of the key principles of reproducible research (the other one being code sharing). Using the data and code a researcher has used to generate a finding, other researchers can reproduce those findings and examine the process that lead to them. Reproducibility is critical for transparency, so that others can verify the process, and for speeding up knowledge transfer. But recent events have gotten me thinking more about the data sharing aspect of reproducibility and whether it is tenable in the long run.

Much, if not most, data sharing can occur without any sort of controversy. For example, in our air pollution studies, we make available air pollution monitoring levels that are inputs to our models (these data are already publicly available from the US EPA although we process them a little). But the majority of my research involves health data which contains sensitive information about individuals. These data range from national studies using Medicare and Medicaid insurance claims to more local Baltimore studies involving patients at Johns Hopkins Hospital. In most cases, I hesitate to publish these data unless there is an obvious way to do so that doesn’t violate privacy. For example, we sometimes publish high-level summary statistics of the data or will publish small clinical datasets with identifiers removed. My discussions with others about sharing health data for research purposes have always been vague and hand-wavy. In general, there’s a sense that “reproducibility is good” and that “data sharing is good” but as someone who has implement these vague ideas, there is frustratingly little guidance on how.

Times have changed quite a bit since I first started thinking seriously about reproducible research back in 2006. Today we have companies like Facebook, Google, and other social networks that are routinely collecting troves of information about our behaviors. Credit scoring companies like Equifax collect a tremendous amount of financial data on us and are vulnerable to hacking. People for the most part do not have a problem with the fact that the data are collected because there is an exchange there (data for services). But when personal data is shared with others without their knowledge or consent, they are rightfully outraged. An interesting aspect of the most recent Facebook/Cambridge Analytica controversy is that it was an academic researcher who shared the data with consulting firm Cambridge Analytica. If he had written an academic paper and shared the data with people who wanted to reproduce the findings, would that have been the right thing to do? For the sake of reproducibility?

I think research on humans is moving in the direction of making it harder to share rather than easier. It is getting easier and cheaper to collect highly detailed information about individuals and their behavior and it is tempting to incorporate that into research questions. From my own work, in a paper published in 2008 we linked health data aggregated to the county level to air pollution concentrations in that county. In a more recent study on coarse particulate matter exposure, we linked health data at the ZIP code level (a smaller unit of aggregation) because we were able to build a machine learning model to predict air pollution exposure at smaller scales. When you further stratify a ZIP code by race, gender, and age, the counts start to get very small. While using data at this level of detail is useful for answering important new research questions, it is impossible to share this kind of data for reproducibility purposes. That said, this particular dataset (Medicaid billing claims) is not proprietary; any researcher can obtain it from the Center for Medicare and Medicaid Studies for a fee and a research protocol.

There are a few points that need some serious consideration before we can have any real sharing of health data and at this point I don’t think there are good answers to the questions raised.

Data masking. People often ask why you can’t just “mask” the data in some way in order to protect people’s privacy. Masking can come in the form of aggregation, or jittering, or otherwise adding noise. First, I know of no generic way to do this for any dataset that guarantees that you can’t identify someone. Second, there is no way to guarantee that the masked dataset cannot be linked with an outside dataset in order to then identify someone in your dataset. Given the increasing number of “outside datasets” that are available (Facebook, Google, LinkedIn, etc.) I find it difficult to imagine that this task would get easier in the future. Finally, masking data to the point where they are reasonably unidentifiable often makes them useless for reproducibility purposes. For example, in the Medicaid example above, we could aggregate the data up (a lot), but that would defeat the entire purpose of using that dataset in the first place.
Liability. I don’t have any clue who would be held liable if a person’s privacy were violated due to a dataset that I published. At this point, I’m just going to assume it’s me. I have some difficulty seeing my university or anyone else running to my defense if something like this were to occur. While Mark Zuckerberg may have a team of lawyers and public relations specialists to help him testify before Congress when Facebook has a data breach, I have no such resources. You can see how this uncertainty might give any researcher pause before publishing a health-related dataset. For example, Netflix was sued over its famous “Netflix Prize” where millions of movie ratings were made publicly available. If you’re thinking “That won’t happen to me, I’m just an academic researcher,” all I can say is good luck with that defense.
Security. In the past, there was a sense that if you shared a dataset with “a few other researchers” it wasn’t such a big deal. With this ancient logic, it would have been difficult for a dataset to “escape” because it’s not like someone was really going to go through the effort of copying a bunch of disks or tapes. Alas, the Internet is a thing now and all it takes is one hack for data to be exposed to the world. If I share my data with someone with a password of 123456, is that really good for anyone?
Cost. Data sharing in general is a time consuming business but it’s much easier if you can format a dataset in a reasonably cross-platform way and just push it to a well-supported public repository. This is how most discussions of data sharing are framed, with the implication being that once a dataset is shared, there are no ongoing costs. However, with health data, that’s not possible, so the current approach is to essentially deal with requests as they come in and handle them on a case-by-case basis. This takes time and the process is not particularly well organized. Furthermore, there is no funding for this process because usually data is shared after the grant funding has gone away. I’ve seen very little discussion of sustainable business models for sharing of protected academic health data. As researchers publish more studies, they essentially incur more unfunded mandates to deal with the requests for the underlying data. Larger institutions may have the ability to aggregate resources to pay for this, but smaller institutions likely do not.

Recent controversies at Facebook and various other companies have highlighted the vast quantities of data these companies collect and how vulnerable the data are to exposure. There is a reasonable debate going on regarding whether companies should be able to share this data and for what purposes. Academics have to realize that they are also part of this debate and that any decisions made in that domain will likely affect them. For example, the Health Insurance Portability and Accountability Act (HIPAA) was not a law that particularly had academics in mind, but it nevertheless had a profound effect on the way that academics handled health insurance data for research purposes. Should governments around the world decide to restrict the amount of data that social networks like Facebook can share with third parties, those decisions will likely affect academics too. The recently implemented EU General Data Protection Regulation is a step in this direction and there will likely be more to follow.

While there are many complexities involved in sharing health data, I do think data sharing serves an important practical purpose, beyond the transparency and the “it’s the right thing to do” reasons. Lack of data sharing serves to entrench incumbents in the field, and so large and rich institutions can afford to hoard their data and collect new data whenever they feel like it. This hoarding of data feeds an upward cycle where these institutions can therefore get more funding to collect even more data that they don’t share. Newer entrants to a given field are left with fewer resources and an inability to collect the same kinds of data. A rich infrastructure of data sharing allows these new entrants to get “up to speed” quicker and to ask more interesting questions from the get go. The same is true in the commercial world, as huge companies like Facebook and Google can afford to hold on to their data and comply with various complex regulations. They may be especially motivated to do so if it prevents upstarts and competitors from appearing. The recent release of Facebook’s dating feature is a near perfect example of this. Facebook might be very motivated to not share its data if it can build its own separate business while simultaneously jeopardizing the business models of its competitors Match.com, Tinder, and OkCupid, which depend critically on Facebook’s social graph.

In the end, doing things for research does not give someone a blank check to do anything. Over time we have restricted the activities that researchers can engage in, because society has decided they are unethical or otherwise inappropriate. I think it’s inevitable that the sharing of data will require a society-wide discussion about whether sharing for research purposes provides a benefit that outweighs the costs. Right now, we have researchers essentially making unilateral decisions on an ad hoc basis and I don’t think that is sustainable.