Where do you get your data?


Here’s a question I get fairly frequently from various types of people: Where do you get your data? This is sometimes followed up quickly with “Can we use some of your data?”

My contention is that if someone asks you these questions, start looking for the exits.

There are of course legitimate reasons why someone might ask you this question. For example, they might be interested in the source of the data to verify its quality. But too often, they are interested in getting the data because they believe it would be a good fit to a method that they have recently developed. Even if that is in fact true, there are some problems.

Before I go on, I need to clarify that I don’t have a problem with data sharing per se, but I usually get nervous when a person’s opening line is “Where do you get your data?” This question presumes a number of things that are usually signs of a bad collaborator:

The real question that I think people should be asking is “Where do you find such great scientific collaborators?” Because it’s those great collaborators that generated the data and worked hand-in-hand with you to get intelligible results.

Niels Keiding wrote a provocative commentary about the tendency for statisticians to ignore the substantive context of data and to use illustrative/toy examples over and over again. He argued that because of this tendency, we should not be so excited about reproducible research, because as more data become available, we will see more examples of people ignoring the science.

I disagree that this is an argument against reproducible research, but I agree that statisticians (and others) do have a tendency to overuse datasets simply because they are “out there” (stackloss data, anyone?). However, it’s probably impossible to stop people from conducting poor science in any field, and we shouldn’t use the possibility that this might happen in statistics to prevent research from being more reproducible in general. 

But I digress…. My main point is that people who simply ask for “the data” are probably not interested in digging down and understanding the really interesting questions.