Simply Statistics: The problem with small big data

There’s lots of talk about “big data” these days and I think that’s great. I think it’s bringing statistics out into the mainstream (even if they don’t call it statistics) and it creating lots of opportunities for people with statistics training. It’s one of the reasons we created this blog.

One thing that I think gets missed in much of the mainstream reporting is that, in my opinion, the biggest problems aren’t with the truly massive datasets out there that need to be mined for important information. Sure, those types of problems pose interesting challenges with respect to hardware infrastructure and algorithm design.

I think a bigger problem is what I call “small big data”. Small big data is the dataset that is collected by an individual whose data collection skills are far superior to his/her data analysis skills. You can think of the size of the problem as being measured by the ratio of the dataset size to the investigator’s statistical skill level. For someone with no statistical skills, any dataset represents “big data”.

These days, any individual can create a massive dataset with relatively few resources. In some of the work I do, we send people out with portable air pollution monitors that record pollution levels every 5 minutes over a 1-week period. People with fitbits can get highly time-resolved data about their daily movements. A single MRI can produce millions of voxels of data.

One challenge here is that these examples all represent datasets that are large “on paper”. That is, there are a lot of bits to store, but that doesn’t mean there’s a lot of useful information there. For example, I find people are often impressed by data that are collected with very high temporal or spatial resolution. But often, you don’t need that level of detail and can get away with coarser resolution over a wider range of scenarios. For example, if you’re interested in changes in air pollution exposure across seasons but you only measure people in the summer, then it doesn’t matter if you measure levels down to the microsecond and produce terabytes of data. Another example might be the idea the sequencing technology doesn’t in fact remove biological variability, no matter how large a dataset it produces.

Another challenge is that the person who collected the data is often not qualified/prepared to analyze it. If the data collector didn’t arrange beforehand to have someone analyze the data, then they’re often stuck. Furthermore, usually the grant that paid for the data collection didn’t budget (enough) for the analysis of the data. The result is that there’s a lot of “small big data” that just sits around unanalyzed. This is an unfortunate circumstance, but in my experience quite common.

One conclusion we can draw is that we need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data). But the sad truth is that there aren’t enough of us on the planet to fill the demand. So we need to come up with more creative ways to get the skills out there without requiring our physical presence.