Big Data - Context = Bad

There's a nice article by Nick Bilton in the New York Times Bits blog about the need for context when looking at Big Data. Actually, the article starts off by describing how Google's Flu Trends model overestimated the number of people infected with flue in the U.S. this season, but then veers off into a more general discussion about Big Data.

My favorite quote comes from Mark Hansen:

“Data inherently has all of the foibles of being human,” said Mark Hansen, director of the David and Helen Gurley Brown Institute for Media Innovation at Columbia University. “Data is not a magic force in society; it’s an extension of us.”

Bilton also talks about a course he taught where students built sensors to install in elevators and stairwells at NYU to see how often they were used. The idea was to explore how often and when the NYU students used the stairs versus the elevator.

As I left campus that evening, one of the N.Y.U. security guards who had seen students setting up the computers in the elevators asked how our experiment had gone. I explained that we had found that students seemed to use the elevators in the morning, perhaps because they were tired from staying up late, and switch to the stairs at night, when they became energized.

“Oh, no, they don’t,” the security guard told me, laughing as he assured me that lazy college students used the elevators whenever possible. “One of the elevators broke down a few evenings last week, so they had no choice but to use the stairs.”

I can see at least three problems here, not necessarily mutually exclusive:

  1. Big Data are often "Wrong" Data. The students used the sensors measure something, but it didn't give them everything they needed. Part of this is that the sensors were cheap, and budget was likely a big constraint here. But Big Data are often big because they are cheap. But of course, they still couldn't tell that the elevator was broken.
  2. A failure of interrogation. With all the data the students collected with their multitude of sensors, they were unable to answer the question "What else could explain what I'm observing?"
  3. A strong desire to tell a story. Upon looking at the data, they seemed to "make sense" or to at least match a preconceived notion of that they should look like. This is related to #2 above, which is that you have to challenge what you see. It's very easy and tempting to let the data tell an interesting story rather than the right story.

I don't mean to be unduly critical of some students in a class who were just trying to collect some data. I think there should be more of that going on. But my point is that it's not as easy as it looks. Even trying to answer a seemingly innocuous question of how students use elevators and stairs requires some forethought, study design, and careful analysis.

 

This entry was posted in Uncategorized. Bookmark the permalink.
  • http://www.facebook.com/tranlm John Smith

    well said. It's only by being as critical as we can that the field can continue moving forward.

  • Ken

    Confounding happens, it is always a problem in observational studies. The problem is that rather than asking a general why, the excluded the possibility that there was some other process rather than the students doing what they wanted to.

  • Mayo

    Possibly they were lured into using pig statistics? http://errorstatistics.com/2013/03/04/big-data-or-pig-data/

  • http://twitter.com/Malarky67 Stephen Henderson

    There was recent news story here in the UK about the suspected under-estimation of alcohol consumption here in the UK. Previous government figures have been based upon large public health surveys. However all alcohol has a sales tax (duty) here in the UK and these figures are available as part of treasury documents. A group just tallied them both and found that what people tell the surveys falls well short of the alcohol that is actually bought in the country (by about a half).

    http://www.bbc.co.uk/news/health-21586566

    Even fairly traditional well respected data sources need reality checks.

  • Hélio Silva

    ( Big Data – Context = Bad Predition ) X publicity = Fiasco