12
Dec

The key word in "Data Science" is not Data, it is Science

Tweet about this on Twitter463Share on Facebook244Share on Google+116Share on LinkedIn113Email this to someone

One of my colleagues was just at a conference where they saw a presentation about using data to solve a problem where data had previously not been abundant. The speaker claimed the data were "big data" and a question from the audience was: "Well, that isn't really big data is it, it is only X Gigabytes".

While that exact question would elicit groans from most people who work with data, I think it highlights one of the key problems with the thinking around data science. Most people hyping data  science have focused on the first word: data. They care about volume and velocity and whatever other buzzwords describe data that is too big for you to analyze in Excel. This hype about the size (relative or absolute) of the data being collected fed into the second category of hype - hype about tools. People threw around EC2, Hadoop, Pig, and had huge debates about Python versus R.

But the key word in data science is not "data"; it is "science". Data science is only useful when the data are used to answer a question. That is the science part of the equation. The problem with this view of data science is that it is much harder than the view that focuses on data size or tools. It is much, much easier to calculate the size of a data set and say "My data are bigger than yours" or to say, "I can code in Hadoop, can you?" than to say, "I have this really hard question, can I answer it with my data?".

A few reasons it is harder to focus on the science than the data/tools are:

  1. John Tukey's quote: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.". You may have 100 Gb and only 3 Kb are useful for answering the real question you care about. 
  2. When you start with the question you often discover that you need to collect new data or design an experiment to confirm you are getting the right answer.
  3. It is easy to discover "structure" or "networks" in a data set. There will always be correlations for a thousand reasons if you collect enough data. Understanding whether these correlations matter for specific, interesting questions is much harder.
  4. Often the structure you found on the first pass is due to a phenomena (measurement error, artifacts, data processing) that doesn't answer an interesting question.

The issue is that the hype around big data/data science will flame out (it already is) if data science is only about "data" and not about "science". The long term impact of data science will be measured by the scientific questions we can answer with the data.

  • Will

    YES! Nothing has really changed in terms of analysis - it always has to start with a question. There are so many companies out there that aren't willing to ask questions of the data but they buy into advanced statistical software. Then they have the expectation that it will solve all of their problems but they're not willing to experiment to get the answers...

  • http://www.dewdrops.net/drew Drew V

    Great post. You lost me in the last line, though.

    I've always thought of the "Science" in "Data Science" as not referring to the questions we answer, but to the methodology.

    To me, "Data Science" is about making better decisions by applying the scientific method to problems which we weren't able to before, because either the data or the tools to analyze it didn't exist before. This allows us to draw conclusions from empirical evidence instead of relying on instinct and anecdotes.

    And the value isn't limited to scientific questions,but in everything from deciding which treatments patients should receive, to how roadways should be designed, or even mundane things like which TV shows should be created.

  • Thomas Speidel

    Great points and I agree with everything. I see the "hype" as both a blessing and curse. It is a blessing because it is finally exposing organizations to doing more analytic/stats work, regardless of the size of data. A reality that, so far, was a prerogative of only a few: "using statistics has been the sexy job of the last 30 years. It has just taken awhile for organisations to catch on." (Jim Goodnight, SAS co-founder).

    As a statistician, it's also a curse because it has biased the conversation, blurred the objectives of what we do and opened the doors to both companies and people who want to quickly monetize from the hype. There's little science in big data or data science in the sense that often there is no question. Instead, efforts are geared towards knowledge discovery (what we would call exploratory data analysis), with all the problems that entails.

    As the cost of storage keeps going down and data collection is ubiquitous, there's a tendency to keep collecting more just because we can, with the false expectation that more is always better. Hence "big data", which we cannot access with conventional tools, hadoop, data warehouses, database solutions becomes more of an IT/CS problem than an analytical one. Sampling, which forces us to think about representativeness, type I/II errors, effect size, loss of information etc. is being supplanted by a super-sample that seems to have magic properties by virtue of its size.

    The fundamental question that needs to be answered is whether the increased costs that are associated with big data result in improved results (lower prediction error, lower TypeI/II error, better validity etc) compared to sampling.

  • https://sites.google.com/site/themattprather Matt Prather

    Yes.

  • No One Special

    I agree with the 3 previous commentators. 1) If the amount of stored data is increasing at rate X, are the energy costs, carbon footprints, etc.. decreasing at the same rate? 2) last sentence needs to have the word "science" shifted over a couple of places: The long term impact of data science will be measured by the questions we can answer scientifically with the data. 3) it always has to start with a question.

  • Vincent Granville

    There are several flavors of data science, like there are several flavors of statistics. The version that I promote is a blend of engineering, technology and business management. It does include science and research, but science is not the most important and certainly not the most visible ingredient.