What statistics should do about big data: problem forward not solution backward

There has been a lot of discussion among statisticians about big data and what statistics should do to get involved. Recently Steve M. and Larry W. took up the same issue on their blog. I have been thinking about this for a while, since I work in genomics, which almost always comes with "big data". It is also one area of big data where statistics and statisticians have played a huge role.

A question that naturally arises is, "why have statisticians been so successful in genomics?" I think a major reason is the phrase I borrowed from Brian C. (who may have borrowed it from Ron B.)

problem first, not solution backward

One of the reasons that "big data" is even a term is that there is that data are less expensive than they were a few years ago. One example is the dramatic drop in the price of DNA-sequencing. But there are many many more examples. The quantified self movement and Fitbits, Google Books, social network data from Twitter, etc. are all areas where data that cost us a huge amount to collect 10 years ago can now be collected and stored very cheaply.

As statisticians we look for generalizable principles; I would say that you have to zoom pretty far out to generalize from social networks to genomics but here are two:

  1. The data can't be easily analyzed in an R session on a simple laptop (say low Gigs to Terabytes)
  2. The data are generally quirky and messy (unstructured text, json files with lots of missing data, fastq files with quality metrics, etc.)

So how does one end up at the "leading edge" of big data? By being willing to deal with the schlep and work out the knitty gritty of how you apply even standard methods to data sets where taking the mean takes hours. Or taking the time to learn all the kinks that are specific to say, how does one process a microarray, and then taking the time to fix them. This is why statisticians were so successful in genomics, they focused on the practical problems and this gave them access to data no one else had/could use properly.

Doing these things requires a lot of effort that isn't elegant. It also isn't "statistics" by the definition that only mathematical methodology is statistics. Steve alludes to this in his post when he says:

Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning).

I think this is a view shared by many statisticians. That since there isn't a new elegant theory yet, there aren't "new ideas" in big data. That focus is solution backward. We want an elegant theory that we can then apply to specific problems if they happen to come up.

The alternative is problem forward. The fact that we can collect data so cheaply means we can measure and study things we never could before. Computer scientists, physicists, genome biologists, and others are leading in big data precisely because they aren't thinking about the statistical solution. They are thinking about solving an important scientific problem and are willing to deal with all the dirty details to get there. This allows them to work on data sets and problems that haven't been considered by other people.

In genomics, this has happened before. In that case, the invention of microarrays revolutionized the field and statisticians jumped on board, working closely with scientists, handling the dirty details, and building software so others could too. As a discipline if we want to be part of the "big data" revolution I think we need to focus on the scientific problems and let methodology come second. That requires a rethinking of what it means to be statistics. Things like parallel computing, data munging, reproducibility, and software development have to be accepted as equally important to methods development.

The good news is that there is plenty of room for statisticians to bring our unique skills in dealing with uncertainty to these new problems; but we will only get a seat at the table if we are willing to deal with the mess that comes with doing real science.

I'll close by listing a few things I'd love to see:

  1. A Bioconductor-like project for social network data. Tyler M. and Ali S. have a paper that would make for an awesome package for this project. 
  2. Statistical pre-processing for fMRI and other brain imaging data. Keep an eye on our smart group for that.
  3. Data visualization for translational applications, dealing with all the niceties of human-data interfaces. See healthvis or the stuffy Miriah Meyer is doing.
  4. Most importantly, starting with specific, unsolved scientific problems. Seeking novel ways to collect cheap data, and analyzing them, even with known and straightforward statistical methods to deepen our understanding about ourselves or the universe.
This entry was posted in Uncategorized. Bookmark the permalink.
  • Larry Wasserman

    Jeff
    I agree with you that statisticians need to be willing to dig into the nitty gritty
    and get their hands dirty. Still, we need to educate the world that not all answers
    are created equal. If a statistician wrote a paper on string theory, without
    learning any of the background physics, it would get ignored (or highly criticized).
    And rightly so. But when it comes to statistics, the world tolerates people
    creating data analysis methods that might make little sense. The same standard
    of scholarship does not seem to apply.
    --Larry

    • jtleek

      Larry-

      I agree that statistical education/literacy are completely critical. I think the analogy with physics/string theory fails as an 85% statistical solution can be incredibly helpful in a wide range of applications (business, biology, physics, etc.).This means the majority of statistics will likely be done by people who are not statisticians. We should embrace this, this makes our field central to so many disciplines.

      So the question is - what is the right kind of education we should be pursing? One option is to set about trying to define a "one true path" and stating that some methods are strictly better than others. Another is to teach critical statistical thinking (like they teach in critical thinking classes or composition classes) that teach people how to think about statistics - but encourage them to contribute.

      I would say that there are many people who are not statisticians who have made incredibly strong contributions to statistics in a variety of areas - we should ask them how they became expert.

      -Jeff

      • Larry Wasserman

        agreed

      • Evan Johnson

        Jeff,

        I liked your thoughts and I think you are right. However, there are a couple problems that we (as statisticians) need to solve before this will ever happen, among them are:

        Speaking for genomics: Very few people outside our discipline read our journals. This is primarily because we have set bar for statistical innovation/sophistication too high. Most of what we publish is not accessible to non-statisticians in more applied fields (those who actually have and are analyzing the data). Most of the time we require novel statistical innovation to brand-new problems and fail to realize that maybe we should start with more simple and straightforward approaches. Also, if we want others outside of statistics to contribute to the field, we need to be willing to accept less sophisticated approaches in our journals as long as these methods work well and are statistically justified.

        Until we do this, the impact factors of our top journals will continue to hover around 1-2, and our journals will continue to have little influence on how data are actually analyzed in practice (at least in genomics).

        Just my two cents...

        Evan

        • Bin Yu

          I agree, and I'd like to add that our journals should emphasize much more on solving problems outside statistics
          than solving problems inside statistics.

          • Evan Johnson

            Bin:

            Well stated. I totally agree.

            To be clear, I think there there is no reason to deemphasize the value of solving problems within statistics--I believe these types of problems are crucial for our identity as a discipline.

            However, we do need to be more willing to value less technical statistical contributions when they truly benefit science in other disciplines. When we expand the boundaries of statistics into new realms, we have to accept that the initial contributions are not initially going to be as sophisticated as the statistical contributions in more mature areas, and we have to embrace this fact better.

            Evan

    • Rick Wicklin

      Jerome Cornfield addressed these issues in his 1975 ASA Presidential Address (http://www.jstor.org/stable/2285368). He very much advocated that statistics advanced by "getting their hands dirty," as @Larry puts it. He called statistics "the bedfellow of the sciences" and claimed "the true joy [of a statistician] is to see the breadth of application and the breadth of understanding grow together" (p. 11 of reference cited). Cornfield's contributions came from his work at the NIH, and I think that his Presidential Address speaks well to many of the issues we are facing with Big Data.

      For more on Jerome Cornfield, see http://blogs.sas.com/content/iml/2013/03/18/biography-of-jerome-cornfield/

  • Tao Shi

    I think one area statisticians might make great contributions is working with scientists on how to collect relevant big data to answer their questions, other than just helping them find answerable questions with existing cheap big data. Either way, we need to get involved in solving the problem people care and get our hands dirty with the messy data.

  • itfeature.com

    Agreed. Great article. We need more advanced techniques that help to analyze large data set with ease. Learn Basic Statistics

  • abc

    hi

  • abc

    good

  • DataH

    Jeff, we are seeing an increase in businesses seeking specialized skills to help address challenges that arose with the era of big data. The HPCC Systems platform from LexisNexis helps to fill this gap by allowing data analysts themselves to own the complete data lifecycle. Designed by data scientists, ECL is a declarative programming language used to express data algorithms across the entire HPCC platform. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. More at http://hpccsystems.com

  • Erick

    Are there R packages with documentation for implementing DoE? I am familiar with JMP (SAS)'s interface for DoE implementation for screening. I'm thinking -- finding factors that (A) influence an outcome and (B) optimize (e.g., maximize yield or minimize cost) outcomes. I'm asking because this is an important tool for biotech/pharma and we are graduating lots of mol. bio., genetics, biochem, microbio., and chem. BS majors who want to go into biotech/pharma but have never heard of DoE (design of experiments). I'm thinking of creating a statistics-focused microbiology lab that teaches applications of DoE, but my expertise is limited to JMP and I'd rather teach in open source instead of requiring paying for a GUI.

  • antagomir

    "I'll close by listing a few things I'd love to see: A Bioconductor-like project for social network data." -> A Bioconductor-like initiative for open government data and computational social sciences (including but not limited to social network analysis) is now starting to take shape and welcoming new developers: ropengov.github.com

  • Temesgen Mekuria

    Is there any one can give me few ideas about

    1. the current contribution of statistics to decision making under uncertainty and

    2. the role of statistics in " big data " and analytics.