Why big data is in trouble: they forgot about applied statistics

This year the idea that statistics is important for big data has exploded into the popular media. Here are a few examples, starting with the Lazer et. al paper in Science that got the ball rolling on this idea.

All of these articles warn about issues that statisticians have been thinking about for a very long time: sampling populations, confounders, multiple testing, bias, and overfitting. In the rush to take advantage of the hype around big data, these ideas were ignored or not given sufficient attention.

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics  has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn't have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

Statistical thinking has also been conspicuously absent from major public big data efforts so far. Here are some examples:

One example of this kind of thinking is this insane table from the alumni magazine of the University of California which I found from this amazing talk by Terry Speed (via Rafa, go watch his talk right now, it gets right to the heart of the issue).  It shows a fundamental disrespect for applied statisticians who have developed serious expertise in a range of scientific disciplines.

Screen Shot 2014-05-06 at 9.06.38 PM

All of this leads to two questions:

  1. Given the importance of statistical thinking why aren't statisticians involved in these initiatives?
  2. When thinking about the big data era, what are some statistical ideas we've already figured out?

This entry was posted in Uncategorized. Bookmark the permalink.
  • Chad LaFever

    Suppose one is currently employed as a data scientist or data analyst, and wants to get this expertise without necessarily going back for another MS in statistics. What's the best way to gain it? Take classes on the side at a uni? My background is in economics and pure mathematics, with exposure to econometric modeling, etc, but no deep immersion in the finer areas of statistics.

    • Georgette Asherman

      If you have the time and employer support, take a graduate level course with a calculus requirement in applied probability and statistics. Ideally it should be in a department where the students have a quantitative bent, not where they are fulfilling a requirement. For work on your own, I won't recommend specific books or on-line courses, but aim for more advanced content with some applications beyond the contrived 'ball and urn' problems. A lot of good examples from farming or industry might seem obscure but offer a way to understand applied stats concepts.

      Look at the American Statistical Association's CHANCE and SIGNIFICANCE magazines. They have accessible discussions and interesting case studies in current events, sports, social issues and ecology.

    • Lynn Michaluk

      You can take advanced stat courses for free and get a statement of accomplishment for doing so at coursera.org and at Edx.org. Many courses will be redundant, but there are courses in Big Data and Machine Learning offered, among others.

  • BK

    I agree that the challenges of found data and even experimental design are frequently lost, even on relative experts in the field that generates the data. This leads to all sorts of nightmares.

    However, you have opened Pandora's comment box by asking why statisticians aren't involved. The estimable Susan Holmes of Stanford explained part of it quite well when she said that by-and-large biologists can't afford statisticians. This is a truth also heard from Cornell faculty, and others. Biologists not only can't afford world-class statisticians, they can't even afford recent graduates in most cases. I have heard explicit justifications for hiring any recent Cornell masters degree holder in Biostats; because otherwise, biology faculty instead ask biology students and postdocs to take classes in the subject and use their own statistics training, formal and informal.

    At the same time, in planning animal experiments, for example, institutions attempt to force biologists to consult the biostatistics staff, even to including them on the regulatory committees; justifying the $75/hr rates is a perennial fight. As a result, the consultations are short, shallow, after the fact and pro forma. They often are ultimately adversarial, untimely, and not terribly helpful. The relationships spiral down from there.

    What is going on here? The basic problem seems to be a pricing/valuation issue. Biologists price statisticians much lower than statisticians price themselves. Is it disrespect? I'd argue not - because biologists price statisticians at the same price that they price biologists with similar years of training. Who is being undervalued?

    To my eye, the issue is that biologists are invested in the specific work to a different degree, which forms a large part of their compensation. Physicians who wish to do research often 'buy' their own time and take a substantial decrease in compensation as a result.

    So, the question becomes whether a scientist believes he can achieve his goals based on the education he has received in statistics - or whether he is willing to pay someone many times more per hour as a consultant. I do believe that scientists often make the wrong decision at this point, when it comes to information technology, programming, human resources, architecture, facilities, animal husbandry, and other 'overhead' areas. However, they are operating under constraints and perverse incentives.

    If a biologist wants to join a project, he essentially 'buys in' with labor, grant money, and so on. Very few are paid high wages. The humanities, from my experience, is similar. If statisticians really want to be involved - the best way is to 'join the team.' Otherwise, people will continue to muddle on with what they can glean from statistics classes and tutorials.

    • CompStat

      If you have a toothache, and feel the estimated cost given by the dentist is too much, will you settle for do-it-yourself home remedies providing variable results?

      Would you rather subject your hard-earned data to amateur or even wrong analysis and interpretation, which nullifies or downgrades the quality of your own hard work rather than pay what it takes to get professional services?

      Have you heard of the law of demand and supply? There are so few good biostatisticians in the market and so much more demand....

      How many biologists have joined a department of statistics to understand the subject? You expect the biostatistician to master your language sufficiently to understand you, analyze your data and even write down the interpretation of results in your language for publication. Do you realize how much more than a mere statistician or a mere biologist this individual has to be?

      The biologist who "buys in with labor, grant money and so on" gets plenty of career growth rewards in return. What do you give the biostatistician who is at your mercy in terms of career growth and position in your organization? You expect them to be happy watching every junior biologist rise above them in the organizational structure while they stay exactly where they are?

      Most importantly, the above biologist gets to do projects of his choice and grants to do so. Will you allow the biostatistician to play a crucial role or have a crucial say in the overall direction/management of your project, even when he/she has more experience than you and understands things better?

      The Physician who takes a lower salary than what he could earn as a practitioner nevertheless gets monetary compensation for non-practising, which puts his salary above that of a biologist with similar years of study & experience. Will you give the same to a statistician who is sacrificing the (i) comfort zone of working in the sociology of a statistics department to come and work in the different sociology of a biology lab, (ii) option of doing his/her own research, where he/she is the boss,to working on your projects (iii) career growth in a statistics department to stagnation in your lab? (iv) assessment of performance by experts based on quality of work done to being judged by people who do not understand the work and hence give higher weightage to criteria involving soft skills like friendliness, behavior etc?

      Are you willing to treat your statistician with the same professional respect you give the physician? Are you willing to acknowledge that the statistican is also a scientist and not just a tool-pusher? Are you willing to do some experiments to generate data for your biostatistician's innovative research, on reciprocal basis, in return for his/her supporting your projects?

  • Rex

    Good problem to think about, but isn't there bias here? This article only focuses on the cases that support its claim. Inception.

  • Andrew Gelman

    That image is pretty amazing. But it does capture the attitude about applied statistics of most of the Berkeley stat dept in 1993!

  • Paramita

    Very good points -- it worried me that we are jumping into the "big data" bandwagon without much thought about study designs and how to meaningfully analyze and interpret data. Another thing that worries me is data collection via crowd-sourcing, reminds me of this post from Calvin and Hobbes: http://www.smartkpis.com/blog/wp-content/uploads/Calvin_Hobbes_Data_Quality.gif

  • Jonathan

    On Q2: We were doing SGD a long time before BigData turned Newton-type optimization irrelevant:

  • tpepler

    I have finally figured out why my wife does not like it when I leave dirty underwear lying on the floor. Therefore I now am a Psychologist.

    The problem is that there are too many people who think this way about the field of Statistics.

  • ilir

    The way I see it abundant data only relaxes one of the constraints on what can be done with data. Statistics methods were initially developed with 3 considerations in mind: 1) that they do not point at the wrong measure, 2) that they work with a limited sample, 3) that they are not onerous to calculate. Ubiquitous computers and numerical algorithms relaxed part 3 some time ago, and now part 2 is being relaxed in some areas. I cannot imagine a world where part 1 would be accomplished without a good understanding of the underlying process.

  • Harlan Harris

    This is great. My only quibble is that not everyone with excellent statistical training is necessarily a graduate of a statistics department and/or a card-carrying statistician. Many fields, in particular the experimental social sciences, teach statistics rather well, if not as broadly or deeply as statistics departments do. But the original point is spot on -- you can throw a lot of machine learning or statistical algorithms at big data problems and fool yourself very efficiently, if you don't deeply understand what you're doing.

  • bakunin

    Great post. I would like to comment on the argument that statisticians are not involved enough in studies involving statistical analysis (business analytics included). I believe that we can partially blame the academic system. Although it has been recognized somewhat that statistical education is important these days in a world where data is everywhere, we seem to be going about statistical education the wrong way. The goal of statistic courses is often that the student will be able to apply on his own the statistical methods covered. This goal is not as a whole wrong but it is made to general in all the concepts discussed in many courses. Take business statistics for example. Should a student be able to build summaries through tables and plots? ABSOLUTELY!! Should a student be able to perform two way ANOVA and multiple regression analysis? ABSOLUTELY NOT!! In the case of two way ANOVA and multiple regression there is a series of assumptions and complexities in the procedure. A business student should be capable of understanding the need for them and interpret the results of the analysis (performed by a statistician). However, any business statistics textbook will be layered with triple summations while discussing how to do two way ANOVA. The textbook will also be explaining how to use hypothesis testing to choose covariates in multiple regression analysis without any mention of the problems that arise when applying this technique and leaving out alternative ways to choose covariates. Business students shouldn't be expected to perform these type of analysis and in practice, they almost never do perform them anyway. Furthermore, the pedagogical principle backfires, since many business students are turned off by the subjects of statistics when facing triple summations and what not. Statistical educational on other areas use the same pedagogical principle. As someone who teaches business statistics and conducts research in epidemiology I can say that the need of working with a statistician is often not understood and that our pedagogical system should teach the student to apply SOME of the concepts learned and understand the results from some statistical analysis instead of conducting the analysis themselves.

  • Taylan

    The greatest contribution statisticians can make in the discipline is development of theories around text data.

  • Ryan

    What difference does it make what it is called? Big data... applied statistics, etc. If there are issues in a study, raise them. Don't focus on credentialism. Focus on output. No one needs a degree in statistics to do statistics. It isn't a profession. It is a method.

  • http://perceivant.com/ Mike Hurley

    Honestly, I've not seen a better example illustrating why companies shouldn't try to do data science projects totally in-house. Most companies don't have a team with the cross section of skills required to do this right. Companies should divide the labor for a project between the right people internally, or outsource for the talent sets they are missing. Business executives need to understand the talent requirements when helping with project design.

  • Mark Burgess

    This excellent post left a nail in my head until I reconciled a few points overnight. First, if you passed on the Terry Speed reference, go read (pdf) or listen (media) to that, NOW. Second, new toy (hadoop/nosql ecosystems) slushiness - which will pass as the tools mature and we advance on the hype curve - and new data variety (forget about scale for now), are giving us developers/dba's/data managers some new-order-of-magnitude headaches in reconciling nuances of similar-sounding data elements from various sources and times. Sometimes statisticians and data analysts tended to trust us database geeks' data dictionaries too much already, when we had benefit of yesterday's RDBMS's strong data typing and other controls, so maybe we weren't quite ready for you? Thirdly, I thought the pragmatic practices Jeff espouses would be requisite for any project and any technology - the "low relative cost" of the new platforms means you can ask bigger, better questions, not that you can hire dummies to do your data analysis. So,"Come Back! - yes, there's room at the table for you, but be prepared that the snappiness and convenience of one-click rdbms installs and fancy etl tools is gone for now, regressed into the dark ages. Until the tooling catches up with the new data management systems, be prepared to get splattered with oddities and border cases that we might've been hiding from you (sadly) all those years to make the data fit our strongly-typed data stores.

  • W. Evan Johnson

    Great post. While I totally agree that more input from statisticians is needed here, I have to say that some of the blame should also be given to the statisticians as well. I can't tell you how many colleagues and collaborators have decided to go at it alone without a statistician because they find statisticians too difficult to communicate with and that most statisticians end up focusing on the wrong problem (focus on theoretically optimal or too generalized problem while not really solving the actual problem at hand). I think statisticians in general need a bit of a reality check and need to learn to better communication and collaboration skills (although this is not saying that biologists, physicians and others also need the same reality check)

  • http://www.pietutors.com/ Ashish Soni

    I think one does need to have a grduate/post graduate or PH.D to be an statistician. I know many very good statisticians who do not hold either of these degrees and yet they are working as statistical consultants in big firms just based on their flair for statstics.

  • junkcharts

    Jeff: see my article on calling on statisticians to join the Big Data movement in Significance from last year.

    Also, reading the bestselling Big Data book gives you an idea of the huge mountain we have to climb. The book fundamentally rejects statistics. It describes core principles of Big Data -- and these are principles which means they don't require justification. For example, the oft-cited "N=All" is almost never a fact of the data but an assumption on the part of the analyst. Then, a whole slew of conclusions tumble out relying on that assumption.

    Be forewarned. You have to be in a sedated mood while you read that book.

  • aa2858

    Very good points and these points touch on some of issues I have been struggling with in my quest to become uber data scientist. My college degree was in computer science and economics which exposed me to a number of basic maths and statistics about 2 decades ago. I have been working with databases all my career.

    In my data science journey, I have gone through a number of MOOC classes and relevant one for this comment are coursera's StatisticOne and Andrew Ng Machine Learning (finishing this now). Leaving out implementation details, the concepts i learn from both classes are pretty much the same in my own opinion. I saw hypothesis, null testing, confidence interval, etc that I learn in StatisticsOne classes showing up in the ML class although at different coverage level. ML and pattern recognition books that I have bought are statistics laden.

    These points made me think that to be uber data scientist that I am aspiring to be I will need courses in applied statistics to bring everything home considering my background. in the ML class and other data science courses I have preview, there wasn't serious attention giving to statistics which I think it is over simplification of things.

    I live is Georgia and I thinking about taking this course :-


    My data science quest is more of curiosity thing than career at this point. What do you guys think?

  • Millie Barr

    This sums up a lot that is wrong with government mismanagement of Health data in my state. Our Health Department insists on collecting hundreds (literally hundreds) of data items for every patient for every trip to hospital - millions and millions of rows of data every year. The audit process that assesses the accuracy of this data is unspeakably deficient. Only about one in every million rows of data ever gets checked back to the actual patient record for accuracy, and then only a small non-controversial subset of the columns is checked. There is a "validation" process that is simply a tool for pedantic public servants to beat hapless hospitals into submission. For instance if out of the 180,000 rows of data your hospital sends this year, 100 postal codes are invalid, the state withholds $400,000 in funding. So what do people do to fix the missing or inaccurate data? They make it up of course. What is the statistical significance of 100 missing values out of 180,000? There is no problem handling unknown values if you know they're unknown. If they've been made to appear like valid data though, that's a different matter entirely. But it gets worse. If your hospital fails to "correct" (makeup/fudge/falsify) a single value out of the row then entire row is withheld from the dataset, so your reported activity is reduced.

    By petulantly forcing hospitals to waste massive resources "correcting" data that makes no difference to anything, the data itself is actually rendered less accurate, not more.

    This utterly child-like approach to evaluation of hospital activity has spawned an entire industry specialising in incompetence - that of the Health Information Manager. These people are taught NOTHING at university about data management (normalisation, querying, warehousing) and even less about inferential statistics, sampling methods and so on.

    But that's not the end of it. Once this accretion of data is stockpiled in the most inefficient way possible, zealous researchers of all persuasions trawl through it and make monumental conclusions on the flimsiest of evidence. The data shows the incidence of health condition X has spiked this year; this is a most disturbing development. Or is it? Could it simply be that a handful of hospitals have worked out that coding the data a particular way will net them more money? There's a huge jump in presentations at hospital B from minority group A. We need to move resources out of other areas to that area to help. Or do we? Could it be that a genius Health Information Manager decided that code 5 which meant minority group B last year, now means minority group A this year, and the hospital simply failed to update it's software.

    Vastly smaller amounts of vastly better collected data would yield better decisions and better research for a fraction of the cost. As I say, the current process supports an entire industry of stupidity.

  • http://www.facebook.com/mlouca Michael Louca

    Great QUESTION: Given the importance of statistical thinking why aren't statisticians involved in these initiatives?

    Not just statisticians though. There should be attention from strategists on what they want answered. The bottom up approach sounds exciting, with the complexity and diversity of data structures, database systems, and quality, one has to be a little suspect on the quality of the insights.

    My take is that part of "Big Data" message/hype has been 'hijacked' by a couple of different groups, all with a different "agendas", or in some cases no agenda other than to discuss a concept or the potential.

    When you look at the backgrounds of "Data Scientists" it really is a rather broad group of very highly intelligent individuals with different training, skills, and work experiences. The proponents are clearly not coordinated in their efforts, and so you get a muddy definition and purpose.

    For some "Big Data" proponents, it feels obvious that the underlying motive is not better analysis or better utilization of unused data, but rather an attempt to control existing data and ultimately the business messages that get communicated. A power struggle so to speak.

    The challenge for "Big Data" is that it is not unified group within any organization that is "Big Data". "Big Data" supporters are entities both within and outside the organization. Often they are talking about different challenges.

    Often, "Big Data" when described, seems to differ very little from what existing Analytical groups would do. And often the business questions are those market research have tackled in the past.

    If the ultimate goal was to produce superior results at a faster speed, why not go about this by unifying existing data, research, and professionals. The first step would be to identify existing gaps in knowledge, data, and skills. For each organization, it will be different. But no doubt, it will not be just one individual.

    As a market researcher, I have some very good experiences uncovering insights working with analytical folks who could produce data that I had no access too. The combined data often produced results and discussion that were far superior than would have been produced by separate efforts.

    Why not drop the moniker "Big Data", as it seems to confuse and divide people, and push for "Coordinated Quant Analysis", or something that better communicates the intention of the professionals.

  • Damjan Vukcevic

    Terry Speed gave a similar talk in Australia recently. If you don't have time to watch the full video, but would like a more coherent summary than just flipping through his slides, you can read my review of his talk: http://damjan.vukcevic.net/portfolio/terry-speed-big-data/

  • Vincent Granville

    My Hidden Decision Trees methodology (an hybrid approximate, very robust logistic regression blended with a few hundred small decision trees, relying on fast combinatorial feature selection to optimize a newly created metric called predictive power - see my book at http://www.datasciencecentral.com/profiles/blogs/my-data-science-book for details) has been used to process and score billions of clicks, IP addresses (sometimes in real time) , keywords and detect some of the largest Botnets impacting the digital advertising industry. It is heavily statistical technology in nature, uses model-free, data-driven confidence intervals, and pretty much none of the statistical techniques described in any textbooks other than mine. It is indeed data science. Most recently, it was used to detect large-scale criminal activity hiding behind Amazon AWS (web services) and not detected by Amazon, by analyzing ad network data and data from Spamhaus, Barracuda, ProjectHoneyPot, Adometry and some other sources including social network data. Many of the techniques I used to turn big data into value are described at http://www.datasciencecentral.com/group/research.

  • Diego Pereira

    Historically, statistics was at the service of states. Nothing more contrary to the spirit of a scientist. Even nowadays, some statisticians seem to live to attack what scientists and doctors in other areas are producing. Certainly this kind of attitude won't gain them respect but rejection.

    In science uncertainty is all we have. We do not posses the truth about the world or about the way it works. That doesn't mean we are relativists unable to commit with an idea or approach. We defend our approaches, but usually we are open minded to others. Is the strength in the argument, and the evidence that support it what matters the most.

    In this regard, be aware that what you call evidence is not the only source of evidence, not the strongest one.

    Scientists understand there is more evidence than the one explained by a mathematical model, which no matter how good it could be remains being an idea, an hypothesis, a theoretical approach instead of a fact. After the "Science wars" scientists started understanding reality as a construct and science stopped claiming to posses "the truth". We build realities from our limited perception of the world in order to understand a bit some of its complexity.

    For all this, most impressive scientists are very humble people. And for all this subjectivists and bayesians fit better in the scientific world.

    Statisticians have to earn their stripes in science, and that implies a switch in attitude towards knowledge and truth.

    You have to decide if you are going to behave as politicians or as scientists; if you are going to support the efforts of the scientific community for doing research out of the mainstream or if you are going to be lame and keep saying how good are the governments in supporting new areas of research when they are not.

    But the most important thing is you have to understand models are just models; that your perspective about the world, while valid, is not sufficient to disqualify others; and that the knowledge you posses is not enough for claiming control over other sciences and disciplines or to rule the world.

    If you work in medicine stop reading and go to a medical consultation, do a fellowship that allows you to understand how medicine works, what is the diagnostic process and how decisions are made there. But overall be respectful before going publicly to criticize doctors when you don't understand what they do or where they come from.

    If you are going to work with a disease, understand the disease before disqualifying people's opinions about something you don't understand. Study how ontologies and classifications are build in medicine starting with the classics from Hippocrates to French semiology. No, Medicine didn't start with Fisher and the so-called evidence based medicine is a bad joke: it is neither medicine nor evidence based.

    And before complaining about how scientists and doctors may treat you, be sure you treat developers and programmers with at least the same respect you would like to obtain in science and medicine. I will never understand why some incredible hackers are excluded from publications when their work in many cases has been the pillar for the results achieved.

    A side note:
    Philosophically, and practically, statisticians are not
    scientists but artists. Statistics is not an empirical practice but a
    theoretical one: Mathematics can't be defined as a science. I don't see
    why this can be taken as an offense, or as a disrespectful comment, but
    certainly some people feels offended by that. I sincerely would like to understand why...