The vast majority of statistical analysis is not performed by statisticians

Tweet about this on TwitterShare on Facebook95Share on Google+11Share on LinkedIn21Email this to someone

Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. The only data that were collected were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy (see e.g. Camp Williams).

To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there. This is where the idea of “6-degrees of Kevin Bacon” comes from. Based on 64 data points.  A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort.

Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome. This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $10,000 in about a week.

 As the price of data dropped so dramatically over the last two decades, the division of labor between analysts and everyone else became less and less clear. Data became so cheap that it couldn’t be confined to just a few highly trained people. So raw data started to trickle out in a number of different ways. It started with maps of temperatures across the U.S. in newspapers and quickly ramped up to information on how many friends you had on Facebook, the price of tickets on 50 airlines for the same flight, or measurements of your blood pressure, good cholesterol, and bad cholesterol at every doctor’s visit. Arguments about politics started focusing on the results of opinion polls and who was asking the questions. The doctor stopped telling you what to do and started presenting you with options and the risks that went along with each.

That is when statisticians stopped being the primary data analysts. At some point, the trickle of data about you, your friends, and the world started impacting every component of your life. Now almost every decision you make is based on data you have about the world around you. Let’s take something simple, like where are you going to eat tonight. You might just pick the nearest restaurant to your house. But you could also ask your friends on Facebook where you should eat, or read reviews on Yelp, or check out menus on the restaurants websites. All of these are pieces of data that are collected and presented for you to "analyze".

This revolution demands a new way of thinking about statistics. It has precipitated explosive growth in data visualization - the most accessible form of data analysis. It has encouraged explosive growth in MOOCs like the ones Roger, Brian and I taught. It has created open data initiatives in government. It has also encouraged more accessible data analysis platforms in the form of startups like StatWing that make it easier for non-statisticians to analyze data.

What does this mean for statistics as a discipline? Well it is great news in that we have a lot more people to train. It also really drives home the importance of statistical literacy. But it also means we need to adapt our thinking about what it means to teach and perform statistics. We need to focus increasingly on interpretation and critique and away from formulas and memorization (think English composition versus grammar). We also need to realize that the most impactful statistical methods will not be used by statisticians, which means we need more fool proofing, more time automating, and more time creating software. The potential payout is huge for realizing that the tide has turned and most people who analyze data aren't statisticians.

Comments ( 10 )
  • Jimmy Jin says:

    Hi Jeff. I absolutely agree. I would also argue that, not only is the majority of data analysis not performed by statisticians, but also that "data analysis" is increasingly being seen (by non-statisticians) as something separate from statistics, even though it is not.

    In short, do you think that statistics is having a branding issue? I hear the word "statistics" much less than I do "machine learning" or "data science" nowadays, even though they are (essentially) the same thing. I'm a little worried that statisticians are backing themselves into a corner somehow, although I can't really put my finger on how or why.

    • Wesley Brooks says:

      I think one issue is that lots of people had terrible "statistics" classes in college, but not "data science" or "machine learning" classes, leading them to prefer the latter terms. Also, they probably associate "statistics" with the things they were taught in those classes: t-tests, looking up p-values in tables, calculating standard deviations. The more exciting things you do in real data analysis need a more exciting name!

    • Parag Kulkarni says:

      I totally agree that statisticians will be cornered by community of engineers and others.This is high time that all statisticians should be united just like all architects and lawyers are united across platforms.

  • Randy Bartlett says:

    Interesting. I think we should emphasize that these non-statisticians are dangerous to the corporation and should report into the statisticians.

  • cinnamon50 says:

    As a PhD in molecular biology, may I make some technical corrections ?

    Each of the following is a common error that is propagated by the DNA community

    There are not 3 billion letters/cell, there are 6, 3 from mom and 3 from pop. Mom and Pop are different. Not to mention all of the non human DNA (the gut bacterial DNA/genome)

    There is today no technology that can sequence all of the DNA (eg, centromeres) in the human body, and there is no technology on the horizon that I know of (poss exception long read pacbio or moleculo)
    That is, there are large parts of the human genome that have never been sequenced, such as the centromeres, and there is no current proposed technology that will allow us to sequence these regions.

    the quality of most sequences is quite low; DNA in the human body is packaged in 46 chromosomes; hence, by definition, a "complete" sequence should have 46 parts; in fact most "high quality" sequences have hundreds of parts - this is because of the gaps in current sequnces. (in jargon, the number of "contigs" (continuous sequence) should be 46 (or 92, assuming we seq to the edge of each centromere) in fact the number of contigs is much much higher

    you are totally right that the bottleneck is now statistics but, I think, not in the way you mean.

    The problem is two fold
    1) doing experiments on people is messy and very $,

    2) small changes can have big impacts, eg a change of 1-2% in average blood pressure might save thousands of lives (typical blood pressure is ~~ 100 in std units; I believe the error on replicate measurements - same person, same measuring technique, just a minute or two apart - is on the order of 5%)

  • Nicola Ward Petty says:

    Nicely put. There are some statisticians who are protectionist, and want to avoid any non-statisticians from handling data without the correct training. Well that ship has sailed, and our job as educators is to help the majority of people to understand "chance, data and evidence". The New Zealand school curriculum is leading the world in this, and the subject of statistics is increasingly popular at high school level. These are exciting times. I write a blog about teaching statistics and operations research and try to take statistical concepts to the people with my YouTube videos.

  • Monika says:

    Increasingly, control of the statistical analysis in our research unit (consisting mostly of psychologists) is being wrenched from my hands by SPSS. Well actually by users thereof, who only know the rudiments of statistics and badly apply the software to their situation. I had a conversation with one bright, medalled PhD post-doc recently who had not heard of the term 'intercept' in connection with regression. I notice a tendancy now for single words to be replaced with a more definitive phrase (no analytical data to back this up), so maybe we should follow suit and come up with a phrase for 'intercept' - suggestions? It saddens me to see tools from such a powerful and complex box (statistics) being weilded so incompetently, indiscriminantly and dangerously.

  • Mary Howard says:

    I do think in the current era statistics departments should focus on making sure their graduates are every bit as good at computer programming as they are at statistics; this aspect did seem to be missing from master level statistics programs in the past and perhaps the current backlash is due to this.
    I've never actually called myself a statistician because my masters degree is in Management Science, but did take over 30 graduate hours in statistics and have worked as a statistical programmer for many years. But I'm currently taking the Coursera Data Science course, and am very suprised that there are so many who wish to be "data scientists" without having had hardly any training in statistics at all; I don't really understand how they will be able to do good important work like medical breakthroughs, but perhaps that is not what they desire to do. With even the simplest of things, such as perhaps one would want to put every variable on the same scale, how can such a person function with no statistics background, in that such a person might not even be aware of a z-score?

  • kynn says:

    I'm a bit skeptical about the idea that this is a new phenomenon. Fisher wrote "Statistical methods for research workers" in the 1920's, aimed at non-statisticians, and it was enormously successful (14 editions). Many books aimed at the same audience have followed. For close to a century now, researchers in the biological and social sciences have been expected to be capable of carrying out basic statistical analyses. Even if only a relatively small fraction of them actually do, it would not surprise me if the sum total of all that data analysis trumps the amount done by card-carrying statisticians...

  • StevenS123 says:

    And this is precisely why you should give Data Analysis MOOC another run.

Leave A Comment

Your email address will not be published. Required fields are marked *