Data scientist is just a sexed up word for statistician

A couple of cool things happened at this years JSM.

  1. Twitter adoption went way up and it was much easier for people (like me) who weren't there to keep track of all the action by monitoring the #JSM2013 hashtag.
  2. Nate Silver gave the keynote and about a million statisticians showed up.

Nate Silver is hands down the rockstar of our field. I mean, no other statistician changing jobs would make the news at the Times, at ESPN, and on pretty much every other major news source.

Silver's talk at JSM focused on 11 principles of statistical journalism, which are covered really nicely here by Joseph Rickert from Revolution. After his talk, he answered questions Tweeted from the audience. He brought the house down (I'm sure in person, but definitely on Twitter) with his response to a question about data scientists versus statisticians with the perfectly weighted response for the audience:

Data scientist is just a sexed up word for statistician

Of course statisticians love to hear this but data scientists didn't necessarily agree.

I've talked about the statistician/data scientist divide before and how I think that we need better marketing as statisticians. I think it is telling that some of the very accomplished, very successful people tweeting about Nate's quote are uncomfortable being labeled statistician. The reason, I think, is that statisticians have a reputation for focusing primarily on theory and not being willing to do the schlep.

I do think there is some cachet to having the "hot job title" but eventually solving real problems matters more. Which leads me to my favorite part of Nate's quote, the part that isn't getting nearly as much play as it should:

Just do good work and call yourself whatever you want.

I think that as statisticians we should embrace a "big tent" approach to labeling. But rather than making it competitive by saying data scientists aren't that great they are just "sexed up" statisticians, we should make it inclusive, "data scientists are statisticians because being a statistician is awesome and anyone who does cool things with data is a statistician". People who build websites, or design graphics, or make reproducible documents, or build pipelines, or hack low-level data are all statisticians and we should respect them all for their unique skills.

This entry was posted in Uncategorized. Bookmark the permalink.
  • Hilary Mason

    It's always fun to read a post and see one of your tweets included!

    I'd like to explain my comment a bit. I'm a computer scientist by training who explores data and builds algorithms, systems, and products around data. I use statistics in my practice, but would never claim to be an expert statistician.

    One of the things I love about the term "data scientist" is that it allows for people to come to the practice of learning things from data with a variety of different backgrounds, relative strengths, and eventual goals.

    • jtleek

      First of all, thanks for reading, it is awesome to have you as part of the discussion!

      My point wasn't to disparage the people who made those comments or to disparage the term data scientist. I just am passionate about statisticians taking a "big tent" approach to calling people part of our community. It is something I think we have struggled with as a community.

      When you say "expert statistician" I think that has connotations (likely created by our communities' use of that term) that I'm not sure I'm comfortable with. I'd prefer people with statistical skill that use data in their day-to-day job to be comfortable calling themselves statisticians as much as data scientists.

      In any case, I think that the right way to go is positive regardless. The people I admire have many different labels but they have in common they tackle important, hard problems in science with computation and statistics.

      • Hilary Mason

        +1 to the positive framing of statistics!

  • Thomas Lumley

    In the other direction, 'data scientist' is arguably just a subfield of statistics -- though I suppose you could subdivide like physics into 'experimental data science' and "theoretical data science'. There's definitely too much uninteresting theory, but there's also important theory that gives you tools for thinking about how to do things with data.

    • Abhijit Dasgupta

      Thomas, perhaps the broadest subdivision would be "applied data science". So much of what is going on in business is not experimental in any sense, but observational.

      The relationship of statistics as a field to the scaled-up data sizes we have today is reflected in Davidian's recent blog post about statistics being about "small data". The relevance of the hypothesis-testing framework, which most people associate with the field of statistics (or perhaps all that is remembered from that statistics course oh so long ago), is increasingly in doubt when sample sizes are this large. Statistics is so much more, but if the image of statistics is stuck at H0 and p-values, it is no wonder that the role of statisticians is being doubted for this age of vacuum-cleaner data collection.

  • Abhijit Dasgupta

    I think you hit on a great point that the connotations of the term "statistics" are not positive, hence some reaction to Silver's statement. I think the "non-statisticians" harken back to college where "sadistics" reigned. I note, smilingly, Drew's tweet about "self-ID'd statisticians". That is really the point. So many of the data scientists are not formally trained statisticians but quantitatively oriented people trying to make sense of their data in their domain: they're sociologists, psychologists, business people, computer scientists, and the like (basically what Hilary is also saying). I remember back in grad school getting into an argument about "applied mathematics" vs "statistics". There has been a silo'ing of fields which is unfortunate. The quantitative sciences and arts are really a big tent from which we all should borrow. The history of statistics as well as other fields are full of examples of this.

    The current practices to me are more akin to engineering than formal science. We don't really have a good handle even in statistics as to what the best practices are and should be; that's where a lot of the theoretical statisticians and computer scientists have a role to play. However, we have a need (for business reasons or scientific reasons or just to write another paper) to find methods, computational or methodological, which address the nature of the data and finding signal within that data, in hopefully some sort of reproducible fashion. I suspect a lot of what's going on is not reproducible, but that doesn't mean it isn't valuable. It will help in learning what works and what doesn't, and what can find signal rather than false signal within noise, and slowly things will sort themselves out. Till then, to echo Silver, we just try to "do good work"

  • Julian Wolfson

    I think the "small tent" attitude of many statisticians is a result of seeing a lot of bad statistics being done by people who consider themselves competent in the field without adequate training. The worry about 'data science', then, is that it will attract a lot of folks with no statistical expertise who will become the de facto statistical 'experts'.

    In my view, the reason 'data science is just sexed up statistics' was such an applause line isn't because statisticians see no distinction between the type of work they do and that being done at Google/Facebook/etc., but because they feel that statistical training should still be the foundation for such work.

    I wonder if 'data science' will end up following a similar path to bioinformatics, which emerged in the '90's looking like it was going to be its own field requiring a unique set of skills (and accompanying training programs). Now, some programs offer training in using the *tools* of the discipline, graduates mostly work in mid-level implementation-oriented jobs while the major advances in the field are being made by specialists in more traditional disciplines such as statistics and computer science. Data science could end up much the same, with training providing the workers necessary to "do the schlep", but the thought leaders mostly following more traditional paths and then gaining an interest in the specific problems of the discipline.

  • Mike Hunter

    The distinctions between data scientists and statisticians are real and involve fundamentally different toolkits and training at least among practitioners. Type the words "statistics" vs "data scientist" into any job search engine -- the results should show qualitatively different job types with "statistics" positions concentrated in pharma industries using SAS and "data scientists" in new media working with big data using Hadoop, Hive, Mahout, etc. Different industries...different tools and training required.

    • isomorphisms

      Agree. DS involves more programming and databases and less statistics, than pure statistics.

  • Daniel D. Gutierrez

    Ah gee, I'm all sexed up? I'll take that characterization. But seriously, I like the "scientist" in Data Scientist very much because I really do think it is quite accurate in how I approach a project - it is like I'm in a lab experimenting with data and my lab equipment is a laptop running R. I approach all my machine learning projects in exactly that manner. Do I wear a lab coat? No, but you can carry out The Scientific Method in your mind, like Einstein did with this thought experiments. I'm just pleased that my field has evolved in this direction. After all, computer "science" is still a science.