Tag: statistics


Data Scientist vs. Statistician

There’s in interesting discussion over at reddit on the difference between a data scientist and a statistician. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.

My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction).  There a few interesting comments about computing. For example from ByteMining:

Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.

Another more down-to-earth comment comes from marshallp:

There is a real distinction between data scientist and statistician

  • the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job

  • the data scientist gets s—loads of cash after having learnt a scripting language and an api

More people should be encouraged into data science and not pointless years of stats classes

 Not sure I fully agree but I see where he’s coming from!

[Note: See also our post on how determine whether you are a data scientist.]


[youtube http://www.youtube.com/watch?v=V-hFORcBj44?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]

The History of Nonlinear Principal Components Analysis, a lecture given by Jan de Leeuw. For those that have ~45 minutes to spare, it’s a very nice talk given in Jan’s characteristic style.


Coarse PM and measurement error paper

Howard Chang, a former PhD student of mine now at Emory, just published a paper on a measurement error model for estimating the health effects of coarse particulate matter (PM). This is a cool paper that deals with the problem that coarse PM tends to be very spatially heterogeneous. Coarse PM is a bit of a hot topic now because there is currently no national ambient air quality standard for coarse PM specifically. There is a standard for fine PM, but compared to fine PM,  the scientific evidence for health effects of coarse PM is relatively less developed. 

When you want to assign a coarse PM exposure level to people in a county (assuming you don’t have personal monitoring) there is a fair amount of uncertainty about the assignment because of the spatial variability. This is in contrast to pollutants like fine PM or ozone which tend to be more spatially smooth. Standard approaches essentially ignore the uncertainty which may lead to some bias in estimates of the health effects.

Howard developed a measurement error model that uses observations from multiple monitors to estimate the spatial variability and correct for it in time series regression models estimating the health effects of coarse PM. Another nice thing about his approach is that it avoids any complex spatial-temporal modeling to do the correction.

Related Posts: Jeff on “Cool papers” and “Dissecting the genomics of trauma


Do we really need applied statistics journals?

All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the Annals of Statistics or JASA Theory & Methods or Biometrika. A more “methods-y” paper might go to JASA or JRSS-B or Biometrics or maybe even Biostatistics (where all three of us are or have been associate editors).

But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important scientific problem.

Well, there are some official applied statistics journals: JASA Applications & Case Studies or JRSS-C or Annals of Applied Statistics. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, will it reach the right audience?

What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.

The fundamental conundrum of applied stat papers comes down to this question: If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it? If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently. 

Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this recent paper in Nature Genetics or a paper I published a few years ago in the Journal of the American Medical Association. I think there are two key features that these papers (and many others like them) have in common:

  • There was an important scientific question addressed. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals. 
  • The problem was well-suited to be addressed by statisticians. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so. 

So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.

There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.

Related Posts: Rafa on “Where are the Case Studies?” and “Authorship Conventions”