I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.
and makes the important point that
Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.
The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley’s definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data. More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used “data product” R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.
So why then does Hadley think that “Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products”? I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl’s work was published in the American Journal of Human Genetics. A second reason may be that most of us academic applied statisticians don’t teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.
About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.
This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.
So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.
comments powered by Disqus