Thoughts on David Donoho’s “Fifty Years of Data Science”

Roger Peng

Note: This post was originally published as part of a collection of discussion pieces on David Donoho’s paper. The original paper and collection of discussions can be found at the JCGS web site.

Professor Donoho’s commentary comes at a perfect time, given that, according to his own chronology, we are just about due for another push to “widen the tent” of statistics to include a broader array of activities. Looking back at the efforts of people like Tukey, Cleveland, and Chambers to broaden the meaning of statistics, I would argue that to some extent their efforts have failed. If you look at a textbook for a course in a typical PhD program in statistics today, I believe it would look much like the textbooks used by Cleveland, Chambers, and Tukey in their own studies. In fact, it might even be the same textbook! Progress has been slow, in my opinion. But why is that? Why can’t statistics grow to more fully embrace the many activities that Professor Donoho describes in his six divisions of “Greater Data Science”?

One frustration that I think many statisticians have in discussions of data science is that if you look at each of the six divisions that Donoho lays out—data exploration, data transformation, computing, modeling, visualization, and science of data science—statisticians do all those things. But the truth is, we do not teach most of them. Over the years, the teaching of statistics has expanded to include topics like computing and visualization, but typically as optional or ancillary courses. Many of the activities on Donoho’s list are things that students are assumed to “figure out on their own” without any formal instruction. Asymptotic theory, on the other hand, requires formal instruction.

In my own experience, the biggest challenge to teaching the areas of Greater Data Science is that it is difficult and can be very inefficient. Ultimately, I believe these are the reasons that we as a field choose not to teach this material. Many of the areas in the six divisions, data exploration and transformation, can be frustratingly difficult to generalize. If I clean up administrative claims data from Medicare and link them to air pollution data from the Environmental Protection Agency, does any of the knowledge I gain from those activities apply to the processing of RNA-seq data and linking it with clinical phenotypes? It’s difficult to see any connection. On the other hand, both datasets will likely serve as inputs to a generalized linear model. Rather than teach one course on cleaning administrative claims data and another course on processing RNA-seq data, consider how many birds can be hit with the three stones of an exponential family, a link function, and a linear predictor? Furthermore, the behavior of generalized linear models can be analyzed mathematically to make incredibly useful predictions about the variability of their estimates.

The lack of a formal framework for “data cleaning” reduces the teaching of the subject to a parade of special cases. While each case might be of interest to some, it’s unlikely that any case would be applicable to all. In any institution of higher learning with finite resources, it’s impossible to provide formal instruction on all the special cases to everybody who needs them. It’s much more efficient to teach the generalized linear model and the central limit theorem.

Is the lack of a formal framework for some areas of data science attributable to some fundamental aspect of those topics, or does it arise simply from a lack of trying? In my opinion, the evidence to date lays the blame on our field’s traditional bias towards to use of mathematics as the principal tool for analysis. Indeed, much of the interesting formal work being done in data cleaning and transformation makes use of a completely different toolbox, one largely drawing from computer science and software engineering. Because of this different toolbox, our field has been blinded to recent developments and has missed an important opportunity to cultivate more academic leaders in this area.

Professor Donoho rightly highlights the work of Hadley Wickham and Yihui Xie, both statisticians, who have made seminal contributions to the field of statistics in their development of the ggplot2, knitr, dplyr, and many other packages for R. It is notable that Wickham’s paper outlining the concept of “tidy data”, a concept which has sparked a minor revolution in the field of data analysis, was originally published in the Journal of Statistical Software, a nominally “applied” journal. I would argue that such a paper more properly belongs in the Annals of Statistics than in a software journal. The formal framework offered in that paper has inspired the creation of numerous “tidy data” approaches to analyzing data that have proven remarkably simple to use and understand. The “grammar” outlined in Wickham’s paper and implemented in the dplyr package serve as an abstraction whose usefulness has been demonstrated in a variety of situations. The lack of mathematical notation in the presentation of dplyr or any of its “tidyverse” relatives does not make it less useful nor does it make it less broadly applicable.

Finally, it is worth a comment that the people that Professor Donoho cites as driving previous pushes to widen the tent of statistics either did not initially come from academia or at least straddled the boundary. Cleveland, Chambers, and Tukey all spent significant time in industry and government settings and that experience no doubt colored their perspectives. Moving to today, it might just be a coincidence that both Wickham and Xie are employed outside of academia by RStudio, but I doubt it. Perhaps it is always the case that the experts come from somewhere else. However, I fear that academic statistics will miss out on the opportunity to recruit bright people who are making contributions in a manner not familiar to many of us.