Cleveland’s (?) 2001 plan for redefining statistics as “data science”


This plan has been making the rounds on Twitter and is being attributed to William Cleveland in 2001 (thanks to Kasper for the link). I’m not sure of the provenance of the document but it has some really interesting ideas and is worth reading in its entirety. I actually think that many Biostatistics departments follow the proposed distribution of effort pretty closely. 

One of the most interesting sections is the discussion of computing (emphasis mine): 

Data analysis projects today rely on databases, computer and network hardware, and computer and network software. A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use. In choosing competing models and methods, analysts will trade effectiveness for efficiency of use.


This suggests that statisticians should look to computing for knowledge today, just as data science looked to mathematics in the past.

I also found the theory section worth a read and figure it will definitely lead to some discussion: 

Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring the same body of mathematics for all. Students should study mathematics on an as-needed basis.


Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue.