Tag: Theory


Sunday data/statistics link roundup (1/6/2013)

  1. Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I've ever seen, with a heatmap of traffic density to try to influence commuters.
  2. An interesting paper that talks about how clustering is only a really hard problem when there aren't obvious clusters. I was a little disappointed in the paper, because it defines the "obviousness" of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.
  3. A post about the two cultures of statistical learning and a related post on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that the only good data science is good science - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won't change to tackle the realistic challenges posed in solving modern scientific problems.
  4. A graph that shows the incredibly strong correlation ( > 0.99!) between the growth of autism diagnoses and organic food sales. Another example where even really strong correlation does not imply causation.
  5. The Buffalo Bills are going to start an advanced analytics department (via Rafa and Chris V.), maybe they can take advantage of all this free play-by-play data from years of NFL games.
  6. A prescient interview with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).
  7. The statistical software signal - what your choice of software says about you. Just another reason we need a deterministic statistical machine.



Cleveland's (?) 2001 plan for redefining statistics as "data science"

This plan has been making the rounds on Twitter and is being attributed to William Cleveland in 2001 (thanks to Kasper for the link). I’m not sure of the provenance of the document but it has some really interesting ideas and is worth reading in its entirety. I actually think that many Biostatistics departments follow the proposed distribution of effort pretty closely. 

One of the most interesting sections is the discussion of computing (emphasis mine): 

Data analysis projects today rely on databases, computer and network hardware, and computer and network software. A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use. In choosing competing models and methods, analysts will trade effectiveness for efficiency of use.


This suggests that statisticians should look to computing for knowledge today, just as data science looked to mathematics in the past.

I also found the theory section worth a read and figure it will definitely lead to some discussion: 

Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring the same body of mathematics for all. Students should study mathematics on an as-needed basis.


Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue.