Simply Statistics


Caffo's Theorem

Brian Caffo from the comments:

Personal theorem: the application of statistics in any new field will be labeled “Technical sounding word” + ics. Examples: Sabermetrics, analytics, econometrics, neuroinformatics, bioinformatics, informatics, chemeometrics. 

It’s like how adding mayonnaise to anything turns it in to salad (eg: egg salad, tuna salad, ham salad, pasta salad, …)

I’d like to be the first to propose the statistical study of turning things in salad. So called mayonaisics.

Related Posts: Caffo + Ninjas = Awesome


Do we really need applied statistics journals?

All statisticians in academia are constantly confronted with the question of where to publish their papers. Sometimes it’s obvious: A theoretical paper might go to the Annals of Statistics or JASA Theory & Methods or Biometrika. A more “methods-y” paper might go to JASA or JRSS-B or Biometrics or maybe even Biostatistics (where all three of us are or have been associate editors).

But where should the applied papers go? I think this is an increasingly large category of papers being produced by statisticians. These are papers that do not necessarily develop a brand new method or uncover any new theory, but apply statistical methods to an interesting dataset in a not-so-obvious way. Some papers might combine a set of existing methods that have never been combined before in order to solve an important scientific problem.

Well, there are some official applied statistics journals: JASA Applications & Case Studies or JRSS-C or Annals of Applied Statistics. At least they have the word “application” or “applied” in their title. But the question we should be asking is if a paper is published in one of those journals, will it reach the right audience?

What is the audience for an applied stat paper? Perhaps it depends on the subject matter. If the application is biology, then maybe biologists. If it’s an air pollution and health application, maybe environmental epidemiologists. My point is that the key audience is probably not a bunch of other statisticians.

The fundamental conundrum of applied stat papers comes down to this question: If your application of statistical methods is truly addressing an important scientific question, then shouldn’t the scientists in the relevant field want to hear about it? If the answer is yes, then we have two options: Force other scientists to read our applied stat journals, or publish our papers in their journals. There doesn’t seem to be much momentum for the former, but the latter is already being done rather frequently. 

Across a variety of fields we see statisticians making direct contributions to science by publishing in non-statistics journals. Some examples are this recent paper in Nature Genetics or a paper I published a few years ago in the Journal of the American Medical Association. I think there are two key features that these papers (and many others like them) have in common:

  • There was an important scientific question addressed. The first paper investigates variability of methylated regions of the genome and its relation to cancer tissue and the second paper addresses the problem of whether ambient coarse particles have an acute health effect. In both cases, scientists in the respective substantive areas were interested in the problem and so it was natural to publish the “answer” in their journals. 
  • The problem was well-suited to be addressed by statisticians. Both papers involved large and complex datasets for which training in data analysis and statistics was important. In the analysis of coarse particles and hospitalizations, we used a national database of air pollution concentrations and obtained health status data from Medicare. Linking these two databases together and conducting the analysis required enormous computational effort and statistical sophistication. While I doubt we were the only people who could have done that analysis, we were very well-positioned to do so. 

So when statisticians are confronted by a scientific problems that are both (1) important and (2) well-suited for statisticians, what should we do? My feeling is we should skip the applied statistics journals and bring the message straight to the people who want/need to hear it.

There are two problems that come to mind immediately. First, sometimes the paper ends up being so statistically technical that a scientific journal won’t accept it. And of course, in academia, there is the sticky problem of how do you get promoted in a statistics department when your CV is filled with papers in non-statistics journals. This entry is already long enough so I’ll address these issues in a future post.

Related Posts: Rafa on “Where are the Case Studies?” and “Authorship Conventions”


Spectacular Plots Made Entirely in R

When doing data analysis, I often create a set of plots quickly just to explore the data and see what the general trends are. Later I go back and fiddle with the plots to make them look pretty for publication. But some people have taken this to the next level. Here are two plots made entirely in R:

The descriptions of how they were created are here and here.

Related: Check out Roger’s post on R colors and my post on APIs


Caffo + Ninjas = Awesome

Our colleague Brian Caffo and his team of statistics ninjas won the “Imaging-Based Diagnostic Classification Contest” as part of the ADHD 200 Global Competition. From the prize citation:

The method developed by the team from Johns Hopkins University excelled in its specificity, or its ability to identify typically developing children (TDC) without falsely classifying them as ADHD-positive. They correctly classified 94% of TDC, showing that a diagnostic imaging methodology can be developed with a very low risk of false positives, a fantastic result. Their method was not as effective in terms of sensitivity, or its ability to identify true positive ADHD diagnoses. They only identified 21% of cases; however, among those cases, they discerned the subtypes of ADHD with 89.47% accuracy. Other teams demonstrated that there is ample room to improve sensitivity scores. 

Congratulations to Brian and his team!


Colors in R

One of my favorite R packages that I use all the time is the RColorBrewer package. The package has been around for a while now and is written/maintained by Erich Neuwirth. The guts of the package are based on Cynthia Brewer’s very cool work on the use of color in cartography (check out the colorbrewer web site).

As a side note, I think the ability to manipulate colors in plots/graphs/maps is one of R’s many great strengths. My personal experience is that getting the right color scheme can make a difference in how data are perceived in a plot.

RColorBrewer basically provides one function, brewer.pal, that generates different types of color palettes. There are three types of palettes: sequential, diverging, and qualitative. Roughly speaking, sequential palettes are for continuous data where low is less important and high is more important, diverging palettes are for continuous data where both low and high are important (i.e. deviation from some reference point), and qualitative palettes are for categorical data where there is no logical order (i.e. male/female).

To use the brewer.pal function, it’s often useful to combine it with another R function, colorRampPalette. This function is built into R and is part of the grDevices package. It takes a palette of colors and interpolates between the colors to give you an entire spectrum. Think of a painter’s palette with 4 or 5 color blotches on it, and then think of the painter taking a brush and blending the colors together. That’s what colorRampPalette does. So brewer.pal gives you the colors and colorRampPalette mashes them together. It’s a happy combination.

So, how do we use these functions? My basic approach is to first set the palette depending on the type of data. Suppose we have continuous sequential data and we want the “Blue-Purple” palette

colors <- brewer.pal(4, "BuPu")

Here, I’ve taken 4 colors from the “BuPu” palette, so there are now 4 blotches on my palette. To interpolate these colors, I can call colorRampPalette, which actually returns a function.

pal <- colorRampPalette(colors)

Now, pal is a function that takes a positive integer argument and returns that number of colors from the palette. So for example

> pal(5)
[1] "#EDF8FB" "#C1D7E9" "#9FB1D4" "#8B80BB" "#88419D"

I got 5 different colors from the palette, with their red, green, and blue values coded in hexadecimal. If I wanted 20 colors I could have called pal(20).

The pal function is useful in other functions like image or wireframe (in the lattice package). In both of those functions, the ‘col’ argument can be given a set of colors generated by the pal function. For example, you could call

image(volcano, col = pal(30))

and you would plot the ‘volcano’ data using 30 colors from the “BuPu” palette.

If you’re wondering what all the different palettes are and what colors are in them, here’s a handy reference:

Or you can just call


There’s been a lot of interesting work done on colors in R and this is just scratching the surface. I’ll probably return to this subject in a future post.


Where would we be without Dennis Ritchie?

Most have probably seen this already since it happend a few days ago, but Dennis Ritchie died. It just blows my mind how influential his work was — developing the C language, Unix — and how so many pieces of technology bear his fingerprints. 

My first encounter with K&R was in college when I learned C programming in the “Data Structures and Programming Techniques” class at Yale (taught by Stan “the man” Eisenstadt). Looking back, his book seems fairly easy to read and understand, but I must have cursed that book a million times when I took that course!


Interview With Daniela Witten

Note: This is the first in a series of posts where we will be interviewing junior, up-and-coming statisticians/data scientists. Our goal is to build visibility for people who are at the early stages of their careers. 

Daniela Witten

Daniela is an assistant professor of Biostatistics at the University of Washington in Seattle. She moved to Seattle after getting her Ph.D. at Stanford. Daniela has been developing exciting new statistical methods for analyzing high dimensional data and is a recipient of the NIH Director’s Early Independence Award.

Which term applies to you: data scientist/statistician/analyst?

Statistician! We have to own the term. Some of us have a tendency to try to sugarcoat what we do. But I say that I’m a statistician with pride! It means that I have been rigorously trained, that I have a broadly applicable skill set, and that I’m always open to new and interesting problems. Also, I sometimes get surprised reactions from people at cocktail parties, which is funny.

To the extent that there is a stigma associated with being a statistician, we statisticians need to face the problem and overcome it. The future of our field depends on it.

How did you get into statistics/data science?

I definitely did not set out to become a statistician. Before I got to college, I was planning to study foreign languages. Like most undergrads, I changed my mind, and eventually I majored in biology and math. I spent a summer in college doing experimental biology, but quickly discovered that I had neither the hand-eye coordination nor the patience for lab work. When I was nearing the end of college, I wasn’t sure what was next. I wanted to go to grad school, but I didn’t want to commit to one particular area of study for the next five years and potentially for my entire career. 

I was lucky to be at Stanford and to stumble upon the Stat department there. Initially, statistics appealed to me because it was a good way to combine my interests in math and biology from the safety of a computer terminal instead of a lab bench. After spending more time in the department, I realized that if I studied statistics, I could develop a broad skill set that could be applied to a variety of areas, from cancer research to movie recommendations to the stock market.

What is the problem currently driving you?

My research involves the development of statistical methods for the analysis of very large data sets. Recently, I’ve been interested in better understanding networks and their applications to biology. In the past few years there has been a lot of work in the statistical community on network estimation, or graphical modeling. In parallel, biologists have been interested in taking network-based approaches to understanding large-scale biological data sets. There is a real need for these two areas of research to be brought closer together, so that statisticians can develop useful tools for rigorous network-based analysis of biological data sets.

For example, the standard approach for analyzing a gene expression data set with samples from two classes (like cancer and normal tissue) involves testing each gene for differential expression between the two classes, for instance using a two-sample t-statistic. But we know that an individual gene does not drive the differences between cancer and normal tissue; rather, sets of genes work together in pathways in order to have an effect on the phenotype. Instead of testing individual genes for differential expression, can we develop an approach to identify aspects of the gene network that are perturbed in cancer?

What are the top 3 skills you look for in a student who works with you?

I look for a student who is intellectually curious, self-motivated, and a good personality fit. Intellectual curiosity is a prerequisite for grad school, self-motivation is needed to make it through the 2 years of PhD level coursework and 3 years of research that make up a typical Stat/Biostat PhD, and a good personality fit is needed because grad school is long and sometimes frustrating*, and it’s important to have an advisor who can be a friend along the way!

*but ultimately very rewarding

Who were really good mentors to you? What were the qualities that really helped you?

My PhD advisor, Rob Tibshirani, has been a great mentor. In addition to being a top statistician, he is also an enthusiastic advisor, a tireless advocate for his students, and a loyal friend. I learned from him the value of good collaborations and of simple solutions to complicated problems. I also learned that it is important to maintain a relaxed attitude and to occasionally play pranks on students.

For more information:

Check out her website. Or read her really nice papers on penalized classification and penalized matrix decompositions.


Moneyball for Academic Institutes

A way that universities grow in research fields for which they have no department is by creating institutes. Millions of dollars are invested to promote collaboration between existing faculty interested in the new field. But do they work? Does the university get their investment back? Through the years I have noticed that many institutes are nothing more than a webpage and others are so successful they practically become self-sustained entities. This paper (published in STM) led by John Hogenesch, uses data from papers and grants to evaluate an institute at Penn. Among other things, they present a method that uses network analysis to objectively evaluate the effect of the institute on collaboration. The findings are fascinating. 

The use of data to evaluate academics is becoming more and more popular, especially among administrators. Is this a good thing? I am not sure yet, but statisticians better get involved before a biased analyses gets some of us fired.


Benford's law

Am I the only one who didn’t know about Benford’s law? It says that for many datasets, the probability that the first digit of a random element is d is given by P(d)= log_10 (1 + 1/d). This post by Jialan Wang explores financial report data and, using Benford’s law, notices that something fishy is going on… 

Hat tip to David Santiago.

Update: A link has been fixed.