Simply Statistics: Data analysis subcultures

Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials Roger and I responded to the controversy around the journal that banned p-values today [in Nature.](http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412) A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper: across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
You see [Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).
In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.
This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.
Psychology has a history of using parametric statistics, genomics is big into empirical Bayes, and you see a lot of Bayesian statistics in climate studies.
You see](http://en.wikipedia.org/wiki/White_test) used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.
Training sets and test sets are used in machine learning for prediction, but rarely used for inference.

This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter a lot in a data analysis. The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we’d like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, “I saw it in a widely-cited paper in journal XX from my field” it is likely that little progress will be made on resolving the statistical problems in science.