Data analysis subcultures

Jeff Leek
2015-04-29

Roger and I responded to the controversy around the journal that banned p-values today in Nature. A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials Roger and I responded to the controversy around the journal that banned p-values today [in Nature.](http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412) A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper: across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.

I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see methods like randomized trials Roger and I responded to the controversy around the journal that banned p-values today [in Nature.](http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412) A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper: across multiple disciplines.

But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back very different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:

This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter a lot in a data analysis.  The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we’d like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, “I saw it in a widely-cited paper in journal XX from my field” it is likely that little progress will be made on resolving the statistical problems in science.