Simply Statistics: If you were going to write a paper about the false discovery rate you should have done it in 2002

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

empirical processes
proportional hazards model
generalized linear model
semiparametric
generalized estimating equation
false discovery rate
microarray statistics
lasso shrinkage
rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

Controlling the false discovery rate: a practical and powerful approach to multiple testing
Thresholding of statistical maps in functional neuroimaging using the false discovery rate
The control of the false discovery rate in multiple testing under dependency
Controlling the false discovery rate in behavior genetics research
Identifying differentially expressed genes using false discovery rate controlling procedures
The positive false discovery rate: A Bayesian interpretation and the q-value
On the adaptive control of the false discovery rate in multiple testing with independent statistics
Implementing false discovery rate control: increasing your power
Operating characteristics and extensions of the false discovery rate procedure
Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:

Emp. Proc. 1990.241
Prop. Haz. 1990.929
GLM 1994.433
Semi-param. 1994.433
GEE 2000.379
FDR 2002.760
microarray 2003.600
lasso 2004.900
rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations. It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.