Rafa’s citations above replacement in statistics journals is crazy high.

Jeff Leek

Editor’s note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. 

I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from Web of Science. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.




There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer “big hits” (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren’t huge hits. I suspect these are the papers that solve one problem well but don’t solve the most general problem ever.

Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: “Exploration, normalization, and summaries of high density oligonucleotide array probe level data”, which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic “wins above replacement” which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.

What about a “citations /year above replacement” statistic where you calculate for each journal:

Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal

Then average this number across journals. This attempts to quantify how many extra citations/year a person’s papers generate compared to the “average” paper in that journal. For Rafa the numbers look like this:

So Rafa’s citations above replacement is (13.62 + 69.3 + 0.95)/3 =  27.96! There are a couple of reasons why this isn’t a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn’t make the top 400 for journals like JASA - which would bring down his CAR/Y.