Simply Statistics


If you were going to write a paper about the false discovery rate you should have done it in 2002

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people's genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:



You can see from the plot that the median publication year of the top 30 hits for "empirical processes" was 1990 and for "RNA-seq statistics" was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn't perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the "success" of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited "false discovery rate" paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.


How to find the science paper behind a headline when the link is missing

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.


Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn't linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):


Screen Shot 2015-01-15 at 1.11.22 PM



Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don't get to a scientific journal you aren't finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.


Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
Untitled presentation (2)


And some key words:


Untitled presentation (3)


Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.




Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.


Screen Shot 2015-01-15 at 1.31.38 PM


Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM


Unfortunately, the article may be paywalled, so if you don't work at a university or institute with a subscription, you can always tweet the article name with the hashtag #icanhazpdf and your contact info. Then you just have to hope that someone will send it to you (they often do).




Beast mode parenting as shown by my Fitbit data

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:




There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:



Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.




Sunday data/statistics link roundup (1/4/15)

  1. I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
  2. I'm very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
  3. I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
  4. If you haven't read it yet, this piece in the Economist on statisticians during the war is really good.
  5. The arXiv celebrated it's 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.

Ugh ... so close to one million page views for 2014

In my last Sunday Links roundup I mentioned we were going to be really close to 1 million page views this year. Chris V. tried to rally the troops:



but alas we are probably not going to make it (unless by some miracle one of our posts goes viral in the next 12 hours):



Stay tuned for a bunch of cool new stuff from Simply Stats in 2015, including a new podcasting idea, more interviews, another unconference, and a new plotting theme!


Sunday data/statistics link roundup (12/21/14)

James Stewart, author of the most popular Calculus textbook in the world, passed away. In case you wonder if there is any money in textbooks, he had a $32 million house in Toronto. Maybe I should get out of MOOCs and into textbooks.

  1. This post on medium about a new test for causality is making the rounds.  The authors of the original paper make clear their assumptions make the results basically unrealistic for any real analysis for example:"We simplify the causal discovery problem by assuming no confounding, selection bias and feedback." The medium article is too bold and as I replied to an economist who tweeted there was a new test that could distinguish causality: "Nope".
  2. I'm excited that the Rafa + the ASA have started a section on Genomics and Genetics. It is nice to have a place to belong within our community. I hope it can be a place where folks who aren't into the hype (a lot of those in genomics), but really care about applications, can meet each other and work together.
  3. Great essay by Hanna W. about data, machine learning and fairness. I love this quote: "in order to responsibly articulate and address issues relating to bias, fairness, and inclusion, we need to stop thinking of big data sets as being homogeneous, and instead shift our focus to the many diverse data sets nested within these larger collections." (via Hilary M.)
  4. Over at Flowing Data they ran down the best data visualizations of the year.
  5. This rant from Dirk E. perfectly encapsulates every annoying thing about the Julia versus R comparisons I see regularly.
  6. We are tantalizingly close to 1 million page views for the year for Simply Stats. Help get us over the edge, share your favorite simply stats article with all your friends using the hashtag #simplystats1e6

Interview with Emily Oster

Emily Oster
Emily Oster is an Associate Professor of Economics at Brown University. She is a frequent and highly respected contributor to 538 where she brings clarity to areas of interest to parents, pregnant woman, and the general public where empirical research is conflicting or difficult to interpret. She is also the author of the popular new book about pregnancy: Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to KnowWe interviewed Emily as part of our ongoing interview series with exciting empirical data scientists. 
SS: Do you consider yourself an economist, econometrician, statistician, data scientist or something else?
EO: I consider myself an empirical economist. I think my econometrics colleagues would have a hearty laugh at the idea that I'm an econometrician! The questions I'm most interested in tend to have a very heavy empirical component - I really want to understand what we can learn from data. In this sense, there is a lot of overlap with statistics. But at the end of the day, the motivating questions and the theories of behavior I want to test come straight out of economics.
SS: You are a frequent contributor to 538. Many of your pieces are attempts to demystify often conflicting sets of empirical research (about concussions and suicide, or the dangers of water flouridation). What would you say are the issues that make empirical research about these topics most difficult?
EO: In nearly all the cases, I'd summarize the problem as : "The data isn't good enough." Sometimes this is because we only see observational data, not anything randomized. A large share of studies using observational data that I discuss have serious problems with either omitted variables or reverse causality (or both).  This means that the results are suggestive, but really not conclusive.  A second issue is even when we do have some randomized data, it's usually on a particular population, or a small group, or in the wrong time period. In the flouride case, the studies which come closest to being "randomized" are from 50 years ago. How do we know they still apply now?  This makes even these studies challenging to interpret.
SS: Your recent book "Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know" takes a similar approach to pregnancy. Why do you think there are so many conflicting studies about pregnancy? Is it because it is so hard to perform randomized studies?
EO: I think the inability to run randomized studies is a big part of this, yes. One area of pregnancy where the data is actually quite good is labor and delivery. If you want to know the benefits and consequences of pain medication in labor, for example, it is possible to point you to some reasonably sized randomized trials. For various reasons, there has been more willingness to run randomized studies in this area. When pregnant women want answers to less medical questions (like, "Can I have a cup of coffee?") there is typically no randomized data to rely on. Because the possible benefits of drinking coffee while pregnant are pretty much nil, it is difficult to conceptualize a randomized study of this type of thing.
Another big issue I found in writing the book was that even in cases where the data was quite good, data often diverges from practice. This was eye-opening for me and convinced me that in pregnancy (and probably in other areas of health) people really do need to be their own advocates and know the data for themselves.
SS: Have you been surprised about the backlash to your book for your discussion of the zero-alcohol policy during pregnancy? 
EO: A little bit, yes. This backlash has died down a lot as pregnant women actually read the book and use it. As it turns out, the discussion of alcohol makes up a tiny fraction of the book and most pregnant women are more interested in the rest of it!  But certainly when the book came out this got a lot of focus. I suspected it would be somewhat controversial, although the truth is that every OB I actually talked to told me they thought it was fine. So I was surprised that the reaction was as sharp as it was.  I think in the end a number of people felt that even if the data were supportive of this view, it was important not to say it because of the concern that some women would over-react. I am not convinced by this argument.
SS: What are the three most important statistical concepts for new mothers to know? 
EO: I really only have two!
I think the biggest thing is to understand the difference between randomized and non-randomized data and to have some sense of the pittfalls of non-randomized data. I reviewed studies of alcohol where the drinkers were twice as likely as non-drinkers to use cocaine. I think people (pregnant or not) should be able to understand why one is going to struggle to draw conclusions about alcohol from these data.
A second issue is the concept of probability. It is easy to say, "There is a 10% chance of the following" but do we really understand that? If someone quotes you a 1 in 100 risk from a procedure, it is important to understand the difference between 1 in 100 and 1 in 400.  For most of us, those seem basically the same - they are both small. But they are not, and people need to think of ways to structure decision-making that acknowledge these differences.
SS: What computer programming language is most commonly taught for data analysis in economics? 
EO: So, I think the majority of empirical economists use Stata. I have been seeing more R, as well as a variety of other things, but more commonly among people who do heavier computational fields.
SS: Do you have any advice for young economists/statisticians who are interested in empirical research? 
1. Work on topics that interest you. As an academic you will ultimately have to motivate yourself to work. If you aren't interested in your topic (at least initially!), you'll never succeed.
2. One project which is 100% done is way better than five projects at 80%. You need to actually finish things, something which many of us struggle with.
3. Presentation matters. Yes, the substance is the most important thing, but don't discount the importance of conveying your ideas well.

Repost: Statistical illiteracy may lead to parents panicking about Autism

Editor's Note: This is a repost of a previous post on our blog from 2012. The repost is inspired by similar issues with statistical illiteracy that are coming up in allergy screening and pregnancy screening

I just was doing my morning reading of a few news sources and stumbled across this Huffington Post article talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn't find a link to the research paper in the article so I did some searching and found out this result is also being covered by Time, Science Daily, Medical Daily, and a bunch of other news outlets.

Now thoroughly freaked, I looked online and found the pdf of the original research article. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies' cries can predict autism. Here are the flaws with the study:

  1. Small sample size. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.
  2. Major and unavoidable confounding. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.
  3. No correction for multiple testing. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD even did a cartoon about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction for them. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.

Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.

The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?



A non-comprehensive list of awesome things other people did in 2014

Editor's Note: Last year I made a list off the top of my head of awesome things other people did. I loved doing it so much that I'm doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people's awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!


  1. I'm copying everything about Jenny Bryan's amazing Stat 545 class in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.
  2. Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote this awesome paper on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.
  3. Speaking of those folks, the undergrad guidelines for stats programs put out by the ASA do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.
  4. Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His epiviz paper is great and you should go start using the Bioconductor package if you do genomics.
  5. Hilary Mason founded fast forward labs. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.
  6. As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to infer causality from related time series. The R package has some cool features too. I definitely am excited to see all the new innovation in this area.
  7. Hadley was Hadley.
  8. Rafa and Mike taught an awesome class on data analysis for genomics. They also created a book on Github that I think is one of the best introductions to the statistics of genomics that exists so far.
  9. Hilary Parker wrote this amazing introduction to writing R packages that took the twitterverse by storm. It is perfectly written for people who are just at the point of being able to create their own R package. I think it probably generated 100+ R packages just by being so easy to follow.
  10. Oh you're not reading StatsChat yet? For real?
  11. FiveThirtyEight launched. Despite some early bumps they have done some really cool stuff. Loved the recent piece on the beer mile and I read every piece that Emily Oster writes. She does an amazing job of explaining pretty complicated statistical topics to a really broad audience.
  12. David Robinson's broom package is one of my absolute favorite R packages that was built this year. One of the most annoying things about R is the variety of outputs different models give and this tidy version makes it really easy to do lots of neat stuff.
  13. Chung and Storey introduced the jackstraw which is both a very clever idea and the perfect name for a method that can be used to identify variables associated with principal components in a statistically rigorous way.
  14. I rarely dig excel-type replacements, but the simplicity of makes me love it. It does one thing and one thing really well.
  15. The hipsteR package for teaching old R dogs new tricks is one of the many cool things Karl Broman did this year. I read all of his tutorials and never cease to learn stuff. In related news if I was 1/10th as organized as that dude I'd actually you know, get stuff done.
  16. Whether I agree with them or not that they should be allowed to do unregulated human subjects research, statistics at tech companies, and in particular randomized experiments have never been hotter. The boldest of the bunch is OKCupid who writes blog posts with titles like, "We experiment on human beings!"
  17. In related news, I love the PlanOut project by the folks over at Facebook, so cool to see an open source approach to experimentation at web scale.
  18. No wonder Mike Jordan (no not that Mike Jordan) is such a superstar. His reddit AMA raised my respect for him from already super high levels. First, its awesome that he did it, and second it is amazing how well he articulates the relationship between CS and Stats.
  19. I'm trying to figure out a way to get Matthew Stephens to write more blog posts. He teased us with the Dynamic Statistical Comparisons post and then left us hanging. The people demand more Matthew.
  20. Di Cook also started a new blog in 2014. She was also part of this cool exploratory data analysis event for the UN. They have a monster program going over there at Iowa State, producing some amazing research and a bunch of students that are recognizable by one name (Yihui, Hadley, etc.).
  21. Love this paper on sure screening of graphical models out of Daniela Witten's group at UW. It is so cool when a simple idea ends up being really well justified theoretically, it makes the world feel right.
  22. I'm sure this actually happened before 2014, but the Bioconductor folks are still the best open source data science project that exists in my opinion. My favorite development I started using in 2014 is the git-subversion bridge that lets me update my Bioc packages with pull requests.
  23. rOpenSci ran an awesome hackathon. The lineup of people they invited was great and I loved the commitment to a diverse group of junior R programmers. I really, really hope they run it again.
  24. Dirk Eddelbuettel and Carl Boettiger continue to make bigtime contributions to R. This time it is Rocker, with Docker containers for R. I think this could be a reproducibility/teaching gamechanger.
  25. Regina Nuzzo brought the p-value debate to the masses. She is also incredible at communicating pretty complicated statistical ideas to a broad audience and I'm looking forward to more stats pieces by her in the top journals.
  26. Barbara Engelhardt keeps rocking out great papers. But she is also one of the best AE's I have ever had handle a paper for me at PeerJ. Super efficient, super fair, and super demanding. People don't get enough credit for being amazing in the peer review process and she deserves it.
  27. Ben Goldacre and Hans Rosling continue to be two of the best advocates for statistics and the statistical discipline - I'm not sure either claims the title of statistician but they do a great job anyway. This piece about Professor Rosling in Science gives some idea about the impact a statistician can have on the most current problems in public health. Meanwhile, I think Dr. Goldacre does a great job of explaining how personalized medicine is an information science in this piece on statins in the BMJ.
  28. Michael Lopez's series of posts on graduate school in statistics should be 100% required reading for anyone considering graduate school in statistics. He really nails it.
  29.  Trey Causey has an equally awesome Getting Started in Data Science post that I read about 10 times.
  30. Drop everything and go read all of Philip Guo's posts. Especially this one about industry versus academia or this one on the practical reason to do a PhD.
  31. The top new Twitter feed of 2014 has to be @ResearchMark (incidentally I'm still mourning the disappearance of @STATSHULK).
  32. Stephanie Hicks' blog combines recipes for delicious treats and statistics, also I thought she had a great summary of the Women in Stats (#WiS2014) conference.
  33. Emma Pierson is a Rhodes Scholar who wrote for 538, 23andMe, and a bunch of other major outlets as an undergrad. Her blog, is another must read. Here is an example of her awesome work on how different communities ignored each other on Twitter during the Ferguson protests.
  34. The Rstudio crowd continues to be on fire. I think they are a huge part of the reason that R is gaining momentum. It wouldn't be possible to list all their contributions (or it would be an Rstudio exclusive list) but I really like Packrat and R markdown v2.
  35. Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.
  36. Julian Wolfson and Joe Koopmeiners at University of Minnesota are straight up gamers. They live streamed their recruiting event this year. One way I judge good ideas is by how mad I am I didn't think of it and this one had me seeing bright red.
  37. This is just an awesome paper comparing lots of machine learning algorithms on lots of data sets. Random forests wins and this is a nice update of one of my favorite papers of all time: Classifier technology and the illusion of progress.
  38. Pipes in R! This stuff is for real. The piping functionality created by Stefan Milton and Hadley is one of the few inventions over the last several years that immediately changed whole workflows for me.


I'll let @ResearchMark take us out:


Sunday data/statistics link roundup (12/14/14)

  1. A very brief analysis suggests that economists are impartial when it comes to their liberal/conservative views. That being said, I'm not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line (via Rafa).
  2. I am digging the simplicity of from the folks at Medium. But I worry about spurious correlations everywhere. I guess I should just let that ship sail.
  3. FiveThirtyEight does a run down of the beer mile. If they set up a data crunchers beer mile, we are in.
  4. I love it when Thomas Lumley interviews himself about silly research studies and particularly their associated press releases. I can actually hear his voice in my head when I read them. This time the lipstick/IQ silliness gets Lumleyed.
  5. Jordan was better than Kobe. Surprise. Plus Rafa always takes the Kobe bait.
  6. Matlab/Python/R translation cheat sheet (via Stephanie H.).
  7. If I've said it once, I've said it a million times, statistical thinking is now as important as reading and writing. The latest example is parents not understanding the difference between sensitivity and the predictive value of a positive may be leading to unnecessary abortions (via Dan M./Rafa).