Simply Statistics

16
Jan

If you were going to write a paper about the false discovery rate you should have done it in 2002

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people's genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:

  • empirical processes
  • proportional hazards model
  • generalized linear model
  • semiparametric
  • generalized estimating equation
  • false discovery rate
  • microarray statistics
  • lasso shrinkage
  • rna-seq statistics

Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.

  • Controlling the false discovery rate: a practical and powerful approach to multiple testing
  • Thresholding of statistical maps in functional neuroimaging using the false discovery rate
  • The control of the false discovery rate in multiple testing under dependency
  • Controlling the false discovery rate in behavior genetics research
  • Identifying differentially expressed genes using false discovery rate controlling procedures
  • The positive false discovery rate: A Bayesian interpretation and the q-value
  • On the adaptive control of the false discovery rate in multiple testing with independent statistics
  • Implementing false discovery rate control: increasing your power
  • Operating characteristics and extensions of the false discovery rate procedure
  • Adaptive linear step-up procedures that control the false discovery rate

People who work in this area will recognize that many of these papers are the most important/most cited in the field.

Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:

 

citations-boxplot

You can see from the plot that the median publication year of the top 30 hits for "empirical processes" was 1990 and for "RNA-seq statistics" was 2010. The medians for the other topics were:

  • Emp. Proc. 1990.241
  • Prop. Haz. 1990.929
  • GLM 1994.433
  • Semi-param. 1994.433
  • GEE 2000.379
  • FDR 2002.760
  • microarray 2003.600
  • lasso 2004.900
  • rna-seq 2010.765

I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn't perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the "success" of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the expectation of brilliance has on certain subgroups, it is important to recognize the importance of timing and luck. The median most cited "false discovery rate" paper was 2002, but almost none of the 30 top hits were published after about 2008.

The code for my analysis is here. It is super hacky so have mercy.

15
Jan

How to find the science paper behind a headline when the link is missing

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.

 

Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn't linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.

Here is the news article (link):

 

Screen Shot 2015-01-15 at 1.11.22 PM

 

 

Step 1: Look for a link to the article

Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. This is not the original research article. If you don't get to a scientific journal you aren't finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.

 

Step 2: Look for names of the authors, scientific key words and journal name if available

You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:
Untitled presentation (2)

 

And some key words:

 

Untitled presentation (3)

 

Step 3 Use Google Scholar

You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to Google Scholar then click on the little triangle next to the search box.

 

 

 

Untitled presentation (4)

Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.

 

Screen Shot 2015-01-15 at 1.31.38 PM

 

Step 4 Victory

Often this will come up with the article you are looking for:

Screen Shot 2015-01-15 at 1.32.45 PM

 

Unfortunately, the article may be paywalled, so if you don't work at a university or institute with a subscription, you can always tweet the article name with the hashtag #icanhazpdf and your contact info. Then you just have to hope that someone will send it to you (they often do).

 

 

12
Jan

Statistics and R for the Life Sciences: New HarvardX course starts January 19

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The first course of our Biomedical Data Science online curriculum
starts next week. You can sign up here. Instead of relying on
mathematical formulas to teach statistical concepts, students can
program along as we show computer code for simulations that illustrate
the main ideas of exploratory data analysis and statistical inference
(p-values, confidence intervals and power calculations for example).
By doing this, students will learn Statistics and R simultaneously and
will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each
video students will have the opportunity to assess their understanding
through homeworks involving coding in R. A big improvement over the
first version is that we have added dozens of assessment.

Note that this course is the first in an eight part series on Data Analysis for Genomics. Updates will be provided via twitter @rafalab.

 

edx_screenshot_v2

07
Jan

Beast mode parenting as shown by my Fitbit data

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

This weekend was one of those hardcore parenting weekends that any parent of little kids will understand. We were up and actively taking care of kids for a huge fraction of the weekend. (Un)fortunately I was wearing my Fitbit, so I can quantify exactly how little we were sleeping over the weekend.

Here is Saturday:

saturday

 

 

There you can see that I rocked about midnight-4am without running around chasing a kid or bouncing one to sleep. But Sunday was the real winner:

 

sunday

Check that out. I was totally asleep from like 4am-6am there. Nice.

Stay tuned for much more from my Fitbit data over the next few weeks.

 

 

04
Jan

Sunday data/statistics link roundup (1/4/15)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone
  1. I am digging this visualization of your life in weeks. I might have to go so far as to actually make one for myself.
  2. I'm very excited about the new podcast TalkingMachines and what an awesome name! I wish someone would do that same thing for applied statistics (Roger?)
  3. I love that they call Ben Goldacre the anti-Dr. Oz in this piece, especially given how often Dr. Oz is telling the truth.
  4. If you haven't read it yet, this piece in the Economist on statisticians during the war is really good.
  5. The arXiv celebrated it's 1M paper upload. It costs less to run than the top 2 executives at PLoS make. It is too darn expensive to publish open access right now.
31
Dec

Ugh ... so close to one million page views for 2014

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In my last Sunday Links roundup I mentioned we were going to be really close to 1 million page views this year. Chris V. tried to rally the troops:

 

 

but alas we are probably not going to make it (unless by some miracle one of our posts goes viral in the next 12 hours):

soclose

 

Stay tuned for a bunch of cool new stuff from Simply Stats in 2015, including a new podcasting idea, more interviews, another unconference, and a new plotting theme!

22
Dec

On how meetings and conference calls are disruptive to a data scientist

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: The week of Xmas eve is usually my most productive of the year. This is because there is reduced emails and 0 meetings (I do take a break, but after this great week for work). Here is a repost of one of our first entries explaining how meetings and conference calls are particularly disruptive in data science. 

In this TED talk Jason Fried explains why work doesn't happen at work. He describes the evils of meetings. Meetings are particularly disruptive for applied statisticians, especially for those of us that hack data files, explore data for systematic errors, get inspiration from visual inspection, and thoroughly test our code. Why? Before I become productive I go through a ramp-up/boot-up stage. Scripts need to be found, data loaded into memory, and most importantly, my brains needs to re-familiarize itself with the data and the essence of the problem at hand. I need a similar ramp up for writing as well. It usually takes me between 15 to 60 minutes before I am in full-productivity mode. But once I am in “the zone”, I become very focused and I can stay in this mode for hours. There is nothing worse than interrupting this state of mind to go to a meeting. I lose much more than the hour I spend at the meeting. A short way to explain this is that having 10 separate hours to work is basically nothing, while having 10 hours in the zone is when I get stuff done.

Of course not all meetings are a waste of time. Academic leaders and administrators need to consult and get advice before making important decisions. I find lab meetings very stimulating and, generally, productive: we unstick the stuck and realign the derailed. But before you go and set up a standing meeting consider this calculation: a weekly one hour meeting with 20 people translates into 1 hour x 20 people x 52 weeks/year = 1040 person hours of potentially lost production per year. Assuming 40 hour weeks, that translates into six months. How many grants, papers, and lectures can we produce in six months? And this does not take into account the non-linear effect described above. Jason Fried suggest you cancel your next meeting, notice that nothing bad happens and enjoy the extra hour of work.

I know many others that are like me in this regard and for you I have these recommendations: 1- avoid unnecessary meetings, especially if you are already in full-productivity mode. Don’t be afraid to use this as an excuse to cancel.  If you are in a soft $ institution, remember who pays your salary.  2- Try to bunch all the necessary meetings all together into one day. 3- Separate at least one day a week to stay home and work for 10 hours straight. Jason Fried also recommends that every work place declare a day in which no one talks. No meetings, no chit-chat, no friendly banter, etc… No talk Thursdays anyone?

21
Dec

Sunday data/statistics link roundup (12/21/14)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

James Stewart, author of the most popular Calculus textbook in the world, passed away. In case you wonder if there is any money in textbooks, he had a $32 million house in Toronto. Maybe I should get out of MOOCs and into textbooks.

  1. This post on medium about a new test for causality is making the rounds.  The authors of the original paper make clear their assumptions make the results basically unrealistic for any real analysis for example:"We simplify the causal discovery problem by assuming no confounding, selection bias and feedback." The medium article is too bold and as I replied to an economist who tweeted there was a new test that could distinguish causality: "Nope".
  2. I'm excited that the Rafa + the ASA have started a section on Genomics and Genetics. It is nice to have a place to belong within our community. I hope it can be a place where folks who aren't into the hype (a lot of those in genomics), but really care about applications, can meet each other and work together.
  3. Great essay by Hanna W. about data, machine learning and fairness. I love this quote: "in order to responsibly articulate and address issues relating to bias, fairness, and inclusion, we need to stop thinking of big data sets as being homogeneous, and instead shift our focus to the many diverse data sets nested within these larger collections." (via Hilary M.)
  4. Over at Flowing Data they ran down the best data visualizations of the year.
  5. This rant from Dirk E. perfectly encapsulates every annoying thing about the Julia versus R comparisons I see regularly.
  6. We are tantalizingly close to 1 million page views for the year for Simply Stats. Help get us over the edge, share your favorite simply stats article with all your friends using the hashtag #simplystats1e6
19
Dec

Interview with Emily Oster

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone
Emily Oster
Emily Oster is an Associate Professor of Economics at Brown University. She is a frequent and highly respected contributor to 538 where she brings clarity to areas of interest to parents, pregnant woman, and the general public where empirical research is conflicting or difficult to interpret. She is also the author of the popular new book about pregnancy: Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to KnowWe interviewed Emily as part of our ongoing interview series with exciting empirical data scientists. 
 
SS: Do you consider yourself an economist, econometrician, statistician, data scientist or something else?
EO: I consider myself an empirical economist. I think my econometrics colleagues would have a hearty laugh at the idea that I'm an econometrician! The questions I'm most interested in tend to have a very heavy empirical component - I really want to understand what we can learn from data. In this sense, there is a lot of overlap with statistics. But at the end of the day, the motivating questions and the theories of behavior I want to test come straight out of economics.
SS: You are a frequent contributor to 538. Many of your pieces are attempts to demystify often conflicting sets of empirical research (about concussions and suicide, or the dangers of water flouridation). What would you say are the issues that make empirical research about these topics most difficult?
 
EO: In nearly all the cases, I'd summarize the problem as : "The data isn't good enough." Sometimes this is because we only see observational data, not anything randomized. A large share of studies using observational data that I discuss have serious problems with either omitted variables or reverse causality (or both).  This means that the results are suggestive, but really not conclusive.  A second issue is even when we do have some randomized data, it's usually on a particular population, or a small group, or in the wrong time period. In the flouride case, the studies which come closest to being "randomized" are from 50 years ago. How do we know they still apply now?  This makes even these studies challenging to interpret.
SS: Your recent book "Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know" takes a similar approach to pregnancy. Why do you think there are so many conflicting studies about pregnancy? Is it because it is so hard to perform randomized studies?
 
EO: I think the inability to run randomized studies is a big part of this, yes. One area of pregnancy where the data is actually quite good is labor and delivery. If you want to know the benefits and consequences of pain medication in labor, for example, it is possible to point you to some reasonably sized randomized trials. For various reasons, there has been more willingness to run randomized studies in this area. When pregnant women want answers to less medical questions (like, "Can I have a cup of coffee?") there is typically no randomized data to rely on. Because the possible benefits of drinking coffee while pregnant are pretty much nil, it is difficult to conceptualize a randomized study of this type of thing.
Another big issue I found in writing the book was that even in cases where the data was quite good, data often diverges from practice. This was eye-opening for me and convinced me that in pregnancy (and probably in other areas of health) people really do need to be their own advocates and know the data for themselves.
SS: Have you been surprised about the backlash to your book for your discussion of the zero-alcohol policy during pregnancy? 
 
EO: A little bit, yes. This backlash has died down a lot as pregnant women actually read the book and use it. As it turns out, the discussion of alcohol makes up a tiny fraction of the book and most pregnant women are more interested in the rest of it!  But certainly when the book came out this got a lot of focus. I suspected it would be somewhat controversial, although the truth is that every OB I actually talked to told me they thought it was fine. So I was surprised that the reaction was as sharp as it was.  I think in the end a number of people felt that even if the data were supportive of this view, it was important not to say it because of the concern that some women would over-react. I am not convinced by this argument.
SS: What are the three most important statistical concepts for new mothers to know? 
 
EO: I really only have two!
I think the biggest thing is to understand the difference between randomized and non-randomized data and to have some sense of the pittfalls of non-randomized data. I reviewed studies of alcohol where the drinkers were twice as likely as non-drinkers to use cocaine. I think people (pregnant or not) should be able to understand why one is going to struggle to draw conclusions about alcohol from these data.
A second issue is the concept of probability. It is easy to say, "There is a 10% chance of the following" but do we really understand that? If someone quotes you a 1 in 100 risk from a procedure, it is important to understand the difference between 1 in 100 and 1 in 400.  For most of us, those seem basically the same - they are both small. But they are not, and people need to think of ways to structure decision-making that acknowledge these differences.
SS: What computer programming language is most commonly taught for data analysis in economics? 
 
EO: So, I think the majority of empirical economists use Stata. I have been seeing more R, as well as a variety of other things, but more commonly among people who do heavier computational fields.
SS: Do you have any advice for young economists/statisticians who are interested in empirical research? 
EO:
1. Work on topics that interest you. As an academic you will ultimately have to motivate yourself to work. If you aren't interested in your topic (at least initially!), you'll never succeed.
2. One project which is 100% done is way better than five projects at 80%. You need to actually finish things, something which many of us struggle with.
3. Presentation matters. Yes, the substance is the most important thing, but don't discount the importance of conveying your ideas well.
18
Dec

Repost: Statistical illiteracy may lead to parents panicking about Autism

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's Note: This is a repost of a previous post on our blog from 2012. The repost is inspired by similar issues with statistical illiteracy that are coming up in allergy screening and pregnancy screening

I just was doing my morning reading of a few news sources and stumbled across this Huffington Post article talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn't find a link to the research paper in the article so I did some searching and found out this result is also being covered by Time, Science Daily, Medical Daily, and a bunch of other news outlets.

Now thoroughly freaked, I looked online and found the pdf of the original research article. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies' cries can predict autism. Here are the flaws with the study:

  1. Small sample size. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.
  2. Major and unavoidable confounding. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.
  3. No correction for multiple testing. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD even did a cartoon about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction for them. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.

Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.

The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?