Simply Statistics


The Leek group guide to genomics papers

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Leek group guide to genomics papers

When I was a student, my advisor, John Storey, made a list of papers for me to read on nights and weekends. That list was incredibly helpful for a couple of reasons.

  • It got me caught up on the field of computational genomics
  • It was expertly curated, so it filtered a lot of papers I didn't need to read
  • It gave me my first set of ideas to try to pursue as I was reading the papers

I have often thought I should make a similar list for folks who may want to work wtih me (or who want to learn about statistial genomics). So this is my first attempt at that list. I've tried to separate the papers into categories and I've probably missed important papers. I'm happy to take suggestions for the list, but this is primarily designed for people in my group so I might be a little bit parsimonious.



An economic model for peer review

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I saw this tweet the other day:

It reminded me that a few years ago I had a paper that went through the peer review wringer. It drove me completely bananas. One thing that drove me so crazy about the process was how long the referees waited before reviewing and how terrible the reviews were after that long wait. So I started thinking about the "economics of peer review". Basically, what is the incentive for scientists to contribute to the system.

To get a handle on this idea, I designed a "peer review game" where there are a fixed number of players N. The players play the game for a fixed period of time. During that time, they can submit papers or they can review papers. For each person, their final score at the end of the time is S_i = \sum {\rm Submitted \; Papers \; Accepted}.

Based on this model, under closed peer review, there is one Nash equilibrium under the strategy that no one reviews any papers. Basically, no one can hope to improve their score by reviewing, they can only hope to improve their score by submitting more papers (sound familiar?). Under open peer review, there are more potential equilibria, based on the relative amount of goodwill you earn from your fellow reviewers by submitting good reviews.

We then built a model system for testing out our theory. The system involved having groups of students play a "peer review game" where they submitted solutions to SAT problems like:

Each solution was then randomly assigned to another player to review. Those players could (a) review it and reject it, (b) review it and accept it, or (c) not review it. The person with the most points at the end of the time (one hour) won.

We found some cool things:

  1. In closed review, reviewing gave no benefit.
  2. In open review, reviewing gave a small positive benefit.
  3. Both systems gave comparable accuracy
  4. All peer review increased the overall accuracy of responses

The paper is here and all of the data and code are here.


The Drake index for academics

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I think academic indices are pretty silly; maybe we should introduce so many academic indices that people can't even remember which one is which. There are pretty serious flaws with both citation indices and social media indices that I think render them pretty meaningless in a lot of ways.

Regardless of these obvious flaws I want in the game. Instead of the K-index for academics I propose the Drake index. Drake has achieved both critical and popular success. His song "Honorable Mentions" from the ESPYs (especially the first verse) reminds me of the motivation of the K-index paper.

To quantify both the critical and popular success of a scientist, I propose the Drake Index (TM). The Drake Index is defined as follows

(# Twitter Followers)/(Max Twitter Followers by a Person in your Field) + (#Citations)/(Max Citations by a Person in your Field)

Let's break the index down. There are two main components (Twitter followers and Citations) measuring popular and critical acclaim. But they are measured on different scales. So we attempt to normalize them to the maximum in their field so the indices will both be between 0 and 1. This means that your Drake index score is between 0 and 2. Let's look at a few examples to see how the index works.

  1. Drake  = (16.9M followers)/(55.5 M followers for Justin Bieber) + (0 citations)/(134 Citations for Natalie Portman) = 0.30
  2. Rafael Irizarry = (1.1K followers)/(17.6K followers for Simply Stats) + (38,194 citations)/(185,740 citations for Doug Altman) = 0.27
  3. Roger Peng - (4.5K followers)/(17.6K followers for Simply Stats) + (4,011 citations)/(185,740 citations for Doug Altman) = 0.27
  4. Jeff Leek - (2.6K followers)/(17.6K followers for Simply + (2,348 citations)/(185,740 citations for Doug Altman) = 0.16

In the interest of this not being taken any seriously than an afternoon blogpost should be I won't calculate any other people's Drake index. But you can :-).


You think P-values are bad? I say show me the data.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands.  Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another.  I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around.  So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.


Unbundling the educational package

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I just got back from the World Economic Forum's summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren't disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.

There are many ideas bandied about with respect to "disrupting" education and some are interesting to consider. MOOCs were the darlings of...last year...but they're old news now. Sam Lessin has a nice piece in the The Information (total paywall, sorry, but it's worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a nice framework for thinking about the "education business".

One thing that I latched on to in Damodaran's piece is the idea of education as a "bundled product". Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of

  1. Subject matter content (i.e. course material)
  2. Mentoring and guidance by faculty
  3. Social and professional networking
  4. Other activities (sports, arts ensembles, etc.)

MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the Minerva project are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).

I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don't know.


Applied Statisticians: people want to learn what we do. Let's teach them.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In this recent opinion piece, Hadley Wickham explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

and makes the important point that

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.

The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley's definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data.  More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used "data product" R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.

So why then does Hadley think that "Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products"?  I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl's work was published in the American Journal of Human Genetetics. A second reason may be that most of us academic applied statisticians don't teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.

About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.

This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.

So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.





A non-comprehensive list of awesome female data people on Twitter

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I was just talking to a student who mentioned she didn't know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn't seen a good list of women on Twitter who do stats/data. So I thought I'd make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I'll update the list?

I have also been informed that these Twitter lists are probably better than my post. But I'll keep updating my list anyway cause I want to know who all the right people to follow are!



Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don't, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people's data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn't change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn't before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were't able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn't put me in charge anyway, I've never won the lottery with any infrastructure I've tried to build.


A (very) brief review of published human subjects research conducted with social media companies

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research is still unclear. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I'm only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few.  I'd be interested to see more in the comments if people know about them.

Paper: Experimental evidence of massive-scale emotional contagion through social networks
Company: Facebook
What they did: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.
What they found: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.

Paper: Social influence bias: a randomized experiment
Company: Not stated but sounds like Reddit
What they did: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.
What they found: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.

Paper: Identifying influential and susceptible members of social networks 
Company: Facebook
What they did: Using a commercial Facebook app,  they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.
What they found: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.


Paper: Inferring causal impact using Bayesian structural time-series models
Company: Google
What they did: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser's website through paid and organic (non-paid) clicks.
What they found: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.









SwiftKey and Johns Hopkins partner for Data Science Specialization Capstone

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I use SwiftKey on my Android phone all the time. So I was super pumped up when they agreed to partner with us on the first Capstone course for the Johns Hopkins Data Science  Specialization to run in October 2014. To enroll in the course you have to pass the other 9 courses in the Data Science Specialization.

The 9 courses have only been running for 4 months but already 200+ people have finished all 9! It has been unbelievable to see the response to the specialization and we are exited about taking it to the next level.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, students will use the knowledge gained in our  Data Products course to build a predictive text product they can show off to their family, friends, and potential employers.

We are really excited to work with SwiftKey to take our Specialization to the next level! Here is Roger's intro video for the course to get you fired up too.