Simply Statistics


Data Science Students Predict the Midterm Election Results

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

As explained in an earlier post, one of the homework assignments of my CS109 class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.

The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up.  However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.

One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can't infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was very large in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year's polls did very poorly. The 90% credible intervals provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance.  Several of our students compared favorably to 538:

name avg bias MSE avg interval size # missed
Manuel Andere -3.9 6.9 24.1 3
Richard Lopez -5.0 7.4 26.9 3
Daniel Sokol -4.5 6.4 24.1 4
Isabella Chiu -5.3 9.6 26.9 6
Denver Mosigisi Ogaro -3.2 6.6 18.9 7
Yu Jiang -5.6 9.6 22.6 7
David Dowey -3.5 6.2 16.3 8
Nate Silver -4.2 6.6 16.4 8
Filip Piasevoli -3.5 7.4 22.1 8
Yapeng Lu -6.5 8.2 16.5 10
David Jacob Lieb -3.7 7.2 17.1 10
Vincent Nguyen -3.8 5.9 11.1 14

It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538's 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican's favor.

Here are some plots showing results from two of our top performers:

Rplot Rplot01

I hope this exercise helped students realize that data science can be both fun and useful. I can't wait to do this again in 2016.





Sunday data/statistics link roundup (11/9/14)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

So I'm a day late, but you know, I got a new kid and stuff...

  1. The New Yorker hating on MOOCs, they mention all the usual stuff. Including the really poorly designed San Jose State experiment. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the wrong part of the hype curve. MOOCs won't solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).
  2. My colleague Dan S. is teaching a missing data workshop here at Hopkins next week (via Dan S.)
  3. A couple of cool Youtube videos explaining how the normal distribution sounds and the pareto principle with paperclips (via Presh T., pair with the 80/20 rule of statistical methods development)
  4. If you aren't following Research Wahlberg, you aren't on academic twitter.
  5. I followed  #biodata14  closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren't invited (we like to party too!).
  6. Our data science specialization generates almost 1,000 new R github repos a month! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.
  7. The Rstudio guys have also put together what looks like a great course similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.
  8. Congrats to Data Carpentry and Tracy Teal on their funding from the Moore Foundation!


Time varying causality in n=1 experiments with applications to newborn care

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

We just had our second son about a week ago and I've been hanging out at home with him and the rest of my family. It has reminded me of a few things from when we had our first son. First, newborns are tiny and super-duper adorable. Second, daylight savings time means gaining an extra hour of sleep for many people, but for people with young children it is more like (via Reddit):


Third, taking care of a newborn is like performing a series of n=1 experiments where the causal structure of the problem changes every time you perform an experiment.

Suppose, hypothetically, that your newborn has just had something to eat and it is 2am in the morning (again, just hypothetically). You are hoping he'll go back down to sleep so you can catch some shut-eye yourself. But your baby just can't sleep and seems uncomfortable. Here are a partial list of causes for this: (1) dirty diaper, (2) needs to burp, (3) still hungry, (4) not tired, (5) over tired, (6) has gas, (7) just chillin. So you start going down the list and trying to address each of the potential causes of late-night sleeplessness: (1) check diaper, (2) try burping, (3) feed him again, etc. etc. Then, miraculously, one works and the little guy falls asleep.

It is interesting how the natural human reaction  to this is to reorder the potential causes of sleeplessness and start with the thing that worked next time. Then often get frustrated when the same thing doesn't work the next time. You can't help it, you did an experiment, you have some data, you want to use it. But the reality is that the next time may have nothing to do with the first.

I'm in the process of collecting some very poorly annotated data collected exclusively at night if anyone wants to write a dissertation on this problem.


538 election forecasts made simple

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.





Sunday data/statistics link roundup (11/2/14)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Better late than never! If you have something cool to share, please continue to email it to me with subject line "Sunday links".

  1. A DrivenData is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).
  2. This article claiming academic science isn't sexist has been widely panned Emily Willingham pretty much destroys it here (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also publishing this piece about academic sexual harassment at Yale?
  3. Noah Smith, an economist, tries to summarize the problem with "most research being wrong". It is an interesting take, I wonder if he read Roger's piece saying almost exactly the same thing  like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should read our paper?
  4. Nature now requests that code sharing occur "where possible" (via Steven S.)
  5. Great movie scientist versus real scientist cartoons, I particularly like the one about replication (via Steven S.).

Why I support statisticians and their resistance to hype

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.




Return of the sunday links! (10/26/14)

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

New look for the blog and bringing back the links. If you have something that you'd like included in the Sunday links, email me and let me know. If you use the title of the message "Sunday Links" you'll be more likely for me to find it when I search my gmail.

  1. Thomas L. does a more technical post on semi-parametric efficiency, normally I'm a data n' applications guy, but I love these in depth posts, especially when the papers remind me of all the things I studied at my alma mater.
  2. I am one of those people who only knows a tiny bit about Docker, but hears about it all the time. That being said, after I read about Rocker, I got pretty excited.
  3. Hadley W.'s favorite tools, seems like that dude likes R Studio for some reason....(me too)
  4. A cool visualization of chess piece survival rates.
  5. A short movie by 538 about statistics and the battle between Deep Blue and Gary Kasparov. Where's the popcorn?
  6. Twitter engineering released an R package for detecting outbreaks. I wonder how circular binary segmentation would do?




An interactive visualization to teach about the curse of dimensionality

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student Prasad, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the full app here or check it out on the blog here:



Vote on simply statistics new logo design

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can't decide. Since we are statisticians, we thought we'd collect some data. Here is the link to the poll. Let us know


Thinking like a statistician: don't judge a society by its internet comments

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In a previous post I explained how thinking like a statistician can help you avoid  feeling sad after using Facebook. The basic point was that missing not at random (MNAR) data on your friends' profiles (showing only the best parts of their life) can result in the biased view that your life is boring and uninspiring in comparison. A similar argument can be made to avoid  losing faith in humanity after reading internet comments or anonymous tweets, one of the most depressing activities that I have voluntarily engaged in.  If you want to see proof that racism, xenophobia, sexism and homophobia are still very much alive, read the unfiltered comments sections of articles related to race, immigration, gender or gay rights. However, as a statistician, I remain optimistic about our society after realizing how extremely biased these particular MNAR data can be.

Assume we could summarize an individual's "righteousness" with a numerical index. I realize this is a gross oversimplification, but bear with me. Below is my view on the distribution of this index across all members of our society.


Note that the distribution is not bimodal. This means there is no gap between good and evil, instead we have a continuum. Although there is variability, and we do have some extreme outliers on both sides of the distribution, most of us are much closer to the median than we like to believe. The offending internet commentators represent a very small proportion (the "bad" tail shown in red). But in a large population, such as internet users, this extremely small proportion can be quite numerous and gives us a biased view.

There is one more level of variability here that introduces biases. Since internet comments can be anonymous, we get an unprecedentedly large glimpse into people's opinions and thoughts. We assign a "righteousness" index to our thoughts and opinion and include it in the scatter plot shown in the figure above. Note that this index exhibits variability within individuals: even the best people have the occasional bad thought.  The points in red represent thoughts so awful that no one, not even the worst people, would ever express publicly. The red points give us an overly pessimistic estimate of the individuals that are posting these comments, which exacerbates our already pessimistic view due to a non-representative sample of individuals.

I hope that thinking like a statistician will help the media and social networks put in statistical perspective the awful tweets or internet comments that represent the worst of the worst. These actually provide little to no information on humanity's distribution of righteousness, that I think is moving consistently, albeit slowly, towards the good.