Category: Uncategorized

12
Nov

Data Science Students Predict the Midterm Election Results

As explained in an earlier post, one of the homework assignments of my CS109 class was to predict the results of the midterm election. We created a competition in which 49 students entered. The most interesting challenge was to provide intervals for the republican - democrat difference in each of the 35 senate races. Anybody missing more than 2 was eliminated. The average size of the intervals was the tie breaker.

The main teaching objective here was to get students thinking about how to evaluate prediction strategies when chance is involved. To a naive observer, a biased strategy that favored democrats and correctly called, say, Virginia may look good in comparison to strategies that called it a toss-up.  However, a look at the other 34 states would reveal the weakness of this biased strategy. I wanted students to think of procedures that can help distinguish lucky guesses from strategies that universally perform well.

One of the concepts we discussed in class was the systematic bias of polls which we modeled as a random effect. One can't infer this bias from polls until after the election passes. By studying previous elections students were able to estimate the SE of this random effect and incorporate it into the calculation of intervals. The realization of this random effect was very large in these elections (about +4 for the democrats) which clearly showed the importance of modeling this source of variability. Strategies that restricted standard error measures to sample estimates from this year's polls did very poorly. The 90% credible intervals provided by 538, which I believe does incorporate this, missed 8 of the 35 races (23%). This suggests that they underestimated the variance.  Several of our students compared favorably to 538:

name avg bias MSE avg interval size # missed
Manuel Andere -3.9 6.9 24.1 3
Richard Lopez -5.0 7.4 26.9 3
Daniel Sokol -4.5 6.4 24.1 4
Isabella Chiu -5.3 9.6 26.9 6
Denver Mosigisi Ogaro -3.2 6.6 18.9 7
Yu Jiang -5.6 9.6 22.6 7
David Dowey -3.5 6.2 16.3 8
Nate Silver -4.2 6.6 16.4 8
Filip Piasevoli -3.5 7.4 22.1 8
Yapeng Lu -6.5 8.2 16.5 10
David Jacob Lieb -3.7 7.2 17.1 10
Vincent Nguyen -3.8 5.9 11.1 14

It is important to note that 538 would have probably increased their interval size had they actively participated in a competition requiring 95% of the intervals to cover. But all in all, students did very well. The majority correctly predicted the republican take over. The median mean square error across all 49 participantes was 8.2 which was not much worse that 538's 6.6. Other example of strategies that I think helped some of these students perform well was the use of creative weighting schemes (based on previous elections) to average poll and the use of splines to estimate trends, which in this particular election were going in the republican's favor.

Here are some plots showing results from two of our top performers:

Rplot Rplot01

I hope this exercise helped students realize that data science can be both fun and useful. I can't wait to do this again in 2016.

 

 

 

10
Nov

Sunday data/statistics link roundup (11/9/14)

So I'm a day late, but you know, I got a new kid and stuff...

  1. The New Yorker hating on MOOCs, they mention all the usual stuff. Including the really poorly designed San Jose State experiment. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the wrong part of the hype curve. MOOCs won't solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).
  2. My colleague Dan S. is teaching a missing data workshop here at Hopkins next week (via Dan S.)
  3. A couple of cool Youtube videos explaining how the normal distribution sounds and the pareto principle with paperclips (via Presh T., pair with the 80/20 rule of statistical methods development)
  4. If you aren't following Research Wahlberg, you aren't on academic twitter.
  5. I followed  #biodata14  closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren't invited (we like to party too!).
  6. Our data science specialization generates almost 1,000 new R github repos a month! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.
  7. The Rstudio guys have also put together what looks like a great course similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.
  8. Congrats to Data Carpentry and Tracy Teal on their funding from the Moore Foundation!

05
Nov

Time varying causality in n=1 experiments with applications to newborn care

We just had our second son about a week ago and I've been hanging out at home with him and the rest of my family. It has reminded me of a few things from when we had our first son. First, newborns are tiny and super-duper adorable. Second, daylight savings time means gaining an extra hour of sleep for many people, but for people with young children it is more like (via Reddit):

 

Third, taking care of a newborn is like performing a series of n=1 experiments where the causal structure of the problem changes every time you perform an experiment.

Suppose, hypothetically, that your newborn has just had something to eat and it is 2am in the morning (again, just hypothetically). You are hoping he'll go back down to sleep so you can catch some shut-eye yourself. But your baby just can't sleep and seems uncomfortable. Here are a partial list of causes for this: (1) dirty diaper, (2) needs to burp, (3) still hungry, (4) not tired, (5) over tired, (6) has gas, (7) just chillin. So you start going down the list and trying to address each of the potential causes of late-night sleeplessness: (1) check diaper, (2) try burping, (3) feed him again, etc. etc. Then, miraculously, one works and the little guy falls asleep.

It is interesting how the natural human reaction  to this is to reorder the potential causes of sleeplessness and start with the thing that worked next time. Then often get frustrated when the same thing doesn't work the next time. You can't help it, you did an experiment, you have some data, you want to use it. But the reality is that the next time may have nothing to do with the first.

I'm in the process of collecting some very poorly annotated data collected exclusively at night if anyone wants to write a dissertation on this problem.

04
Nov

538 election forecasts made simple

Nate Silver does a great job of explaining his forecast model to laypeople. However, as a statistician I've always wanted to know more details. After preparing a "predict the midterm electionshomework for my data science class I have a better idea of what is going on.

Here is my best attempt at explaining the ideas of 538 using formulas and data. And here is the R markdown.

 

 

 

02
Nov

Sunday data/statistics link roundup (11/2/14)

Better late than never! If you have something cool to share, please continue to email it to me with subject line "Sunday links".

  1. A DrivenData is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).
  2. This article claiming academic science isn't sexist has been widely panned Emily Willingham pretty much destroys it here (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also publishing this piece about academic sexual harassment at Yale?
  3. Noah Smith, an economist, tries to summarize the problem with "most research being wrong". It is an interesting take, I wonder if he read Roger's piece saying almost exactly the same thing  like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should read our paper?
  4. Nature now requests that code sharing occur "where possible" (via Steven S.)
  5. Great movie scientist versus real scientist cartoons, I particularly like the one about replication (via Steven S.).
28
Oct

Why I support statisticians and their resistance to hype

Despite Statistics being the most mature data related discipline, statisticians have not fared well in terms of being selected for funding or leadership positions in the new initiatives brought about by the increasing interest in data. Just to give one example (Jeff and Terry Speed give many more) the White House Big Data Partners Workshop  had 19 members of which 0 were statisticians. The statistical community is clearly worried about this predicament and there is widespread consensus that we need to be better at marketing. Although I agree that only good can come from better communicating what we do, it is also important to continue doing one of the things we do best: resisting the hype and being realistic about data.

This week, after reading Mike Jordan's reddit ask me anything, I was reminded of exactly how much I admire this quality in statisticians. From reading the interview one learns about instances where hype has led to confusion, how getting past this confusion helps us better understand and consequently appreciate the importance of his field. For the past 30 years, Mike Jordan has been one of the most prolific academics working in the areas that today are receiving increased attention. Yet, you won't find a hyped-up press release coming out of his lab.  In fact when a journalist tried to hype up Jordan's critique of hype, Jordan called out the author.

Assessing the current situation with data initiatives it is hard not to conclude that hype is being rewarded. Many statisticians have come to the sad realization that by being cautious and skeptical, we may be losing out on funding possibilities and leadership roles. However, I remain very much upbeat about our discipline.  First, being skeptical and cautious has actually led to many important contributions. An important example is how randomized controlled experiments changed how medical procedures are evaluated. A more recent one is the concept of FDR, which helps control false discoveries in, for example,  high-throughput experiments. Second, many of us continue to work in the interface with real world applications placing us in a good position to make relevant contributions. Third, despite the failures alluded to above, we continue to successfully find ways to fund our work. Although resisting the hype has cost us in the short term, we will continue to produce methods that will be useful in the long term, as we have been doing for decades. Our methods will still be used when today's hyped up press releases are long forgotten.

 

 

26
Oct

Return of the sunday links! (10/26/14)

New look for the blog and bringing back the links. If you have something that you'd like included in the Sunday links, email me and let me know. If you use the title of the message "Sunday Links" you'll be more likely for me to find it when I search my gmail.

  1. Thomas L. does a more technical post on semi-parametric efficiency, normally I'm a data n' applications guy, but I love these in depth posts, especially when the papers remind me of all the things I studied at my alma mater.
  2. I am one of those people who only knows a tiny bit about Docker, but hears about it all the time. That being said, after I read about Rocker, I got pretty excited.
  3. Hadley W.'s favorite tools, seems like that dude likes R Studio for some reason....(me too)
  4. A cool visualization of chess piece survival rates.
  5. A short movie by 538 about statistics and the battle between Deep Blue and Gary Kasparov. Where's the popcorn?
  6. Twitter engineering released an R package for detecting outbreaks. I wonder how circular binary segmentation would do?

 

 

24
Oct

An interactive visualization to teach about the curse of dimensionality

I recently was contacted for an interview about the curse of dimensionality. During the course of the conversation, I realized how hard it is to explain the curse to a general audience. One of the best descriptions I could come up with was trying to describe sampling from a unit line, square, cube, etc. and taking samples with side length fixed. You would capture fewer and fewer points. As I was saying this, I realized it is a pretty bad way to explain the curse of dimensionality in words. But there was potentially a cool data visualization that would illustrate the idea. I went to my student Prasad, our resident interactive viz design expert to see if he could build it for me. He came up with this cool Shiny app where you can simulate a number of points (n) and then fix a side length for 1-D, 2-D, 3-D, and 4-D and see how many points you capture in a cube of that length in that dimension. You can find the full app here or check it out on the blog here:

 

22
Oct

Vote on simply statistics new logo design

As you can tell, we have given the Simply Stats blog a little style update. It should be more readable on phones or tablets now. We are also about to get a new logo. We are down to the last couple of choices and can't decide. Since we are statisticians, we thought we'd collect some data. Here is the link to the poll. Let us know

16
Oct

Creating the field of evidence based data analysis - do people know what a p-value looks like?

In the medical sciences, there is a discipline called "evidence based medicine". The basic idea is to study the actual practice of medicine using experimental techniques. The reason is that while we may have good experimental evidence about specific medicines or practices, the global behavior and execution of medical practice may also matter. There have been some success stories from this approach and also backlash from physicians who don't like to be told how to practice medicine. However, on the whole it is a valuable and interesting scientific exercise.

Roger introduced the idea of evidence based data analysis in a previous post. The basic idea is to study the actual practice and behavior of data analysts to identify how analysts behave. There is a strong history of this type of research within the data visualization community starting with Bill Cleveland and extending forward to work by Diane Cook, , Jeffrey Heer, and others.

Today we published a large-scale evidence based data analysis randomized trial. Two of the most common data analysis tasks (for better or worse) are exploratory analysis and the identification of statistically significant results. Di Cook's group calls this idea "graphical inference" or "visual significance" and they have studied human's ability to detect significance in the context of linear regression and how it associates with demographics and visual characteristics of the plot.

We performed a randomized study to determine if data analysts with basic training could identify statistically significant relationships. Or as the first author put it in a tweet:

What we found was that people were pretty bad at detecting statistically significant results, but that over multiple trials they could improve. This is a tentative first step toward understanding how the general practice of data analysis works. If you want to play around and see how good you are at seeing p-values we also built this interactive Shiny app. If you don't see the app you can also go to the Shiny app page here.