06
Mar

The importance of simulating the extremes

Simulation is commonly used by statisticians/data analysts to: (1) estimate variability/improve predictors, (2) to evaluate the space of potential outcomes, and (3) to evaluate the properties of new algorithms or procedures. Over the last couple of days, discussions of simulation have popped up in a couple of different places.

First, the reviewers of a paper that my student is working on had asked a question about the behavior of the method in different conditions. I mentioned in passing, that I thought it was a good idea to simulate some cases where our method will definitely break down.

I also saw this post by John Cook about simple/complex models. He raises the really important point that increasingly complex models built on a canonical, small, data set can fool you. You can make the model more and more complicated - but in other data sets the assumptions might not hold and the model won't generalize. Of course, simple models can have the same problems, but generally simple models will fail on small data sets in the same way they would fail on larger data sets (in my experience) - either they work or they don't.

These two ideas got me thinking about why I like simulation. Some statisticians, particularly applied statisticians, aren't fond of simulation for evaluating methods. I think the reason is that you can always simulate a situation that meets all of your assumptions and make your approach look good. Real data rarely conform to model assumptions and so are harder to "trick". On the other hand, I really like simulation, it can reveal a lot about how and when a method will work well and it allows you to explore scenarios - particularly for new or difficult to obtain data.

Here are the simulations I like to see:

1. Simulation where the assumptions are true There are a surprising number of proposed methods/analysis procedures/analyses that fail or perform poorly even when the model assumptions hold. This could be because the methods overfit, have a bug, are computationally unstable, are on the wrong place on the bias/variance tradeoff curve, etc. etc. etc. I always do at least one simulation for every method where the answer should be easy to get, because I know if I don't get the right answer, it is back to the drawing board.
2. Simulation where things should definitely fail I like to try out a few realistic scenarios where I'm pretty sure my model assumptions won't hold and the method should fail. This kind of simulation is good for two reasons: (1) sometimes I'm pleasantly surprised and the model will hold up and (2) (the more common scenario) I can find out where the model assumption boundaries are so that I can give concrete guidance to users about when/where the method will work and when/where it will fail.

The first type of simulation is easy to come up with - generally you can just simulate from the model. The second type is much harder. You have to creatively think about reasonable ways that your model can fail. I've  found that using real data for simulations can be the best way to start coming up with ideas to try - but I usually find that it is worth building on those ideas to imagine even more extreme circumstances. Playing the evil demon for my own methods often leads me to new ideas/improvements I hadn't thought of before. It also helps me to evaluate the work of other people - since I've tried to explore the contexts where methods likely fail.

In any case, if you haven't simulated the extremes I don't think you really know how your methods/analysis procedures are working.

30
Oct

On weather forecasts, Nate Silver, and the politicization of statistical illiteracy

As you know, we have a thing for statistical literacy here at Simply Stats. So of course this column over at Politico got our attention (via Chris V. and others). The column is an attack on Nate Silver, who has a blog where he tries to predict the outcome of elections in the U.S., you may have heard of it…

The argument that Dylan Byers makes in the Politico column is that Nate Silver is likely to be embarrassed by the outcome of the election if Romney wins. The reason is that Silver’s predictions have suggested Obama has a 75% chance to win the election recently and that number has never dropped below 60% or so.

I don’t know much about Dylan Byers, but from reading this column and a quick scan of his twitter feed, it appears he doesn’t know much about statistics. Some people have gotten pretty upset at him on Twitter and elsewhere about this fact, but I’d like to take a different approach: education. So Dylan, here is a really simple example that explains how Nate Silver comes up with a number like the 75% chance of victory for Obama.

Let’s pretend, just to make the example really simple, that if Obama gets greater than 50% of the vote, he will win the election. Obviously, Silver doesn’t ignore the electoral college and all the other complications, but it makes our example simpler. Then assume that based on averaging a bunch of polls  we estimate that Obama is likely to get about 50.5% of the vote.

Now, we want to know what is the “percent chance” Obama will win, taking into account what we know. So let’s run a bunch of “simulated elections” where on average Obama gets 50.5% of the vote, but there is variability because we don’t have the exact number. Since we have a bunch of polls and we averaged them, we can get an estimate for how variable the 50.5% number is. The usual measure of variance is the standard deviation. Say we get a standard deviation of 1% for our estimate. That would be a pretty accurate number, but not totally unreasonable given the amount of polling data out there.

We can run 1,000 simulated elections like this in R* (a free software programming language, if you don’t know R, may I suggest Roger’s Computing for Data Analysis class?). Here is the code to do that. The last line of code calculates the percent of times, in our 1,000 simulated elections, that Obama wins. This is the number that Nate would report on his site. When I run the code, I get an Obama win 68% of the time (Obama gets greater than 50% of the vote). But if you run it again that number will vary a little, since we simulated elections.

The interesting thing is that even though we only estimate that Obama leads by about 0.5%, he wins 68% of the simulated elections. The reason is that we are pretty confident in that number, with our standard deviation being so low (1%). But that doesn’t mean that Obama will win 68% of the vote in any of the elections! In fact, here is a histogram of the percent of the vote that Obama wins:

He never gets more than 54% or so and never less than 47% or so. So it is always a reasonably close election. Silver’s calculations are obviously more complicated, but the basic idea of simulating elections is the same.

Now, this might seem like a goofy way to come up with a “percent chance” with simulated elections and all. But it turns out it is actually a pretty important thing to know and relevant to those of us on the East Coast right now. It turns out weather forecasts (and projected hurricane paths) are based on the same sort of thing - simulated versions of the weather are run and the “percent chance of rain” is the fraction of times it rains in a particular place.

So Romney may still win and Obama may lose - and Silver may still get a lot of it right. But regardless, the approach taken by Silver is not based on politics, it is based on statistics. Hopefully we can move away from politicizing statistical illiteracy and toward evaluating the models for the real, underlying assumptions they make.

* In this case, we could calculate the percent of times Obama would win with a formula (called an analytical calculation) since we have simplified so much. In Nate’s case it is much more complicated, so you have to simulate.

21
Oct

The self-assessment trap

Several months ago I was sitting next to my colleague Ben Langmead at the Genome Informatics meeting. Various talks were presented on short read alignments and every single performance table showed the speaker’s method as #1 and Ben’s Bowtie as #2 among a crowded field of lesser methods. It was fun to make fun of Ben for getting beat every time, but the reality was that all I could conclude was that Bowtie was best and speakers were falling into the the self-assessment trap: each speaker had tweaked the assessment to make their method look best. This practice is pervasive in Statistics where easy-to-tweak Monte Carlo simulations are commonly used to assess performance. In a recent paper, a team at IBM described how the problem in the systems biology literature is pervasive as well. Co-author Gustavo Stolovitzky Stolovitsky is a co-developer of the DREAM challenge in which the assessments are fixed and developers are asked to submit. About 7 years ago we developed affycomp, a comparison webtool for microarray preprocessing methods. I encourage others involved in fields where methods are constantly being compared to develop such tools. It’s a lot of work, but journals are usually friendly to papers describing the results of such competitions.

Related Posts:  Roger on colors in R, Jeff on battling bad science