Tag: cook


The importance of simulating the extremes

Simulation is commonly used by statisticians/data analysts to: (1) estimate variability/improve predictors, (2) to evaluate the space of potential outcomes, and (3) to evaluate the properties of new algorithms or procedures. Over the last couple of days, discussions of simulation have popped up in a couple of different places.

First, the reviewers of a paper that my student is working on had asked a question about the behavior of the method in different conditions. I mentioned in passing, that I thought it was a good idea to simulate some cases where our method will definitely break down.

I also saw this post by John Cook about simple/complex models. He raises the really important point that increasingly complex models built on a canonical, small, data set can fool you. You can make the model more and more complicated - but in other data sets the assumptions might not hold and the model won't generalize. Of course, simple models can have the same problems, but generally simple models will fail on small data sets in the same way they would fail on larger data sets (in my experience) - either they work or they don't.

These two ideas got me thinking about why I like simulation. Some statisticians, particularly applied statisticians, aren't fond of simulation for evaluating methods. I think the reason is that you can always simulate a situation that meets all of your assumptions and make your approach look good. Real data rarely conform to model assumptions and so are harder to "trick". On the other hand, I really like simulation, it can reveal a lot about how and when a method will work well and it allows you to explore scenarios - particularly for new or difficult to obtain data.

Here are the simulations I like to see:

  1. Simulation where the assumptions are true There are a surprising number of proposed methods/analysis procedures/analyses that fail or perform poorly even when the model assumptions hold. This could be because the methods overfit, have a bug, are computationally unstable, are on the wrong place on the bias/variance tradeoff curve, etc. etc. etc. I always do at least one simulation for every method where the answer should be easy to get, because I know if I don't get the right answer, it is back to the drawing board.
  2. Simulation where things should definitely fail I like to try out a few realistic scenarios where I'm pretty sure my model assumptions won't hold and the method should fail. This kind of simulation is good for two reasons: (1) sometimes I'm pleasantly surprised and the model will hold up and (2) (the more common scenario) I can find out where the model assumption boundaries are so that I can give concrete guidance to users about when/where the method will work and when/where it will fail.

The first type of simulation is easy to come up with - generally you can just simulate from the model. The second type is much harder. You have to creatively think about reasonable ways that your model can fail. I've  found that using real data for simulations can be the best way to start coming up with ideas to try - but I usually find that it is worth building on those ideas to imagine even more extreme circumstances. Playing the evil demon for my own methods often leads me to new ideas/improvements I hadn't thought of before. It also helps me to evaluate the work of other people - since I've tried to explore the contexts where methods likely fail.

In any case, if you haven't simulated the extremes I don't think you really know how your methods/analysis procedures are working.


The value of re-analysis

I just saw this really nice post over on John Cook's blog. He talks about how it is a valuable exercise to re-type code for examples you find in a book or on a blog. I completely agree that this is a good way to learn through osmosis, learn about debugging, and often pick up the reasons for particular coding tricks (this is how I learned about vectorized calculations in Matlab, by re-typing and running my advisors code back in my youth).

In a more statistical version of this idea, Gary King has proposed reproducing the analysis in a published paper as a way to get a paper of your own.  You can figure out the parts that a person did well and the parts that you would do differently, maybe finding enough insight to come up with your own new paper. But I think this level of replication involves actually two levels of thinking:

  1. Can you actually reproduce the code used to perform the analysis?
  2. Can you solve the "paper as puzzle" exercise proposed by Ethan Perlstein over at his site. Given the results in the paper, can you come up with the story?

Both of these things require a bit more "higher level thinking" than just re-running the analysis if you have the code. But I think even the seemingly "low-level" task of just retyping and running the code that is used to perform a data analysis can be very enlightening. The problem is that this code, in many cases, does not exist. But that is starting to change. If you check out Rpubs or RunMyCode or even the right parts of Figshare you can find data analyses you can run through and reproduce.

The only downside is there is currently no measure of quality on these published analyses. It would be great if people could focus their time re-typing only good data analyses, rather than one at random. Or, as a guy once (almost) said, "Data analysis practice doesn't make perfect, perfect data analysis practice makes perfect."