Simply Statistics


Interview with Emily Oster

Emily Oster
Emily Oster is an Associate Professor of Economics at Brown University. She is a frequent and highly respected contributor to 538 where she brings clarity to areas of interest to parents, pregnant woman, and the general public where empirical research is conflicting or difficult to interpret. She is also the author of the popular new book about pregnancy: Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to KnowWe interviewed Emily as part of our ongoing interview series with exciting empirical data scientists. 
SS: Do you consider yourself an economist, econometrician, statistician, data scientist or something else?
EO: I consider myself an empirical economist. I think my econometrics colleagues would have a hearty laugh at the idea that I'm an econometrician! The questions I'm most interested in tend to have a very heavy empirical component - I really want to understand what we can learn from data. In this sense, there is a lot of overlap with statistics. But at the end of the day, the motivating questions and the theories of behavior I want to test come straight out of economics.
SS: You are a frequent contributor to 538. Many of your pieces are attempts to demystify often conflicting sets of empirical research (about concussions and suicide, or the dangers of water flouridation). What would you say are the issues that make empirical research about these topics most difficult?
EO: In nearly all the cases, I'd summarize the problem as : "The data isn't good enough." Sometimes this is because we only see observational data, not anything randomized. A large share of studies using observational data that I discuss have serious problems with either omitted variables or reverse causality (or both).  This means that the results are suggestive, but really not conclusive.  A second issue is even when we do have some randomized data, it's usually on a particular population, or a small group, or in the wrong time period. In the flouride case, the studies which come closest to being "randomized" are from 50 years ago. How do we know they still apply now?  This makes even these studies challenging to interpret.
SS: Your recent book "Expecting Better: Why the Conventional Pregnancy Wisdom Is Wrong--and What You Really Need to Know" takes a similar approach to pregnancy. Why do you think there are so many conflicting studies about pregnancy? Is it because it is so hard to perform randomized studies?
EO: I think the inability to run randomized studies is a big part of this, yes. One area of pregnancy where the data is actually quite good is labor and delivery. If you want to know the benefits and consequences of pain medication in labor, for example, it is possible to point you to some reasonably sized randomized trials. For various reasons, there has been more willingness to run randomized studies in this area. When pregnant women want answers to less medical questions (like, "Can I have a cup of coffee?") there is typically no randomized data to rely on. Because the possible benefits of drinking coffee while pregnant are pretty much nil, it is difficult to conceptualize a randomized study of this type of thing.
Another big issue I found in writing the book was that even in cases where the data was quite good, data often diverges from practice. This was eye-opening for me and convinced me that in pregnancy (and probably in other areas of health) people really do need to be their own advocates and know the data for themselves.
SS: Have you been surprised about the backlash to your book for your discussion of the zero-alcohol policy during pregnancy? 
EO: A little bit, yes. This backlash has died down a lot as pregnant women actually read the book and use it. As it turns out, the discussion of alcohol makes up a tiny fraction of the book and most pregnant women are more interested in the rest of it!  But certainly when the book came out this got a lot of focus. I suspected it would be somewhat controversial, although the truth is that every OB I actually talked to told me they thought it was fine. So I was surprised that the reaction was as sharp as it was.  I think in the end a number of people felt that even if the data were supportive of this view, it was important not to say it because of the concern that some women would over-react. I am not convinced by this argument.
SS: What are the three most important statistical concepts for new mothers to know? 
EO: I really only have two!
I think the biggest thing is to understand the difference between randomized and non-randomized data and to have some sense of the pittfalls of non-randomized data. I reviewed studies of alcohol where the drinkers were twice as likely as non-drinkers to use cocaine. I think people (pregnant or not) should be able to understand why one is going to struggle to draw conclusions about alcohol from these data.
A second issue is the concept of probability. It is easy to say, "There is a 10% chance of the following" but do we really understand that? If someone quotes you a 1 in 100 risk from a procedure, it is important to understand the difference between 1 in 100 and 1 in 400.  For most of us, those seem basically the same - they are both small. But they are not, and people need to think of ways to structure decision-making that acknowledge these differences.
SS: What computer programming language is most commonly taught for data analysis in economics? 
EO: So, I think the majority of empirical economists use Stata. I have been seeing more R, as well as a variety of other things, but more commonly among people who do heavier computational fields.
SS: Do you have any advice for young economists/statisticians who are interested in empirical research? 
1. Work on topics that interest you. As an academic you will ultimately have to motivate yourself to work. If you aren't interested in your topic (at least initially!), you'll never succeed.
2. One project which is 100% done is way better than five projects at 80%. You need to actually finish things, something which many of us struggle with.
3. Presentation matters. Yes, the substance is the most important thing, but don't discount the importance of conveying your ideas well.

Repost: Statistical illiteracy may lead to parents panicking about Autism

Editor's Note: This is a repost of a previous post on our blog from 2012. The repost is inspired by similar issues with statistical illiteracy that are coming up in allergy screening and pregnancy screening

I just was doing my morning reading of a few news sources and stumbled across this Huffington Post article talking about research correlating babies cries to autism. It suggests that the sound of a babies cries may predict their future risk for autism. As the parent of a young son, this obviously caught my attention in a very lizard-brain, caveman sort of way. I couldn't find a link to the research paper in the article so I did some searching and found out this result is also being covered by Time, Science Daily, Medical Daily, and a bunch of other news outlets.

Now thoroughly freaked, I looked online and found the pdf of the original research article. I started looking at the statistics and took a deep breath. Based on the analysis they present in the article there is absolutely no statistical evidence that a babies' cries can predict autism. Here are the flaws with the study:

  1. Small sample size. The authors only recruited 21 at risk infants and 18 healthy infants. Then, because of data processing issues, only ended up analyzing 7 high autistic risk versus 5 low autistic-risk in one analysis and 10 versus 6 in another. That is no where near a representative sample and barely qualifies as a pilot study.
  2. Major and unavoidable confounding. The way the authors determined high autistic risk versus low risk was based on whether an older sibling had autism. Leaving aside the quality of this metric for measuring risk of autism, there is a major confounding factor: the families of the high risk children all had an older sibling with autism and the families of the low risk children did not! It would not be surprising at all if children with one autistic older sibling might get a different kind of attention and hence cry differently regardless of their potential future risk of autism.
  3. No correction for multiple testing. This is one of the oldest problems in statistical analysis. It is also one that is a consistent culprit of false positives in epidemiology studies. XKCD even did a cartoon about it! They tested 9 variables measuring the way babies cry and tested each one with a statistical hypothesis test. They did not correct for multiple testing. So I gathered resulting p-values and did the correction for them. It turns out that after adjusting for multiple comparisons, nothing is significant at the usual P < 0.05 level, which would probably have prevented publication.

Taken together, these problems mean that the statistical analysis of these data do not show any connection between crying and autism.

The problem here exists on two levels. First, there was a failing in the statistical evaluation of this manuscript at the peer review level. Most statistical referees would have spotted these flaws and pointed them out for such a highly controversial paper. A second problem is that news agencies report on this result and despite paying lip-service to potential limitations, are not statistically literate enough to point out the major flaws in the analysis that reduce the probability of a true positive. Should journalists have some minimal in statistics that allows them to determine whether a result is likely to be a false positive to save us parents a lot of panic?



A non-comprehensive list of awesome things other people did in 2014

Editor's Note: Last year I made a list off the top of my head of awesome things other people did. I loved doing it so much that I'm doing it again for 2014. Like last year, I have surely missed awesome things people have done. If you know of some, you should make your own list or add it to the comments! The rules remain the same. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people's awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. Update: I missed pipes in R, now added!


  1. I'm copying everything about Jenny Bryan's amazing Stat 545 class in my data analysis classes. It is one of my absolute favorite open online set of notes on data analysis.
  2. Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton wrote this awesome paper on integrating R markdown into the curriculum. I love the stuff that Mine and Nick are doing to push data analysis into undergrad stats curricula.
  3. Speaking of those folks, the undergrad guidelines for stats programs put out by the ASA do an impressive job of balancing the advantages of statistics and the excitement of modern data analysis.
  4. Somebody tell Hector Corrada Bravo to stop writing so many awesome papers. He is making us all look bad. His epiviz paper is great and you should go start using the Bioconductor package if you do genomics.
  5. Hilary Mason founded fast forward labs. I love the business model of translating cutting edge academic (and otherwise) knowledge to practice. I am really pulling for this model to work.
  6. As far as I can tell 2014 was the year that causal inference become the new hotness. One example of that is this awesome paper from the Google folks on trying to infer causality from related time series. The R package has some cool features too. I definitely am excited to see all the new innovation in this area.
  7. Hadley was Hadley.
  8. Rafa and Mike taught an awesome class on data analysis for genomics. They also created a book on Github that I think is one of the best introductions to the statistics of genomics that exists so far.
  9. Hilary Parker wrote this amazing introduction to writing R packages that took the twitterverse by storm. It is perfectly written for people who are just at the point of being able to create their own R package. I think it probably generated 100+ R packages just by being so easy to follow.
  10. Oh you're not reading StatsChat yet? For real?
  11. FiveThirtyEight launched. Despite some early bumps they have done some really cool stuff. Loved the recent piece on the beer mile and I read every piece that Emily Oster writes. She does an amazing job of explaining pretty complicated statistical topics to a really broad audience.
  12. David Robinson's broom package is one of my absolute favorite R packages that was built this year. One of the most annoying things about R is the variety of outputs different models give and this tidy version makes it really easy to do lots of neat stuff.
  13. Chung and Storey introduced the jackstraw which is both a very clever idea and the perfect name for a method that can be used to identify variables associated with principal components in a statistically rigorous way.
  14. I rarely dig excel-type replacements, but the simplicity of makes me love it. It does one thing and one thing really well.
  15. The hipsteR package for teaching old R dogs new tricks is one of the many cool things Karl Broman did this year. I read all of his tutorials and never cease to learn stuff. In related news if I was 1/10th as organized as that dude I'd actually you know, get stuff done.
  16. Whether I agree with them or not that they should be allowed to do unregulated human subjects research, statistics at tech companies, and in particular randomized experiments have never been hotter. The boldest of the bunch is OKCupid who writes blog posts with titles like, "We experiment on human beings!"
  17. In related news, I love the PlanOut project by the folks over at Facebook, so cool to see an open source approach to experimentation at web scale.
  18. No wonder Mike Jordan (no not that Mike Jordan) is such a superstar. His reddit AMA raised my respect for him from already super high levels. First, its awesome that he did it, and second it is amazing how well he articulates the relationship between CS and Stats.
  19. I'm trying to figure out a way to get Matthew Stephens to write more blog posts. He teased us with the Dynamic Statistical Comparisons post and then left us hanging. The people demand more Matthew.
  20. Di Cook also started a new blog in 2014. She was also part of this cool exploratory data analysis event for the UN. They have a monster program going over there at Iowa State, producing some amazing research and a bunch of students that are recognizable by one name (Yihui, Hadley, etc.).
  21. Love this paper on sure screening of graphical models out of Daniela Witten's group at UW. It is so cool when a simple idea ends up being really well justified theoretically, it makes the world feel right.
  22. I'm sure this actually happened before 2014, but the Bioconductor folks are still the best open source data science project that exists in my opinion. My favorite development I started using in 2014 is the git-subversion bridge that lets me update my Bioc packages with pull requests.
  23. rOpenSci ran an awesome hackathon. The lineup of people they invited was great and I loved the commitment to a diverse group of junior R programmers. I really, really hope they run it again.
  24. Dirk Eddelbuettel and Carl Boettiger continue to make bigtime contributions to R. This time it is Rocker, with Docker containers for R. I think this could be a reproducibility/teaching gamechanger.
  25. Regina Nuzzo brought the p-value debate to the masses. She is also incredible at communicating pretty complicated statistical ideas to a broad audience and I'm looking forward to more stats pieces by her in the top journals.
  26. Barbara Engelhardt keeps rocking out great papers. But she is also one of the best AE's I have ever had handle a paper for me at PeerJ. Super efficient, super fair, and super demanding. People don't get enough credit for being amazing in the peer review process and she deserves it.
  27. Ben Goldacre and Hans Rosling continue to be two of the best advocates for statistics and the statistical discipline - I'm not sure either claims the title of statistician but they do a great job anyway. This piece about Professor Rosling in Science gives some idea about the impact a statistician can have on the most current problems in public health. Meanwhile, I think Dr. Goldacre does a great job of explaining how personalized medicine is an information science in this piece on statins in the BMJ.
  28. Michael Lopez's series of posts on graduate school in statistics should be 100% required reading for anyone considering graduate school in statistics. He really nails it.
  29.  Trey Causey has an equally awesome Getting Started in Data Science post that I read about 10 times.
  30. Drop everything and go read all of Philip Guo's posts. Especially this one about industry versus academia or this one on the practical reason to do a PhD.
  31. The top new Twitter feed of 2014 has to be @ResearchMark (incidentally I'm still mourning the disappearance of @STATSHULK).
  32. Stephanie Hicks' blog combines recipes for delicious treats and statistics, also I thought she had a great summary of the Women in Stats (#WiS2014) conference.
  33. Emma Pierson is a Rhodes Scholar who wrote for 538, 23andMe, and a bunch of other major outlets as an undergrad. Her blog, is another must read. Here is an example of her awesome work on how different communities ignored each other on Twitter during the Ferguson protests.
  34. The Rstudio crowd continues to be on fire. I think they are a huge part of the reason that R is gaining momentum. It wouldn't be possible to list all their contributions (or it would be an Rstudio exclusive list) but I really like Packrat and R markdown v2.
  35. Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.
  36. Julian Wolfson and Joe Koopmeiners at University of Minnesota are straight up gamers. They live streamed their recruiting event this year. One way I judge good ideas is by how mad I am I didn't think of it and this one had me seeing bright red.
  37. This is just an awesome paper comparing lots of machine learning algorithms on lots of data sets. Random forests wins and this is a nice update of one of my favorite papers of all time: Classifier technology and the illusion of progress.
  38. Pipes in R! This stuff is for real. The piping functionality created by Stefan Milton and Hadley is one of the few inventions over the last several years that immediately changed whole workflows for me.


I'll let @ResearchMark take us out:


Sunday data/statistics link roundup (12/14/14)

  1. A very brief analysis suggests that economists are impartial when it comes to their liberal/conservative views. That being said, I'm not sure the regression line says what they think it does, particularly if you pay attention to the variance around the line (via Rafa).
  2. I am digging the simplicity of from the folks at Medium. But I worry about spurious correlations everywhere. I guess I should just let that ship sail.
  3. FiveThirtyEight does a run down of the beer mile. If they set up a data crunchers beer mile, we are in.
  4. I love it when Thomas Lumley interviews himself about silly research studies and particularly their associated press releases. I can actually hear his voice in my head when I read them. This time the lipstick/IQ silliness gets Lumleyed.
  5. Jordan was better than Kobe. Surprise. Plus Rafa always takes the Kobe bait.
  6. Matlab/Python/R translation cheat sheet (via Stephanie H.).
  7. If I've said it once, I've said it a million times, statistical thinking is now as important as reading and writing. The latest example is parents not understanding the difference between sensitivity and the predictive value of a positive may be leading to unnecessary abortions (via Dan M./Rafa).

Sunday data/statistics link roundup (12/7/14)

  1. A randomized controlled trial shows that using conversation to detect suspicious behavior is much more effective then just monitoring body language (via Ann L. on Twitter). This comes as a crushing blow to those of us who enjoyed the now-cancelled Lie to Me and assumed it was all real.
  2. Check out this awesome real-time visualization of different types of network attacks. Rafa says if you watch long enough you will almost certainly observe a "storm" of attacks. A cool student project would be modeling the distribution of these attacks if you could collect the data (via David S.).
  3. Consider this: Did Big Data Kill the Statistician? I understand the sentiment, that statistical thinking and applied statistics has been around a long time and has produced some good ideas. On the other hand, there is definitely a large group of statisticians who aren't willing to expand their thinking beyond a really narrow set of ideas (via Rafa)
  4. Gangnam Style viewership creates integers too big for Youtube (via Rafa)
  5. A couple of interviews worth reading, ours with Cole Trapnell and SAMSI's with Jyotishka Data (via Jamie N.)
  6.  A piece on the secrets we don't know we are giving away through giving our data to [companies/the government/the internet].

Interview with Cole Trapnell of UW Genome Sciences

Cole Trapnell is an Assistant Professor of Genome Sciences at the University of Washington. He is the developer of multiple incredibly widely used tools for genomics including Tophat, Cufflinks, and Monocle. His lab at UW studies cell differentiation, reprogramming, and other transitions between stable or metastable cellular states using a combination of computational and experimental techniques. We talked to Cole as part of our ongoing interview series with exciting junior data scientists. 
SS: Do you consider yourself a computer scientist, a statistician, a computational biologist, or something else?

CT: The questions that get me up and out of bed in the morning the fastest are biology questions. I work on cell differentiation - I want to know how to define the state of a cell and how to predict transitions between states. That said, my approach to these questions so far has been to use new technologies to look at previously hard to access aspects of gene regulation.  For example, I’ve used RNA-Seq to look beyond gene expression into finer layers of regulation like splicing. Analyzing sequencing experiments often involves some pretty non-trivial math, computer science, and statistics.  These data sets are huge, so you need fast algorithms to even look at them. They all involve transforming reads into a useful readout of biology, and the technical and biological variability in that transformation needs to be understood and controlled for, so you see cool mathematical and statistical problems all the time. So I guess you could say that I’m a biologist, both experimental and computational. I have to do some computer science and statistics in order to do biology.

SS: You got a Ph.D. in computer science but have spent the last several years in a wet lab learning to be a bench biologist - why did you make that choice?

CT: Three reasons, mainly:

1) I thought learning to do bench work would make me a better overall scientist.  It has, in many ways, I think. It’s fundamentally changed the way I approach the questions I work on, but it’s also made me more effective in lots of tiny ways. I remember when I first got to John Rinn’s lab, we needed some way to track lots of libraries and other material.  I came up with some scheme where each library would get an 8-digit alphanumeric code generated by a hash function or something like that (we’d never have to worry about collisions!). My lab mate handed me a marker and said, “OK, write that on the side of these 12 micro centrifuge tubes”.  I threw out my scheme and came up with something like “JR_1”, “JR_2”, etc.  That’s a silly example, but I mention it because it reminds me of how completely clueless I was about where biological data really comes from.

2) I wanted to establish an independent, long-term research program investigating differentiation, and I didn’t want to have to rely on collaborators to generate data. I knew at the end of grad school that I wanted to have my own wet lab, and I doubted that anyone would trust me with that kind of investment without doing some formal training. Despite the now-common recognition by experimental biologists that analysis is incredibly important, there’s still a perception out there that computational biologists aren’t “real biologists”, and that computational folks are useful tools, but not the drivers of the intellectual agenda. That's of course not true, but I didn’t want to fight the stigma.

3) It sounded fun. I had one or two friends who had followed the "dry to wet” training trajectory, and they were having a blast.   Seeing a result live under the microscope is satisfying in a way that I’ve rarely experienced looking at a computer screen.

SS: Do you plan to have both a wet lab and a dry lab when you start your new group? 

CT: Yes. I’m going to be starting my lab at the University of Washington in the department of Genome Sciences this summer, and it’s going to be a roughly 50/50 operation, I hope. Many of the labs there are set up that way, and there’s a real culture of valuing both sides. As a postdoc, I’ve been extremely fortunate to collaborate with grad students and postdocs who were trained as cell or molecular biologists but wanted to learn sequencing analysis. We’d train each other, often at great cost in terms of time spent solving “somebody else’s problem”.  I’m going to do my best to create an environment like that, the way John did for me and my lab mates.

SS: You are frequently on the forefront of new genomic technologies. As data sets get larger and more complicated how do we ensure reproducibility and replicability of computational results? 

CT: That’s a good question, and I don’t really have a good answer. You’ve talked a lot on this blog about the importance of making science more reproducible and how journals could change to make it so. I agree wholeheartedly with a lot of what you’ve said. I like the idea of "papers as packages”, but I don’t see it happening soon, because it’s a huge amount of extra work and there’s not a big incentive to do so.  Doing so might make it easier to be attacked, so there could even a disincentive! Scientists do well when the publish papers and those papers are cited widely. We have lots of ways to quantify “impact” - h-index, total citation count, how many times your paper is shared via twitter on a given day, etc.  (Say what you want about whether these are meaningful measures).

We don’t have a good way to track who’s right and who’s wrong, or whose results are reproducible and whose aren’t, short of full blown paper retraction.  Most papers aren’t even checked in a serious way. Worse, the papers that are checked are the ones that a lot of people see - few people spend precious time following up on tangential observations in low circulation journals.  So there’s actually an incentive to publish “controversial" results in highly visible journals because at least you’re getting attention.

Maybe we need a Yelp for papers and data sets?  One where in order to dispute the reproducibility of the analysis, you’d have to provide the code *you* ran to generate a contradictory result?  There needs to be a genuine and tangible *reward* (read: funding and career advancement) for putting up an analysis that others can dive into, verify, extend, and learn from.

In any case, I think it’s worth noting that reproducibility is not a problem unique to computation - experimentalists have a hard time reproducing results they got last week, much less results that came from some other lab!  There’s all kinds of harmless reasons for that.  Experiments are hard.  Reagents come in bad lots. You had too much coffee that morning and can’t steady your pipet hand to save your life. But I worry a bit that we could spend a lot of effort making our analysis totally automated and perfectly reproducible and still be faced with the same problem.

SS: What are the interesting statistical challenges in single-cell RNA-sequencing? 


Oh man, there are many.  Here’s a few:

1) There some very interesting questions about variability in expression across cells, or within one cell across time. There’s clearly a lot of variability in the expression level of a given gene across cells.  But there’s really no way right now to take “replicate” measurements of a single cell.  What would that mean?  With current technology, to make an RNA-Seq library form a cell, you have to lyse it.  So that’s it for that cell.  Even if you had a non-invasive way to measure the whole transcriptome, the cell is a living machine that’s always changing in ways large and small, even in culture. Would you consider repeated measurements “replicates”.  Furthermore, how can you say that two different cells are “replicate” measurements of a  single, defined cell state?  Do such states even really exist?

For that matter, we don’t have a good way of assessing how much variability stems from technical sources as opposed to biological sources.  One common way of assessing technical variability is to spike some alien transcripts at known concentrations in to purified RNA before making the library, so you can see how variable your endpoint measurements are for those alien transcripts. But to do that for single-cell RNA-Seq, we’d have to actually spike transcripts *into* the nucleus of a cell before we lyse it and put it through the library prep process.  Just doping it into the lysate’s not good enough, because the lysis itself might (and likely does) destroy a substantial fraction of the endogenous RNA in the cell.  So there are some real barriers to overcome in order to get a handle on how much variability is really biological.

2) A second challenge is writing down what a biological process looks like at single cell resolution. I mean we want to write down a model that predicts the expression levels of each gene in a cell as it goes through some biological process. We want to be able to say this gene comes on first, then this one, then these genes, and so on. In genomics up until now, we’ve been in the situation where we are measuring many variables (P) from few measurements (N).  That is, N << P, typically, which has made this problem extremely difficult.  With single cell RNA-Seq, that may no longer be the case.  We can already easily capture hundreds of cells, and thousands of cells per capture is just around the corner, so soon, N will be close to P, and maybe someday greater.

Assume for the moment that we are capturing cells that are either resting at or transiting between well defined states. You can think of each cell as a point in a high-dimensional geometric space, where each gene is a different dimension.  We’d like to find those equilibrium states and figure out which genes are correlated with which other genes.  Even better, we’d like to study the transitions between states and identify the genes that drive them.  The curse of dimensionality is always going to be a problem (we’re not likely to capture millions or billions of cells anytime soon), but maybe we have enough data to make some progress. There’s interesting literature out there for tackling problems at this scale, but to my knowledge these methods haven’t yet been widely applied in biology.  I guess you can think of cell differentiation viewed at whole-transcriptome, single-cell resolution as one giant manifold learning problem.  Same goes for oncogenesis, tissue homeostasis, reprogramming, and on and on. It’s going to be very exciting to see the convergence of large scale statistical machine learning and cell biology.

SS: If you could do it again would you do computational training then wet lab training or the other way around? 

CT: I’m happy with how I did things, but I’ve seen folks go the other direction very successfully.  My labmates Loyal Goff and Dave Hendrickson started out as molecular biologists, but they’re wizards at the command line now.

SS: What is your programming language of choice? 

CT: Oh, I’d say I hate them all equally ;)

Just kidding. I’ll always love C++. I work in R a lot these days, as my work has veered away from developing tools for other people towards analyzing data I’ve generated.  I still find lots of things about R to be very painful, but ggplot2, plyr, and a handful of other godsend packages make the juice worth the squeeze.


Repost: A deterministic statistical machine

Editor's note: This is a repost of our previous post about deterministic statistical machines. It is inspired by the recent announcement that the Automatic Statistician received funding from Google. In 2012 we also applied to Google for a small research award to study this same problem, but didn't get it. In the interest of extreme openness like Titus Brown or Ethan White, here is our application we submitted to Google. I showed this to a friend who told me the reason we didn't get it is because our proposal was missing two words: "artificial", "intelligence". 

As Roger pointed out the most recent batch of Y Combinator startups included a bunch of data-focused companies. One of these companies, StatWing, is a web-based tool for data analysis that looks like an improvement on SPSS with more plain text, more visualization, and a lot of the technical statistical details “under the hood”. I first read about StatWing on TechCrunch, where the title, “How Statwing Makes It Easier To Ask Questions About Data So You Don’t Have To Hire a Statistical Wizard”.

StatWing looks super user-friendly and the idea of democratizing statistical analysis so more people can access these ideas is something that appeals to me. But, as one of the aforementioned statistical wizards, this had me freaked out for a minute. Once I looked at the software though, I realized it suffers from the same problem that most “user-friendly” statistical software suffers from. It makes it really easy to screw up a data analysis. It will tell you when something is significant and if you don’t like that it isn’t, you can keep slicing and dicing the data until it is. The key issue behind getting insight from data is knowing when you are fooling yourself with confounders, or small effect sizes, or overfitting. StatWing looks like an improvement on the UI experience of data analysis, but it won’t prevent false positives that plague science and cost business big $$.

So I started thinking about what kind of software would prevent these sort of problems while still being accessible to a big audience. My idea is a “deterministic statistical machine”. Here is how it works, you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a pre-specified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result to figshare.

The advantage is that people can get their data-related questions answered using a standard tool. It does a lot of the “heavy lifting” in checking for potential problems and produces nice reports. But it is a deterministic algorithm for analysis so overfitting, fudging the analysis, etc. are harder. By publishing all reports to figshare, it makes it even harder to fudge the data. If you fiddle with the data to try to get a result you want, there will be a “multiple testing paper trail” following you around.

The DSM should be a web service that is easy to use. Anybody want to build it? Any suggestions for how to do it better?


Sunday data/statistics link roundup (11/9/14)

So I'm a day late, but you know, I got a new kid and stuff...

  1. The New Yorker hating on MOOCs, they mention all the usual stuff. Including the really poorly designed San Jose State experiment. I think this deserves a longer post, but this is definitely a case where people are looking at MOOCs on the wrong part of the hype curve. MOOCs won't solve all possible education problems, but they are hugely helpful to many people and writing them off is a little silly (via Rafa).
  2. My colleague Dan S. is teaching a missing data workshop here at Hopkins next week (via Dan S.)
  3. A couple of cool Youtube videos explaining how the normal distribution sounds and the pareto principle with paperclips (via Presh T., pair with the 80/20 rule of statistical methods development)
  4. If you aren't following Research Wahlberg, you aren't on academic twitter.
  5. I followed  #biodata14  closely. I think having a meeting on Biological Big Data is a great idea and many of the discussion leaders are people I admire a ton. I also am a big fan of Mike S. I have to say I was pretty bummed that more statisticians weren't invited (we like to party too!).
  6. Our data science specialization generates almost 1,000 new R github repos a month! Roger and I are in a neck and neck race to be the person who has taught the most people statistics/data science in the history of the world.
  7. The Rstudio guys have also put together what looks like a great course similar in spirit to our Data Science Specialization. The Rstudio folks have been *super* supportive of the DSS and we assume anything they make will be awesome.
  8. Congrats to Data Carpentry and Tracy Teal on their funding from the Moore Foundation!


Time varying causality in n=1 experiments with applications to newborn care

We just had our second son about a week ago and I've been hanging out at home with him and the rest of my family. It has reminded me of a few things from when we had our first son. First, newborns are tiny and super-duper adorable. Second, daylight savings time means gaining an extra hour of sleep for many people, but for people with young children it is more like (via Reddit):


Third, taking care of a newborn is like performing a series of n=1 experiments where the causal structure of the problem changes every time you perform an experiment.

Suppose, hypothetically, that your newborn has just had something to eat and it is 2am in the morning (again, just hypothetically). You are hoping he'll go back down to sleep so you can catch some shut-eye yourself. But your baby just can't sleep and seems uncomfortable. Here are a partial list of causes for this: (1) dirty diaper, (2) needs to burp, (3) still hungry, (4) not tired, (5) over tired, (6) has gas, (7) just chillin. So you start going down the list and trying to address each of the potential causes of late-night sleeplessness: (1) check diaper, (2) try burping, (3) feed him again, etc. etc. Then, miraculously, one works and the little guy falls asleep.

It is interesting how the natural human reaction  to this is to reorder the potential causes of sleeplessness and start with the thing that worked next time. Then often get frustrated when the same thing doesn't work the next time. You can't help it, you did an experiment, you have some data, you want to use it. But the reality is that the next time may have nothing to do with the first.

I'm in the process of collecting some very poorly annotated data collected exclusively at night if anyone wants to write a dissertation on this problem.


Sunday data/statistics link roundup (11/2/14)

Better late than never! If you have something cool to share, please continue to email it to me with subject line "Sunday links".

  1. A DrivenData is a Kaggle-like site but for social good. I like the principle of using data for societal benefit, since there are so many ways it seems to be used for nefarious purposes (via Rafa).
  2. This article claiming academic science isn't sexist has been widely panned Emily Willingham pretty much destroys it here (via Sherri R.). The thing that is interesting about this article is the way that it tries to use data to give the appearance of empiricism, while using language to try to skew the results. Is it just me or is this totally bizarre in light of the NYT also publishing this piece about academic sexual harassment at Yale?
  3. Noah Smith, an economist, tries to summarize the problem with "most research being wrong". It is an interesting take, I wonder if he read Roger's piece saying almost exactly the same thing  like the week before? He also mentions it is hard to quantify the rate of false discoveries in science, maybe he should read our paper?
  4. Nature now requests that code sharing occur "where possible" (via Steven S.)
  5. Great movie scientist versus real scientist cartoons, I particularly like the one about replication (via Steven S.).