Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

What can we learn from data analysis failures?

Back in February, I gave a talk at the Walter and Eliza Hall Research Institute in Melbourne titled “Lessons in Disaster: What Can We Learn from Data Analysis Failures?” This talk was quite different from talks that I usually give on computing or environmental health and I’m guessing it probably showed. It was nevertheless a fun talk and I got to talk about space-related stuff. If you want to hear some discussion of the development of this talk, you can listen to Episode 53 of Not So Standard Deviations.

Process versus outcome productivity

Several times over the last few weeks my hatred of Doodle polls has come up in meetings. I think the polling technology is great, but I’m still frustrated by the polls. Someone asked what I’d rather have happen and I said: “set the meeting, then let me know when it is, if I can come I will. if i’m not there then i’m happy for you to decide without me”

What is a Successful Data Analysis?

Defining success in data analysis has eluded me for quite some time now. About two years ago I tried to explore this question in my Dean’s Lecture, but ultimately I think I missed the mark. In that talk I tried to identify standards (I called them “aesthetics”) by which we could universally evaluate the quality of a data analysis and tried to make an analogy with music theory. It was a fun talk, in part because I got to play the end of Charles Ives’ Second Symphony.

Input on the Draft NIH Strategic Plan for Data Science

The NIH is soliciting input for their Strategic Plan for Data Science. If you are interested, today, April 2, is the deadline. You can provide input here. Below is what I plan to submit. Summary My main critique is that the report is somewhat vague. More specifics and concrete examples should be included. My main concern is that the draft describes initiatives with the goal of improving the back end of data science (data storage, data management, and computing infrastructure) without realizing that to do this one needs to understand the needs of those working on the front end of data science (data exploration, quality assessment, interactive data analysis, and method development).

What do Fahrenheit, comma separated files, and markdown have in common?

Anil Dash asked people what their favorite file format was. David Robinson replied: CSV is similar to Markdown. No one global standard (though there are attempts) but a damn good attempt at "Whatever humans think it is at a glance, they're probably right" — David Robinson (@drob) February 8, 2018 His tweet reminded me a lot of this tweet from Stephen Turner In defense of Fahrenheit pic.twitter.com/qwDcBm0XVr — Stephen Turner (@strnr) February 20, 2015 There is a spectrum for tools from the theortically optimal to the most human usable.

Some datasets for teaching data science

In this post I describe the dslabs package, which contains some datasets that I use in my data science courses. A much discussed topic in stats education is that computing should play a more prominent role in the curriculum. I strongly agree, but I think the main improvement will come from bringing applications to the forefront and mimicking, as best as possible, the challenges applied statisticians face in real life. I therefore try to avoid using widely used toy examples, such as the mtcars dataset, when I teach data science.

A non-comprehensive list of awesome things other people did in 2017

Editor’s note: For the last few years I have made a list of awesome things that other people did (2016,2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff.

Thoughts on David Donoho’s "Fifty Years of Data Science"

Note: This post was originally published as part of a collection of discussion pieces on David Donoho’s paper. The original paper and collection of discussions can be found at the JCGS web site. Professor Donoho’s commentary comes at a perfect time, given that, according to his own chronology, we are just about due for another push to “widen the tent” of statistics to include a broader array of activities. Looking back at the efforts of people like Tukey, Cleveland, and Chambers to broaden the meaning of statistics, I would argue that to some extent their efforts have failed.

How Do Machines Learn?

I like all of CGP Grey’s videos but most of them have to do with voting systems and so aren’t really relevant to this blog. But his latest video titled “How Do Machines Learn?” is highly relevant and I thought very well done. That said, although the animations of the robots were very cute and helped to tell the story, I found them a bit disconcerting in a way that I can’t quite explain.

Puerto Rico's governor wants recount of hurricane death toll

A quick followup to Rafa’s analysis of the death toll from Hurricane Maria, from Axios: Puerto Rican Governor Ricardo Rosselló ordered a recount Monday of every death on the island since Hurricane Maria made landfall on September 20, as evidence continues to show that the official death toll grossly undercuts the true number, reports the New York Times. There are at least two ways to do this. One way is inferential in nature, taking a look at what we might expect the mortality to be and looking at what was observed.