Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

This is a brave post and everyone in statistics should read it

This post by Kristian Lum is incredibly brave. It points out some awful behavior by people in our field and should be required reading for everyone. It took a lot of courage for Kristian to post this but we believe her, think this is a serious and critical issue for our field, and will not tolerate this kind of behavior among our colleagues. Her post has aleady inspired important discussions among the faculty at Johns Hopkins Biostatistics and is an important contribution to making sure our field is welcoming for everyone.

Hurricane María official death count in conflict with mortality data

A recent preprint by Alexis R. Santos-Lozada and Jeffrey T. Howard concludes that The mortality burden may [be] higher than official counts, and may exceed the current official death toll by a factor of 10. The authors used monthly death records from the Puerto Rico Vital Statistics system from 2010 to 2016. Although data for 2017 was apparently not available, they extracted data from a statement made by Héctor Pesquera, the Secretary of Public Safety:

Some roadblocks to the broad adoption of machine learning and AI

I read two blog posts on AI over the Thanksgiving break. One was a nice post discussing the challenges for AI in medicine by Luke Oakden-Rayder and the other was about the need for increased focus on basic research in AI motivated by AlphaGo by Tim Harford. I’ve had a lot of interactions with people lately who want to take advantage of machine learning/AI in their research or business. Despite the excitement around AI and the exciting results we see from sophisticated research teams almost daily - the actual extent and application of AI is much smaller.

A few things that would reduce stress around reproducibility/replicability in science

I was listening to the Effort Report Episode on The Messy Execution of Reproducible Research where they were discussing the piece about Amy Cuddy in the New York Times. I think both the article and the podcast did a good job of discussing the nuances of the importance of reproducibility and the challenges of the social interactions around this topic. After listening to the podcast I realized that I see a lot of posts about reproducibility/replicability, but many of them are focused on the technical side.

Follow Up on Reasoning About Data

Sometimes, when I write a really long blog post, I forget what the point was at the end. I suppose I could just update the previous post…but that feels wrong for some reason. I meant to make one final point in my last post about how better data analyses help you reason about the data. In particular, I meant to tie together the discussion about garbage collection to the section on data analysis.

Reasoning About Data

In my ongoing discussion in my mind about what makes for a good data analysis, one of the ideas that keeps coming back to me is this notion of being able to “reason about the data”. The idea here is that it’s important that a data analysis allow you to understand how the data, as opposed to other aspects of an analysis like assumptions or models, played a role in producing the outputs.

How do you convince other people to use R?

I just got back from the rOpenSci OzUnconf that was run in Melbourne last week. I’d like to give a big thanks to the organizers (Nick Tierney, Di Cook, Rob Hyndman and others) for putting on a great unconference. These events are always a great opportunity to meet people just getting started in the R community and to get them involved. As is typical for these unconferences, topic ideas were pitched via issues on the OzUnconf GitHub repo.

It Costs Money to Get It Right

On the latest episode of Not So Standard Deviations I talked with Hilary about Apple’s efforts to train machine learning algorithms in their Face ID technology in the iPhone X. The gist of Face ID is that it recognizes your face using a mathematical representation and then unlocks the phone when it can confirm that it is you. In its keynote presentation, Apple mentioned that it’s using machine learning to do this and even had developed its own custom chips to do the computations.

Creating an expository graph for a talk

I’m co-teaching a data science class at Johns Hopkins with John Muschelli. I gave the lectures on EDA and he just gave a lecture on how to create an “expository graph”. When we teach the class an exploratory graph is the kind of graph you make for yourself just to try to understand a data set. An expository graph is one where you are trying to communicate information to someone else.

Recording Podcasts with a Remote Co-Host

I previously wrote about my editing workflow for podcasts and I thought I’d follow up with some details on how I record both Not So Standard Deviations and The Effort Report. This post is again going to be a bit Mac-specific because, well, that’s what I do. Communication Both of my podcasts have a co-host who is not in the same physical location as me. Therefore, we need to use some sort of Internet-based communication software (Skype, Google Hangouts, FaceTime, etc.