Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Partitioning the Variation in Data

One of the fundamental questions that we can ask in any data analysis is, “Why do things vary?” Although I think this is fundamental, I’ve found that it’s not explicitly asked as often as I might think. The problem with not asking this question is that it can often lead to a lot of pointless and time-consuming work. Taking a moment to ask yourself, “What do I know that can explain why this feature or variable varies?

Teaching R to New Users - From tapply to the Tidyverse

Abstract The intentional ambiguity of the R language, inherited from the S language, is one of its defining features. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? The ability of R to cater to users who do not see themselves as programmers, but then allow them to slide gradually into programming, is an enduring quality of the language and is what has allowed it to gain significance over time.

What Should be Done When Data Have Creators?

I was listening to the podcast The West Wing Weekly recently and Episode 4.17 (“Red Haven’s on Fire”) featured former staff writer Lauren Schmidt Hissrich. In introducing her, the podcast co-hosts mentioned that Hissrich was a writer for the Netflix series Daredevil, based on the Marvel Comics character. She is also the showrunner for a new Netflix series called The Witcher, which is based on a book by Andrzej Sapkowski.

Cultural Differences in Map Data Visualization

Matthew Panzarino had an interesting article in TechCrunch on Apple’s process for rebuilding their Maps app. While most of the article describes the laborious process of data collection, one part jumped out at me, which was the team that Panzarino describes as the “Department of Details.” They are responsible for a number of odds and ends regarding how maps are presented, but they are particularly concerned with presenting maps to people around the world.

Creativity in Data Analysis

I’ve often heard that there is a need for data analysts to be creative in their work. But why? Where and how exactly is that creativity exercised? On one extreme, it could be thought that a data analyst should be easily replaced by a machine. For various types of data and for various types of questions, there should be a deterministic approach to analysis that does not change. Presumably, this could be coded up into a computer program and the data could be fed into the program every time, with a result presented at the end.

The Role of Resources in Data Analysis

When learning about data analysis in school, you don’t hear much about the role that resources—time, money, and technology—play in the development of analysis. This is a conversation that is often had “in the hallway” when talking to senior faculty or mentors. But the available resources do play a significant role in determining what can be done with a given question and dataset. It’s tempting to think that the situation is binary—either you have sufficient resources to do the “right” analysis, or you simply don’t do the analysis.

People vs. Institutions in Data Analysis

In my post about relationships in data analysis I got a little push back regarding whether human relationships would ever not be important in data analysis and whether that has anything to do with the “maturity” of the field. I believe human beings will always play a role in data analysis, but it’s possible that over time they will play different roles. I wanted to discuss in this post what I meant about “institutions” and “institutional knowledge” in the context of data analysis and when the specific person who does the analysis is critical to how the analysis is done.

An ode to King James

The NBA season is over and, once again, what I will most remember are King James’ heroics. As a lifelong Boston Celtics fan, I am supposed to hate LeBron James. But I don’t. As a fan of the game of basketball, and a statistician, I just can’t help but be in awe of the best player ever to play the game. Also how can you hate this guy (don’t miss the wrist watch)?

Estimating mortality rates in Puerto Rico after hurricane María using newly released official death counts

Late last Friday, the Puerto Rico Department of Health finally released monthly death count data for the time period following Hurricane Maria: BREAKING: the Puerto Rico Health Department has buckled under pressure and released the number of deaths for each month, through May of 2018. In September 2017, when Hurricane Maria made landfall, there was a notable spike, followed by an even larger one in October. — David Begnaud (@DavidBegnaud) June 1, 2018 The news came three days after the publication of our paper describing a survey conducted to better understand what happened after the hurricane.

Trustworthy Data Analysis

The success of a data analysis depends critically on the audience. But why? A lot has to do with whether the audience trusts the analysis as well as the person presenting the analysis. Almost al presentations are incomplete because for any analysis of reasonable size, some details must be omitted for the sake of clarity. A good presentation will have a structured narrative that will guide the presenter in choosing what should be included and what should be omitted.