Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The Role of Resources in Data Analysis

When learning about data analysis in school, you don’t hear much about the role that resources—time, money, and technology—play in the development of analysis. This is a conversation that is often had “in the hallway” when talking to senior faculty or mentors. But the available resources do play a significant role in determining what can be done with a given question and dataset. It’s tempting to think that the situation is binary—either you have sufficient resources to do the “right” analysis, or you simply don’t do the analysis.

People vs. Institutions in Data Analysis

In my post about relationships in data analysis I got a little push back regarding whether human relationships would ever not be important in data analysis and whether that has anything to do with the “maturity” of the field. I believe human beings will always play a role in data analysis, but it’s possible that over time they will play different roles. I wanted to discuss in this post what I meant about “institutions” and “institutional knowledge” in the context of data analysis and when the specific person who does the analysis is critical to how the analysis is done.

An ode to King James

The NBA season is over and, once again, what I will most remember are King James’ heroics. As a lifelong Boston Celtics fan, I am supposed to hate LeBron James. But I don’t. As a fan of the game of basketball, and a statistician, I just can’t help but be in awe of the best player ever to play the game. Also how can you hate this guy (don’t miss the wrist watch)?

Estimating mortality rates in Puerto Rico after hurricane María using newly released official death counts

Late last Friday, the Puerto Rico Department of Health finally released monthly death count data for the time period following Hurricane Maria: BREAKING: the Puerto Rico Health Department has buckled under pressure and released the number of deaths for each month, through May of 2018. In September 2017, when Hurricane Maria made landfall, there was a notable spike, followed by an even larger one in October. — David Begnaud (@DavidBegnaud) June 1, 2018 The news came three days after the publication of our paper describing a survey conducted to better understand what happened after the hurricane.

Trustworthy Data Analysis

The success of a data analysis depends critically on the audience. But why? A lot has to do with whether the audience trusts the analysis as well as the person presenting the analysis. Almost al presentations are incomplete because for any analysis of reasonable size, some details must be omitted for the sake of clarity. A good presentation will have a structured narrative that will guide the presenter in choosing what should be included and what should be omitted.

Context Compatibility in Data Analysis

All data arise within a particular context and often as a result of a specific question being asked. That is all well and good until we attempt to use that same data to answer a different question within a different context. When you match an existing dataset with a new question, you have to ask if the original context in which the data were collected is compatible with the new question and the new context.

Awesome postdoc opportunities in computational genomics at JHU

Johns Hopkins is a pretty amazing place to do computational genomics right now. My colleagues are really impressive, for example five of our faculty are part of the Chan Zuckerberg Initiative and we have faculty across a range of departments including Biostatistics, Computer Science, Biology, Biomedical Engineering, Human Genetics. A number of my colleagues are activitely looking for postdocs and in an effort to make the postdoc job market a little less opaque I’m listing this non-comprehensive list of opportunities I know about here.

Rethinking Academic Data Sharing

The sharing of data is one of the key principles of reproducible research (the other one being code sharing). Using the data and code a researcher has used to generate a finding, other researchers can reproduce those findings and examine the process that lead to them. Reproducibility is critical for transparency, so that others can verify the process, and for speeding up knowledge transfer. But recent events have gotten me thinking more about the data sharing aspect of reproducibility and whether it is tenable in the long run.

Software as an academic publication

Software has for while now played a weird an uncomfortable role in the academic statistics world. When I first started out (circa 2000), I think developing software was considered “nice”, but for the most part, was not considered valuable as an academic contribution in the statistical universe. People were generally happy to use the software and extol its virtues, but when it came to evaluating a person’s scholarship, software usually ranked somewhere near the bottom of the list, after papers, grants, and maybe even JSM contributed talks.

Relationships in Data Analysis

I recently finished reading Steve Coll’s book Directorate S, which is a chronicle of the U.S. war in Afghanistan post 9-11. It’s a good book, and one line stuck out for me as I thought it had relevance for data analysis. In one chapter, Coll writes about Lieutenant Colonel John Loftis, who helped run a training program for U.S. military officials who were preparing to go serve in Afghanistan. In reference to Afghan society, he says, “Everything over there is about relationships.