Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Divergent and Convergent Phases of Data Analysis

There are often discussions within the data science community about which tools are best for doing data science. The most recent iteration of this discussion is the so-called “First Notebook War”, which is well-summarized by Yihui Xie in his blog post (it is a great read). One thing that I have found missing from many discussions about tooling in data analysis is an acknowledgment that data analysis tends to advance through different phases and that different tools can be more or less useful in each of those phases.

Being at the Center

Hilary Parker and I just released part 2 of our book club discussion of Nigel Cross’s book Design Thinking and it centers around a profile of designer Gordan Murray, who spent his career designing Formula One race cars. One of the aspects of his job as a designer is taking a “systems approach” to solving problems. Coupled with that approach is his role in balancing the various priorities of members of his team.

Constructing a Data Analysis

This week Hilary Parker and I have started our “Book Club” on Not So Standard Deviations where we will be discussing Nigel Cross’s book Design Thinking: Understanding How Designers Think and Work. We will be talking about how the work of designers parallels the work of data scientists and how many of the principles developed in design port over so well to data analysis. While data visualization has always taken cues from design, I think much broader aspects of data analysis could benefit from the work studying design.

The Law and Order of Data Science

One conversation I’ve had a few times revolves around the question, “What’s the difference between science and data science?” If I were to come up with a simple distinction, I might say that Science starts with a question; data science starts with the data. What makes data science so difficult is that it starts in the wrong place. As a result, a certain amount of extra work must be done to understand the context surrounding a dataset before we can do anything useful.

The Trillion Dollar Question

Recently, Apple’s stock price rose to the point where the company’s market valuation was above $1 trillion, the first U.S. company to reach that benchmark. Subsequently, numerous articles were published describing Apple’s journey to this point and why it got there. Most people describe Apple as a technology company. They make technology products: iPhones, iPads, Macs, etc. These are all computing devices. But there is another way to think of Apple and what kind of company they are as well as how they became so successful.

Why I Indent My Code 8 Spaces

Jenny Bryan recently gave a wonderful talk at the Use R! 2018 meeting in Brisbane about “Code Smells and Feels” (I recommend you watch a video of that talk). Her talk covers various ways to detect when your code “smells” and how to fix those smells through refactoring. While there is quite a bit of literature on this with respect to other programming languages, it’s not well-covered with respect to R.

Partitioning the Variation in Data

One of the fundamental questions that we can ask in any data analysis is, “Why do things vary?” Although I think this is fundamental, I’ve found that it’s not explicitly asked as often as I might think. The problem with not asking this question is that it can often lead to a lot of pointless and time-consuming work. Taking a moment to ask yourself, “What do I know that can explain why this feature or variable varies?

Teaching R to New Users - From tapply to the Tidyverse

Abstract The intentional ambiguity of the R language, inherited from the S language, is one of its defining features. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? The ability of R to cater to users who do not see themselves as programmers, but then allow them to slide gradually into programming, is an enduring quality of the language and is what has allowed it to gain significance over time.

What Should be Done When Data Have Creators?

I was listening to the podcast The West Wing Weekly recently and Episode 4.17 (“Red Haven’s on Fire”) featured former staff writer Lauren Schmidt Hissrich. In introducing her, the podcast co-hosts mentioned that Hissrich was a writer for the Netflix series Daredevil, based on the Marvel Comics character. She is also the showrunner for a new Netflix series called The Witcher, which is based on a book by Andrzej Sapkowski.

Cultural Differences in Map Data Visualization

Matthew Panzarino had an interesting article in TechCrunch on Apple’s process for rebuilding their Maps app. While most of the article describes the laborious process of data collection, one part jumped out at me, which was the team that Panzarino describes as the “Department of Details.” They are responsible for a number of odds and ends regarding how maps are presented, but they are particularly concerned with presenting maps to people around the world.