Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The role of academia in data science education

I was recently asked to moderate an academic panel on the role of universities in training the data science workforce. I preceded each question with opinionated introductions which I have fused into this blog post. These are weakly held opinions so please consider commenting if you disagree with anything. To discuss data science education we first need to clearly state what it means. The panel organizers defined data science as “an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields.

Guest Post: Galin Jones on criteria for promotion and tenture in (bio)statistics departments

Editor’s Note: I attended an ASA Chair’s meeting and spoke about ways we could support junior faculty in data science. After giving my talk Galin Jones, Professor and Director of Statistics at University of Minnesota, and I had an interesting conversation about how they had changed their promotion criteria in response to a faculty candidate being unique. I asked him to write about his experience and he kindly contributed the following post.

The economic consequences of MOOCs

tl;dr check out our new paper on the relationship between MOOC completion and economic outcomes! Last Monday we launched our Chromebook Data Science Program so that anyone with an internet connection, a web browser, and the ability to read and follow instructions could become a data scientist. Why did we launch another MOOC program? Aren’t MOOCs dead? Well we didn’t think so :). We have been pretty excited about MOOCs for a while now and now run five different MOOC programs through the Johns Hopkins Data Science Lab.

Chromebook Data Science - a free online data science program for anyone with a web browser.

The Johns Hopkins Data Science Lab has been teaching massive online open courses for more than 5 years now. During that time we’ve reached more than 5 million learners who want to break into the number one rated job in America. While we have been incredibly excited about the results of these training programs, we’ve also learned over the last 5+ years that there are still significant barriers to getting into data science.

The complex process of obtaining Puerto Rico mortality data: a timeline

Rolando Acosta and I recently posted a manuscript on bioRxiv describing the effects of Hurricane María, based on an analysis of mortality data provided by the Demographic Registry. I was also an author on a paper published in May based on a survey of 3,000 households. These are very different datasets. Assuming it is complete, the Demographic Registry dataset provides much more precise quantitative information. However, this dataset was not made publicly available until June 1, 2018, three days after the paper based on the survey data was released.

Divergent and Convergent Phases of Data Analysis

There are often discussions within the data science community about which tools are best for doing data science. The most recent iteration of this discussion is the so-called “First Notebook War”, which is well-summarized by Yihui Xie in his blog post (it is a great read). One thing that I have found missing from many discussions about tooling in data analysis is an acknowledgment that data analysis tends to advance through different phases and that different tools can be more or less useful in each of those phases.

Being at the Center

Hilary Parker and I just released part 2 of our book club discussion of Nigel Cross’s book Design Thinking and it centers around a profile of designer Gordan Murray, who spent his career designing Formula One race cars. One of the aspects of his job as a designer is taking a “systems approach” to solving problems. Coupled with that approach is his role in balancing the various priorities of members of his team.

Constructing a Data Analysis

This week Hilary Parker and I have started our “Book Club” on Not So Standard Deviations where we will be discussing Nigel Cross’s book Design Thinking: Understanding How Designers Think and Work. We will be talking about how the work of designers parallels the work of data scientists and how many of the principles developed in design port over so well to data analysis. While data visualization has always taken cues from design, I think much broader aspects of data analysis could benefit from the work studying design.

The Law and Order of Data Science

One conversation I’ve had a few times revolves around the question, “What’s the difference between science and data science?” If I were to come up with a simple distinction, I might say that Science starts with a question; data science starts with the data. What makes data science so difficult is that it starts in the wrong place. As a result, a certain amount of extra work must be done to understand the context surrounding a dataset before we can do anything useful.

The Trillion Dollar Question

Recently, Apple’s stock price rose to the point where the company’s market valuation was above $1 trillion, the first U.S. company to reach that benchmark. Subsequently, numerous articles were published describing Apple’s journey to this point and why it got there. Most people describe Apple as a technology company. They make technology products: iPhones, iPads, Macs, etc. These are all computing devices. But there is another way to think of Apple and what kind of company they are as well as how they became so successful.