Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Is Artificial Intelligence Revolutionizing Environmental Health?

NOTE: This post was written by Kevin Elliott, Michigan State University; Nicole Kleinstreuer, National Institutes of Health; Patrick McMullen, ScitoVation; Gary Miller, Columbia University; Bhramar Mukherjee, University of Michigan; Roger D. Peng, Johns Hopkins University; Melissa Perry, The George Washington University; Reza Rasoulpour, Corteva Agriscience, and Elizabeth Boyle, National Academies of Sciences, Engineering, and Medicine. The full summary for the workshop on which this post is based can be obtained here.

You can replicate almost any plot with R

Although R is great for quickly turning data into plots, it is not widely used for making publication ready figures. But, with enough tinkering you can make almost any plot in R. For examples check out the flowingdata blog or the Fundamentals of Data Visualization book. Here I show five charts from the lay press that I use as examples in my data science courses. In the past I would show the originals, but I decided to replicate them in R to make it possible to generate class notes with just R code (there was a lot of googling involved).

So You Want to Start a Podcast

Podcasting has gotten quite a bit easier over the past 10 years, due in part to improvements to hardware and software. I wrote about both how I edit and record both of my podcasts about 2 years ago and, while not much has changed since then, I thought it might be helpful if I organized the information in a better way for people just starting out with a new podcast.

The data deluge means no reasonable expectation of privacy - now what?

Today a couple of different things reminded me about something that I suppose many people are talking about but has been on my mind as well. The idea is that many of our societies social norms are based on the reasonable expectation of privacy. But the reasonable expectation of privacy is increasingly a thing of the past. Three types of data I’ve been thinking about are: Obviously identifying data: Data like cellphone GPS traces and public social media posts are obviously information that is indentifiable and reduce privacy.

More datasets for teaching data science: The expanded dslabs package

Introduction We have expanded the dslabs package, which we previously introduced as a package containing realistic, interesting and approachable datasets that can be used in introductory data science courses. This release adds 7 new datasets on climate change, astronomy, life expectancy, and breast cancer diagnosis. They are used in improved problem sets and new projects within the HarvardX Data Science Professional Certificate Program, which teaches beginning R programming, data visualization, data wrangling, statistics, and machine learning for students with no prior coding background.

Research quality data and research quality databases

When you are doing data science, you are doing research. You want to use data to answer a question, identify a new pattern, improve a current product, or come up with a new product. The common factor underlying each of these tasks is that you want to use the data to answer a question that you haven’t answered before. The most effective process we have come up for getting those answers is the scientific research process.

I co-founded a company! Meet Problem Forward Data Science

I have some exciting news about something I’ve been working on for the last year or so. I started a company! It’s called Problem Forward data science. I’m pumped about this new startup for a lot of reasons. My co-founder is one of my families closest friends, Jamie McGovern, who has more than 2 decades of experience in the consulting world and who I’ve known for 15 years. We are creating a cool new model of “data scientist as a service” (more on that below) We have a problem forward, not solution backward approach to data science that grew out of the Hopkins philosophy of data science.

Generative and Analytical Models for Data Analysis

Describing how a data analysis is created is a topic of keen interest to me and there are a few different ways to think about it. Two different ways of thinking about data analysis are what I call the “generative” approach and the “analytical” approach. Another, more informal, way that I like to think about these approaches is as the “biological” model and the “physician” model. Reading through the literature on the process of data analysis, I’ve noticed that many seem to focus on the former rather than the latter and I think that presents an opportunity for new and interesting work.

Tukey, Design Thinking, and Better Questions

Roughly once a year, I read John Tukey’s paper “The Future of Data Analysis”, originally published in 1962 in the Annals of Mathematical Statistics. I’ve been doing this for the past 17 years, each time hoping to really understand what it was he was talking about. Thankfully, each time I read it I seem to get something new out of it. For example, in 2017 I wrote a whole talk around some of the basic ideas.

Interview with Abhi Datta

Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at @datta_science. If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!