Roughly once a year, I read John Tukey’s paper “The Future of Data Analysis”, originally published in 1962 in the Annals of Mathematical Statistics. I’ve been doing this for the past 17 years, each time hoping to really understand what it was he was talking about. Thankfully, each time I read it I seem to get something new out of it. For example, in 2017 I wrote a whole talk around some of the basic ideas.
Editor’s note: This is the next in our series of interviews with early career statisticians and data scientists. Today we are talking to Abhi Datta about his work in large scale spatial analysis and his interest in soccer! Follow him on Twitter at @datta_science. If you have recommendations of an (early career) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter!
Over the last few weeks I’ve had a couple of interactions with folks from the computer science world who were pretty disparaging of the R programming language. A lot of the critism focused on perceived limitations of R to statistical analysis. It’s true, R does have a hugely comprehensive list of analysis packages on CRAN, Bioconductor, Neuroconductor, and ROpenSci as well as great package management. As I was having these conversations I realized that R has grown into a multi-purpose connective language for things beyond just data analysis.
Statisticians have been pointing out the problem with dynamite plots, also known as bar and line graphs, for years. Karl Broman lists them as one of the top ten worst graphs. The problem has even been documented in the peer reviewed literature. For example, this British Journal of Pharmacology paper titled Show the data, don’t conceal them was published in 2011. However, despite all these efforts, dynamite plots continue to be ubiquitous in the scientific literature.
Editor’s note: For a while we ran an interview series for statisticians and data scientists, but things have gotten a little hectic around here so we’ve dropped the ball! But we are re-introducing the series, starting with Stephanie Hicks. If you have recommendations of a (junior) person in academics or industry you would like to see promoted, reach out to Jeff (@jtleek) on Twitter! Stephanie Hicks received her PhD in statistics in 2013 at Rice University and has already made major contributions to the analysis of single cell sequencing data and the theory and practice of teaching data science.
What makes for a good data scientist? This is a question I asked a long time ago and am still trying to figure out the answer. Seven years ago, I wrote: I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis.
In episode 71 of Not So Standard Deviations, Hilary Parker and I inaugurated our first “Data Science Design Challenge” segment where we discussed how we would solve a given problem using data science. The idea with calling it a “design challenge” was to contrast it with common “hackathon” type models where you are presented with an already-collected dataset and then challenged to find something interesting in the data. Here, we wanted to start with a problem and then talk about how data might be collected and analyzed to address the problem.
A recent article in the Wall Street Journal, “At Netflix, Who Wins When It’s Hollywood vs. the Algorithm?” by Shalini Ramachandran and Joe Flint details some of the internal debates within Netflix between the Los Angeles-based content team, which is in charge of developing and marketing new content for the streaming service, and the data team. The initial example described is an advertising image for a new show (“Grace and Frankie”“) starring Jane Fonda and Lily Tomlin.
In data analysis, we make use of a lot of theory, whether we like to admit it or not. In a traditional statistical training, things like the central limit theorem and the law of large numbers (and their many variations) are deeply baked into our heads. I probably use the central limit theorem everyday in my work, sometimes for the better, and sometimes for the worse. Even if I’m not directly applying a Normal approximation, knowledge of the central limit theorem will often guide my thinking and help me to decide what to do in a given data analytic situation.
I was recently asked to moderate an academic panel on the role of universities in training the data science workforce. I preceded each question with opinionated introductions which I have fused into this blog post. These are weakly held opinions so please consider commenting if you disagree with anything. To discuss data science education we first need to clearly state what it means. The panel organizers defined data science as “an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields.