What makes for a good data scientist? This is a question I asked a long time ago and am still trying to figure out the answer. Seven years ago, I wrote: I was thinking about the people who I think are really good at data analysis and it occurred to me that they were all people I knew. So I started thinking about people that I don’t know (and there are many) but are equally good at data analysis.
In episode 71 of Not So Standard Deviations, Hilary Parker and I inaugurated our first “Data Science Design Challenge” segment where we discussed how we would solve a given problem using data science. The idea with calling it a “design challenge” was to contrast it with common “hackathon” type models where you are presented with an already-collected dataset and then challenged to find something interesting in the data. Here, we wanted to start with a problem and then talk about how data might be collected and analyzed to address the problem.
A recent article in the Wall Street Journal, “At Netflix, Who Wins When It’s Hollywood vs. the Algorithm?” by Shalini Ramachandran and Joe Flint details some of the internal debates within Netflix between the Los Angeles-based content team, which is in charge of developing and marketing new content for the streaming service, and the data team. The initial example described is an advertising image for a new show (“Grace and Frankie”“) starring Jane Fonda and Lily Tomlin.
In data analysis, we make use of a lot of theory, whether we like to admit it or not. In a traditional statistical training, things like the central limit theorem and the law of large numbers (and their many variations) are deeply baked into our heads. I probably use the central limit theorem everyday in my work, sometimes for the better, and sometimes for the worse. Even if I’m not directly applying a Normal approximation, knowledge of the central limit theorem will often guide my thinking and help me to decide what to do in a given data analytic situation.
I was recently asked to moderate an academic panel on the role of universities in training the data science workforce. I preceded each question with opinionated introductions which I have fused into this blog post. These are weakly held opinions so please consider commenting if you disagree with anything. To discuss data science education we first need to clearly state what it means. The panel organizers defined data science as “an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields.
Editor’s Note: I attended an ASA Chair’s meeting and spoke about ways we could support junior faculty in data science. After giving my talk Galin Jones, Professor and Director of Statistics at University of Minnesota, and I had an interesting conversation about how they had changed their promotion criteria in response to a faculty candidate being unique. I asked him to write about his experience and he kindly contributed the following post.
tl;dr check out our new paper on the relationship between MOOC completion and economic outcomes! Last Monday we launched our Chromebook Data Science Program so that anyone with an internet connection, a web browser, and the ability to read and follow instructions could become a data scientist. Why did we launch another MOOC program? Aren’t MOOCs dead? Well we didn’t think so :). We have been pretty excited about MOOCs for a while now and now run five different MOOC programs through the Johns Hopkins Data Science Lab.
The Johns Hopkins Data Science Lab has been teaching massive online open courses for more than 5 years now. During that time we’ve reached more than 5 million learners who want to break into the number one rated job in America. While we have been incredibly excited about the results of these training programs, we’ve also learned over the last 5+ years that there are still significant barriers to getting into data science.
Rolando Acosta and I recently posted a manuscript on bioRxiv describing the effects of Hurricane María, based on an analysis of mortality data provided by the Demographic Registry. I was also an author on a paper published in May based on a survey of 3,000 households. These are very different datasets. Assuming it is complete, the Demographic Registry dataset provides much more precise quantitative information. However, this dataset was not made publicly available until June 1, 2018, three days after the paper based on the survey data was released.
There are often discussions within the data science community about which tools are best for doing data science. The most recent iteration of this discussion is the so-called “First Notebook War”, which is well-summarized by Yihui Xie in his blog post (it is a great read). One thing that I have found missing from many discussions about tooling in data analysis is an acknowledgment that data analysis tends to advance through different phases and that different tools can be more or less useful in each of those phases.