Which term applies to you: data scientist, statistician, computer scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied. I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story. I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.” About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it. Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy. Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms. Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc. I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
You are in the department of politics? How is it being a “data person” in a non-computational department?
Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people. I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network. That data would then be collected and analyzed as a controlled experiment. Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative). To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once. In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important.
Is that what inspired you to create the New York Open Statistical Meetup?
I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup). Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together. Once Josh became fully consumed by starting / running BankSimple I took it over by myself. I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming. The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
You created the data science Venn diagram. Where do you fall on the diagram?
Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone. I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics. If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.
I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?
I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results. I love Github because it creates a great environment for following the work of other coders, and participating in the development process. For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment. I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community. I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN. This would allow for more people to participate, but maintain the great community resource that CRAN provides.
What inspired you to write, “Machine Learning for Hackers”? Who was your target audience?
A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like. During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning. From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically. That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne. We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math. This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class.