Tag: Machine Learning

13
Apr

Interview with Drew Conway - Author of "Machine Learning for Hackers"

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
 

You are in the department of politics? How is it being a “data
person” in a non-computational department?

Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 
 

Is that what inspired you to create the New York Open Statistical Meetup?

I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
 

You created the data science Venn diagram. Where do you fall on the diagram?

Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  
 

I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?

I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 
 

What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?

A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 
29
Sep

Kindle Fire and Machine Learning

Amazon released it’s new iPad competitor, the Kindle Fire, today. A quick read through the description shows it has some interesting features, including a custom-built web browser called Silk. One innovation that they claim is that the browser works in conjunction with Amazon’s EC2 cloud computing platform to speed up the web-surfing experience by doing some computing on your end and some on their end. Seems cool, if it really does make things faster.

Also there’s this interesting bit:

Machine Learning

Finally, Silk leverages the collaborative filtering techniques and machine learning algorithms Amazon has built over the last 15 years to power features such as “customers who bought this also bought…” As Silk serves up millions of page views every day, it learns more about the individual sites it renders and where users go next. By observing the aggregate traffic patterns on various web sites, it refines its heuristics, allowing for accurate predictions of the next page request. For example, Silk might observe that 85 percent of visitors to a leading news site next click on that site’s top headline. With that knowledge, EC2 and Silk together make intelligent decisions about pre-pushing content to the Kindle Fire. As a result, the next page a Kindle Fire customer is likely to visit will already be available locally in the device cache, enabling instant rendering to the screen.

That seems like a logical thing for Amazon to do. While the idea of pre-fetching pages is not particularly new, I haven’t yet heard of the idea of doing data analysis on web pages to predict which things to pre-fetch. One issue this raises in my mind, is that in order to do this, Amazon needs to combine information across browsers, which means your surfing habits will become part of one large mega-dataset. Is that what we want?

On the one hand, Amazon already does some form of this by keeping track of what you buy. But keeping track of every web page you goto and what links you click on seems like a much wider scope.