Simply Statistics

13
Apr

Interview with Drew Conway - Author of "Machine Learning for Hackers"

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
 

You are in the department of politics? How is it being a “data
person” in a non-computational department?

Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 
 

Is that what inspired you to create the New York Open Statistical Meetup?

I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
 

You created the data science Venn diagram. Where do you fall on the diagram?

Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  
 

I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?

I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 
 

What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?

A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 
12
Apr

The Problem with Universities

I have had the following conversation a number of times recently:

  1. I want to do X. X is a lot of fun and is really interesting. Doing X involves a little of A and a little of B.
  2. We should get some students to do X also.
  3. Okay, but from where should we get the students? Students in Department of A don’t know B. Students from Department of B don’t know A.
  4. Fine, maybe we could start a program that specifically trains people in X. In this program we’ll teach them A and B. It’ll be the first program of it’s kind! Woohoo!
  5. Sure that’s great, but because there aren’t any other departments of X, the graduates of our program now have to get jobs in departments of A or B. Those departments complain that students from Department of X only know a little of A (or B).
  6. Grrr. Go away.

Has anyone figured out a solution to this problem? Specifically, how do you train students to do something for which there’s no formal department/program without jeopardizing their career prospects?

11
Apr

Statistics is not math...

Statistics depends on math, like a lot of other disciplines (physics, engineering, chemistry, computer science). But just like those other disciplines, statistics is not math; math is just a tool used to solve statistical problems. Unlike those other disciplines, statistics gets lumped in with math in headlines. Whenever people use statistical analysis to solve an interesting problem, the headline reads:

“Math can be used to solve amazing problem X”

or

“The Math of Y” 

Here are some examples:

The Mathematics of Lego - Using data on legos to estimate a distribution

The Mathematics of War - Using data on conflicts to estimate a distribution

Usain Bolt can run faster with maths (Tweet) - Turns out they analyzed data on start times to come to the conclusion

The Mathematics of Beauty - Analysis of data relating dating profile responses and photo attractiveness

These are just a few off of the top of my head, but I regularly see headlines like this. I think there are a couple reasons for math being grouped with statistics: (1) many of the founders of statistics were mathematicians first (but not all of them) (2) many statisticians still identify themselves as mathematicians, and (3) in some cases statistics and statisticians define themselves pretty narrowly. 

With respect to (3), consider the following list of disciplines:

  1. Biostatistics
  2. Data science
  3. Machine learning
  4. Natural language processing
  5. Signal processing
  6. Business analytics
  7. Econometrics
  8. Text mining
  9. Social science statistics
  10. Process control

All of these disciplines could easily be classified as “applied statistics”. But how many folks in each of those disciplines would classify themselves as statisticians? More importantly, how many would be claimed by statisticians? 

10
Apr
09
Apr

What is a major revision?

I posted a little while ago on a proposal for a fast statistics journal. It generated a bunch of comments and even a really nice follow up post with some great ideas. Since then I’ve gotten reviews back on a couple of papers and I think I realized one of the key issues that is driving me nuts about the current publishing model. It boils down to one simple question: 

What is a major revision? 

I often get reviews back that suggest “major revisions” in one or many of the following categories:

  1. More/different simulations
  2. New simulations
  3. Re-organization of content
  4. Re-writing language
  5. Asking for more references
  6. Asking me to include a new method
  7. Asking me to implement someone else’s method for comparison
I don’t consider any of these major revisions. Personally, I have stopped asking for them as major revisions. In my opinion, major revisions should be reserved for issues with the manuscript that suggest that it may be reporting incorrect results. Examples include:
  1. No simulations
  2. No real data
  3. The math/computations look incorrect
  4. The software didn’t work when I tried it
  5. The methods/algorithms are unreadable and can’t be followed
The first list is actually a list of minor/non-essential revisions in my opinion. They may improve my paper, but they won’t confirm that it is correct or not. I find that they are often subjective and are up to the whims of referees. In my own personal refereeing I am making an effort to remove subjective major revisions and only include issues that are critical to evaluate the correctness of a manuscript. I also try to divorce the issues of whether an idea is interesting or not from whether an idea is correct or not. 
I’d be curious to know what other peoples’ definitions of major/minor revisions are?




09
Apr

Sunday data/statistics link roundup (4/8)

  1. This is a great article about the illusion of progress in machine learning. In part, I think it explains why the Leekasso (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data. Via @chlalanne.
  2. An interesting post from Michael Eisen on a serious abuse of statistical ideas in the New York Times. The professor of genetics quoted in the story apparently wasn’t aware of the birthday problem. Lack of statistical literacy, even among scientists, is becoming critical. I would love it if the Kahn academy (or some enterprising students) would come up with a set of videos that just explained a bunch of basic statistical concepts - skipping all the hard math and focusing on the ideas. 
  3.  TechCrunch finally caught up to our Mayo vs. Prometheus coverage. This decision is going to affect more than just personalized medicine. Speaking of the decision, stay tuned for more on that topic from the folks over here at Simply Statistics. 
  4. How much is a megabyte? I love this question. They asked people on the street how much data was in a megabyte. The answers were pretty far ranging looks like. This question is hyper-critical for scientists in the new era, but the better question might be, “How much is a terabyte?”
05
Apr
04
Apr
03
Apr

ENAR Meeting

This is the ENAR meeting so posting will be intermittent. If you’re at the meeting I’ll be talking at 1:45 today in the Columbia B room in a session on climate change and health. I hear Rafa is roaming the halls too so make sure you say hi if you see him.