Tag: github

28
Oct

Sunday Data/Statistics Link Roundup (10/28/12)

  1. An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to teach statistics to everyone in high school.
  2. Scientific American was loaded this last week, here is another article on cancer screening.  The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter @hildabast.
  3. A bunch of data on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.). 
  4. A really interesting post over at Stats Chat about why we study seemingly obvious things. Hint, the reason is that “obvious” things aren’t always true. 
  5. A story on “sentiment analysis” by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)
  6. Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. Here are his slides from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing.  
  7. Bonus Link: R is a trendy open source technology for big data
27
May

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads
13
Apr

Interview with Drew Conway - Author of "Machine Learning for Hackers"

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
 

You are in the department of politics? How is it being a “data
person” in a non-computational department?

Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 
 

Is that what inspired you to create the New York Open Statistical Meetup?

I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
 

You created the data science Venn diagram. Where do you fall on the diagram?

Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  
 

I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?

I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 
 

What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?

A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 
04
Mar

Sunday data/statistics link roundup (3/4)

  1. A cool article on Github by the folks at Wired. I’m starting to think the fact that I’m not on Github is a serious dent in my nerd cred. 
  2. Datawrapper - a less intensive, but less flexible open source data visualization creator. I have seen a few of these types of services starting to pop up. I think that some statistics training should be mandatory before people use them. 
  3. An interesting blog post with the provocative title, “Why bother publishing in a journal” The story he describes works best if you have a lot of people who are interested in reading what you put on the internet. 
  4. A post on stackexchange comparing the machine learning and statistics cultures. 
  5. Stackoverflow is a great place to look for R answers. It is the R mailing list, minus the flames…
  6. Roger’s posts on Beijing air pollution are worth another read if you missed them. Particularly this one, where he computes the cigarette equivalent of the air pollution levels.