Tag: data science


My Online Course Development Workflow

One of the nice things about developing 9 new courses for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.

The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it's nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn't work, you can start over the next day and have another version done in less than 24 hours.

I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.

  • I use Camtasia for Mac to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it's pretty much glorified screencasting software for me.
  • For talking head type videos I use my iPhone 5S mounted on a tripod. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see here for a much better example of what can be done). I attach the phone to an Apogee microphone to pick up better sound. For some of the interviews that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).
  • To record my primary sound (i.e. me talking), I use the Zoom H4N portable recorder. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.
  • For simple sound recording (no video or screen) I use Audacity.
  • All of my lecture videos are run through Final Cut Pro X on my 15-inch MacBook Pro with Retina Display. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you're not used to a nonlinear editor (as I was not). I bought this excellent book to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.
  • I have an 8TB Western Digital Thunderbolt drive to store the raw video for all my classes (and some backups). I also use two 1TB Thunderbolt drives to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.
  • Finished videos are shared with a Dropbox for Business account so that Jeff, Brian, and I can all look at each other's stuff. Videos are exported to H.264/AAC and uploaded to Coursera.
  • For developing slides, Jeff, Brian, and I have standardized around using Slidify. The beauty of using slidify is that it lets you write everything in Markdown, a super simple text format. It also make it simpler to manage all the course material in Git/GitHub because you don't have to lug around huge PowerPoint files. Everything is  a light-weight text file. And thanks to Ramnath's incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via GitHub Pages).

The first courses for the Data Science Specialization start on April 7th. Don't forget to sign up!


NIH is looking for an Associate Director for Data Science: Statisticians should consider applying

NIH understands the importance of data and several months ago they announced this new position. Here is an excerpt from the add:

The ADDS will focus on the urgent need and increased opportunities for capitalizing on the expanding collections of biomedical data to advance NIH’s mission. In doing so, the incumbent will provide programmatic NIH-wide leadership for areas of data science that relate to data emanating from many areas of study (e.g., genomics, imaging, and electronic heath records). This will require knowledge about multiple domains of study as well as familiarity with approaches for integrating data from these various domains.

In my opinion, the person holding this job should have hands-on experience with data analysis and programming. The nuisances nuances of what a data analyst needs to successfully do his/her job can't be underestimated. This knowledge will help this director make the right decisions when it comes to choosing what data to make available and how to make it available.  When it comes to creating data resources, good intentions don't always translate into usable products.

In this new era of data driven science this position will be highly influential making this job quite attractive. If you know of a Statistician that you think is interested please pass along the information.


Sunday data/statistics link roundup (1/6/2013)

  1. Not really statistics, but this is an interesting article about how rational optimization by individual actors does not always lead to an optimal solutiohn. Related, ere is the coolest street sign I think I've ever seen, with a heatmap of traffic density to try to influence commuters.
  2. An interesting paper that talks about how clustering is only a really hard problem when there aren't obvious clusters. I was a little disappointed in the paper, because it defines the "obviousness" of clusters only theoretically by a distance metric. There is very little discussion of the practical distance/visual distance metrics people use when looking at clustering dendograms, etc.
  3. A post about the two cultures of statistical learning and a related post on how data-driven science is a failure of imagination. I think in both cases, it is worth pointing out that the only good data science is good science - i.e. it seeks to answer a real, specific question through the scientific method. However, I think for many modern scientific problems it is pretty naive to think we will be able to come to a full, mechanistic understanding complete with tidy theorems that describe all the properties of the system. I think the real failure of imagination is to think that science/statistics/mathematics won't change to tackle the realistic challenges posed in solving modern scientific problems.
  4. A graph that shows the incredibly strong correlation ( > 0.99!) between the growth of autism diagnoses and organic food sales. Another example where even really strong correlation does not imply causation.
  5. The Buffalo Bills are going to start an advanced analytics department (via Rafa and Chris V.), maybe they can take advantage of all this free play-by-play data from years of NFL games.
  6. A prescient interview with Isaac Asimov on learning, predicting the Kahn Academy, MOOCs and other developments in online learning (via Rafa and Marginal Revolution).
  7. The statistical software signal - what your choice of software says about you. Just another reason we need a deterministic statistical machine.



Sunday data/statistics link roundup (11/25/2012)

  1. My wife used to teach at Grinnell College, so we were psyched to see that a Grinnell player set the NCAA record for most points in a game. We used to go to the games, which were amazing to watch, when we lived in Iowa. The system the coach has in place there is a ton of fun to watch and is based on statistics!
  2. Someone has to vet the science writers at the Huffpo. This is out of control, basically claiming that open access publishing is harming science. I mean, I'm all about being a curmudgeon and all, but the internet exists now, so we might as well get used to it. 
  3. This one is probably better for Steven's blog, but this is a pretty powerful graph about the life-saving potential of vaccines.  
  4. Roger posted yesterday about the NY Times piece on deep learning. It is one of our most shared posts of all time, you should also check out the comments, which are exceedingly good. Two things I thought I'd point out in response to a lot of the reaction: (1) I think part of Roger's post was suggesting that the statistics community should adopt some of CS's culture of solving problems with already existing, really good methods and (2) I tried searching for a really clear example of "deep learning" yesterday so we could try some statistics on it and didn't find any really clear explanations. Does anyone have a really simple example of deep learning (ideally with code) so we can see how it relates to statistical concepts? 

How important is abstract thinking for graduate students in statistics?

A recent lunchtime discussion here at Hopkins brought up the somewhat-controversial topic of abstract thinking in our graduate program. We, like a lot of other biostatistics/statistics programs, require our students to take measure theoretic probability as part of the curriculum. The discussion started as a conversation about whether we should require measure theoretic probability for our students. It evolved into a discussion of the value of abstract thinking (and whether measure theoretic probability was a good tool to measure abstract thinking).

Brian Caffo and I decided an interesting idea would be a point-counterpoint with the prompt, “How important is abstract thinking for the education of statistics graduate students?” Next week Brian and I will provide a point-counterpoint response based on our discussion.

In the meantime we’d love to hear your opinions!


A disappointing response from @NatureMagazine about folks with statistical skills

Last week I linked to an ad for a Data Editor position at Nature Magazine. I was super excited that Nature was recognizing data as an important growth area. But the ad doesn’t mention anything about statistical analysis skills; it focuses exclusively on data management expertise. As I pointed out in the earlier post, managing data is only half the equation - figuring out what to do with the data is the other half. The second half requires knowledge of statistics.

The folks over at Nature responded to our post on Twitter:

 it’s unrealistic to think this editor (or anyone) could do what you suggest. Curation & accessibility are key. ^ng

I disagree with this statement for the following reasons:

1. Is it really unrealistic to think someone could have data management and statistical expertise? Pick your favorite data scientist and you would have someone with those skills. Most students coming out of computer science, computational biology, bioinformatics, or statistical genomics programs would have a blend of those two skills in some proportion. 

But maybe the problem is this:

Applicants must have a PhD in the biological sciences

It is possible that there are few PhDs in the biological sciences who know both statistics and data management (although that is probably changing). But most computational biologists have a pretty good knowledge of biology and a very good knowledge of data - both managing and analyzing. If you are hiring a data editor, this might be the target audience. I’d replace PhD in the biological science in the ad with, knowledge of biology,statistics, data analysis, and data visualization. There would be plenty of folks with those qualifications.

2. The response mentions curation, which is a critical issue. But good curation requires knowledge of two things: (i) the biological or scientific problem and (ii) how and in what way the data will be analyzed and used by researchers. As the Duke scandal made clear, a statistician with technological and biological knowledge running through a data analysis will identify many critical issues in data curation that would be missed by someone who doesn’t actually analyze data. 

3. The response says that “Curation and accessibility” are key. I agree that they are part of the key. It is critical that data can be properly accessed by researchers to perform new analyses, verify results in papers, and discover new results. But if the goal is to ensure the quality of science being published in Nature (the role of an editor) curation and accessibility are not enough. The editor should be able to evaluate statistical methods described in papers to identify potential flaws, or to rerun code and make sure that it performs the same/sensible analyses. A bad analysis that is reproducible will be discovered more quickly, but it is still a bad analysis. 

To be fair, I don’t think that Nature is the only organization that is missing the value of statistical skill in hiring data positions. It seems like many organizations are still just searching for folks who can handle/process the massive data sets being generated. But if they want to make accurate and informed decisions, statistical knowledge needs to be at the top of their list of qualifications.  


Sunday data/statistics link roundup (4/29)

  1. Nature genetics has an editorial on the Mayo and Myriad cases. I agree with this bit: “In our opinion, it is not new judgments or legislation that are needed but more innovation. In the era of whole-genome sequencing of highly variable genomes, it is increasingly hard to justify exclusive ownership of particularly useful parts of the genome, and method claims must be more carefully described.” Via Andrew J.
  2. One of Tech Review’s 10 emerging technologies from a February 2003 article? Data mining. I think doing interesting things with data has probably always been a hot topic, it just gets press in cycles. Via Aleks J. 
  3. An infographic in the New York Times compares the profits and taxes of Apple over time, here is an explanation of how they do it. (Via Tim O.)
  4. Saw this tweet via Joe B. I’m not sure if the frequentists or the Bayesians are winning, but it seems to me that the battle no longer matters to my generation of statisticians - there are too many data sets to analyze, better to just use what works!
  5. Statistical and computational algorithms that write news stories. Simply Statistics remains 100% human written (for now). 
  6. The 5 most critical statistical concepts. 

Interview with Drew Conway - Author of "Machine Learning for Hackers"

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.

You are in the department of politics? How is it being a “data
person” in a non-computational department?

Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 

Is that what inspired you to create the New York Open Statistical Meetup?

I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.

You created the data science Venn diagram. Where do you fall on the diagram?

Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  

I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?

I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 

What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?

A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 

Sunday data/statistics link roundup (3/25)

  1. The psychologist whose experiment didn’t replicate then went off on the scientists who did the replication experiment is at it again. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. 
  2. Berkeley is running a data science course with instructors Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing. Stay tuned for more info about my applied statistics class which starts this week. 
  3. A cool article about Factual, one of the companies whose sole mission in life is to collect and distribute data. We’ve linked to them before. We are so out ahead of the Times on this one…
  4. This isn’t statistics related, but I love this post about Jeff Bezos. If we all indulged our inner 11 year old a little more, it wouldn’t be a bad thing. 
  5. If you haven’t had a chance to read Reeves guest post on the Mayo Supreme Court decision yet, you should, it is really interesting. A fascinating intersection of law and statistics is going on in the personalized medicine world right now. 

Cleveland's (?) 2001 plan for redefining statistics as "data science"

This plan has been making the rounds on Twitter and is being attributed to William Cleveland in 2001 (thanks to Kasper for the link). I’m not sure of the provenance of the document but it has some really interesting ideas and is worth reading in its entirety. I actually think that many Biostatistics departments follow the proposed distribution of effort pretty closely. 

One of the most interesting sections is the discussion of computing (emphasis mine): 

Data analysis projects today rely on databases, computer and network hardware, and computer and network software. A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use. In choosing competing models and methods, analysts will trade effectiveness for efficiency of use.


This suggests that statisticians should look to computing for knowledge today, just as data science looked to mathematics in the past.

I also found the theory section worth a read and figure it will definitely lead to some discussion: 

Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring the same body of mathematics for all. Students should study mathematics on an as-needed basis.


Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue.