You think P-values are bad? I say show me the data.

Both the scientific community and the popular press are freaking out about reproducibility right now. I think they have good reason to, because even the US Congress is now investigating the transparency of science. It has been driven by the very public reproducibility disasters in genomics and economics.

There are three major components to a reproducible and replicable study from a computational perspective: (1) the raw data from the experiment must be available, (2) the statistical code and documentation to reproduce the analysis must be available and (3) a correct data analysis must be performed.

There have been successes and failures in releasing all the data, but PLoS' policy on data availability and the alltrials initiative hold some hope. The most progress has been made on making code and documentation available. Galaxy, knitr, and iPython make it easier to distribute literate programs than it has ever been previously and people are actually using them!

The trickiest part of reproducibility and replicability is ensuring that people perform a good data analysis. The first problem is that we actually don't know which statistical methods lead to higher reproducibility and replicability in users hands.  Articles like the one that just came out in the NYT suggest that using one type of method (Bayesian approaches) over another (p-values) will address the problem. But the real story is that those are still 100% philosophical arguments. We actually have very little good data on whether analysts will perform better analyses using one method or another.  I agree with Roger in his tweet storm (quick someone is wrong on the internet Roger, fix it!):

This is even more of a problem because the data deluge demands that almost all data analysis be performed by people with basic to intermediate statistics training at best. There is no way around this in the short term. There just aren't enough trained statisticians/data scientists to go around.  So we need to study statistics just like any other human behavior to figure out which methods work best in the hands of the people most likely to be using them.

Posted in Uncategorized | 10 Comments

Unbundling the educational package

I just got back from the World Economic Forum's summer meeting in Tianjin, China and there was much talk of disruption and innovation there. Basically, if you weren't disrupting, you were furniture. Perhaps not surprisingly, one topic area that was universally considered ripe for disruption was Education.

There are many ideas bandied about with respect to "disrupting" education and some are interesting to consider. MOOCs were the darlings of...last year...but they're old news now. Sam Lessin has a nice piece in the The Information (total paywall, sorry, but it's worth it) about building a subscription model for universities. Aswath Damodaran has what I think is a nice framework for thinking about the "education business".

One thing that I latched on to in Damodaran's piece is the idea of education as a "bundled product". Indeed, I think the key aspect of traditional on-site university education is the simultaneous offering of

  1. Subject matter content (i.e. course material)
  2. Mentoring and guidance by faculty
  3. Social and professional networking
  4. Other activities (sports, arts ensembles, etc.)

MOOCs have attacked #1 for many subjects, typically large introductory courses. Endeavors like the Minerva project are attempting to provide lower-cost seminar-style courses (i.e. anti-MOOCs).

I think the extent to which universities will truly be disrupted will hinge on how well we can unbundle the four (or maybe more?) elements described above and provide them separately but at roughly the same level of quality. Is it possible? I don't know.

Posted in Uncategorized | 3 Comments

Applied Statisticians: people want to learn what we do. Let's teach them.

In this recent opinion piece, Hadley Wickham explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

and makes the important point that

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.

The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley's definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data.  More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used "data product" R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.

So why then does Hadley think that "Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products"?  I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl's work was published in the American Journal of Human Genetetics. A second reason may be that most of us academic applied statisticians don't teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.

About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.

This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.

So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.

 

 

 

Posted in Uncategorized | 6 Comments

A non-comprehensive list of awesome female data people on Twitter

I was just talking to a student who mentioned she didn't know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn't seen a good list of women on Twitter who do stats/data. So I thought I'd make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I'll update the list?

I have also been informed that these Twitter lists are probably better than my post. But I'll keep updating my list anyway cause I want to know who all the right people to follow are!

 

Posted in Uncategorized | 22 Comments

Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don't, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people's data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn't change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn't before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were't able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn't put me in charge anyway, I've never won the lottery with any infrastructure I've tried to build.

Posted in Uncategorized | 12 Comments

A (very) brief review of published human subjects research conducted with social media companies

As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research is still unclear. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I'm only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few.  I'd be interested to see more in the comments if people know about them.

Paper: Experimental evidence of massive-scale emotional contagion through social networks
Company: Facebook
What they did: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.
What they found: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.

Paper: Social influence bias: a randomized experiment
Company: Not stated but sounds like Reddit
What they did: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.
What they found: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.

Paper: Identifying influential and susceptible members of social networks 
Company: Facebook
What they did: Using a commercial Facebook app,  they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.
What they found: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.

 

Paper: Inferring causal impact using Bayesian structural time-series models
Company: Google
What they did: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser's website through paid and organic (non-paid) clicks.
What they found: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.

 

 

 

 

 

 

 

Posted in Uncategorized | 3 Comments

SwiftKey and Johns Hopkins partner for Data Science Specialization Capstone

I use SwiftKey on my Android phone all the time. So I was super pumped up when they agreed to partner with us on the first Capstone course for the Johns Hopkins Data Science  Specialization to run in October 2014. To enroll in the course you have to pass the other 9 courses in the Data Science Specialization.

The 9 courses have only been running for 4 months but already 200+ people have finished all 9! It has been unbelievable to see the response to the specialization and we are exited about taking it to the next level.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, students will use the knowledge gained in our  Data Products course to build a predictive text product they can show off to their family, friends, and potential employers.

We are really excited to work with SwiftKey to take our Specialization to the next level! Here is Roger's intro video for the course to get you fired up too.

Posted in Uncategorized | 12 Comments

Interview with COPSS Award winner Martin Wainwright

Editor's note: Martin Wainwright is the winner of the 2014 COPSS Award. This award is the most prestigious award in statistics, sometimes refereed to as the Nobel Prize in Statistics. Martin received the award for: " For fundamental and groundbreaking contributions to high-dimensional statistics, graphical modeling, machine learning, optimization and algorithms, covering deep and elegant mathematical analysis as well as new methodology with wide-ranging implications for numerous applications." He kindly agreed to be interviewed by Simply Statistics. 

wainwright

SS: How did you find out you had received the COPSS prize?

It was pretty informal --- I received an email in February from
Raymond Carroll, who chaired the committee. But it had explicit
instructions to keep the information private until the award ceremony
in August.

SS: You are in Electrical Engineering & Computer Science (EECS) and
Statistics at Berkeley: why that mix of departments?

Just to give a little bit of history, I did my undergraduate degree in
math at the University of Waterloo in Canada, and then my Ph.D. in
EECS at MIT, before coming to Berkeley to work as a postdoc in
Statistics. So when it came time to looking at faculty positions,
having a joint position between these two departments made a lot of
sense. Berkeley has always been at the forefront of having effective
joint appointments of the "Statistics plus X" variety, whether X is
EECS, Mathematics, Political Science, Computational Biology and so on.

For me personally, the EECS plus Statistics combination is terrific,
as a lot of my interests lie at the boundary between these two areas,
whether it is investigating tradeoffs between computational and
statistical efficiency, connections between information theory and
statistics, and so on. I hope that it is also good for my students!
In any case, whether they enter in EECS or Statistics, they graduate
with a strong background in both statistical theory and methods, as
well as optimization, algorithms and so on. I think that this kind of
mix is becoming increasingly relevant to the practice of modern
statistics, and one can certainly see that Berkeley consistently
produces students, whether from my own group or other people at
Berkeley, with this kind of hybrid background.
SS: What do you see as the relationship between statistics and machine
learning?

This is an interesting question, but tricky to answer, as it can
really depend on the person. In my own view, statistics is a very
broad and encompassing field, and in this context, machine learning
can be viewed as a particular subset of it, one especially focused on
algorithmic and computational aspects of statistics. But on the other
hand, as things stand, machine learning has rather different cultural
roots than statistics, certainly strongly influenced by computer
science. In general, I think that both groups have lessons to learn
from each other. For instance, in my opinion, anyone who wants to do
serious machine learning needs to have a solid background in
statistics. Statisticians have been thinking about data and
inferential issues for a very long time now, and these fundamental
issues remain just as important now, even though the application
domains and data types may be changing. On the other hand, in certain
ways, statistics is still a conservative field, perhaps not as quick
to move into new application domains, experiment with new methods and
so on, as people in machine learning do. So I think that
statisticians can benefit from the playful creativity and unorthodox
experimentation that one sees in some machine learning work, as well
as the algorithmic and programming expertise that is standard in
computer science.

SS: What sorts of things is your group working on these days?

I have fairly eclectic interests, so we are working on a range of
topics. A number of projects concern the interface between
computation and statistics. For instance, we have a recent pre-print
(with postdoc Sivaraman Balakrishnan and colleague Bin Yu) that tries
to address the gap between statistical and computational guarantees in
applications of the expectation-maximization (EM) algorithm for latent
variable models. In theory, we know that the global minimizer of the
(nonconvex) likelihood has good properties, but the in practice, the
EM algorithm only returns local optima. How to resolve this gap
between existing theory and actual practice? In this paper, we show
that under pretty reasonable conditions---that hold for various types
of latent variable models---the EM fixed points are as good as the
global minima from the statistical perspective. This explains what is
observed a lot in practice, namely that when the EM algorithm is given
a reasonable initialization, it often returns a very good answer.

There are lots of other interesting questions at this
computation/statistics interface. For instance, a lot of modern data
sets (e.g., Netflix) are so large that they cannot be stored on a
single machine, but must be split up into separate pieces. Any
statistical task must then be carried out in a distributed way, with
each processor performing local operations on a subset of the data,
and then passing messages to other processors that summarize the
results of its local computations. This leads to a lot of fascinating
questions. What can be said about the statistical performance of such
distributed methods for estimation or inference? How many bits do the
machines need to exchange in order for the distributed performance to
match that of the centralized "oracle method" that has access to all
the data at once? We have addressed some of these questions in a
recent line of work (with student Yuchen Zhang, former student John
Duchi and colleague Micheel Jordan).

So my students and postdocs are keeping me busy, and in addition, I am
also busy writing a couple of books, one jointly with Trevor Hastie
and Rob Tibshirani at Stanford University on the Lasso and related
methods, and a second solo-authored effort, more theoretical in focus,
on high-dimensional and non-asymptotic statistics.
SS: What role do you see statistics playing in the relationship
between Big Data and Privacy?

Another very topical question: privacy considerations are certainly
becoming more and more relevant as the scale and richness of data
collection grows. Witness the recent controversies with the NSA, data
manipulation on social media sites, etc. I think that statistics
should have a lot to say about data and privacy. There has a long
line of statistical work on privacy, dating back at least to Warner's
work on survey sampling in the 1960s, but I anticipate seeing more of
it over the next years. Privacy constraints bring a lot of
interesting statistical questions---how to design experiments, how to
perform inference, how should data be aggregated and what should be
released and so on---and I think that statisticians should be at the
forefront of this discussion.

In fact, in some joint work with former student John Duchi and
colleague Michael Jordan, we have examined some tradeoffs between
privacy constraints and statistical utility. We adopt the framework
of local differential privacy that has been put forth in the computer
science community, and study how statistical utility (in the form of
estimation accuracy) varies as a function of the privacy level.
Obviously, preserving privacy means obscuring something, so that
estimation accuracy goes down, but what is the quantitative form of
this tradeoff? An interesting consequence of our analysis is that in
certain settings, it identifies optimal mechanisms for preserving a
certain level of privacy in data.

What advice would you give young statisticians getting into the
discipline right now?

It is certainly an exciting time to be getting into the discipline.
For undergraduates thinking of going to graduate school in statistics,
I would encourage them to build a strong background in basic
mathematics (linear algebra, analysis, probability theory and so on)
that are all important for a deep understanding of statistical methods
and theory. I would also suggest "getting their hands dirty", that is
doing some applied work involving statistical modeling, data analysis
and so on. Even for a person who ultimately wants to do more
theoretical work, having some exposure to real-world problems is
essential. As part of this, I would suggest acquiring some knowledge
of algorithms, optimization, and so on, all of which are essential in
dealing with large, real-world data sets.

Posted in Uncategorized | Leave a comment

Crowdsourcing resources for the Johns Hopkins Data Science Specialization

Since we began offering the Johns Hopkins Data Science Specialization we've noticed the unbelievable passion that our students have about our courses and the generosity they show toward each other on the course forums. Many students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. We also know there are tons of other great organizations creating material (looking at you Software Carpentry folks).

We're excited to announce that we've created a site using GitHub Pages: http://datasciencespecialization.github.io/ to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization please send us a pull request and we will add a link to your content on our site. You can find out more about contributing here: https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing

We can't wait to see what you've created and where the community can take this site!

Posted in Uncategorized | Leave a comment

swirl and the little data scientist's predicament

Editor's note: This is a repost of "R and the little data scientist's predicament". A brief idea for an update is presented at the end in italics. 

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
  3. It would be awesome if the functions would include some sort of dynamic graphics (withsvgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.

Update: Now that Nick and co. have created swirl the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!

Posted in Uncategorized | 2 Comments