Simply Statistics


Interview with COPSS Award winner Martin Wainwright

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: Martin Wainwright is the winner of the 2014 COPSS Award. This award is the most prestigious award in statistics, sometimes refereed to as the Nobel Prize in Statistics. Martin received the award for: " For fundamental and groundbreaking contributions to high-dimensional statistics, graphical modeling, machine learning, optimization and algorithms, covering deep and elegant mathematical analysis as well as new methodology with wide-ranging implications for numerous applications." He kindly agreed to be interviewed by Simply Statistics. 


SS: How did you find out you had received the COPSS prize?

It was pretty informal --- I received an email in February from
Raymond Carroll, who chaired the committee. But it had explicit
instructions to keep the information private until the award ceremony
in August.

SS: You are in Electrical Engineering & Computer Science (EECS) and
Statistics at Berkeley: why that mix of departments?

Just to give a little bit of history, I did my undergraduate degree in
math at the University of Waterloo in Canada, and then my Ph.D. in
EECS at MIT, before coming to Berkeley to work as a postdoc in
Statistics. So when it came time to looking at faculty positions,
having a joint position between these two departments made a lot of
sense. Berkeley has always been at the forefront of having effective
joint appointments of the "Statistics plus X" variety, whether X is
EECS, Mathematics, Political Science, Computational Biology and so on.

For me personally, the EECS plus Statistics combination is terrific,
as a lot of my interests lie at the boundary between these two areas,
whether it is investigating tradeoffs between computational and
statistical efficiency, connections between information theory and
statistics, and so on. I hope that it is also good for my students!
In any case, whether they enter in EECS or Statistics, they graduate
with a strong background in both statistical theory and methods, as
well as optimization, algorithms and so on. I think that this kind of
mix is becoming increasingly relevant to the practice of modern
statistics, and one can certainly see that Berkeley consistently
produces students, whether from my own group or other people at
Berkeley, with this kind of hybrid background.
SS: What do you see as the relationship between statistics and machine

This is an interesting question, but tricky to answer, as it can
really depend on the person. In my own view, statistics is a very
broad and encompassing field, and in this context, machine learning
can be viewed as a particular subset of it, one especially focused on
algorithmic and computational aspects of statistics. But on the other
hand, as things stand, machine learning has rather different cultural
roots than statistics, certainly strongly influenced by computer
science. In general, I think that both groups have lessons to learn
from each other. For instance, in my opinion, anyone who wants to do
serious machine learning needs to have a solid background in
statistics. Statisticians have been thinking about data and
inferential issues for a very long time now, and these fundamental
issues remain just as important now, even though the application
domains and data types may be changing. On the other hand, in certain
ways, statistics is still a conservative field, perhaps not as quick
to move into new application domains, experiment with new methods and
so on, as people in machine learning do. So I think that
statisticians can benefit from the playful creativity and unorthodox
experimentation that one sees in some machine learning work, as well
as the algorithmic and programming expertise that is standard in
computer science.

SS: What sorts of things is your group working on these days?

I have fairly eclectic interests, so we are working on a range of
topics. A number of projects concern the interface between
computation and statistics. For instance, we have a recent pre-print
(with postdoc Sivaraman Balakrishnan and colleague Bin Yu) that tries
to address the gap between statistical and computational guarantees in
applications of the expectation-maximization (EM) algorithm for latent
variable models. In theory, we know that the global minimizer of the
(nonconvex) likelihood has good properties, but the in practice, the
EM algorithm only returns local optima. How to resolve this gap
between existing theory and actual practice? In this paper, we show
that under pretty reasonable conditions---that hold for various types
of latent variable models---the EM fixed points are as good as the
global minima from the statistical perspective. This explains what is
observed a lot in practice, namely that when the EM algorithm is given
a reasonable initialization, it often returns a very good answer.

There are lots of other interesting questions at this
computation/statistics interface. For instance, a lot of modern data
sets (e.g., Netflix) are so large that they cannot be stored on a
single machine, but must be split up into separate pieces. Any
statistical task must then be carried out in a distributed way, with
each processor performing local operations on a subset of the data,
and then passing messages to other processors that summarize the
results of its local computations. This leads to a lot of fascinating
questions. What can be said about the statistical performance of such
distributed methods for estimation or inference? How many bits do the
machines need to exchange in order for the distributed performance to
match that of the centralized "oracle method" that has access to all
the data at once? We have addressed some of these questions in a
recent line of work (with student Yuchen Zhang, former student John
Duchi and colleague Micheel Jordan).

So my students and postdocs are keeping me busy, and in addition, I am
also busy writing a couple of books, one jointly with Trevor Hastie
and Rob Tibshirani at Stanford University on the Lasso and related
methods, and a second solo-authored effort, more theoretical in focus,
on high-dimensional and non-asymptotic statistics.
SS: What role do you see statistics playing in the relationship
between Big Data and Privacy?

Another very topical question: privacy considerations are certainly
becoming more and more relevant as the scale and richness of data
collection grows. Witness the recent controversies with the NSA, data
manipulation on social media sites, etc. I think that statistics
should have a lot to say about data and privacy. There has a long
line of statistical work on privacy, dating back at least to Warner's
work on survey sampling in the 1960s, but I anticipate seeing more of
it over the next years. Privacy constraints bring a lot of
interesting statistical questions---how to design experiments, how to
perform inference, how should data be aggregated and what should be
released and so on---and I think that statisticians should be at the
forefront of this discussion.

In fact, in some joint work with former student John Duchi and
colleague Michael Jordan, we have examined some tradeoffs between
privacy constraints and statistical utility. We adopt the framework
of local differential privacy that has been put forth in the computer
science community, and study how statistical utility (in the form of
estimation accuracy) varies as a function of the privacy level.
Obviously, preserving privacy means obscuring something, so that
estimation accuracy goes down, but what is the quantitative form of
this tradeoff? An interesting consequence of our analysis is that in
certain settings, it identifies optimal mechanisms for preserving a
certain level of privacy in data.

What advice would you give young statisticians getting into the
discipline right now?

It is certainly an exciting time to be getting into the discipline.
For undergraduates thinking of going to graduate school in statistics,
I would encourage them to build a strong background in basic
mathematics (linear algebra, analysis, probability theory and so on)
that are all important for a deep understanding of statistical methods
and theory. I would also suggest "getting their hands dirty", that is
doing some applied work involving statistical modeling, data analysis
and so on. Even for a person who ultimately wants to do more
theoretical work, having some exposure to real-world problems is
essential. As part of this, I would suggest acquiring some knowledge
of algorithms, optimization, and so on, all of which are essential in
dealing with large, real-world data sets.


Crowdsourcing resources for the Johns Hopkins Data Science Specialization

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Since we began offering the Johns Hopkins Data Science Specialization we've noticed the unbelievable passion that our students have about our courses and the generosity they show toward each other on the course forums. Many students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. We also know there are tons of other great organizations creating material (looking at you Software Carpentry folks).

We're excited to announce that we've created a site using GitHub Pages: to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization please send us a pull request and we will add a link to your content on our site. You can find out more about contributing here:

We can't wait to see what you've created and where the community can take this site!


swirl and the little data scientist's predicament

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This is a repost of "R and the little data scientist's predicament". A brief idea for an update is presented at the end in italics. 

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
  3. It would be awesome if the functions would include some sort of dynamic graphics (withsvgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.

Update: Now that Nick and co. have created swirl the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!


The Leek group guide to giving talks

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I wrote a little guide to giving talks that goes along with my data sharing , R packages, and reviewing guides. I posted it to Github and would be really happy to take any feedback/pull requests that folks might have. If you send a pull request please be sure to add yourself to the contributor list.


Stop saying "Scientists discover..." instead say, "Prof. Doe's team discovers..."

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I was just reading an article about data science in the WSJ. They were talking about how data scientists with just 2 years experience can earn a whole boatload of money*. I noticed a description that seemed very familiar:

At e-commerce site operator Etsy Inc., for instance, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

This perfectly describes the resume of a student that worked with me here at Hopkins and is now tearing it up in industry. But it made me a little bit angry that they didn't publicize her name. Now she may have requested her name not be used, but I think it is more likely that it is a case of the standard, "Scientists discover..." (see e.g. this article or this one or this one).

There is always a lot of discussion about how to push people to get into STEM fields, including a ton of misguided attempts that waste time and money. But here is one way that would cost basically nothing and dramatically raise the profile of scientists in the eyes of the public: use their names when you describe their discoveries.

The value of this simple change could be huge. In an era of selfies, reality TV, and the power of social media, emphasizing the value that individual scientists bring could have a huge impact on STEM recruiting. That paragraph above is a lot more inspiring to potential young data scientists when rewritten:

At e-commerce site operator Etsy Inc., for instance, Dr Hilary Parker,  a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.





Incidentally, I think it is a bit overhyped. I have rarely heard of anyone making $200k-$300k with that little experience, but maybe I'm wrong? I'd be interested to hear if people really were making that kind of $$ at that stage in their careers. 


It's like Tinder, but for peer review.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I have an idea for an app. You input the title and authors of a preprint (maybe even the abstract). The app shows the title/authors/abstract to people who work in a similar area to you. You could estimate this based on papers they have published that have similar key words to start.

Then you swipe left if you think the paper is interesting and right if you think it isn't. We could then aggregate the data on how many "likes" a paper gets as a measure of how "interesting" it is. I wonder if this would be a better measure of later citations/interestingness than the opinion of a small number of editors and referees.

This is obviously taking my proposal of a fast statistics journal to the extreme and would provide no measure of how scientifically sound the paper was. But in an age when scientific soundness is only one part of the equation for top journals, a measure of interestingness that was available before review could be of huge value to journals.

If done properly, it would encourage people to publish preprints. If you posted a preprint and it was immediately "interesting" to many scientists, you could use that to convince editors to get past that stage and consider your science. More things like this could happen:

So anyone want to build it?


If you like A/B testing here are some other Biostatistics ideas you may like

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Web companies are using A/B testing and experimentation regularly now to determine which features to push for advertising or improving user experience. A/B testing is a form of randomized controlled trial that was originally employed in psychology but first adopted on a massive scale in Biostatistics. Since then a large amount of work on trials and trial design has been performed in the Biostatistics community. Some of these ideas may be useful in the same context within web companies, probably a lot of them are already being used and I just haven't seen published examples. Here are some examples:

  1. Sequential study designs. Here the sample size isn't fixed in advance (an issue that I imagine is pretty hard to do with web experiments) but as the experiment goes on, the data are evaluated and a stopping rule that controls appropriate error rates is used. Here are a couple of  good (if a bit dated) review on sequential designs [1] [2].
  2. Adaptive study designs. These are study designs that use covariates or responses to adapt the treatment assignments of people over time. With careful design and analysis choices, you can still control the relevant error rates. Here are a couple of reviews on adaptive trial designs [1] [2]
  3. Noninferiority trials These are trials designed to show that one treatment is at least as good as the standard of care. They are often implemented when a good placebo group is not available, often for ethical reasons. In light of the ethical concerns for human subjects research at tech companies  this could be a useful trial design. Here is a systematic review for noninferiority trials [1]

It is also probably useful to read about proportional hazards models and time varying coefficients. Obviously these are just a few ideas that might be useful, but talking to a Biostatistician who works on clinical trials (not me!) would be a great way to get more information.


Do we need institutional review boards for human subjects research conducted by big web companies?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Web companies have been doing human subjects research for a while now. Companies like Facebook and Google have employed statisticians for almost a decade (or more) and part of the culture they have introduced is the idea of randomized experiments to identify ideas that work and that don't. They have figured out that experimentation and statistical analysis often beat out the opinion of the highest paid person at the company for identifying features that "work". Here "work" may mean features that cause people to read advertising, or click on ads, or match up with more people.

This has created a huge amount of value and definitely a big interest in the statistical community. For example, today's session on "Statistics: The Secret Weapon of Successful Web Giants" was standing room only.

But at the same time, these experiments have raised some issues. Recently scientists from Cornell and Facebook published a study where they experimented with the news feeds of users. This turned into a PR problem for Facebook and Cornell because people were pretty upset they were being experimented on and weren't being told about it. This has led defenders of the study to say: (a) Facebook is doing the experiments anyway, they just published it this time, (b) in this case very little harm was done, (c) most experiments done by Facebook are designed to increase profitability, at least this experiment had a more public good focused approach, and (d) there was a small effect size so what's the big deal?

OK Cupid then published a very timely blog postwith the title, "We experiment on human beings!", probably at least in part to take advantage of the press around the Facebook experiment. This post was received with less vitriol than the Facebook study, but really drove home the point that large web companies perform as much human subjects research as most universities and with little or no oversight. 

The same situation was the way academic research used to work. Scientists used their common sense and their scientific sense to decide on what experiments to run.  Most of the time this worked fine, but then things like the Tuskegee Syphillis Study happened. These really unethical experiments led to the National Research Act of 1974 which codified rules about institutional review boards to oversee research conducted on human subjects, to guarantee their protection. The IRBs are designed to consider the ethical issues involved with performing research on humans to balance protection of rights with advancing science.

Facebook, OK Cupid, and other companies are not subject to IRB approval. Yet they are performing more and more human subjects experiments. Obviously the studies described in the Facebook paper and the OK Cupid post pale in comparison to the Tuskegee study. I also know scientists at these companies and know they are ethical and really trying to do the right thing. But it raises interesting questions about oversight. Given the emotional, professional, and economic value that these websites control for individuals around the globe, it may be time to discuss whether it is time to consider the equivalent of "institutional review boards" for human subjects research conducted by companies.

Companies who test drugs on humans such as Merck are subject to careful oversight and regulation to prevent potential harm to patients during the discovery process. This is obviously not the optimal solution for speed - understandably a major advantage and goal of tech companies. But there are issues that deserve serious consideration. For example, I think it is no where near sufficient to claim that by signing the terms of service that people have given informed consent to be part of an experiment. That being said, they could just stop using Facebook if they don't like that they are being experimented on.

Our reliance on these tools for all aspects of our lives means that it isn't easy to just tell people, "Well if you don't like being experimented on, don't use that tool." You would have to give up at minimum Google, Gmail, Facebook, Twitter, and Instagram to avoid being experimented on. But you'd also have to give up using smaller sites like OK Cupid, because almost all web companies are recognizing the importance of statistics. One good place to start might be in considering new and flexible forms of consent that make it possible to opt in and out of studies in an informed way, but with enough speed and flexibility not to slowing down the innovation in tech companies.



Introducing people to R: 14 years and counting

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.


Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.