Simply Statistics

15
Sep

Applied Statisticians: people want to learn what we do. Let's teach them.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In this recent opinion piece, Hadley Wickham explains how data science goes beyond Statistics and that data science is not promoted in academia. He defines data science as follows:

I think there are three main steps in a data science project: you collect data (and questions), analyze it (using visualization and models), then communicate the results.

and makes the important point that

Any real data analysis involves data manipulation (sometimes called wrangling or munging), visualization and modelling.

The above describes what I have been doing since I became an academic applied statistician about 20 years ago. It describes what several of my colleagues do as well. For example, 15 years ago Karl Broman, in his excellent job talk, covered all the items in Hadley's definition. The arc of the talk revolved around the scientific problem and not the statistical models. He spent a considerable amount of time describing how the data was acquired and how he used perl scripts to clean up microsatellites data.  More than half his slides contained visualizations, either illustrative cartoons or data plots. This research eventually led to his widely used "data product" R/qtl. Although not described in the talk, Karl used make to help make the results reproducible.

So why then does Hadley think that "Statistics research focuses on data collection and modeling, and there is little work on developing good questions, thinking about the shape of data, communicating results or building data products"?  I suspect one reason is that most applied work is published outside the flagship statistical journals. For example, Karl's work was published in the American Journal of Human Genetetics. A second reason may be that most of us academic applied statisticians don't teach what we do. Despite writing a thesis that involved much data wrangling (reading music aiff files into Splus) and data visualization (including listening to fitted signals and residuals), the first few courses I taught as an assistant professor were almost solely on GLM theory.

About five years ago I tried changing the Methods course for our PhD students from one focusing on the math behind statistical methods to a problem and data-driven course. This was not very successful as many of our students were interested in the mathematical aspects of statistics and did not like the open-ended assignments. Jeff Leek built on that class by incorporating question development, much more vague problem statements, data wrangling, and peer grading. He also found it challenging to teach the more messy parts of applied statistics. It often requires exploration and failure which can be frustrating for new students.

This story has a happy ending though. Last year Jeff created a data science Coursera course that enrolled over 180,000 students with 6,000+ completing. This year I am subbing for Joe Blitzstein (talk about filling in big shoes) in CS109: the Data Science undergraduate class Hanspeter Pfister and Joe created last year at Harvard. We have over 300 students registered, making it one of the largest classes on campus. I am not teaching them GLM theory.

So if you are an experienced applied statistician in academia, consider developing a data science class that teaches students what you do.

 

 

 

09
Sep

A non-comprehensive list of awesome female data people on Twitter

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I was just talking to a student who mentioned she didn't know Jenny Bryan was on Twitter. She is and she is an awesome person to follow. I also realized that I hadn't seen a good list of women on Twitter who do stats/data. So I thought I'd make one. This list is what I could make in 15 minutes based on my own feed and will, with 100% certainty, miss really people. Can you please add them in the comments and I'll update the list?

I have also been informed that these Twitter lists are probably better than my post. But I'll keep updating my list anyway cause I want to know who all the right people to follow are!

 

04
Sep

Why the three biggest positive contributions to reproducible research are the iPython Notebook, knitr, and Galaxy

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

There is a huge amount of interest in reproducible research and replication of results. Part of this is driven by some of the pretty major mistakes in reproducibility we have seen in economics and genomics. This has spurred discussion at a variety of levels including at the level of the United States Congress.

To solve this problem we need the appropriate infrastructure. I think developing infrastructure is a lot like playing the lottery, only if the lottery required a lot more work to buy a ticket. You pour a huge amount of effort into building good infrastructure.  I think it helps if you build it for yourself like Yihui did for knitr:

(also make sure you go read the blog post over at Data Science LA)

If lots of people adopt it, you are set for life. If they don't, you did all that work for nothing. So you have to applaud all the groups who have made efforts at building infrastructure for reproducible research.

I would contend that the largest positive contributions to reproducibility in sheer number of analyses made reproducible are:

  •  The knitr R package (or more recently rmarkdown) for creating literate webpages and documents in R.
  • iPython notebooks  for creating literate webpages and documents interactively in Python.
  • The Galaxy project for creating reproducible work flows (among other things) combining known tools.

There are similarities and differences between the different platforms but the one thing I think they all have in common is that they added either no or negligible effort to people's data analytic workflows.

knitr and iPython notebooks have primarily increased reproducibility among folks who have some scripting experience. I think a major reason they are so popular is because you just write code like you normally would, but embed it in a simple to use document. The workflow doesn't change much for the analyst because they were going to write that code anyway. The document just allows it to be built into a more shareable document.

Galaxy has increased reproducibility for many folks, but my impression is the primary user base are folks who have less experience scripting. They have worked hard to make it possible for these folks to analyze data they couldn't before in a reproducible way. But the reproducibility is incidental in some sense. The main reason users come is that they would have had to stitch those pipelines together anyway. Now they have an easier way to do it (lowering workload) and they get reproducibility as a bonus.

If I was in charge of picking the next round of infrastructure projects that are likely to impact reproducibility or science in a positive way, I would definitely look for projects that have certain properties.

  • For scripters and experts I would look for projects that interface with what people are already doing (most data analysis is in R or Python these days), require almost no extra work, and provide some benefit (reproducibility or otherwise). I would also look for things that are agnostic to which packages/approaches people are using.
  • For non-experts I would look for projects that enable people to build pipelines  they were't able to before using already standard tools and give them things like reproducibility for free.

Of course I wouldn't put me in charge anyway, I've never won the lottery with any infrastructure I've tried to build.

20
Aug

A (very) brief review of published human subjects research conducted with social media companies

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

As I wrote the other day, more and more human subjects research is being performed by large tech companies. The best way to handle the ethical issues raised by this research is still unclear. The first step is to get some idea of what has already been published from these organizations. So here is a brief review of the papers I know about where human subjects experiments have been conducted by companies. I'm only counting experiments here that have (a) been published in the literature and (b) involved experiments on users. I realized I could come up with surprisingly few.  I'd be interested to see more in the comments if people know about them.

Paper: Experimental evidence of massive-scale emotional contagion through social networks
Company: Facebook
What they did: Randomized people to get different emotions in their news feed and observed if they showed an emotional reaction.
What they found: That there was almost no real effect on emotion. The effect was statistically significant but not scientifically or emotionally meaningful.

Paper: Social influence bias: a randomized experiment
Company: Not stated but sounds like Reddit
What they did: Randomly up-voted, down voted, or left alone posts to the social networking site. Then they observed whether there was a difference in the overall rating of posts within each treatment.
What they found: Posts that were upvoted ended up with a final rating score (total upvotes - total downvotes) that was 25% higher.

Paper: Identifying influential and susceptible members of social networks 
Company: Facebook
What they did: Using a commercial Facebook app,  they found users who adopted a product and randomized sending messages to their friends about the use of the product. Then they measured whether their friends decided to adopt the product as well.
What they found: Many interesting things. For example: susceptibility to influence decreases with age, people over 31 are stronger influencers, women are less susceptible to influence than men, etc. etc.

 

Paper: Inferring causal impact using Bayesian structural time-series models
Company: Google
What they did: They developed methods for inferring the causal impact of an ad in a time series situation. They used data from an advertiser who showed ads to people related to keywords and measured how many visits there were to the advertiser's website through paid and organic (non-paid) clicks.
What they found: That the ads worked. But more importantly that they could predict the causal effect of the ad using their methods.

 

 

 

 

 

 

 

19
Aug

SwiftKey and Johns Hopkins partner for Data Science Specialization Capstone

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I use SwiftKey on my Android phone all the time. So I was super pumped up when they agreed to partner with us on the first Capstone course for the Johns Hopkins Data Science  Specialization to run in October 2014. To enroll in the course you have to pass the other 9 courses in the Data Science Specialization.

The 9 courses have only been running for 4 months but already 200+ people have finished all 9! It has been unbelievable to see the response to the specialization and we are exited about taking it to the next level.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone you will work on understanding and building predictive text models like those used by SwiftKey.

This course will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, students will use the knowledge gained in our  Data Products course to build a predictive text product they can show off to their family, friends, and potential employers.

We are really excited to work with SwiftKey to take our Specialization to the next level! Here is Roger's intro video for the course to get you fired up too.

18
Aug

Interview with COPSS Award winner Martin Wainwright

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: Martin Wainwright is the winner of the 2014 COPSS Award. This award is the most prestigious award in statistics, sometimes refereed to as the Nobel Prize in Statistics. Martin received the award for: " For fundamental and groundbreaking contributions to high-dimensional statistics, graphical modeling, machine learning, optimization and algorithms, covering deep and elegant mathematical analysis as well as new methodology with wide-ranging implications for numerous applications." He kindly agreed to be interviewed by Simply Statistics. 

wainwright

SS: How did you find out you had received the COPSS prize?

It was pretty informal --- I received an email in February from
Raymond Carroll, who chaired the committee. But it had explicit
instructions to keep the information private until the award ceremony
in August.

SS: You are in Electrical Engineering & Computer Science (EECS) and
Statistics at Berkeley: why that mix of departments?

Just to give a little bit of history, I did my undergraduate degree in
math at the University of Waterloo in Canada, and then my Ph.D. in
EECS at MIT, before coming to Berkeley to work as a postdoc in
Statistics. So when it came time to looking at faculty positions,
having a joint position between these two departments made a lot of
sense. Berkeley has always been at the forefront of having effective
joint appointments of the "Statistics plus X" variety, whether X is
EECS, Mathematics, Political Science, Computational Biology and so on.

For me personally, the EECS plus Statistics combination is terrific,
as a lot of my interests lie at the boundary between these two areas,
whether it is investigating tradeoffs between computational and
statistical efficiency, connections between information theory and
statistics, and so on. I hope that it is also good for my students!
In any case, whether they enter in EECS or Statistics, they graduate
with a strong background in both statistical theory and methods, as
well as optimization, algorithms and so on. I think that this kind of
mix is becoming increasingly relevant to the practice of modern
statistics, and one can certainly see that Berkeley consistently
produces students, whether from my own group or other people at
Berkeley, with this kind of hybrid background.
SS: What do you see as the relationship between statistics and machine
learning?

This is an interesting question, but tricky to answer, as it can
really depend on the person. In my own view, statistics is a very
broad and encompassing field, and in this context, machine learning
can be viewed as a particular subset of it, one especially focused on
algorithmic and computational aspects of statistics. But on the other
hand, as things stand, machine learning has rather different cultural
roots than statistics, certainly strongly influenced by computer
science. In general, I think that both groups have lessons to learn
from each other. For instance, in my opinion, anyone who wants to do
serious machine learning needs to have a solid background in
statistics. Statisticians have been thinking about data and
inferential issues for a very long time now, and these fundamental
issues remain just as important now, even though the application
domains and data types may be changing. On the other hand, in certain
ways, statistics is still a conservative field, perhaps not as quick
to move into new application domains, experiment with new methods and
so on, as people in machine learning do. So I think that
statisticians can benefit from the playful creativity and unorthodox
experimentation that one sees in some machine learning work, as well
as the algorithmic and programming expertise that is standard in
computer science.

SS: What sorts of things is your group working on these days?

I have fairly eclectic interests, so we are working on a range of
topics. A number of projects concern the interface between
computation and statistics. For instance, we have a recent pre-print
(with postdoc Sivaraman Balakrishnan and colleague Bin Yu) that tries
to address the gap between statistical and computational guarantees in
applications of the expectation-maximization (EM) algorithm for latent
variable models. In theory, we know that the global minimizer of the
(nonconvex) likelihood has good properties, but the in practice, the
EM algorithm only returns local optima. How to resolve this gap
between existing theory and actual practice? In this paper, we show
that under pretty reasonable conditions---that hold for various types
of latent variable models---the EM fixed points are as good as the
global minima from the statistical perspective. This explains what is
observed a lot in practice, namely that when the EM algorithm is given
a reasonable initialization, it often returns a very good answer.

There are lots of other interesting questions at this
computation/statistics interface. For instance, a lot of modern data
sets (e.g., Netflix) are so large that they cannot be stored on a
single machine, but must be split up into separate pieces. Any
statistical task must then be carried out in a distributed way, with
each processor performing local operations on a subset of the data,
and then passing messages to other processors that summarize the
results of its local computations. This leads to a lot of fascinating
questions. What can be said about the statistical performance of such
distributed methods for estimation or inference? How many bits do the
machines need to exchange in order for the distributed performance to
match that of the centralized "oracle method" that has access to all
the data at once? We have addressed some of these questions in a
recent line of work (with student Yuchen Zhang, former student John
Duchi and colleague Micheel Jordan).

So my students and postdocs are keeping me busy, and in addition, I am
also busy writing a couple of books, one jointly with Trevor Hastie
and Rob Tibshirani at Stanford University on the Lasso and related
methods, and a second solo-authored effort, more theoretical in focus,
on high-dimensional and non-asymptotic statistics.
SS: What role do you see statistics playing in the relationship
between Big Data and Privacy?

Another very topical question: privacy considerations are certainly
becoming more and more relevant as the scale and richness of data
collection grows. Witness the recent controversies with the NSA, data
manipulation on social media sites, etc. I think that statistics
should have a lot to say about data and privacy. There has a long
line of statistical work on privacy, dating back at least to Warner's
work on survey sampling in the 1960s, but I anticipate seeing more of
it over the next years. Privacy constraints bring a lot of
interesting statistical questions---how to design experiments, how to
perform inference, how should data be aggregated and what should be
released and so on---and I think that statisticians should be at the
forefront of this discussion.

In fact, in some joint work with former student John Duchi and
colleague Michael Jordan, we have examined some tradeoffs between
privacy constraints and statistical utility. We adopt the framework
of local differential privacy that has been put forth in the computer
science community, and study how statistical utility (in the form of
estimation accuracy) varies as a function of the privacy level.
Obviously, preserving privacy means obscuring something, so that
estimation accuracy goes down, but what is the quantitative form of
this tradeoff? An interesting consequence of our analysis is that in
certain settings, it identifies optimal mechanisms for preserving a
certain level of privacy in data.

What advice would you give young statisticians getting into the
discipline right now?

It is certainly an exciting time to be getting into the discipline.
For undergraduates thinking of going to graduate school in statistics,
I would encourage them to build a strong background in basic
mathematics (linear algebra, analysis, probability theory and so on)
that are all important for a deep understanding of statistical methods
and theory. I would also suggest "getting their hands dirty", that is
doing some applied work involving statistical modeling, data analysis
and so on. Even for a person who ultimately wants to do more
theoretical work, having some exposure to real-world problems is
essential. As part of this, I would suggest acquiring some knowledge
of algorithms, optimization, and so on, all of which are essential in
dealing with large, real-world data sets.

15
Aug

Crowdsourcing resources for the Johns Hopkins Data Science Specialization

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Since we began offering the Johns Hopkins Data Science Specialization we've noticed the unbelievable passion that our students have about our courses and the generosity they show toward each other on the course forums. Many students have created quality content around the subjects we discuss, and many of these materials are so good we feel that they should be shared with all of our students. We also know there are tons of other great organizations creating material (looking at you Software Carpentry folks).

We're excited to announce that we've created a site using GitHub Pages: http://datasciencespecialization.github.io/ to serve as a directory for content that the community has created. If you've created materials relating to any of the courses in the Data Science Specialization please send us a pull request and we will add a link to your content on our site. You can find out more about contributing here: https://github.com/DataScienceSpecialization/DataScienceSpecialization.github.io#contributing

We can't wait to see what you've created and where the community can take this site!

13
Aug

swirl and the little data scientist's predicament

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This is a repost of "R and the little data scientist's predicament". A brief idea for an update is presented at the end in italics. 

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
  3. It would be awesome if the functions would include some sort of dynamic graphics (withsvgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.

Update: Now that Nick and co. have created swirl the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!

12
Aug

The Leek group guide to giving talks

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I wrote a little guide to giving talks that goes along with my data sharing , R packages, and reviewing guides. I posted it to Github and would be really happy to take any feedback/pull requests that folks might have. If you send a pull request please be sure to add yourself to the contributor list.

11
Aug

Stop saying "Scientists discover..." instead say, "Prof. Doe's team discovers..."

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I was just reading an article about data science in the WSJ. They were talking about how data scientists with just 2 years experience can earn a whole boatload of money*. I noticed a description that seemed very familiar:

At e-commerce site operator Etsy Inc., for instance, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

This perfectly describes the resume of a student that worked with me here at Hopkins and is now tearing it up in industry. But it made me a little bit angry that they didn't publicize her name. Now she may have requested her name not be used, but I think it is more likely that it is a case of the standard, "Scientists discover..." (see e.g. this article or this one or this one).

There is always a lot of discussion about how to push people to get into STEM fields, including a ton of misguided attempts that waste time and money. But here is one way that would cost basically nothing and dramatically raise the profile of scientists in the eyes of the public: use their names when you describe their discoveries.

The value of this simple change could be huge. In an era of selfies, reality TV, and the power of social media, emphasizing the value that individual scientists bring could have a huge impact on STEM recruiting. That paragraph above is a lot more inspiring to potential young data scientists when rewritten:

At e-commerce site operator Etsy Inc., for instance, Dr Hilary Parker,  a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

 

 

 

 

Incidentally, I think it is a bit overhyped. I have rarely heard of anyone making $200k-$300k with that little experience, but maybe I'm wrong? I'd be interested to hear if people really were making that kind of $$ at that stage in their careers.