Simply Statistics


swirl and the little data scientist's predicament

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: This is a repost of "R and the little data scientist's predicament". A brief idea for an update is presented at the end in italics. 

I just read this fascinating post on _why, apparently a bit of a cult hero among enthusiasts of the Ruby programming language. One of the most interesting bits was The Little Coder’s Predicament, which boiled down essentially says that computer programming languages have grown too complex - so children/newbies can’t get the instant gratification when they start programming. He suggested a simplified “gateway language” that would get kids fired up about programming, because with a simple line of code or two they could make the computer do things like play some music or make a video.

I feel like there is a similar ramp up with data scientists. To be able to do anything cool/inspiring with data you need to know (a) a little statistics, (b) a little bit about a programming language, and (c) quite a bit about syntax.

Wouldn’t it be cool if there was an R package that solved the little data scientist’s predicament? The package would have to have at least some of these properties:

  1. It would have to be easy to load data sets, one line of not complicated code. You could write an interface for RCurl/read.table/download.file for a defined set of APIs/data sets so the command would be something like: load(“education-data”) and it would load a bunch of data on education. It would handle all the messiness of scraping the web, formatting data, etc. in the background.
  2. It would have to have a lot of really easy visualization functions. Right now, if you want to make pretty plots with ggplot(), plot(), etc. in R, you need to know all the syntax for pch, cex, col, etc. The plotting function should handle all this behind the scenes and make super pretty pictures.
  3. It would be awesome if the functions would include some sort of dynamic graphics (withsvgAnnotation or a wrapper for D3.js). Again, the syntax would have to be really accessible/not too much to learn.

That alone would be a huge start. In just 2 lines kids could load and visualize cool data in a pretty way they could show their parents/friends.

Update: Now that Nick and co. have created swirl the technology is absolutely in place to have people do something awesome quickly. You could imagine taking the airplane data and immediately having them make a plot of all the flights using ggplot. Or any number of awesome government data sets and going straight to ggvis. Solving this problem is now no longer technically a challenge, it is just a matter of someone coming up with an amazing swirl module that immediately sucks students in. This would be a really awesome project for a grad student or even an undergrad with an interest in teaching. If you do do it, you should absolutely send it our way and we'll advertise the heck out of it!


The Leek group guide to giving talks

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I wrote a little guide to giving talks that goes along with my data sharing , R packages, and reviewing guides. I posted it to Github and would be really happy to take any feedback/pull requests that folks might have. If you send a pull request please be sure to add yourself to the contributor list.


Stop saying "Scientists discover..." instead say, "Prof. Doe's team discovers..."

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I was just reading an article about data science in the WSJ. They were talking about how data scientists with just 2 years experience can earn a whole boatload of money*. I noticed a description that seemed very familiar:

At e-commerce site operator Etsy Inc., for instance, a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.

This perfectly describes the resume of a student that worked with me here at Hopkins and is now tearing it up in industry. But it made me a little bit angry that they didn't publicize her name. Now she may have requested her name not be used, but I think it is more likely that it is a case of the standard, "Scientists discover..." (see e.g. this article or this one or this one).

There is always a lot of discussion about how to push people to get into STEM fields, including a ton of misguided attempts that waste time and money. But here is one way that would cost basically nothing and dramatically raise the profile of scientists in the eyes of the public: use their names when you describe their discoveries.

The value of this simple change could be huge. In an era of selfies, reality TV, and the power of social media, emphasizing the value that individual scientists bring could have a huge impact on STEM recruiting. That paragraph above is a lot more inspiring to potential young data scientists when rewritten:

At e-commerce site operator Etsy Inc., for instance, Dr Hilary Parker,  a biostatistics Ph.D. who spent years mining medical records for early signs of breast cancer now writes statistical models to figure out the terms people use when they search Etsy for a new fashion they saw on the street.





Incidentally, I think it is a bit overhyped. I have rarely heard of anyone making $200k-$300k with that little experience, but maybe I'm wrong? I'd be interested to hear if people really were making that kind of $$ at that stage in their careers. 


It's like Tinder, but for peer review.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I have an idea for an app. You input the title and authors of a preprint (maybe even the abstract). The app shows the title/authors/abstract to people who work in a similar area to you. You could estimate this based on papers they have published that have similar key words to start.

Then you swipe left if you think the paper is interesting and right if you think it isn't. We could then aggregate the data on how many "likes" a paper gets as a measure of how "interesting" it is. I wonder if this would be a better measure of later citations/interestingness than the opinion of a small number of editors and referees.

This is obviously taking my proposal of a fast statistics journal to the extreme and would provide no measure of how scientifically sound the paper was. But in an age when scientific soundness is only one part of the equation for top journals, a measure of interestingness that was available before review could be of huge value to journals.

If done properly, it would encourage people to publish preprints. If you posted a preprint and it was immediately "interesting" to many scientists, you could use that to convince editors to get past that stage and consider your science. More things like this could happen:

So anyone want to build it?


If you like A/B testing here are some other Biostatistics ideas you may like

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Web companies are using A/B testing and experimentation regularly now to determine which features to push for advertising or improving user experience. A/B testing is a form of randomized controlled trial that was originally employed in psychology but first adopted on a massive scale in Biostatistics. Since then a large amount of work on trials and trial design has been performed in the Biostatistics community. Some of these ideas may be useful in the same context within web companies, probably a lot of them are already being used and I just haven't seen published examples. Here are some examples:

  1. Sequential study designs. Here the sample size isn't fixed in advance (an issue that I imagine is pretty hard to do with web experiments) but as the experiment goes on, the data are evaluated and a stopping rule that controls appropriate error rates is used. Here are a couple of  good (if a bit dated) review on sequential designs [1] [2].
  2. Adaptive study designs. These are study designs that use covariates or responses to adapt the treatment assignments of people over time. With careful design and analysis choices, you can still control the relevant error rates. Here are a couple of reviews on adaptive trial designs [1] [2]
  3. Noninferiority trials These are trials designed to show that one treatment is at least as good as the standard of care. They are often implemented when a good placebo group is not available, often for ethical reasons. In light of the ethical concerns for human subjects research at tech companies  this could be a useful trial design. Here is a systematic review for noninferiority trials [1]

It is also probably useful to read about proportional hazards models and time varying coefficients. Obviously these are just a few ideas that might be useful, but talking to a Biostatistician who works on clinical trials (not me!) would be a great way to get more information.


Do we need institutional review boards for human subjects research conducted by big web companies?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Web companies have been doing human subjects research for a while now. Companies like Facebook and Google have employed statisticians for almost a decade (or more) and part of the culture they have introduced is the idea of randomized experiments to identify ideas that work and that don't. They have figured out that experimentation and statistical analysis often beat out the opinion of the highest paid person at the company for identifying features that "work". Here "work" may mean features that cause people to read advertising, or click on ads, or match up with more people.

This has created a huge amount of value and definitely a big interest in the statistical community. For example, today's session on "Statistics: The Secret Weapon of Successful Web Giants" was standing room only.

But at the same time, these experiments have raised some issues. Recently scientists from Cornell and Facebook published a study where they experimented with the news feeds of users. This turned into a PR problem for Facebook and Cornell because people were pretty upset they were being experimented on and weren't being told about it. This has led defenders of the study to say: (a) Facebook is doing the experiments anyway, they just published it this time, (b) in this case very little harm was done, (c) most experiments done by Facebook are designed to increase profitability, at least this experiment had a more public good focused approach, and (d) there was a small effect size so what's the big deal?

OK Cupid then published a very timely blog postwith the title, "We experiment on human beings!", probably at least in part to take advantage of the press around the Facebook experiment. This post was received with less vitriol than the Facebook study, but really drove home the point that large web companies perform as much human subjects research as most universities and with little or no oversight. 

The same situation was the way academic research used to work. Scientists used their common sense and their scientific sense to decide on what experiments to run.  Most of the time this worked fine, but then things like the Tuskegee Syphillis Study happened. These really unethical experiments led to the National Research Act of 1974 which codified rules about institutional review boards to oversee research conducted on human subjects, to guarantee their protection. The IRBs are designed to consider the ethical issues involved with performing research on humans to balance protection of rights with advancing science.

Facebook, OK Cupid, and other companies are not subject to IRB approval. Yet they are performing more and more human subjects experiments. Obviously the studies described in the Facebook paper and the OK Cupid post pale in comparison to the Tuskegee study. I also know scientists at these companies and know they are ethical and really trying to do the right thing. But it raises interesting questions about oversight. Given the emotional, professional, and economic value that these websites control for individuals around the globe, it may be time to discuss whether it is time to consider the equivalent of "institutional review boards" for human subjects research conducted by companies.

Companies who test drugs on humans such as Merck are subject to careful oversight and regulation to prevent potential harm to patients during the discovery process. This is obviously not the optimal solution for speed - understandably a major advantage and goal of tech companies. But there are issues that deserve serious consideration. For example, I think it is no where near sufficient to claim that by signing the terms of service that people have given informed consent to be part of an experiment. That being said, they could just stop using Facebook if they don't like that they are being experimented on.

Our reliance on these tools for all aspects of our lives means that it isn't easy to just tell people, "Well if you don't like being experimented on, don't use that tool." You would have to give up at minimum Google, Gmail, Facebook, Twitter, and Instagram to avoid being experimented on. But you'd also have to give up using smaller sites like OK Cupid, because almost all web companies are recognizing the importance of statistics. One good place to start might be in considering new and flexible forms of consent that make it possible to opt in and out of studies in an informed way, but with enough speed and flexibility not to slowing down the innovation in tech companies.



Introducing people to R: 14 years and counting

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.


Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.


Jan de Leeuw owns the Internet

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.


Piketty in R markdown - we need some help from the crowd

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Thomas Piketty's book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.

None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.

A few friends of Simply Stats  had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli, and me. We haven't finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book's technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.

As a way to entice you to participate, here is one interesting thing we found. We don't know enough economics to know if what we are finding is "right" or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.



Here is the same plot with an equally spaced x-axis.


I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis was mostly reproducible and reasonable.  But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider contributing to our github repo.