It's like Tinder, but for peer review.

I have an idea for an app. You input the title and authors of a preprint (maybe even the abstract). The app shows the title/authors/abstract to people who work in a similar area to you. You could estimate this based on papers they have published that have similar key words to start.

Then you swipe left if you think the paper is interesting and right if you think it isn't. We could then aggregate the data on how many "likes" a paper gets as a measure of how "interesting" it is. I wonder if this would be a better measure of later citations/interestingness than the opinion of a small number of editors and referees.

This is obviously taking my proposal of a fast statistics journal to the extreme and would provide no measure of how scientifically sound the paper was. But in an age when scientific soundness is only one part of the equation for top journals, a measure of interestingness that was available before review could be of huge value to journals.

If done properly, it would encourage people to publish preprints. If you posted a preprint and it was immediately "interesting" to many scientists, you could use that to convince editors to get past that stage and consider your science. More things like this could happen:

So anyone want to build it?

Posted in Uncategorized | 5 Comments

If you like A/B testing here are some other Biostatistics ideas you may like

Web companies are using A/B testing and experimentation regularly now to determine which features to push for advertising or improving user experience. A/B testing is a form of randomized controlled trial that was originally employed in psychology but first adopted on a massive scale in Biostatistics. Since then a large amount of work on trials and trial design has been performed in the Biostatistics community. Some of these ideas may be useful in the same context within web companies, probably a lot of them are already being used and I just haven't seen published examples. Here are some examples:

  1. Sequential study designs. Here the sample size isn't fixed in advance (an issue that I imagine is pretty hard to do with web experiments) but as the experiment goes on, the data are evaluated and a stopping rule that controls appropriate error rates is used. Here are a couple of  good (if a bit dated) review on sequential designs [1] [2].
  2. Adaptive study designs. These are study designs that use covariates or responses to adapt the treatment assignments of people over time. With careful design and analysis choices, you can still control the relevant error rates. Here are a couple of reviews on adaptive trial designs [1] [2]
  3. Noninferiority trials These are trials designed to show that one treatment is at least as good as the standard of care. They are often implemented when a good placebo group is not available, often for ethical reasons. In light of the ethical concerns for human subjects research at tech companies  this could be a useful trial design. Here is a systematic review for noninferiority trials [1]

It is also probably useful to read about proportional hazards models and time varying coefficients. Obviously these are just a few ideas that might be useful, but talking to a Biostatistician who works on clinical trials (not me!) would be a great way to get more information.

Posted in Uncategorized | Leave a comment

Do we need institutional review boards for human subjects research conducted by big web companies?

Web companies have been doing human subjects research for a while now. Companies like Facebook and Google have employed statisticians for almost a decade (or more) and part of the culture they have introduced is the idea of randomized experiments to identify ideas that work and that don't. They have figured out that experimentation and statistical analysis often beat out the opinion of the highest paid person at the company for identifying features that "work". Here "work" may mean features that cause people to read advertising, or click on ads, or match up with more people.

This has created a huge amount of value and definitely a big interest in the statistical community. For example, today's session on "Statistics: The Secret Weapon of Successful Web Giants" was standing room only.

But at the same time, these experiments have raised some issues. Recently scientists from Cornell and Facebook published a study where they experimented with the news feeds of users. This turned into a PR problem for Facebook and Cornell because people were pretty upset they were being experimented on and weren't being told about it. This has led defenders of the study to say: (a) Facebook is doing the experiments anyway, they just published it this time, (b) in this case very little harm was done, (c) most experiments done by Facebook are designed to increase profitability, at least this experiment had a more public good focused approach, and (d) there was a small effect size so what's the big deal?

OK Cupid then published a very timely blog postwith the title, "We experiment on human beings!", probably at least in part to take advantage of the press around the Facebook experiment. This post was received with less vitriol than the Facebook study, but really drove home the point that large web companies perform as much human subjects research as most universities and with little or no oversight. 

The same situation was the way academic research used to work. Scientists used their common sense and their scientific sense to decide on what experiments to run.  Most of the time this worked fine, but then things like the Tuskegee Syphillis Study happened. These really unethical experiments led to the National Research Act of 1974 which codified rules about institutional review boards to oversee research conducted on human subjects, to guarantee their protection. The IRBs are designed to consider the ethical issues involved with performing research on humans to balance protection of rights with advancing science.

Facebook, OK Cupid, and other companies are not subject to IRB approval. Yet they are performing more and more human subjects experiments. Obviously the studies described in the Facebook paper and the OK Cupid post pale in comparison to the Tuskegee study. I also know scientists at these companies and know they are ethical and really trying to do the right thing. But it raises interesting questions about oversight. Given the emotional, professional, and economic value that these websites control for individuals around the globe, it may be time to discuss whether it is time to consider the equivalent of "institutional review boards" for human subjects research conducted by companies.

Companies who test drugs on humans such as Merck are subject to careful oversight and regulation to prevent potential harm to patients during the discovery process. This is obviously not the optimal solution for speed - understandably a major advantage and goal of tech companies. But there are issues that deserve serious consideration. For example, I think it is no where near sufficient to claim that by signing the terms of service that people have given informed consent to be part of an experiment. That being said, they could just stop using Facebook if they don't like that they are being experimented on.

Our reliance on these tools for all aspects of our lives means that it isn't easy to just tell people, "Well if you don't like being experimented on, don't use that tool." You would have to give up at minimum Google, Gmail, Facebook, Twitter, and Instagram to avoid being experimented on. But you'd also have to give up using smaller sites like OK Cupid, because almost all web companies are recognizing the importance of statistics. One good place to start might be in considering new and flexible forms of consent that make it possible to opt in and out of studies in an informed way, but with enough speed and flexibility not to slowing down the innovation in tech companies.


Posted in Uncategorized | 6 Comments

Introducing people to R: 14 years and counting

I've been introducing people to R for quite a long time now and I've been doing some reflecting today on how that process has changed quite a bit over time. I first started using R around 1998--1999 I think I first started talking about R informally to my fellow classmates (and some faculty) back when I was in graduate school at UCLA. There, the department was officially using Lisp-Stat (which I loved) and only later converted its courses over to R. Through various brown-bag lunches and seminars I would talk about R, and the main selling point at the time was "It's just like S-PLUS but it's free!" As it turns out, S-PLUS was basically abandoned by academics and its ownership changed hands a number of times over the years (it is currently owned by TIBCO). I still talk about S-PLUS when I talk about the history of R but I'm not sure many people nowadays actually have any memories of the product.

When I got to Johns Hopkins in 2003 there wasn't really much of a modern statistical computing class, so Karl Broman, Rafa Irizarry, Brian Caffo, Ingo Ruczinski, and I got together and started what we called the "KRRIB" class, which was basically a weekly seminar where one of us talked about a computing topic of interest. I gave some of the R lectures in that class and when I asked people who had heard of R before, almost no one raised their hand. And no one had actually used it before. My approach was pretty much the same at the time, although I left out the part about S-PLUS because no one had used that either. A lot of people had experience with SAS or Stata or SPSS. A number of people had used something like Java or C/C++ before and so I often used that a reference frame. No one had ever used a functional-style of programming language like Scheme or Lisp.

Over time, the population of students (mostly first-year graduate students) slowly shifted to the point where many of them had been introduced to R while they were undergraduates. This trend mirrored the overall trend with statistics where we are seeing more and more students do undergraduate majors in statistics (as opposed to, say, mathematics). Eventually, by 2008--2009, when I'd ask how many people had heard of or used R before, everyone raised their hand. However, even at that late date, I still felt the need to convince people that R was a "real" language that could be used for real tasks.

R has grown a lot in recent years, and is being used in so many places now, that I think its essentially impossible for a person to keep track of everything that is going on. That's fine, but it makes "introducing" people to R an interesting experience. Nowadays in class, students are often teaching me something new about R that I've never seen or heard of before (they are quite good at Googling around for themselves). I feel no need to "bring people over" to R. In fact it's quite the opposite--people might start asking questions if I weren't teaching R.

Even though my approach to introducing R has evolved over time, with the topics that I emphasize or de-emphasize changing, I've found there are a few topics that I always  stress to people who are generally newcomers to R. For whatever reason, these topics are always new or at least a little unfamiliar.

  • R is a functional-style language. Back when most people primarily saw something like C as a first programming language, it made sense to me that the functional style of programming would seem strange. I came to R from Lisp-Stat so the functional aspect was pretty natural for me. But many people seem to get tripped up over the idea of passing a function as an argument or not being able to modify the state of an object in place. Also, it sometimes takes people a while to get used to doing things like lapply() and map-reduce types of operations. Everyone still wants to write a for loop!
  • R is both an interactive system and a programming language. Yes, it's a floor wax and a dessert topping--get used to it. Most people seem expect one or the other. SAS users are wondering why you need to write 10 lines of code to do what SAS can do in one massive PROC statement. C programmers are wondering why you don't write more for loops. C++ programmers are confused by the weird system for object orientation. In summary, no one is ever happy.
  • Visualization/plotting capabilities are state-of-the-art. One of the big selling points back in the "old days" was that from the very beginning R's plotting and graphics capabilities where far more elegant than the ASCII-art that was being produced by other statistical packages (true for S-PLUS too). I find it a bit strange that this point has largely remained true. While other statistical packages have definitely improved their output (and R certainly has some areas where it is perhaps deficient), R still holds its own quite handily against those other packages. If the community can continue to produce things like ggplot2 and rgl, I think R will remain at the forefront of data visualization.

I'm looking forward to teaching R to people as long as people will let me, and I'm interested to see how the next generation of students will approach it (and how my approach to them will change). Overall, it's been just an amazing experience to see the widespread adoption of R over the past decade. I'm sure the next decade will be just as amazing.

Posted in Uncategorized | 7 Comments

Academic statisticians: there is no shame in developing statistical solutions that solve just one problem

I think that the main distinction between academic statisticians and those calling themselves data scientists is that the latter are very much willing to invest most of their time and energy into solving specific problems by analyzing specific data sets. In contrast, most academic statisticians strive to develop methods that can be very generally applied across problems and data types. There is a reason for this of course:  historically statisticians have had enormous influence by developing general theory/methods/concepts such as the p-value, maximum likelihood estimation, and linear regression. However, these types of success stories are becoming more and more rare while data scientists are becoming increasingly influential in their respective areas of applications by solving important context-specific problems. The success of Money Ball and the prediction of election results are two recent widely publicized examples.

A survey of papers published in our flagship journals make it quite clear that context-agnostic methodology are valued much more than detailed descriptions of successful solutions to specific problems. These applied papers tend to get published in subject matter journals and do not usually receive the same weight in appointments and promotions. This culture has therefore kept most statisticians holding academic position away from collaborations that require substantial time and energy investments in understanding and attacking the specifics of the problem at hand. Below I argue that to remain relevant as a discipline we need a cultural shift.

It is of course understandable that to remain a discipline academic statisticians can’t devote all our effort to solving specific problems and none to trying to the generalize these solutions. It is the development of these abstractions that defines us as an academic discipline and not just a profession. However, if our involvement with real problems is too superficial, we run the risk of developing methods that solve no problem at all which will eventually render us obsolete. We need to accept that as data and problems become more complex, more time will have to be devoted to understanding the gory details.

But what should the balance be?

Note that many of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.

Posted in Uncategorized | 7 Comments

Jan de Leeuw owns the Internet

One of the best things to happen on the Internet recently is that Jan de Leeuw has decided to own the Twitter/Facebook universe. If you do not already, you should be following him. Among his many accomplishments, he founded the Department of Statistics at UCLA (my alma mater), which is currently thriving. On the occasion of the Department's 10th birthday, there was a small celebration, and I recall Don Ylvisaker mentioning that the reason they invited Jan to UCLA way back when was because he "knew everyone and knew everything". Pretty accurate description, in my opinion.

Jan's been tweeting quite a bit of late, but recently had this gem:

followed by

I'm not sure what Jan's thinking behind the first tweet was, but I think many in statistics would consider it a "good thing" to be a minor subfield of data science. Why get involved in that messy thing called data science where people are going wild with data in an unprincipled manner?

This is a situation where I think there is a large disconnect between what "should be" and what "is reality". What should be is that statistics should include the field of data science. Honestly, that would be beneficial to the field of statistics and would allow us to provide a home to many people who don't necessarily have one (primarily, people working not he border between two fields). Nate Silver made reference to this in his keynote address to the Joint Statistical Meetings last year when he said data science was just a fancy term for statistics.

The reality though is the opposite. Statistics has chosen to limit itself to a few areas, such as inference, as Jan mentions, and to willfully ignore other important aspects of data science as "not statistics". This is unfortunate, I think, because unlike many in the field of statistics, I believe data science is here to stay. The reason is because statistics has decided not to fill the spaces that have been created by the increasing complexity of modern data analysis. The needs of modern data analyses (reproducibility, computing on large datasets, data preprocessing/cleaning) didn't fall into the usual statistics curriculum, and so they were ignored. In my view, data science is about stringing together many different tools for many different purposes into an analytic whole. Traditional statistical modeling is a part of this (often a small part), but statistical thinking plays a role in all of it.

Statisticians should take on the challenge of data science and own it. We may not be successful in doing so, but we certainly won't be if we don't try.

Posted in Uncategorized | 1 Comment

Piketty in R markdown - we need some help from the crowd

Thomas Piketty's book Capital in the 21st Century was a surprise best seller and the subject of intense scrutiny. A few weeks ago the Financial Times claimed that the analysis was riddled with errors, leading to a firestorm of discussion. A few days ago the London School of economics posted a similar call to make the data open and machine readable saying.

None of this data is explicitly open for everyone to reuse, clearly licenced and in machine-readable formats.

A few friends of Simply Stats  had started on a project to translate his work from the excel files where the original analysis resides into R. The people that helped were Alyssa Frazee, Aaron Fisher, Bruce Swihart, Abhinav Nellore, Hector Corrada Bravo, John Muschelli, and me. We haven't finished translating all chapters, so we are asking anyone who is interested to help contribute to translating the book's technical appendices into R markdown documents. If you are interested, please send pull requests to the gh-pages branch of this Github repo.

As a way to entice you to participate, here is one interesting thing we found. We don't know enough economics to know if what we are finding is "right" or not, but one interesting thing I found is that the x-axes in the excel files are really distorted. For example here is Figure 1.1 from the Excel files where the ticks on the x-axis are separated by 20, 50, 43, 37, 20, 20, and 22 years.



Here is the same plot with an equally spaced x-axis.


I'm not sure if it makes any difference but it is interesting. It sounds like on measure, the Piketty analysis was mostly reproducible and reasonable.  But having the data available in a more readily analyzable format will allow for more concrete discussion based on the data. So consider contributing to our github repo.

Posted in Uncategorized | 1 Comment

Privacy as a function of sample size

The U.S. Supreme Court just made a unanimous ruling in Riley v. California making it clear that police officers must get a warrant before searching through the contents of a cell phone obtained incident to an arrest. The message was put pretty clearly in the decision:

 Our answer to the question of what police must do before searching a cell phone seized incident to an arrest is accordingly simple — get a warrant.

But I was more fascinated by this quote:

The sum of an individual’s private life can be reconstructed through a thousand photographs labeled with dates, locations, and descriptions; the same cannot be said of a photograph or two of loved ones tucked into a wallet.

So n = 2 is not enough to recreate a private life, but n = 2,000 (with associated annotation) is enough.  I wonder what the minimum sample size needed is to officially violate someone's privacy. I'd be curious get Cathy O'Neil's opinion on that question, she seems to have thought very hard about the relationship between data and privacy.

This is another case where I think that, to some extent, the Supreme Court made a decision on the basis of a statistical concept. Last time it was correlation, this time it is inference. As I read the opinion, part of the argument hinged on how much information do you get by searching a cell phone versus a wallet? Importantly, how much can you infer from those two sets of data?

If any of the Supreme's want a primer in statistics, I'm available.

Posted in Uncategorized | Leave a comment

New book on implementing reproducible research

9781466561595I have mentioned this in a few places but my book edited with Victoria Stodden and Fritz Leisch, Implementing Reproducible Research, has just been published by CRC Press. Although it is technically in their "R Series", the chapters contain information on a wide variety of useful tools, not just R-related tools. 

There is also a supplementary web site hosted through Open Science Framework that contains a lot of additional information, including the list of chapters.

Posted in Uncategorized | 1 Comment

The difference between data hype and data hope

I was reading one of my favorite stats blogs, StatsChat, where Thomas points to this article in the Atlantic and highlights this quote:

Dassault Systèmes is focusing on that level of granularity now, trying to simulate propagation of cholesterol in human cells and building oncological cell models. "It's data science and modeling," Charlès told me. "Coupling the two creates a new environment in medicine."

I think that is a perfect example of data hype. This is a cool idea and if it worked would be completely revolutionary. But the reality is we are not even close to this. In very simple model organisms we can predict very high level phenotypes some of the time with whole cell modeling. We aren't anywhere near the resolution we'd need to model the behavior of human cells, let alone the complex genetic, epigenetic, genomic, and environmental components that likely contribute to complex diseases. It is awesome that people are thinking about the future and the fastest way to science future is usually through science fiction, but this is way overstating the power of current or even currently achievable data science.

So does that mean data science for improving clinical trials right now should be abandoned?


There is tons of currently applicable and real world data science being done in sequential analysis,  adaptive clinical trials, and dynamic treatment regimes. These are important contributions that are impacting clinical trials right now and where advances can reduce costs, save patient harm, and speed the implementation of clinical trials. I think that is the hope of data science - using statistics and data to make steady, realizable improvement in the way we treat patients.

Posted in Uncategorized | Leave a comment