Simply Statistics


More commentary on Mayo v. Prometheus

Some more commentary on Mayo v. Prometheus via the Patently-O blog.

A summary of the various briefs and history of the case can be found at the SCOTUS blog.

Some actual news coverage of the decision.

The decision is well-worth reading, if you’re that kind of nerd. Here, the Court uses the phrase “law of nature” a bit more loosely than perhaps I would use it. On the one hand, something like E=mc^2 might be considered a law of nature, but on the other hand I would consider the observation that certain blood metabolites are correlated with the occurrence of patient side effects as, well, a correlation. Einstein is referred to quite a few times in the opinion, no doubt in part because he himself worked in a patent office (and also discovered a few interesting laws of nature).

If one were to set aside the desire to do inference, then one could argue that in a given sample of people (random or not), any correlation observed within that sample is a “law of nature”, at least within that sample. Then if I draw a different sample and observe a different correlation, is that a different law of nature? Well, it might depend on whether it’s statistically significantly different.

In the end, maybe it doesn’t matter, because no law of nature is patentable, no matter how many there are. I do find it interesting that the Court considered, in some sense, the possibility of statistical variation.

The Court also noted that simply ordering a bunch of steps together did not make a procedure patentable, if the things that were put together were things that doctors (or people in the profession) were already doing. The question becomes, if you take away the statistical correlation in the patent, is there anything left? No, because doctors were already treating patients with immune-mediated gastrointestinal disorders and those patients were already being tested for blood metabolites. 

This section of the decision caught my eye because it sounded a lot like the work of an applied statistician. Much of applied statistics involves taking methods and techniques that are already well known (lasso, anyone?) and applying them in new and interesting ways to new and interesting data. It seems taking a bunch of well-known process/techniques and putting them together is not patentable, even if it is interesting. I don’t think I have a problem with that, but then again, getting patents aren’t my main goal.

Actual lawyers will be able to tell whether this case is significant. However, it seems there are many statistical correlations out there that are waiting to be turned into medical treatments. For example, take the Duke clinical trials saga. I don’t think it’s the case that none of these are patentable, because there still is the option of adding an “inventive concept” on top. However, it seems the simple algorthmic approach of “If X do this, and if Y do that” isn’t going to fly.


Laws of Nature and the Law of Patents: Supreme Court Rejects Patents for Correlations

This is a guest post by Reeves Anderson, an associate at Arnold and Porter LLP. Reeves Anderson is a member of the Appellate and Supreme Court practice group at Arnold & Porter LLP in Washington, D.C.  The views expressed herein are those of the author alone and not of Arnold & Porter LLP or any of the firm’s clients. Stay tuned for follow-up posts by the Simply Statistics crowd on the implications of this ruling for statistics in general and personalized medicine in particular. 

With the country’s attention focused on next week’s arguments over the constitutionality of President Obama’s health care law, the Supreme Court slipped in an important decision today concerning personalized medicine patents.  In Mayo Collaborative Services v. Prometheus Laboratories, the Court unanimously struck down medical diagnostic patents that concerned the use of thiopurine drugs in the treatment of autoimmune diseases.  Prometheus’s patents, which provided that doctors should increase or decrease a treatment dosage depending on metabolite correlations, was ineligible for patent protection, the Court held, because the patents “simply stated a law of nature.” 

As Jeff aptly described the issue in December, Prometheus’s patents sought to control a treatment process centered “on the basis of a statistical correlation.”  Specifically, when a patient ingests a thiopurine drug, metabolites form in the patient’s bloodstream.  Because the production of metabolites varies among patients, the same dosage of thiopurine causes different effects in different patients.  This variation makes it difficult for doctors to determine optimal treatment for a particular patient.  Too high of a dosage risks harmful side effects, whereas too low would be therapeutically ineffective. 

But measurement of a patient’s metabolite levels—in particular, 6-thioguanine and its nucleotides (6-TG) and 6-methyl-mercaptopurine (6-MMP)—is more closely correlated with the likelihood that a particular dosage of a thiopurine drug could cause harm or prove ineffective.  As the Court explained today, however, “those in the field did not know the precise correlations between metabolite levels and the likely harm or ineffectiveness.”  This is where Prometheus stepped in.  “The patent claims at issue here set forth processes embodying researchers’ findings that identified those correlations with some precision.”  Prometheus contended that blood concentrations of 6-TG or of 6-MMP above 400 and 7,000 picomoles per 8x108 red blood cells, respectively, could be toxic, while a concentration of 6-TG metabolite less than 230 pmol per 8x108 red blood cells is likely too low to be effective. 

Prometheus utilized this correlation by patenting a three-step method by which one (i) administers a drug providing 6-TG to a patient with an autoimmune disease; (ii) determines the level of 6-TG in the patient; and (iii) the administrator then can determine whether the thiopurine dosage should be adjusted accordingly.  Significantly, Prometheus’s patents did not include a treatment protocol and thus applied regardless of whether a doctor actually altered his treatment decision in light of the test—in other words, even if the doctor thought the correlations were wrong, irrelevant, or inapplicable to a particular patient.  And in fact, Mayo Clinic, the party challenging Prometheus’s patents, believed Prometheus’s correlations were wrong.  (Mayo’s toxicity levels were 450 and 5700 pmol per 8x108 red blood cells for 6-TG and 6-MMP, respectively.  At oral argument on December 7, 2011, Mayo insisted that its numbers were “more accurate” than Prometheus’s.) 

Turning to the legal issues, both parties agreed that the correlations were “laws of nature,” which, by themselves, are not patentable.  As the Supreme Court has explained repeatedly, laws of nature, like natural phenomena and abstract ideas, are “manifestations of … nature, free to all men and reserved exclusively to none.”  This principle reflects a concern that patent law ought not inhibit further discovery and innovation by tying up the “basic tools of scientific and technological work.” 

In contrast, the application of a law of nature is patentable.  The question for the Court, then, was whether Prometheus’s patent claims “add enough to their statements of correlations to allow the process they describe to qualify as patent-eligible processes that apply natural laws.” 

The Court’s answer was no.  Distilled down, Prometheus’s “three steps simply tell doctors to gather data from which they may draw an inference in light of the correlations.”  The Court determined that Prometheus’s method simply informed the relevant audience (doctors treating patients with autoimmune diseases) about a law of nature, and that the additional steps of “administering” a drug and “determining” metabolite levels were “well-understood, routine, conventional activity already engaged in by the scientific community.”  “[T]he effect is simply to tell doctors to apply the law somehow when treating their patients.”   

Although I leave it to Jeff & company to assess the impact of today’s decision on the practice of personalized medicine, I have two principal observations.  First, it appears that the Court was disturbed by Mayo’s insistence that the correlations in Prometheus’s patents were wrong, and that patent protection would prevent Mayo from improving upon them.  Towards the end of the opinion, Justice Breyer wrote that the patents “threaten to inhibit the development of more refined treatment recommendations (like that embodied in Mayo’s test), that combine Prometheus’s correlations with later discovered features of metabolites, human physiology or individual patient characteristics.”  The worry of stifling future innovation applies to every patent, but the Court seemed especially attuned to that concern here, perhaps due in part to Mayo’s insistence that its “better” test could not be used to help patients. 

Second, Mayo argued that a decision in its favor would reduce the costs of challenging similar patents that purported to “apply” a natural law.  Mayo’s argument was in response to the position of the U.S. Government, which participated in the case as amicus curiae (“friend of the court”).  The Government urged the Court not to rule on the threshold issue of whether Prometheus’s patents applied a law of nature, but rather to strike down the patents because they lacked “novelty” or were “obvious in light of prior art.”  The questions of novelty and obviousness, Mayo argued, are much more fact-intensive and expensive to litigate.  Whether or not the Court agreed with Mayo’s argument, it declined to follow the Government’s advice.  To skip the threshold question, the Court concluded, “would make the ‘law of nature’ exception … a dead letter.” 

Many Supreme Court watchers will now turn their attention to another patent case that has been waiting in the wings, Association for Molecular Pathology v. Myriad Genetics, which asks the Court to decide whether human genes are patentable.  Predictions anyone?


Supreme court unanimously rules against personalized medicine patent!

Just a few minutes ago the Supreme Court released their decision in the Mayo case, see here for the Simply Statistics summary of the case. The court ruled unanimously that the personalized medicine test could not be patented. Such a strong ruling likely has major implications going forward for the field of personalized medicine. At the end of the day, this decision was based on an interpretation of statistical correlation. Stay tuned for a special in-depth analysis in the next couple of days that will get into the details of the ruling and the implications for personalized medicine. 


Interview with Amy Heineike - Director of Mathematics at Quid

Amy Heineike

Amy Heineike is the Director of Mathematics at Quid, a startup that seeks to understand technology development and dissemination through data analysis. She was the first employee at Quid, where she helped develop their technology early on. She has been recognized as one of the top Big Data Scientists. As a part of our ongoing interview series talked to Amy about data science, Quid, and how statisticians can get involved in the tech scene. 

Which term applies to you: data scientist, statistician, computer scientist, or something else?
Data Scientist fits better than any, because it captures the mix of analytics, engineering and product management that is my current day to day.  
When I started with Quid I was focused on R&D - developing the first prototypes of what are now our core analytics technologies, and working to define and QA new data streams.  This required the analysis of lots of unstructured data, like news articles and patent filings, as well as the end visualisation and communication of the results.  
After we raised VC funding last year I switched to building our data science and engineering teams out.  These days I jump from conversations with the team about ideas for new analysis, to defining refinements to our data model, to questions about scalable architecture and filling out pivotal tracker tickets.  The core challenge is translating the vision for the product back to the team so they can build it.
 How did you end up at Quid?
In my previous work I’d been building models to improve our understanding of complex human systems - in particular the complex interaction of cities and their transportation networks in order to evaluate the economic impacts of, Crossrail, a new train line across London, and the implications of social networks on public policy.  Through this work it became clear that data was the biggest constraint - I became fascinated by a quest to find usable data for these questions - and thats what led me to Silicon Valley.  I knew the founders of Quid from University, and approached them with the idea of analysing their data according to ideas I’d had - especially around network analysis - and the initial work we collaborated on became core to the founding techology of Quid.
Who were really good mentors to you? What were the qualities that helped you? 
I’ve been fortunate to work with some brilliant people in my career so far.  While I still worked in London I worked closely with two behavioural economists - Paul Ormerod, who’s written some fantastic books on the subject (mostly recently Why Things Fail), and Bridget Rosewell, until recently the Chief Economist to the Greater London Authority (the city government for London).  At Quid I’ve had a very productive collaboration with Sean Gourley, our CTO.
One unifying characteristic of these three is their ability to communicate complex ideas in a powerful way to a broad audience.  Its an incredibly important skill, a core part of analytics work is taking the results to where they are needed which is often beyond those who know the technical details, to those who care about the implications first.
How does Quid determine relationships between organizations and develop insight based on data? 
The core questions our clients ask us are around how technology is changing and how this impacts their business.  Thats a really fascinating and huge question that requires not just discovering a document with the answer in it, but organizing lots and lots of pieces of data to paint a picture of the emergent change.  What we can offer is not only being able to find a snapshot of that, but also being able to track how it changes over time.
We organize the data firstly through the insight that much disruptive technology emerges in organizations, and that the events that occur between and to organizations are a fantastic way to signal both the traction of technologies and to observe strategic decision making by key actors.
The first kind of relationship thats important is of the transactional type, who is acquiring, funding or partnering with who, and the second is an estimate of the technological clustering of organizations, what trends do particular organizations represent.  Both of these can be discovered through documents about them, including in government filings, press releases and news, but requires analysis of unstructured natural language.  
We’ve experimented with some very engaging visualisations of the results, and have had particular success with network visualisations, which are a very powerful way of allowing people to interact with a large amount of data in a quite playful way.  You can see some of our analyses in the press links at
What skills do you think are most important for statisticians/data scientists moving into the tech industry?
Technical statistical chops are the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users.  To turn this into a product requires understanding how to turn one-off analysis into something reliable enough to run day after day, even as the data evolves and grows, and as different users experience different aspects of it.  A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed on an ongoing basis), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).  
For your ideas to become great products, you need to become part of a great team though!  One of the reasons that such a broad set of skills are associated with Data Science is that there are a lot of pieces that have to come together for it to all work out - and it really takes a team to pull it off.  Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done.  Later stage teams, and big tech companies may have roles that are purer statistics.
Do you have any advice for grad students in statistics/biostatistics on how to get involved in the start-up community or how to find a job at a start-up? 
There is a real opportunity for people who have good statistical and computational skills to get into the startup and tech scenes now.  Many people in Data Science roles have statistics and biostatistics backgrounds, so you shouldn’t find it hard to find kindred spirits.
We’ve always been especially impressed with people who have built software in a group and shared or distributed that software in some way.  Getting involved in an open source project, working with version control in a team, or sharing your code on github are all good ways to start on this.
Its really important to be able to show that you want to build products though.  Imagine the clients or users of the company and see if you get excited about building something that they will use.  Reach out to people in the tech scene, explore who’s posting jobs - and then be able to explain to them what it is you’ve done and why its relevant, and be able to think about their business and how you’d want to help contribute towards it.  Many companies offer internships, which could be a good way to contribute for a short period and find out if its a good fit for you.


Sunday data/statistics link roundup (3/18)

  1. A really interesting proposal by Rafa (in Spanish - we’ll get on him to write a translation) for the University of Puerto Rico. The post concerns changing the focus from simply teaching to creating knowledge and the potential benefits to both the university and to Puerto Rico. It also has a really nice summary of the benefits that the university system in the United States has produced. Definitely worth a read. The comments are also interesting, it looks like Rafa’s post is pretty controversial…
  2. An interesting article suggesting that the Challenger Space Shuttle disaster was at least in part due to bad data visualization. Via @DatainColour
  3. The Snyderome is getting a lot of attention in genomics circles. He used as many new technologies as he could to measure a huge amount of molecular information about his body over time. I am really on board with the excitement about measurement technologies, but this poses a huge challenge for statistics and and statistical literacy. If this kind of thing becomes commonplace, the potential for false positives and ghost diagnoses is huge without a really good framework for uncertainty. Via Peter S. 
  4. More news about the Nike API. Now that is how to unveil some data! 
  5. Add the Nike API to the list of potential statistics projects for students. 

Peter Norvig on the "Unreasonable Effectiveness of Data"

“The Unreasonable Effectiveness of Data”, a talk by Peter Norvig of Google. Sometimes, more data is more better. (Thanks to John C. for the link.)


A proposal for a really fast statistics journal

I know we need a new journal like we need a good poke in the eye. But I got fired up by the recent discussion of open science (by Paul Krugman and others) and the seriously misguided Research Works Act- that aimed to make it illegal to deposit published papers funded by the government in Pubmed central or other open access databases.

I also realized that I spend a huge amount of time/effort on the following things: (1) waiting for reviews (typically months), (2) addressing reviewer comments that are unrelated to the accuracy of my work - just adding citations to referees papers or doing additional simulations, and (3) resubmitting rejected papers to new journals - this is a huge time suck since I have to reformat, etc. Furthermore, If I want my papers to be published open-access I also realized I have to pay at minimum $1,000 per paper

So I thought up my criteria for an ideal statistics journal. It would be accurate, have fast review times, and not discriminate based on how interesting an idea is. I have found that my most interesting ideas are the hardest ones to get published.  This journal would:

  • Be open-access and free to publish your papers there. You own the copyright on your work. 
  • The criteria for publication would be: (1) it has to do with statistics, computation, or data analysis, (2) is the work is technically correct. 
  • We would accept manuals, reports of new statistical software, and full length research articles. 
  • There would be no page limits/figure limits. 
  • The journal would be published exclusively online. 
  • We would guarantee reviews within 1 week and publication immediately upon review if criteria (1) and (2) are satisfied
  • Papers would receive a star rating from the editor - 0-5 stars. There would be a place for readers to also review articles
  • All articles would be published with a tweet/like button so they can be easily distributed
To achieve such a fast review time, here is how it would work. We would have a large group of Associate Editors (hopefully 30 or more). When a paper was received, it would be assigned to an AE. The AEs would agree to referee papers within 2 days. They would use a form like this:
  • Review of: Jeff’s Paper
  • Technically Correct: Yes
  • About statistics/computation/data analysis: Yes
  • Number of Stars: 3 stars

  • 3 Strengths of Paper (1 required): 
  • This paper revolutionizes statistics 

  • 3 Weakness of Paper (1 required): 
  • * The proof that this paper revolutionizes statistics is pretty weak
  • because he only includes one example.
That’s it, super quick, super simple, so it wouldn’t be hard to referee. As long as the answers to the first two questions were yes, it would be published. 
So now here’s my questions: 
  1. Would you ever consider submitting a paper to such a journal?
  2. Would you be willing to be one of the AEs for such a journal? 
  3. Is there anything you would change? 

Sunday Data/Statistics Link Roundup (3/11)

  1. This is the big one. ESPN has opened up access to their API! It looks like there may only be access to some of the data for the general public though, does anyone know more? 
  2. Looks like ESPN isn’t the only sports-related organization in the API mood, Nike plans to open up an API too. It would be great if they had better access to individual, downloadable data. 
  3. Via Leonid K.: a highly influential psychology study failed to replicate in a study published in PLoS One. The author of the original study went off on the author of the paper, on PLoS One, and on the reporter who broke the story (including personal attacks!). It looks like the authors of the PLoS One paper actually did a more careful study than the original authors to me. The authors of the PLoS One paper, the reporter, and the editor of PLoS One all replied in a much more reasonable way. See this excellent summary for all the details. Here are a few choice quotes from the comments: 

1. But there’s a long tradition in social psychology of experiments as parables,

2. I’d love to write a really long response, but let’s just say: priming methods like these fail to replicate all the time (frequently in my own studies), and the news that one of Bargh’s studies failed to replicate is not surprising to me at all.

3. This distinction between direct and conceptual replication helps to explain why a psychologist isn’t particularly concerned whether Bargh’s finding replicates or not.

D.  Reproducible != Replicable in scientific research. But Roger’s perspective on reproducible research still seems appropriate here.