Simply Statistics


Please save the unsolicited R01s

Editor's note: With the sequestration deadline hours away, the career of many young US scientists is on the line.  In this guest post, our colleague Steven Salzberg , an avid defender of NIH and its peer review process, tells us why now more than ever the NIH should prioritize funding R01s over other project grants .

First let's get the obvious facts out of the way: the federal budget is a mess, and Congress is completely disfunctional.  When it comes to NIH funding, this is not a good thing.

Hidden within the larger picture, though, is a serious menace to our decades-long record of incredibly successful research in the United States.  The investigator-driven, basic research grant is in even worse shape than the overall NIH budget.  A recent analysis by FASEB, shown in the figure here, reveals that the number of new R01s reached its peak in 2003 - ten years ago! - and has been steadily declining since.  In 2003, 7,430 new R01s were awarded.  In 2012, that number had dropped to 5,437, a 27% decline.


For those who might not be familiar with the NIH system, the R01 grant is the crown jewel of research grants.  R01s are awarded to individual scientists to pursue all varieties of biomedical research, from very basic science to clinical research.  For R01s, NIH doesn't tell the scientists what to do: we propose the ideas, we write them up, and then NIH organizes a rigorous peer review (which isn't perfect, but it's the best system anyone has).  Only the top-scoring proposals get funded.

This process has gotten much tougher over the years.  In 1995, the success rate for R01s was 25.9%.  Today it is 18.4% and falling.  This includes applications from everyone, even the most experienced and proven scientists.  Thus no matter who you are, you can expect that there is more than an 80% chance that your grant application will be turned down.  In some areas it is even worse: NIAID's website announced that it is currently funding only 6% of R01s.

Why are R01s declining?  Not for lack of interest: the number of applications last year was 29,627, an all-time high.  Besides the overall budget problem, another problem is growing: the fondness of the NIH administration for big, top-down science projects, many times with the letters "ome" or "omics" attached.

Yes, the human genome was a huge success.  Maybe the human microbiome will be too.  But now NIH is pushing gigantic, top-down projects: ENCODE, 1000 Genomes, the cancer anatomy genome project (CGAP), the cancer genome atlas (TCGA), a new "brain-ome" project, and more. The more money is allocated to these big projects, the less R01s NIH can fund. For example, NIAID, with its 6% R01 success rate, has been spending tens of millions of dollars per year on 3 large Microbial Genome Sequencing Center contracts and tens of millions more on 5 large Bioinformatics Resource Center contracts.  As far as I can tell, no one uses these bioinformatics resource centers for anything - in fact, virtually no one outside the centers even knows they exist. Furthermore, these large, top-down driven sequencing projects don't address specific scientific hypotheses, but they produce something that the NIH administration seems to love: numbers.  It's impressive to see how many genomes they've sequenced, and it makes for nice press releases.  But very often we simply don't need these huge, top-down projects to answer scientific questions.  Genome sequencing is cheap enough that we can include it in an R01 grant, if only NIH will stop pouring all its sequencing money into these huge, monolithic projects.

I'll be the first person to cheer if Congress gets its act together and fund NIH at a level that allows reasonable growth.  But whether or not that happens, the growth of big science projects, often created and run by administrators at NIH rather than scientists who have successfully competed for R01s, represents a major threat to the scientist-driven research that has served the world so well for the past 50 years.  Many scientists are afraid to speak out against this trend, because by doing so we (yes, this includes me) are criticizing those same NIH administrators who manage our R01s.   But someone has to say something.  A 27% decline in the number of R01s over the past decade is not a good thing.  Maybe it's time to stop the omics train.


Big data: Giving people what they want

Netflix is using data to create original content for its subscribers, the first example of which was House of Cards. Three main data points for this show were that (1) People like David Fincher (because they watch The Social Network, like, all the time); (2) People like Kevin Spacey; and (3) People liked the British version of House of Cards. Netflix obviously has tons of other data, including when you stop, pause, rewind certain scenes in a movie or TV show.

Netflix has always used data to decide which shows to license, and now that expertise is extended to the first-run. And there was not one trailer for “House of Cards,” there were many. Fans of Mr. Spacey saw trailers featuring him, women watching “Thelma and Louise” saw trailers featuring the show’s female characters and serious film buffs saw trailers that reflected Mr. Fincher’s touch.

Using data to program television content is about as new as Bryl Cream, but Netflix has the Big Data and has direct interaction with its viewers (so does Amazon Prime, which apparently is also looking to create original content). So the question is, does it work? My personal opinion is that it's probably not any worse than previous methods, but may not be a lot better. But I would be delighted to be proven wrong. From my walks around the hallway here it seems House of Cards is in fact a good show (I haven't seen it). But one observation probably isn't enough to draw a conclusion here.

John Landgraf of FX Networks thinks Big Data won't help:

“Data can only tell you what people have liked before, not what they don’t know they are going to like in the future,” he said. “A good high-end programmer’s job is to find the white spaces in our collective psyche that aren’t filled by an existing television show,” adding, those choices were made “in a black box that data can never penetrate.”

I was a bit confused when I read this but the use of the word "programmer" here I'm pretty sure is in reference to television programmer. This quote is reminiscent of Steve Jobs' line about how it's not he consumer's job to know what he/she wants. It also reminds me of financial markets where all the data it the world can only tell you about the past.

In the end, can any of it help you predict the future? Or do some people just get lucky?



Sunday data/statistics link roundup (2/24/2013)

  1. An attempt to create a version of knitr for stata (via John M.). I  like the direction that reproducible research is moving - toward easier use and wider spread adoption. The success of iPython notebook is another great sign for the whole research area.
  2. Email is always a problem for me. In the last week I've been introduced to a couple of really nice apps that give me insight into my email habits (Gmail meter - via John M.) and that help me to send reminders to myself with minimal hassle (Boomerang - via Brian C.)
  3. Andrew Lo proposes a new model for cancer research funding based on his research in financial engineering. In light of the impending sequester I'm interested in alternative funding models for data science/statistics in biology. But the concerns I have about both crowd-funding and Lo's idea are whether the basic scientists get hosed and whether sustained funding at a level that will continue to attract top scientists is possible.
  4. This is a really nice rundown of why medical costs are so high. They key things in the article to me are that: (1) he chased down the data about actual costs versus charges and (2) he highlights the role of the chargemaster - the price setter in medical centers - and how the prices are often set historically with yearly markups (not based on estimates of costs, etc.), and (3) he discusses key nuances like medical liability if the "best" tests aren't run on everyone. Overall, it is definitely worth a read and this seems like a hugely important problem a statistician could really help with (if they could get their hands on the data).
  5. A really cool applied math project where flying robot helicopters toss and catch a stick. Applied math can be super impressive, but they always still need a little boost from statistics, ""This also involved bringing the insights gained from their initial
    and many subsequent experiments to bear on their overall system
    design. For example, a learning algorithm was added to account for
    model inaccuracies." (via Rafa via MR).
  6. We've talked about trying to reduce meetings to increase producitivity before. Here is an article in the NYT talking about the same issue (via Rafa via Karl B.). Brian C. made an interesting observation though, that in a soft money research environment there should be evolutionary pressure against anything that doesn't improve your ability to obtain research funding. Despite this, meetings proliferate in soft-money environments. So there must be some selective advantage to them! Another interesting project for a stats/evolutionary biology student.
  7. If you have read all the Simply Statistics interviews and still want more, check out



Tesla vs. NYT: Do the Data Really Tell All?

I've enjoyed so far the back and forth between Tesla Motors and New York Times reporter John Broder. The short version is

  • Broder tested one of Tesla's new Model S all-electric sedans on a drive from Washington, D.C. to Groton, CT. Part of the reason for this specific trip was to make use of Tesla's new supercharger stations along the route (one in Delaware and one in Connecticut).
  • Broder's trip appeared to have some bumps, including running out of electricity at one point and requiring a tow.
  • After the review was published in the New York Times, Elon Musk, the CEO/Founder of Tesla, was apparently livid. He published a detailed response on the Tesla blog explaining that what Broder wrote in his review was not true and that "he simply did not accurately capture what happened and worked very hard to force our car to stop running".
  • Broder has since responded to Musk's response with further explanation.

Of course, the most interesting aspect of Musk's response on the Tesla blog was that he published the data collected by the car during Broder's test drive. When revelations of this data came about, I thought it was a bit creepy, but Musk makes clear in his post that they require data collection for all reviewers because of a previous bad experience. So, the fact that data were being collected on speed, cabin temperature, battery charge %, and rated range remaining, was presumably known to all, especially Broder. Given that you know Big Brother Musk is watching, it seems odd to deliberately lie in a widely read publication like the Times.

Having read the original article, Musk's response, and Broder's rebuttal, one things is clear to me--there's more than one way to see the data. The challenge here is that Broder had the car, but not the data, so had to rely on his personal recollection and notes. Musk has the data, but wasn't there, and so has to rely on peering at graphs to interpret what happened on the trip.

One graph in particular was fascinating. Musk shows a periodic-looking segment of the speed graph and concludes

Instead of plugging in the car, he drove in circles for over half a mile in a tiny, 100-space parking lot. When the Model S valiantly refused to die, he eventually plugged it in.

Broder claims

I drove around the Milford service plaza in the dark looking for the Supercharger, which is not prominently marked. I was not trying to drain the battery. (It was already on reserve power.) As soon as I found the Supercharger, I plugged the car in.

Okay, so who's right? Isn't the data supposed to settle this?

In a few other cases in this story, the data support both people. In particular, it seems that there was some serious miscommunication between Broder and Tesla's staff. I'm sure they also have recordings of those telephone calls too but they were not reproduced in Musk's response.

The bottom line here, in my opinion, is that sometimes the data don't tell all, especially "big data". In the end, data are one thing, interpretation is another. Tesla had reams of black-box data from the car and yet some of the data still appear to be open to interpretation. My guess is that the data Tesla collects is not collected specifically to root out liars, and so is maybe not optimized for this purpose. Which leads to another key point about big data--they are often used "off-label", i.e. not for the purpose they were originally designed.

I read this story with interest because I actually think Tesla is a fascinating company that makes cool products (that sadly, I could never afford). This episode will surely not be the end of Tesla or of the New York Times, but it illustrates to me that simply "having the data" doesn't necessarily give you what you want.


Sunday data/statistics link roundup (2/17/2013)

  1. The Why Axis - discussion of important visualizations on the web. This is one I think a lot of people know about, but it is new to me. (via Thomas L. - p.s. I'm @leekgroup on Twitter, not @jtleek). 
  2. This paper says that people who "engage in outreach" (read: write blogs) tend to have higher academic output (hooray!) but that outreach itself doesn't help their careers (boo!).
  3. It is a little too late for this year, but next year you could make a Valentine with R.
  4. An email charter (via Rafa). This is pretty similar to my getting email responses from busy people. Not sure who scooped who. I'm still waiting for my to-do list app. Mailbox is close, but I still want actions to be multiple choice or yes/no or delegation rather than just snoozing emails for later.
  5. Top ten reasons not to share your code, and why you should anyway.

Interview with Nick Chamandy, statistician at Google

Nick Chamandy
Nick Chamandy received his M.S. in statistics from the University of Chicago, his Ph.D. in statistics at McGill University and joined Google as a statistician. We talked to him about how he ended up at Google, what software he uses, and how big the Google data sets are. To read more interviews - check out our interviews page.
SS: Which term applies to you: data scientist, statistician, computer scientist, or something else?

NC: I usually use the term Statistician, but at Google we are also known as Data Scientists or Quantitative Analysts. All of these titles apply to some degree. As with many statisticians, my day to day job is a mixture of analyzing data, building models, thinking about experiments, and trying to figure out how to deal with large and complex data structures. When posting job opportunities, we are cognizant that people from different academic fields tend to use different language, and we don't want to miss out on a great candidate because he or she comes from a non-statistics background and doesn't search for the right keyword. On my team alone, we have had successful "statisticians" with degrees in statistics, electrical engineering, econometrics, mathematics, computer science, and even physics. All are passionate about data and about tackling challenging inference problems.

SS: How did you end up at Google?

Coming out of my PhD program at McGill, I was somewhat on the fence about the academia vs. industry decision. Ideally I wanted an opportunity that combined the intellectual freedom and stimulation of academia with the concreteness and real-world relevance of industrial problems. Google seemed to me at the time (and still does) to be by far the most exciting place to pursue that happy medium. The culture at Google emphasizes independent thought and idea generation, and the data are staggering in both size and complexity. That places us squarely on the "New Frontier" of statistical innovation, which is really motivating. I don't know of too many other places where you can both solve a research problem and have an impact on a multi-billion dollar business in the same day.

SS: Is your work related to the work you did as a Ph.D. student?

NC: Although I apply many of the skills I learned in grad school on a daily basis, my PhD research was on Gaussian random fields, with particular application to brain imaging data. The bulk of my work at Google is in other areas, since I work for the Ads Quality Team, whose goal is to quantify and improve the experience that users have interacting with text ads on the search results page. Once in a while though, I come across data sets with a spatial or spatio-temporal component and I get the opportunity to leverage my experience in that area. Some examples are eye-tracking studies run by the user research lab (measuring user engagement on different parts of the search page), and click pattern data. These data sets typically violate many of the assumptions made in neuroimaging applications, notably smoothness and isotropy conditions. And they are predominantly 2-D applications, as opposed to 3-D or higher.

What is your programming language of choice, R, Python or something else?  

I use R, and occasionally matlab, for data analysis. There is a large, active and extremely knowledgeable R community at Google. Because of the scale of Google data, however, R is typically only useful after a massive data aggregation step has been accomplished. Before that, the data are not only too large for R to handle, but are stored on many thousands of machines. This step is usually accomplished using the MapReduce parallel computing framework, and there are several Google-developed scripting languages that can be used for this purpose, including Go. We also have an interactive, ad hoc query language which can be applied to massive, "sharded" data sets (even those with a nested structure), and for which there is an R API. The engineers at Google have also developed a truly impressive package for massive parallelization of R computations on hundreds or thousands of machines. I typically use shell or python scripts for chaining together data aggregation and analysis steps into "pipelines".

SS: How big are the data sets you typically handle? Do you extract them yourself or does someone else extract them for you?

Our data sets contain billions of observations before any aggregation is done. Even after aggregating down to a more manageable size, they can easily consist of 10s of millions of rows, and on the order of 100s of columns. Sometimes they are smaller, depending on the problem of interest. In the vast majority of cases, the statistician pulls his or her own data -- this is an important part of the Google statistician culture. It is not purely a question of self-sufficiency. There is a strong belief that without becoming intimate with the raw data structure, and the many considerations involved in filtering, cleaning, and aggregating the data, the statistician can never truly hope to have a complete understanding of the data. For massive and complex data, there are sometimes as many subtleties in whittling down to the right data set as there are in choosing or implementing the right analysis procedure. Also, we want to guard against creating a class system among data analysts -- every statistician, whether BS, MS or PhD level, is expected to have competence in data pulling. That way, nobody becomes the designated data puller for a colleague. That said, we always feel comfortable asking an engineer or other statistician for help using a particular language, code library, or tool for the purpose of data-pulling. That is another important value of the Google culture -- sharing knowledge and helping others get "unstuck".

Do you work collaboratively with other statisticians/computer scientists at Google? How do projects you work on get integrated into Google's products, is there a process of approval?

Yes, collaboration with both statisticians and engineers is a huge part of working at Google. In the Ads Team we work on a variety of flavours of statistical problems, spanning but not limited to the following categories: (1) Retrospective analysis with the goal of understanding the way users and advertisers interact with our system; (2) Designing and running randomized experiments to measure the impact of changes to our systems; (3) Developing metrics, statistical methods and tools to help evaluate experiment data and inform decision-making; (4) Building models and signals which feed directly into our engineering systems. "Systems" here are things like the algorithms that decide which ads to display for a given query and context.

Clearly (2) and (4) require deep collaboration with engineers -- they can make the changes to our production codebase which deploy a new experiment or launch a new feature in a prediction model. There are multiple engineering and product approval steps involved here, meant to avoid introducing bugs or features which harm the user experience. We work with engineers and computer scientists on (1) and (3) as well, but to a lesser degree. Engineers and computer scientists tend to be extremely bright and mathematically-minded people, so their feedback on our analyses, methodology and evaluation tools is pretty invaluable!

Who have been good mentors to you during your career? Is there something in particular they did to help you?

I've had numerous important mentors at Google (in addition, of course, to my thesis advisors and professors at McGill). Largely they are statisticians who have worked in industry for a number of years and have mastered the delicate balance between deep-thinking a problem and producing something quick and dirty that can have an immediate impact. Grad school teaches us to spend weeks thinking about a problem and coming up with an elegant or novel methodology to solve it (sometimes without even looking at data). This process certainly has its place, but in some contexts a better outcome is to produce an unsophisticated but useful and data-driven answer, and then refine it further as needed. Sometimes the simple answer provides 80% of the benefit, and there is no reason to deprive the consumers of your method this short-term win while you optimize for the remaining 20%. By encouraging the "launch and iterate" mentality for which Google is well-known, my mentors have helped me produce analysis, models and methods that have a greater and more immediate impact.

What skills do you think are most important for statisticians/data scientists moving into the tech industry?

Broadly, statisticians entering the tech industry should do so with an open mind. Technically speaking, they should be comfortable with heavy-tailed, poorly-behaved distributions that fail to conform to assumptions or data structures underlying the models taught in most statistics classes. They should not be overly attached to the ways in which they currently interact with data sets, since most of these don't work for web-scale applications. They should be receptive to statistical techniques that require massive amounts of data or vast computing networks, since many tech companies have these resources at their disposal. That said, a statistician interested in the tech industry should not feel discouraged if he or she has not already mastered large-scale computing or the hottest programming languages. To me, it is less about what skills one must brush up on, and much more about a willingness to adaptively learn new skills and adjust one's attitude to be in tune with the statistical nuances and tradeoffs relevant to this New Frontier of statistics. Statisticians in the tech industry will be well-served by the classical theory and techniques they have mastered, but at times must be willing to re-learn things that they have come to regard as trivial. Standard procedures and calculations can quickly become formidable when the data are massive and complex.


I'm a young scientist and sequestration will hurt me

I'm a biostatistician. That means that I help scientists and doctors analyze their medical data to try to figure out new screening tools, new therapies, and new ways to improve patients' health. I'm also a professor. I  spend a good fraction of my time teaching students about analyzing data in classes here at my university and online. Big data/data analysis is an area of growth for the U.S. economy and some have even suggested that there will be a critical shortage of trained data analysts.

I have other responsibilities but these are the two biggies - teaching and research. I work really hard to be good at them because I'm passionate about education and I'm passionate about helping people. I'm by no means the only (relatively) young person with this same drive. I would guess this is a big reason why a lot of people become scientists. They want to contribute to both our current knowledge (research) and the future of knowledge (teaching).

My salary comes from two places - the students who pay tuition at our school and, to a much larger extent, the federal government's research funding through the NIH. So you are paying my salary. The way that the NIH distributes that funding is through a serious and very competitive process. I submit proposals of my absolute best ideas, so do all the other scientists in the U.S., and they are evaluated by yet another group of scientists who don't have a vested interest in our grants. This system is the reason that only the best, most rigorously vetted science is funded by taxpayer money.

It is very hard to get a grant. In 2012, between 7% and 16% of new projects were funded. So you have to write a proposal that is better than 84-93% of all other proposals being submitted by other really, really smart and dedicated scientists. The practical result is that it is already very difficult for a good young scientist to get a grant. The NIH recognizes this and implements special measures for new scientists to get grants, but it still isn't easy by any means.

Sequestration will likely dramatically reduce the fraction of grants that get funded. Already on that website, the "payline" or cutoff for funding, has dropped from 10% of grants in 2012 to 6% in 2013 for some NIH institutes. If sequestration goes through, it will be worse - maybe a lot worse. The result is that it will go from being really hard to get individual grants to nearly impossible. If that happens, many young scientists like me won't be able to get grants. No matter how passionate we are about helping people or doing the right thing, many of us will have to stop being researchers and scientists and get other jobs to pay the bills - we have to eat.

So if sequestration or other draconian cuts to the NIH go through, they will hurt me and other junior scientists like me. It will make it harder - if not impossible - for me to get grants. It will affect whether I can afford to educate the future generation of students who will analyze all the data we are creating. It will create dramatic uncertainty/difficulty in the lives of the young biological scientists I work with who may not be able to rely on funding from collaborative grants to the extent that I can. In the end, this will hurt me, it will hurt my other scientific colleagues, and it could dramatically reduce our competitiveness in science technology and mathematics (STEM) for years to come. Steven wrote this up beautifully on his blog.

I know that these cuts will also affect the lives of many other people from all walks of life, not just scientists. So I hope that Congress will do the right thing and decide that hurting all these people isn't worth the political points they will score - on both sides. Sequestration isn't the right choice - it is the choice that was most politically expedient when people's backs were against the wall.

Instead of making dramatic, untested, and possibly disastrous cuts across the board for political reasons, let's do what scientists and statisticians have been doing for years when deciding which drugs work and don't. Let's run controlled studies and evaluate the impact of budget cuts to different programs - as Ben Goldacre and his colleagues of so beautifully laid out in their proposal. That way we can bring our spending into line, but sensibly and based on evidence, rather than the politics of the moment or untested economic models not based on careful experimentation.


Sunday data/statistics link roundup (2/10/2013)

  1. An article about how NBA teams have installed cameras that allow their analysts to collect information on every movement/pass/play that is performed in a game. I think the most interesting part for me would be how you would define features. They talk about, for example, how many times a  player drives. I wonder if they have an intern in the basement manually annotating those features or if they are using automatic detection algorithms (via Marginal Revolution).
  2. Our friend Florian jumps into the MIC debate. I haven't followed the debate very closely, but I agree with Florian that if a theory paper  is published in a top journal, later falling back on heuristics and hand waving seems somewhat unsatisfying.
  3. An opinion piece pushing the Journal of Negative Results in Biomedicine. If you can't get your negative result in there, think about our P > 0.05 journal :-).
  4. This has nothing to do with statistics/data but is a bit of nerd greatness. Run these commands from a terminal: traceroute
  5. A data visualization describing the effectiveness of each state's election administrations. I think that it is a really cool idea, although I'm not sure I understand the index. A couple of related plots are this one that shows distance to polling place versus election day turnout and this one that shows the same thing for early voting. It's pretty interesting how dramatically different the plots are.
  6. Postdoc Sherri Rose writes about big data and junior statisticians at Stattrak. My favorite quote: " We need to take the time to understand the science behind our projects before applying and developing new methods. The importance of defining our research questions will not change as methods progress and technology advances".

Issues with reproducibility at scale on Coursera

As you know, we are big fans of reproducible research here at Simply Statistics. The scandal around the lack of reproducibility in the analyses performed by Anil Potti and subsequent fallout drove the importance of this topic home.

So when I started teaching a course on Data Analysis for Coursera, of course I wanted to focus on reproducible research. The students in the class will be performing two data analyses during the course. They will be peer evaluated using a rubric specifically designed for evaluating data analyses at scale. One of the components of the rubric was to evaluate whether the code people submitted with their assignments reproduced all the numbers in the assignment.

Unfortunately, I just had to cancel the reproducibility component of the first data analysis assignment. Here are the things I realized while trying to set up the process that may seem obvious but weren't to me when I was designing the rubric:

  1. Security I realized (thanks to a very smart subset of the students in the class who posted on the message boards) that there is a major security issue with exchanging R code and data files with each other. Even if they use only the data downloaded from the official course website, it is possible that people could use the code to try to hack/do nefarious things to each other. The students in the class are great and the probability of this happening is small, but with a class this size, it isn't worth the risk.
  2. Compatibility I'm requiring that people use R for the course. Even so, people are working on every possible operating system, with many different versions of R . In this scenario, it is entirely conceivable for a person to write totally reproducible code that works on their machine but won't work on a random peer-reviewers machine
  3. Computing Resources The range of computing resources used by people in the class is huge. Everyone from people using modern clusters to people running on a single old beat up laptop. Inefficient code on a fast computer is fine, but on a slow computer with little memory it could mean the difference between reproducibility and crashed computers.

Overall, I think the solution is to run some kind of EC2 instance with a standardized set of software. That is the only thing I can think of that would be scalable to a class this size. On the other hand that would both be expensive, a pain to maintain, and would require everyone to run code on EC2.

Regardless, it is a super interesting question. How do you do reproducibility at scale?


Sunday data/statistics link roundup (2/3/2013)

  1. My student, Hilary, wrote a post about how her name is the most poisoned in history. A poisoned name is a name that quickly loses popularity year over year. The post is awesome for the following reasons: (1) she is a good/funny writer and has lots of great links in the post, (2) she very clearly explains concepts that are widely used in biostatistics like relative risk, and (3) she took the time to try to really figure out all the trends she saw in the name popularity. I'm not the only one who thinks it is a good post, it was reprinted in New York Magazine and went viral this last week.
  2. In honor of it being Super Bowl Sunday (go Ravens!) here is a post about the reasons why it often doesn't make sense to consider the odds of an event retrospectively due to the Wyatt Earp effect. Another way to think about it is, if you have a big tournament with tons of teams - someone will win. But at the very beginning, any team had a pretty small chance of winning all the games and taking the championship. If we wait until some team wins and calculate their pre-tournament odds of winning, it will probably be small. (via David S.)
  3. A new article by Ben Goldacre in the NYT about unreported clinical trials. This is a major issue and Ben is all over it with his All Trials project. This is another reason we need a deterministic statistical machine. Don't worry, we are working on building it.
  4. Even though it is Super Bowl Sunday, I'm still eagerly looking forward to spring and the real sport of baseball. Rafa sends along this link analyzing the effectiveness of patient hitters when they swing at a first strike. It looks like it is only a big advantage if you are an elite hitter.
  5. An article in Wired on the importance of long data. The article talks about how in addition to cross-sectional big data, we might also want to be looking at data over time - possibly large amounts of time. I think the title is maybe a little over the top, but the point is well taken. It turns out this is something a bunch of my colleagues in imaging and environmental health  have been working on/talking about for a while. Longitudinal/time series big data seems like an important and wide-open field (via Nick R.).