## Sunday data/statistics link roundup (1/12/2014)

Well it technically is Monday, but I never went to sleep so that still counts as Sunday right?

1. As a person who has taught a couple of MOOCs I'm used to getting some pushback from people who don't like the whole concept. But I'm still happy that I'm not the only one who thinks they are a pretty good idea and still worth doing. I think that both the hype and the backlash are too much. They hype claimed it would completely end the university as we know it. The backlash says it will have no impact. I think more likely it will have a major impact on people who traditionally don't attend colleges. That's ok with me. I think this post gets it about right.
2. The Leekasso is finally dethroned! Korbinian Strimmer used my simulation code and compared it to CAT scores in the sda package coupled with Higher Criticism feature selection. Here is the accuracy plot. Looks like Leekasso is competitive with CAT-Leekasso, but CAT+HC wins. Big win for Github there and thanks to Korbinian for taking the time to do the simulation!
3. Jack Andraka is getting some pushback from serious scientists on the draft of his paper describing the research he outlined in his TED talk. He is taking the criticism like a pro, which says a lot about the guy. From reading the second hand reviews, it sounds like his project was like most good science projects  - it made some interesting progress but needs a lot of grinding before it turns into something real. The hype made it sound too good to be true. I hope that he will just ignore the hype machine from here on in and keep grinding (via Rafa).
4. I've probably posted this before, but here is the illustrated guide to a Ph.D. Lest you think that little bump doesn't matter, don't forget to scroll to the bottom and read this.
5. The bmorebiostat bloggers (http://bmorebiostat.com/), if you aren't following them, you should be.
6. Potentially cool website for accessing treasury data.
7. Ok its 5am. I need a githug and then off to bed.

## The top 10 predictor takes on the debiased Lasso - still the champ!

After reposting on the comparison between the lasso and the always top 10 predictor (leekasso) I got some feedback that the problem could be I wasn't debiasing the Lasso (thanks Tim T. on Twitter!). The idea behind debiasing (as I understand it) is to use the Lasso to do feature selection and then fit model without shrinkage to "debias" the coefficients. The debiased model is then used for prediction. Noah Simon, who knows approximately infinitely more about this than I do, kindly provided some code for fitting a debiased Lasso. He is not responsible for any mistakes/silliness in the simulation, he was just nice enough to provide some debiased Lasso code. He mentions a similar idea appears in the relaxo package if you set $\phi=0$ .

I used the same simulation set up as before and tried out the Leekasso, the Lasso and the Debiased Lasso. Here are the accuracy results (more red = higher accuracy):

The results suggest the debiased Lasso still doesn't work well under this design. Keep in mind as I mentioned in my previous post that the Lasso may perform better under a different causal model.

Update:  Code available here on Github if you want to play around.

Posted in Uncategorized | 4 Comments

## Preparing for tenure track job interviews

Editor's note: This is a slightly modified version of a previous post.

If you are in the job market you will soon be receiving (or already received) an invitation for an interview. So how should you prepare?  You have two goals. The first is to make a good impression. Here are some tips:

1) During your talk, do NOT go over your allotted time. Practice your talk at least twice. Both times in front of a live audiences that asks questions.

2) Know your audience. If it’s a “math-y” department, give a more “math-y” talk. If it’s an applied department, give a more applied talk. But (sorry for the cliché) be yourself. Don’t pretend to be interested in something you are not as this almost always backfires.

3) Learn about the faculty’s research interests. This will help during the one-on-one meetings.

4)  Be ready to answer the question “what do you want to teach?” and “where do you see yourself in five years?”

5) I can’t think of any department where it is necessary to wear a suit (correct me if I’m wrong in the comments). In some places you might feel uncomfortable wearing a suit while those interviewing you are in shorts and t-shirt.

Second, and just as important, you want to figure out if you like the department you are visiting. Do you want to spend the next 5, 10, 50 years there?  Make sure to find out as much as you can to answer this question. Some questions are more appropriate for junior faculty, the more sensitive ones for the chair. Here are some example questions I would ask:

1) What are the expectations for promotion? Would you promote someone publishing exclusively in subject matter journals such as Nature, Science, Cell, PLoS Biology, American Journal of Epidemiology ? Somebody publishing exclusively in Annals of Statistics? Is being a PI on an R01 a requirement for tenure?

2) What are the expectations for teaching/service/collaboration? How are teaching and committee service assignments made?

3) How did you connect with your collaborators? How are these connections made?

4) What percent of my salary am I expected to cover? Is it possible to do this by being a co-investigator?

5) Where do you live? How are the schools? How is the commute?

6) How many graduate students does the department have? How are graduate students funded? If I want someone to work with me, do I have to cover their stipend/tuition?

7) How is computing supported? This varies a lot from place to place. Some departments share amazing systems. Ask how costs are shared? How is the IT staff? Is R supported? In others you might have to buy your own hardware. Get all the details.

Specific questions for the junior Faculty:

Are the expectations for promotion made clear to you? Do you get feedback on your progress? Do the senior faculty mentor you? Do the senior faculty get along? What do you like most about the department? What can be improved? In the last 10 years, what percent of junior faculty get promoted?

Questions for the chair:

What percent of my salary am I expected to cover? How soon? Is their bridge funding? What is a standard startup package? Can you describe the promotion process in detail? What space is available for postdocs? (for hard money place) I love teaching, but can I buy out teaching with grants?

I am sure I missed stuff, so please comment away….

Posted in Uncategorized | 4 Comments

## Sunday data/statistics link roundup (1/5/14)

1. If you haven't seen lolmythesis it is pretty incredible. 1-2 line description of thesis projects. I think every student should be required to make one of these up before they defend. The best I could come up with for mine is, "We built a machine sensitive enough to measure the abundance of every gene in your body at once; turns out it measures other stuff too."
2. An interesting article about how different direct to consumer genetic tests give different results. It doesn't say, but it would be interesting if the raw data were highly replicable and the interpretations were different. If the genotype calls themselves didn't match up that would be much worse on some level. I agree people have a right to their genetic data. On the other hand, I think it is important to remember that even people with Ph.D's and 15 years experience have trouble interpreting the results of a GWAS. To assume the average individual will understand their genetic risk is seriously optimistic (via Rafa).
3. The 10 commandments of egoless programming.These are so important on big collaborative projects like my group has been working on the last year or so. Fortunately my students and postdocs are much better at being egoless than I am (I am an academic with a blog so it isn't like you couldn't see the ego coming ).
4. This is a neat post on parsing and analyzing data from a Garmin. The analysis even produces an automated report! I love it when people do cool things like this with their own data in R.
5. Super interesting advice page for potential graduate students from a faculty member at Duke Biology. This is particularly interesting in light of the ongoing debate about the viability of the graduate education pipeline highlighted in this recent article. I think it is important for graduate students in Ph.D. programs to know that not every student goes to an academic position. This has been true for a long time in Biostatistics, where many people end up in industry positions. That also means it is the obligation of Ph.D. programs to prepare students for a variety of jobs. Fortunately, most Ph.D.s in Biostatistics have experience processing data, working with collaborators, and developing data products so are usually also really prepared for industry.
6. This old video of Tukey and Friedman is awesome and mind-blowing (via Mike L.).
7. Cool site that lets you try to balance Baltimore's budget. This type of thing would be even cooler if there were Github like pull requests where you could make new suggestions as well.
8. My student Alyssa has a very interesting post on teaching R to a non-programmer in one hour. Take the Frazee Challenge and list what you would teach.
Posted in Uncategorized | 2 Comments

## Repost: Prediction: the Lasso vs. just using the top 10 predictors

Editor's note: This is a previously published post of mine from a couple of years ago (!). I always thought about turning it into a paper. The interesting idea (I think) is how the causal model matters for whether the lasso or the marginal regression approach works better. Also check it out, the Leekasso is part of the SuperLearner package.

One incredibly popular tool for the analysis of high-dimensional data is the lasso. The lasso is commonly used in cases when you have many more predictors than independent samples (the n « p) problem. It is also often used in the context of prediction.

Suppose you have an outcome Y and several predictors X1,…,XM, the lasso fits a model:

Y = B0 + B1 X1 + B2 X2 + … + BM XM + E

subject to a constraint on the sum of the absolute value of the B coefficients. The result is that: (1) some of the coefficients get set to zero, and those variables drop out of the model, (2) other coefficients are “shrunk” toward zero. Dropping some variables is good because there are a lot of potentially unimportant variables. Shrinking coefficients may be good, since the big coefficients might be just the ones that were really big by random chance (this is related to Andrew Gelman’s type M errors).

I work in genomics, where n«p problems come up all the time. Whenever I use the lasso or when I read papers where the lasso is used for prediction, I always think: “How does this compare to just using the top 10 most significant predictors?” I have asked this out loud enough that some people around here started calling it the “Leekasso” to poke fun at me. So I’m going to call it that in a thinly veiled attempt to avoid Stigler’s law of eponymy (actually Rafa points out that using this name is a perfect example of this law, since this feature selection approach has been proposed before at least once).

Here is how the Leekasso works. You fit each of the models:

Y = B0 + BkXk + E

take the 10 variables with the smallest p-values from testing the Bk coefficients, then fit a linear model with just those 10 coefficients. You never use 9 or 11, the Leekasso is always 10.

For fun I did an experiment to compare the accuracy of the Leekasso and the Lasso.

Here is the setup:

• I simulated 500 variables and 100 samples for each study, each N(0,1)
• I created an outcome that was 0 for the first 50 samples, 1 for the last 50
• I set a certain number of variables (between 5 and 50) to be associated with the outcome using the model Xi = b0i + b1iY + e (this is an important choice, more later in the post)
• I tried different levels of signal to the truly predictive features
• I generated two data sets (training and test) from the exact same model for each scenario
• I fit the Lasso using the lars package, choosing the shrinkage parameter as the value that minimized the cross-validation MSE in the training set
• I fit the Leekasso and the Lasso on the training sets and evaluated accuracy on the test sets.

The R code for this analysis is available here and the resulting data is here.

The results show that for all configurations, using the top 10 has a higher out of sample prediction accuracy than the lasso. A larger version of the plot is here.

Interestingly, this is true even when there are fewer than 10 real features in the data or when there are many more than 10 real features ((remember the Leekasso always picks 10).

Some thoughts on this analysis:

1. This is only test-set prediction accuracy, it says nothing about selecting the “right” features for prediction.
2. The Leekasso took about 0.03 seconds to fit and test per data set compared to about 5.61 seconds for the Lasso.
3. The data generating model is the model underlying the top 10, so it isn’t surprising it has higher performance. Note that I simulated from the model: Xi = b0i + b1iY + e, this is the model commonly assumed in differential expression analysis (genomics) or voxel-wise analysis (fMRI). Alternatively I could have simulated from the model: Y = B0 + B1 X1 + B2 X2 + … + BM XM + E, where most of the coefficients are zero. In this case, the Lasso would outperform the top 10 (data not shown). This is a key, and possibly obvious, issue raised by this simulation. When doing prediction differences in the true “causal” model matter a lot. So if we believe the “top 10 model” holds in many high-dimensional settings, then it may be the case that regularization approaches don’t work well for prediction and vice versa.
4. I think what may be happening is that the Lasso is overshrinking the parameter estimates, in other words, you give up too much bias for a gain in variance. Alan Dabney and John Storey have a really nice paper discussing shrinkage in the context of genomic prediction that I think is related.
Posted in Uncategorized | 3 Comments

## The Supreme Court takes on Pollution Source Apportionment...and Realizes It's Hard

Recently, the U.S. Supreme Court heard arguments in the cases EPA v. EME Homer City Generation and American Lung Association v EME Homer City GenerationSCOTUSblog has a nice summary of the legal arguments, for the law buffs out there.

The basic problem is that the way air pollution is regulated, the EPA and state and local agencies monitor the air pollution in each state. When the levels of pollution are above the national ambient air quality standards at the monitors in that state, the state is considered in "non-attainment" (i.e. they have not attained the standard). Otherwise, they are in attainment.

But what if your state doesn't actually generate any pollution, but there's all this pollution blowing in from another state? Pollution knows no boundaries and in that case, the monitors in your state will be in non-attainment, and it isn't even your fault! The Clean Air Act has something called the "good neighbor" policy that was designed to address this issue. From SCOTUSblog:

One of the obligations that states have, in drafting implementation plans [to reduce pollution], is imposed by what is called the “good neighbor” policy.  It dates from 1963, in a more elemental form, but its most fully developed form requires each state to include in its plan the measures necessary to prevent the migration of their polluted air to their neighbors, if that would keep the neighbors from meeting EPA’s quality standards.

The problem is that if you live in a state like Maryland, your air pollution is coming from a bunch of states (Pennsylvania, Ohio, etc.). So who do you blame? Well, the logical thing would be to say that if Pennsylvania contributes to 90% of Maryland's interstate air pollution and Ohio contributes 10%, then Pennsylvania should get 90% of the blame and Ohio 10%. But it's not so easy because air pollution doesn't have any special identifiers on it to indicate what state it came from. This is the source apportionment problem in air pollution and it involves trying to back-calculate where a given amount of pollution came from (or what was its source). It's not an easy problem.

EPA realized the unfairness here and devised the State Air Pollution Rule, also known as the "Transport Rule". From SCOTUSblog:

What the Transport Rule sought to do is to set up a regime to limit cross-border movement of emissions of nitrogen oxides and sulfur dioxide.  Those substances, sent out from coal-fired power plants and other sources, get transformed into ozone and “fine particular matter” (basically, soot), and both are harmful to human health, contributing to asthma and heart attacks.  They also damage natural terrain such as forests, destroy farm crops, can kill fish, and create hazes that reduce visibility.

Both of those pollutants are carried by the wind, and they can be transported very large distances — a phenomenon that is mostly noticed in the eastern states.

There are actually a few versions of this problem. One common one involves identifying the source of a particle (i.e. automobile, power plans, road dust) based on its chemical composition. The idea here is that at any given monitor, there are particles blowing in from all different types of sources and so the pollution you measure is a mixture of all these sources. Making some assumptions about chemical mass balance, there are ways to statistically separate out the contributions from individual sources based on a the chemical composition of the total mass measurement. If the particles that we measure, say, have a lot of ammonium ions and we know that particles generated by coal-burning power plants have a lot of ammonium ions, then we might infer that the particles came from a coal-burning power plant.

The key idea here is that different sources of particles have "chemical signatures" that can be used to separate out their various contributions. This is already a difficult problem, but at least here, we have some knowledge of the chemical makeup of various sources and can incorporate that knowledge into the statistical analysis.

In the problem at the Supreme Court, we're not concerned with particles from various types of sources, but rather from different locations. But, for the most part, different states don't have "chemical signatures" or tracer elements, so it's hard to identify whether a given particle (or other pollutant) blowing in the wind came from Pennsylvania versus Ohio.

So what did EPA do? Well, instead of figuring out where the pollution came from, they decided that states would reduce emissions based on how much it would cost to control those emissions. The states objected because the cost of controlling emissions may well have nothing to do with how much pollution is actually being contributed downwind.

The legal question involves whether or not EPA has the authority to devise a regulatory plan based on costs as opposed to actual pollution contribution. I will let people who actually know the law address that question, but given the general difficulty of source apportionment, I'm not sure EPA could have come up with a much better plan.

## Some things R can do you might not be aware of

There is a lot of noise around the "R versus Contender X" for Data Science. I think the two main competitors right now that I hear about are Python and Julia. I'm not going to weigh into the debates because I go by the motto: "Why not just use something that works?"

R offers a lot of benefits if you are interested in statistical or predictive modeling. It is basically unrivaled in terms of the breadth of packages for applied statistics.  But I think sometimes it isn't obvious that R can handle some tasks that you used to have to do with other languages. This misconception is particularly common among people who regularly code in a different language and are moving to R. So I thought I'd point out a few cool things that R can do. Please add to the list in the comments if I've missed things that R can do people don't expect.

1. R can do regular expressions/text processing: Check out stringr, tm, and a large number of other natural language processing packages.
2. R can get data out of a database: Check out RMySQL, RMongoDB, rhdf5, ROracle, MonetDB.R (via Anthony D.).
3. R can process nasty data: Check out plyrreshape2, Hmisc
4. R can process images: EBImage is a good general purpose tool, but there are also packages for various file types like jpeg.
5. R can handle different data formats: XML and RJSONIO handle two common types, but you can also read from Excel files with xlsx or handle pretty much every common data storage type (you'll have to search R + data type) to find the package.
6. R can interact with APIs: Check out RCurl and httr for general purpose software, or you could try some specific examples like twitteR. You can create an api from R code using yhat.
7. R can build apps/interactive graphics: Some pretty cool things have already been built with shiny, rCharts interfaces with a ton of interactive graphics packages.
8. R can create dynamic documents: Try out knitr or slidify.
9. R can play with Hadoop: Check out the rhadoop wiki.
10. R can create interactive teaching modules: You can do it in the console with swirl or on the web with Datamind.
11. R interfaces very nicely with C if you need to be hardcore (also maybe? interfaces with Python): Rcpp, enough said. Also read the tutorial. I haven't tried the rPython library, but it looks like a great idea.
Posted in Uncategorized | 4 Comments

## A non-comprehensive list of awesome things other people did this year.

Editor's Note: I made this list off the top of my head and have surely missed awesome things people have done this year. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people's awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.

• I emailed Hadley Wickham about some trouble we were having memory profiling. He wrote back immediately, then wrote an R package, then wrote this awesome guide. That guy is ridiculous.
• Jared Horvath wrote this incredibly well-written and compelling argument for the scientific system that has given us a wide range of discoveries.
• Yuwen Liu and colleagues wrote this really interesting paper on power for RNA-seq studies comparing biological replicates and sequencing depth. Shows pretty conclusively to go for more replicates (music to a statisticians ears!).
• Yoav Benjamini and Yotam Hechtlingler wrote an amazing discussion of the paper we wrote about the science-wise false discovery rate. It contributes new ideas about estimation/control in that context.
• Sherri Rose wrote a fascinating article about statistician's role in big data. One thing I really liked was this line: "This may require implementing commonly used methods, developing a new method, or integrating techniques from other fields to answer our problem." I really like the idea that integrating and applying standard methods in new and creative ways can be viewed as a statistical contribution.
• Karl Broman gave his now legendary talk (part1/part2) on statistical graphics that I think should be required viewing for anyone who will ever plot data on a Google Hangout with the Iowa State data viz crowd. They had some technical difficulties during the broadcast so Karl B. took it down. Join me in begging him to put it back up again despited the warts.
• Everything Thomas Lumley wrote on notstatschat, I follow that blog super closely. I love this scrunchable poster he pointed to and this post on Statins and the Causal Markov property.
• I wish I could take Joe Blitzstein's data science class. Particularly check out the reading list, which I think is excellent.
• Lev Muchik, Sinan Aral, and Sean Taylor brought the randomized control trial to social influence bias on a massive scale. I love how RCT are finding their ways into the new, sexy areas.
• Genevera Allen taught a congressman about statistical brain mapping and holy crap he talked about it on the floor of the house.
• Lior Pachter starting mixing it up on his blog. I don't necessarily agree with all of his posts but it is hard to deny the influence that his posts have had on real science. I definitely read it regularly.
• Marie Davidian, President of the ASA, has been on a tear this year, doing tons of cool stuff, including landing the big fish, Nate Silver, for JSM. Super impressive to watch the energy. I'm also really excited to see what Bin Yu works on this year as president of IMS.
• The Stats 2013 crowd has done a ridiculously good job of getting the word out about statistics this year. I keep seeing statistics pop up in places like the WSJ, which warms my heart.
• One way I judge a paper is by how angry/jealous I am that I didn't think of or write that paper. This paper on the reproducibility of RNA-seq experiments was so good I was seeing red. I'll be reading everything that Tuuli Lappalainen's new group at the New York Genome Center writes.
• Hector Corrada Bravo and the crowd at UMD wrote this paper about differential abundance in microbial communities that also made me crazy jealous. Just such a good idea done so well.
• Chad Myers and Curtis Huttenhower continue to absolutely tear it up on networks and microbiome stuff. Just stop guys, you are making the rest of us look bad...
• I don't want to go to Stanford I want to go to Johns Hopkins.
• Ramnath keeps Ramnathing (def. to build incredible things at a speed that we can't keep up with by repurposing old tools in the most creative way possible) with rCharts.
• Neo Chung and John Storey invented the jackstraw for testing the association between measured variables and principal components. It is an awesome idea and a descriptive name.
• I wasn't at Bioc 2013, but I heard from two people who I highly respect and it takes a lot to impress that Levi Waldron gave one of the best talks they'd ever seen. The paper isn't up yet (I think) but here is the R package with the data he described.  His survHd package for fast coxph fits (think rowFtests but with Cox) is also worth checking out.
• John Cook kept cranking out interesting posts, as usual. One of my favorites talks about how one major component of expertise is the ability to quickly find and correct inevitable errors (for example, in code).
• Larry Wasserman's Simpson's Paradox post should be required reading. He is shutting down Normal Deviate, which is a huge bummer.
• Andrew Gelman and I don't always agree on scientific issues, but there is no arguing that he and the stan team have made a pretty impressive piece of software with stan. Richard McElreath also wrote a slick interface that makes fitting a fully Bayesian model match the syntax of lmer.
• Steve Pierson and Ron Wasserstein from ASA are also doing a huge service for our community in tackling the big issues like interfacing statistics to government funding agencies. Steve's Twitter feed has been a great resource for keeping track of deadlines for competitions, grants, and other deadlines.
• Joshua Katz built these amazing dialect maps that have been all over the news. Shiny Apps are getting to be serious business.
• Speaking of RStudio, they keep rolling out the goodies, my favorite recent addition is interactive debugging.
• I'll close with David Duvenaud's HarlMCMC shake:

Posted in Uncategorized | 2 Comments

## A summary of the evidence that most published research is false

One of the hottest topics in science has two main conclusions:

• Most published research is false
• There is a reproducibility crisis in science

The first claim is often stated in a slightly different way: that most results of scientific experiments do not replicate. I recently got caught up in this debate and I frequently get asked about it.

So I thought I'd do a very brief review of the reported evidence for the two perceived crises. An important point is all of the scientists below have made the best effort they can to tackle a fairly complicated problem and this is early days in the study of science-wise false discovery rates. But the take home message is that there is currently no definitive evidence one way or another about whether most results are false.

1. Paper: Why most published research findings are falseMain idea: People use hypothesis testing to determine if specific scientific discoveries are significant. This significance calculation is used as a screening mechanism in the scientific literature. Under assumptions about the way people perform these tests and report them it is possible to construct a universe where most published findings are false positive results. Important drawback: The paper contains no real data, it is purely based on conjecture and simulation.
2. Paper: Drug development: Raise standards for preclinical researchMain ideaMany drugs fail when they move through the development process. Amgen scientists tried to replicate 53 high-profile basic research findings in cancer and could only replicate 6. Important drawback: This is not a scientific paper. The study design, replication attempts, selected studies, and the statistical methods to define "replicate" are not defined. No data is available or provided.
3. Paper: An estimate of the science-wise false discovery rate and application to the top medical literatureMain idea: The paper collects P-values from published abstracts of papers in the medical literature and uses a statistical method to estimate the false discovery rate proposed in paper 1 above. Important drawback: The paper only collected data from major medical journals and the abstracts. P-values can be manipulated in many ways that could call into question the statistical results in the paper.
4. Paper: Revised standards for statistical evidenceMain idea: The P-value cutoff of 0.05 is used by many journals to determine statistical significance. This paper proposes an alternative method for screening hypotheses based on Bayes factors. Important drawback: The paper is a theoretical and philosophical argument for simple hypothesis tests. The data analysis recalculates Bayes factors for reported t-statistics and plots the Bayes factor versus the t-test then makes an argument for why one is better than the other.
5. Paper: Contradicted and initially stronger effects in highly cited research Main idea: This paper looks at studies that attempted to answer the same scientific question where the second study had a larger sample size or more robust (e.g. randomized trial) study design. Some effects reported in the second study do not match the results exactly from the first. Important drawback: The title does not match the results. 16% of studies were contradicted (meaning effect in a different direction). 16% reported smaller effect size, 44% were replicated and 24% were unchallenged. So 44% + 24% + 16% = 86% were not contradicted. Lack of replication is also not proof of error.
6. PaperModeling the effects of subjective and objective decision making in scientific peer reviewMain idea: This paper considers a theoretical model for how referees of scientific papers may behave socially. They use simulations to point out how an effect called "herding" (basically peer-mimicking) may lead to biases in the review process. Important drawback: The model makes major simplifying assumptions about human behavior and supports these conclusions entirely with simulation. No data is presented.
7. Paper: Repeatability of published microarray gene expression analysesMain idea: This paper attempts to collect the data used in published papers and to repeat one randomly selected analysis from the paper. For many of the papers the data was either not available or available in a format that made it difficult/impossible to repeat the analysis performed in the original paper. The types of software used were also not clear. Important drawbackThis paper was written about 18 data sets in 2005-2006. This is both early in the era of reproducibility and not comprehensive in any way. This says nothing about the rate of false discoveries in the medical literature but does speak to the reproducibility of genomics experiments 10 years ago.
8. Paper: Investigating variation in replicability: The "Many Labs" replication project. (not yet published) Main ideaThe idea is to take a bunch of published high-profile results and try to get multiple labs to replicate the results. They successfully replicated 10 out of 13 results and the distribution of results you see is about what you'd expect (see embedded figure below). Important drawback: The paper isn't published yet and it only covers 13 experiments. That being said, this is by far the strongest, most comprehensive, and most reproducible analysis of replication among all the papers surveyed here.

I do think that the reviewed papers are important contributions because they draw attention to real concerns about the modern scientific process. Namely

• We need more statistical literacy
• We need more computational literacy
• We need to require code be published
• We need mechanisms of peer review that deal with code
• We need a culture that doesn't use reproducibility as a weapon
• We need increased transparency in review and evaluation of papers

Some of these have simple fixes (more statistics courses, publishing code) some are much, much harder (changing publication/review culture).

The Many Labs project (Paper 8) points out that statistical research is proceeding in a fairly reasonable fashion. Some effects are overestimated in individual studies, some are underestimated, and some are just about right. Regardless, no single study should stand alone as the last word about an important scientific issue. It obviously won't be possible to replicate every study as intensely as those in the Many Labs project, but this is a reassuring piece of evidence that things aren't as bad as some paper titles and headlines may make it seem.

Many labs data. Blue x's are original effect sizes. Other dots are effect sizes from replication experiments (http://rolfzwaan.blogspot.com/2013/11/what-can-we-learn-from-many-labs.html)

The Many Labs results suggest that the hype about the failures of science are, at the very least, premature. I think an equally important idea is that science has pretty much always worked with some number of false positive and irreplicable studies. This was beautifully described by Jared Horvath in this blog post from the Economist.  I think the take home message is that regardless of the rate of false discoveries, the scientific process has led to amazing and life-altering discoveries.

Posted in Uncategorized | 17 Comments

## Sunday data/statistics link roundup (12/15/13)

1. Rafa (in Spanish) clarifying some of the problems with the anti-GMO crowd.
2. Joe Bliztstein, most recently of #futureofstats fame, talks up data science in the Harvard Crimson (via Rafa). As has been pointed out by Rebecca Nugent when she stopped to visit us, class sizes in undergrad stats programs are blowing up!
3. If you missed it, Michael Eisen dropped by to chat about open access (part 1/part 2). We talked about Randy Schekman, a recent Nobel prize winner who says he isn't publishing in Nature/Science/Cell anymore. Professor Schekman did a Reddit AMA where he got grilled pretty hard about pushing a glamour open access journal eLife, while dissing N/S/C, where he published a lot of stuff before winning the Nobel.
4. The article I received most the last couple of weeks is this one. In it, Peter Higgs says he wouldn't have had time to think deeply to perform the research that led to the Boson discovery in the modern publish or perish academic system. But he got the prize, at least in part, because of the people who conceived/built/tested the theory in the Large Hadron Collider. I'm much more inclined to believe someone would have come up with the Boson theory in our current system than someone would have built the LHC in a system without competitive pressure.
5. I think this post raises some interesting questions about the Obesity Paradox that says overweight people with diabetes may have lower risk of death than normal weight people. The analysis is obviously tongue-in-cheek, but I'd be interested to hear what other people think about whether it is a serious issue or not.
Posted in Uncategorized | 5 Comments