30

May

May

Spurred by Rafa’s post on evaluating statisticians working in genomics, there’s an interesting discussion going on at the Scientists for Reproducible Research group on statistics journals. Evan Johnson kicks it off:

…our statistics journals have little impact on how genomic data are analyzed. My group rarely looks to publish in statistics journals anymore because even IF we can get it published quickly, NO ONE will read it, so the only things we send there anymore are things that we don’t care if anyone ever uses.

Evan continues:

It’s crazy to me that all of our statistical journals are barely even noticed by bioinformaticians, computational biologists, and by people in genomics. Even worse, very few non-statisticians in genomics ever try to publish in our journals. Ultimately, this represents a major failure in the statistical discipline to be collectively influential on how genomic data are analyzed.

I may agree with the first point but I’m not sure I agree with second. Regarding the first, I think Karl put it best in that really the problem is that “the bulk of the people who might benefit from my method do not read the statistical literature”. For the second point, I think the issue is that the way science works is changing. Here’s my cartoon of how science worked in the “old days”, say, pre-computer era:

The idea here is that scientists worked with statisticians (they may have been one and the same) to publish stat papers and scientific papers. If Scientist A saw a paper in a domain journal written by Scientist B using a method developed by Statistician C, how could Scientist A apply that method? He had to talk to Statistician D, who would read that statistics literature and find Statistician C’s paper to learn about the method. The point is that there is no direct link from Scientist A to Statistician C except through statistics journals. Therefore, it was critical for Statistician C to publish in the stat journals to ensure that there would be an impact on scientists.

My cartoon of the “new way” of doing things is below.

Now, if Scientist wants to use a method developed by Statistician C (and used by Scientist B), he simply finds the software developed by Statistician C and applies it to his data. Here, there is a direct connection between A and C through software. If Statistician C wants his method to have an impact on scientists, there are two options: publish in stat journals and hope that the method filters through other statisticians, or publish in domain journals *with software* so that other scientists may apply the method directly. It seems the latter approach is more popular in some areas.

Peter Diggle makes an important point about generalized linear models and the seminal book written by McCullagh and Nelder:

the book [by McCullagh and Nelder] would have been read by many fewer people if Nelder and colleague had not embedded the idea in software that (for the time) was innovative in being interactive rather than batch-oriented.

For better or for worse (and probably very often for worse), the software allowed many many people access to the methods.

The supposed attraction of publishing a statistical method in a statistics journal like JASA or JRSS-B is that the methods are published in a more abstract manner (usually using mathematical symbols) in the hopes that the methods will be applicable to a wide array of problems, not just the problem for which it was developed. Of course, the flip side of this argument is, as Karl says, again eloquently, “if you don’t get down to the specifics of a particular data set, then you haven’t really solved *any* problem”.

I think abstraction is important and we need to continue publishing those kinds of ideas. However, I think there is one key point that the statistics community has had difficulty grasping, which is that **software represents an important form of abstraction**, if not the most important form. Anyone who has written software knows that there are many approaches to implementing your method in software and various levels of abstraction one can use. The variety of problems to which the software can be applied depends on how general the interface to your software is. This is why I always encourage people to write R packages because it often forces them to think a bit more abstractly about who might be using the software.

Whither the statistics journals? It’s hard to say. Having them publish more software probably won’t help as the audience remains the same. I’m a bit stumped here but I look forward to continued discussion!

29

May

May

This year I recorded my lectures during my Statistics for Genomics course. Slowly but surely I am putting all the videos on Youtube. Links will eventually be here (all slides and the first lecture is already up). As new lectures become available I will post updates on rafalab’s facebook page and twitter feed where I will answer questions posted as comments (time permitting). Guest lecturers include Jeff Leek, Ben Langmead, Kasper Hansen and Hongkai Ji.

28

May

May

This is yet another outstanding post by Paul Graham, this time on “Schlep Blindness”. He talks about how there are great startup ideas that no one considers because they are too much of a “schlep” (a tedious unpleasant task). He talks about how most founders of startups want to put up a clever bit of code they wrote and just watch the money flow in. But of course it doesn’t work like that, you need to advertise, interact with customers, raise money, go out and promote your work, fix bugs at 3am, etc.

In academia there is a similar tendency to avoid projects that involve a big schlep. For example, it is relatively straightforward to develop a mathematical model, work out the parameter estimates, and write a paper. But it is a big schlep to then write fast code that implements that method, debug the code, dummy proof the code, fix bugs submitted by users, etc. Rafa’s post, Hadley’s interview, and the discussion Rafa linked to all allude to this issue. Particularly the fact that the schlep, the long slow slog of going through a new data type or writing a piece of usable software is somewhat undervalued.

I think part of the problem is our academic culture and heritage, which has traditionally put a very high premium on being clever and a relatively low premium on being willing to go through the schlep. As applied statistics touches more areas and the number of users of statistical software and ideas grows, the schlep becomes just as important as the clever idea. If you aren’t willing to put in the time to code your methods up and make them accessible to other investigators, then who will be?

To bring this back to the discussion inspired by Rafa’s post, I wonder if applied statistics journals could increase their impact, encourage more readership from scientific folks, and support a broader range of applied statisticians if there was a re-weighting of the importance of cleverness and schlep? As Paul points out:

In addition to their intrinsic value, they’re like undervalued stocks in the sense that there’s less demand for them among founders. If you pick an ambitious idea, you’ll have less competition, because everyone else will have been frightened off by the challenges involved.

24

May

May

During the past couple of years I have been asked these questions by several department chairs and other senior statisticians interested in hiring or promoting faculty working in genomics. The main difficulty stems from the fact that we (statisticians working in genomics) publish in journals outside the mainstream statistical journals. This can be a problem during evaluation because a quick-and-dirty approach to evaluating an academic statistician is to count papers in the Annals of Statistics, JASA, JRSS and Biometrics. The evaluators feel safe counting these papers because they trust the fellow-statistician editors of these journals. However, statisticians working in genomics tend to publish in journals like Nature Genetics, Genome Research, PNAS, Nature Methods, Nucleic Acids Research, Genome Biology, and Bioinformatics. In general, these journals do not recruit statistical referees and a considerable number of papers with questionable statistics do get published in them. **However, **when the paper’s main topic is a statistical method or if it heavily relies on statistical methods, statistical referees are used. So, if the statistician is the corresponding or last author and it’s a stats paper, it is OK to assume the statistics are fine and you should go ahead and be impressed by the impact factor of the journal… it’s not east getting statistics papers in these journals.

But we really should not be counting papers blindly. Instead we should be reading at least some of them. But here again the evaluators get stuck as we tend to publish papers with application/technology specific jargon and show-off by presenting results that are of interest to our potential users (biologists) and not necessarily to our fellow statisticians. Here all I can recommend is that you seek help. There are now a handful of us that are full professors and most of us are more than willing to help out with, for example, promotion letters.

So why don’t we publish in statistical journals? The fear of getting scooped due to the slow turnaround of stats journals is only one reason. New technologies that quickly became widely used (microarrays in 2000 and nextgen sequencing today) created a need for data analysis methods among large groups of biologists. Journals with large readerships and high impact factors, typically not interested in straight statistical methodology work, suddenly became amenable to publishing our papers, especially if they solved a data analytic problem faced by many biologists. The possibility of publishing in widely read journals is certainly seductive.

While in several other fields, data analysis methodology development is restricted to the statistics discipline, in genomics we compete with other quantitative scientists capable of developing useful solutions: computer scientists, physicists, and engineers were also seduced by the possibility of gaining notoriety with publications in high impact journals. Thus, in genomics, the competition for funding, citation and publication in the top scientific journals is fierce.

Then there is funding. Note that while most biostatistics methodology NIH proposals go to the Biostatistical Methods and Research Design (BMRD) study section, many of the genomics related grants get sent to other sections such as the Genomics Computational Biology and Technology (GCAT) and Biodata Management and Anlayis (BDMA) study sections. BDMA and GCAT are much more impressed by Nature Genetics and Genome Research than JASA and Biometrics. They also look for citations and software downloads.

To be considered successful by our peers in genomics, those who referee our papers and review our grant applications, our statistical methods need to be delivered as software and garner a user base. Publications in statistical journals, especially those not appearing in PubMed, are not rewarded. This lack of incentive combined with how time consuming it is to produce and maintain usable software, has led many statisticians working in genomics to focus solely on the development of practical methods rather than generalizable mathematical theory. As a result, statisticians working in genomics do not publish much in the traditional statistical journals. You should not hold this against them, especially if they are developers and maintainers of widely used software.

20

May

May

It’s grant season around here so I’ll be brief:

- I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K.
- On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J.
- Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
- Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis.

16

May

May

[youtube http://www.youtube.com/watch?v=t7FJFuuvxpI?wmode=transparent&autohide=1&egm=0&hd=1&iv_load_policy=3&modestbranding=1&rel=0&showinfo=0&showsearch=0&w=500&h=375]

The West Wing was always a favorite show of mine (at least, seasons 1-4, the Sorkin years) and I think this is a great scene which talks about the difference between evidence and interpretation. The topic is a 5-day waiting period for gun purchases and they’ve just received a poll in a few specific congressional districts showing weak support for this proposed policy.

(Source: http://www.youtube.com/)