Tag: review


What's wrong with the predicting h-index paper.

Editor’s Note: I recently posted about a paper in Nature that purported to predict the H-index. The authors contacted me to get my criticisms, then responded to those criticisms. They have requested the opportunity to respond publicly, and I think it is a totally reasonable request. Until there is a better comment generating mechanism at the journal level, this seems like as good a forum as any to discuss statistical papers. I will post an extended version of my criticisms here and give them the opportunity to respond publicly in the comments. 

The paper in question is a clearly a clever idea and the kind that would get people fired up. Quantifying researchers output is all the rage and being able to predict this quantity in the future would obviously make a lot of evaluators happy. I think it was, in that sense, a really good idea to chase down these data, since it was clear that if they found anything at all it would be very widely covered in the scientific/popular press. 

My original post was inspired out of my frustration with Nature, which has a history of publishing somewhat suspect statistical papers, such as this one. I posted the prediction contest after reading another paper I consider to be a flawed statistical paper, both for statistical reasons and for scientific reasons. I originally commented on the statistics in my post. The authors, being good sports, contacted me for my criticisms. I sent them the following criticisms, which I think are sufficiently major that a statistical referee or statistical journal would have likely rejected the paper:
  1. Lack of reproducibility. The code/data are not made available either through Nature or on your website. This is a critical component of papers based on computation and has led to serious problems before. It is also easily addressable. 
  2. No training/test set. You mention cross-validation (and maybe the R^2 is the R^2 using the held out scientists?) but if you use the cross-validation step to optimize the model parameters and to estimate the error rate, you could see some major overfitting. 
  3. The R^2 values are pretty low. An R^2 of 0.67 is obviously superior to the h-index alone, but (a) there is concern about overfitting, and (b) even without overfitting, that low of R^2 could lead to substantial variance in predictions. 
  4. The prediction error is not reported in the paper (or in the online calculator). How far off could you be at 5 years, at 10? Would the results still be impressive with those errors reported?
  5. You use model selection and show only the optimal model (as described in the last paragraph of the supplementary), but no indication of the potential difficulties with this model selection are made in the text. 
  6. You use a single regression model without any time variation in the coefficients and without any potential non-linearity. Clearly when predicting several years into the future there will be variation with time and non-linearity. There is also likely heavy variance in the types of individuals/career trajectories, and outliers may be important, etc. 
They carefully responded to these criticisms and hopefully they will post their responses in the comments. My impression based on their responses is that the statistics were not as flawed as I originally thought, but that the data aren’t sufficient to form a useful prediction. 
However, I think the much bigger flaw is the basic scientific premise. The h-index has been identified as having major flaws, biases (including gender bias), and to be a generally poor summary of a scientist’s contribution. See here, the list of criticisms here, and the discussion here for starters. The authors of the Nature paper propose a highly inaccurate predictor of this deeply flawed index. While that alone is sufficient to call into question the results in the paper, the authors also make bold claims about their prediction tool: 
Our formula is particularly useful for funding agencies, peer reviewers and hir­ing committees who have to deal with vast 
numbers of applications and can give each only a cursory examination. Statistical techniques have the advantage of returning 
results instantaneously and in an unbiased way.
Suggesting that this type of prediction should be used to make important decisions on hiring, promotion, and funding is highly scientifically flawed. Coupled with the online calculator the authors handily provide (which produces no measure of uncertainty) it makes it all too easy for people to miss the real value of scientific publications: the science contained in them. 

This is an awesome paper all students in statistics should read

The paper is a review of how to do software development for academics. I saw it via C. Titus Brown (who we have interviewed), he is also a co-author. How to write software (particularly for other people) is something that is under emphasized in many curricula. But it turns out this is also one of the more important components of disseminating your work in modern applied statistics. My only wish is that there was an accompanying website with resources/links for people to chase down. 


A statistician and Apple fanboy buys a Chromebook...and loves it!

I don’t mean to brag, but I was an early Apple Fanboy - not sure that is something to brag about now that I write it down. I convinced my advisor to go to all Macs in our lab in 2004. Since then I have been pretty dedicated to the brand, dutifully shelling out almost 2g’s every time I need a new laptop. I love the way Macs just work (until they don’t and you need a new laptop).

But I hate the way Apple seems to be dedicated to bleeding every last cent out of me. So I saved up my Christmas gift money (thanks Grandmas!) and bought a Chromebook. It cost me $350 and I was at least in part inspired by these clever ads

So far I’m super pumped about the performance of the Chromebook. Things I love:

  1. About 10 seconds to boot from shutdown, instantly awake from sleep
  2. Super long battery life - 8 hours a charge might be an underestimate
  3. Size - its a 12 inch laptop and just right for sitting on my lap and typing
  4. Since everything is cloud based,  nothing to install/optimize

It took me a while to get used to the Browser being the operating system. When I close the last browser window, I expect to see the Desktop. Instead, a new browser window pops up. But that discomfort only lasted a short time. 

It turns out I can do pretty much everything I do on my Macbook on the Chromebook. I can access our department’s computing cluster by turning on developer mode and opening a shell (thanks Caffo!). I can do all my word processing on google docs. Email is just gmail as usual. Scribtex for latex (Caffo again). Google Music is so awesome I wish I had started my account before I got my Chromebook. The only thing I’m really trying to settle on is a cloud-based code editor with syntax highlighting. I’m open to suggestions (Caffo?). 

I’m starting to think I could bail on Apple….