Tag: encode

09
Sep

Sunday Data/Statistics Link Roundup (9/9/12)

  1. Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to “how to google” (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. 
  2. A really nice presentation on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?
  3. We talked a lot about the ENCODE project this week. We had an interview with Steven Salzberg, then Rafa followed it up with a discussion of top-down vs. bottom-up science. Tons of data from the ENCODE project is now available, there is even a virtual machine with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “that’s a clown post, bro”.
  4. Another breathless post from the Chronicle about how there are “dozens of plagiarism cases being reported on Coursera”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in my course, hopefully plagiarism software will be in place by then. 
  5. A New York Times article about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau says in his book. This one may just be targeted at scientists/engineers. (link via Dan S.)
  6. Edo and co. are putting together a workshop on the analysis of social network data for NIPS in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!
07
Sep

Top-down versus bottom-up science: data analysis edition

In our most recent video, Steven Salzberg discusses the ENCODE project. Some of the advantages and disadvantages of top-down science are described.  Here, top-down refers to big coordinated projects like the Human Genome Project (HGP). In contrast, the approach of funding many small independent projects, via the R01 mechanism, is referred to as bottom-up. Note that for the cost of HGP we could have funded thousands of R01s. However it is not clear that without the HGP we would have had public sequence data as early as we did. As Steven points out, when it comes to data generation the economies of scale make big projects more efficient. But the same is not necessarily true for data analysis.

Big projects like ENCODE and 1000 genomes include data analysis teams that work in coordination with the data producers.  It is true that very good teams are assembled and very good tools developed. But what if instead of holding the data under embargo until the first analysis is done and a paper (or 30) is published, the data was made publicly available with no restrictions and the scientific community was challenged to compete for data analysis and biological discovery R01s? I have no evidence that this would produce better science, but my intuition is that, at least in the case of data analysis, better methods would be developed. Here is my reasoning. Think of the best 100 data analysts in academia and consider the following two approaches:

1- Pick the best among the 100 and have a small group carefully coordinate with the data producers to develop data analysis methods.

2- Let all 100 take a whack at it and see what falls out.

In scenario 1 the selected group has artificial protection from competing approaches and there are less brains generating novel ideas. In scenario 2 the competition would be fierce and after several rounds of sharing ideas (via publications and conferences), groups would borrow from others and generate even better methods.

Note that the big projects do make the data available and R01s are awarded to develop analysis tools for these data. But this only happens after giving the consortium’s group a substantial head start. 

I have not participated in any of these consortia and perhaps I am being naive. So I am very interested to hear the opinions of others.