Tag: sunday links

02
Dec

Sunday data/statistics link roundup (12/2/12)

  1. An interview with Anthony Goldbloom, CEO of Kaggle. I'm not sure I'd agree with the characterization that all data scientists are: creative, curious, and competitive and certainly those characteristics aren't unique to data scientists. And I didn't know this: "We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000." 
  2. Check it out, art with R! It's actually pretty interesting to see how they use statistical algorithms to generate different artistic styles. Here are some more. 
  3. Now that Ethan Perlstein's crowdfunding experiment was successful, other people are getting on the bandwagon. If you want to find out what kind of bacteria you have in your gut, for example, you could check out this
  4. I thought I had it rough, but apparently some data analysts spend all their time developing algorithms to detect penis drawings!
  5. Roger was on Anderson Cooper 360 as part of the Building America segment. We can't find the video, but here is the transcript. 
  6. An interesting article on the half-life of facts. I think the analogy is an interesting one and certainly there is research to be done there. But I think it jumps the shark a bit when they start talking about how the moon landing was predictable, etc. I completely believe in the retrospective analysis of knowledge, but predicting things is pretty hard, especially when it is the future.  
18
Nov

Sunday Data/Statistics Link Roundup (11/18/12)

  1. An interview with Brad Efron about scientific writing. I haven’t watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians.
  2. Slidify, another approach for making HTML5 slides directly from R.  I love the idea of making HTML slides, I would definitely do this regularly. But there are a couple of issues I feel still aren’t resolved: (1) It is still just a little too hard to change the theme/feel of the slides in my opinion. It is just CSS, but that’s still just enough of a hurdle that it is keeping me away and (2) I feel that the placement/insertion of images is still a little clunky, Google Docs has figured this out, I’d love it if they integrated the best features of Slidify, Latex, etc. into that system. 
  3. Statistics is still the new hotness. Here is a Business Insider list about 5 statistics problems that will “change the way you think about the world”
  4. I love this one in the New Yorker, especially the line,”statisticians are the new sexy vampires, only even more pasty” (via Brooke A.)
  5. We’ve hit the big time! We have been linked to by a real (Forbes) blogger. 
  6. If you haven’t noticed, we have a new logo. We are going to be making a few other platform-related changes over the next week or so. If you have any trouble, let us know!
28
Oct

Sunday Data/Statistics Link Roundup (10/28/12)

  1. An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”. But when science seems to contradict itself, it loses credibility. Another reason to teach statistics to everyone in high school.
  2. Scientific American was loaded this last week, here is another article on cancer screening.  The article covers several of the issues that make it hard to convince people that screening isn’t always good. The predictive value of the positive confusion is a huge one in cancer screening right now. The author of the piece is someone worth following on Twitter @hildabast.
  3. A bunch of data on the use of Github. Always cool to see new data sets that are worth playing with for student projects, etc. (via Hilary M.). 
  4. A really interesting post over at Stats Chat about why we study seemingly obvious things. Hint, the reason is that “obvious” things aren’t always true. 
  5. A story on “sentiment analysis” by NPR that suggests that most of the variation in a stock’s price during the day can be explained by the number of Facebook likes. Obviously, this is an interesting correlation. Probably more interesting for hedge funders/stockpickers if the correlation was with the change in stock price the next day. (via Dan S.)
  6. Yihui Xie visited our department this week. We had a great time chatting with him about knitr/animation and all the cool work he is doing. Here are his slides from the talk he gave. Particularly check out his idea for a fast journal. You are seeing the future of publishing.  
  7. Bonus Link: R is a trendy open source technology for big data
21
Oct

Sunday Data/Statistics Link Roundup (10/21/12)

  1. This is scientific variant on the #whatshouldwecallme meme isn’t exclusive to statistics, but it is hilarious. 
  2. This is a really interesting post that is a follow-up to the XKCD password security comic. The thing I find most interesting about this is that researchers realized the key problem with passwords was that we were looking at them purely from a computer science perspective. But people use passwords, so we need a person-focused approach to maximize security. This is a very similar idea to our previous post on an experimental foundation for statistics. Looks like Di Cook and others are already way ahead of us on this idea. It would be interesting to redefine optimality incorporating the knowledge that most of the time it is a person running the statistics. 
  3. This is another fascinating article about the math education wars. It starts off as the typical dueling schools issue in academia - two different schools of thought who routinely go after the other side. But the interesting thing here is it sounds like one side of this math debate is being waged by a person collecting data and the other is being waged by a side that isn’t. It is interesting how many areas are being touched by data - including what kind of math we should teach. 
  4. I’m going to visit Minnesota in a couple of weeks. I was so pumped up to be an outlaw. Looks like I’m just a regular law abiding citizen though….
  5. Here are outstanding summaries of what went on at the Carl Morris Big Data conference this last week. Tons of interesting stuff there. Parts one, two, and three
14
Oct

Sunday Data/Statistics Link Roundup (10/14/12)

  1. A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. 
  2. A related article is this one on the way risk is reported in the media. It is becoming more and more clear that to be an educated member of society now means that you absolutely have to have a basic understanding of the concepts of statistics. Both leaders and the general public are responsible for the danger that lies in misinterpreting/misleading with risk. 
  3. A press release from the Census Bureau about how the choice of college major can have a major impact on career earnings. More data breaking the results down by employment characteristics and major are here and here. These data update some of the data we have talked about before in calculating expected salaries by major. (via Scott Z.)
  4. An interesting article about Recorded Future that describes how they are using social media data etc. to try to predict events that will happen. I think this isn’t an entirely crazy idea, but the thing that always strikes me about these sorts of project is how hard it is to measure success. It is highly unlikely you will ever exactly predict a future event, so how do you define how close you were? For instance, if you predicted an uprising in Egypt, but missed by a month, is that a good or a bad prediction? 
  5. Seriously guys, this is getting embarrassing. An article appears in the New England Journal “finding” an association between chocolate consumption and Nobel prize winners.  This is, of course, a horrible statistical analysis and unless it was a joke to publish it, it is irresponsible of the NEJM to publish. I’ll bet any student in Stat 101 could find the huge flaws with this analysis. If the editors of the major scientific journals want to continue publishing statistical papers, they should get serious about statistical editing.
26
Aug

Sunday data/statistics link roundup (8/26/12)

First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links:

  1. Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court. 
  2. A really nice interview with David Spiegelhalter on Statistics and Risk. David runs the Understanding Uncertainty blog and published a recent paper on visualizing uncertainty. My favorite line from the interview might be: “There is a nice quote from Joel Best that “all statistics are social products, the results of people’s efforts”. He says you should always ask, “Why was this statistic created?” Certainly statistics are constructed from things that people have chosen to measure and define, and the numbers that come out of those studies often take on a life of their own.”
  3. For those of you who use Tumblr like we do, here is a cool post on how to put technical content into your blog. My favorite thing I learned about is the Github Gist that can be used to embed syntax-highlighted code.
  4. A few interesting and relatively simple stats for projecting the success of NFL teams.  One thing I love about sports statistics is that they are totally willing to be super ad-hoc and to be super simple. Sometimes this is all you need to be highly predictive (see for example, the results of Football’s Pythagorean Theorem). I’m sure there are tons of more sophisticated analyses out there, but if it ain’t broke… (via Rafa). 
  5. My student Hilary has a new blog that’s worth checking out. Here is a nice review of ProjectTemplate she did. I think the idea of having an organizing principle behind your code is a great one. Hilary likes ProjectTemplate, I think there are a few others out there that might be useful. If you know about them, you should leave a comment on her blog!
  6. This is ridiculously cool. Man City has opened up their data/statistics to the data analytics community. After registering, you have access to many of the statistics the club uses to analyze their players. This is yet another example of open data taking over the world. It’s clear that data generators can create way more value for themselves by releasing cool data, rather than holding it all in house. 
  7. The Portland Public Library has created a website called Book Psychic, basically a recommender system for books. I love this idea. It would be great to have a recommender system for scientific papers
12
Aug

Sunday data/statistics link roundup (8/12/12)

  1. An interesting blog post about the top N reasons to do a Ph.D. in bioinformatics or computational biology. A couple of things that I find interesting and could actually be said of any program in biostatistics as well are: computing is the key skill of the 21st century and computational skills are highly transferrable. Via Andrew J. 
  2. Here is an interesting auto-complete map of the United States where the prompt was, “Why is [state] so”. It seems like using the Google auto-complete functions can lead to all sorts of humorous data, xkcd has used it as a data source a couple of times in the past. By the way, the person(s) who think Idaho is boring haven’t been to the right parts of Idaho. (via Rafa). 
  3. One of my all-time favorite statistics quotes appears in this column by David Brooks: “…what God hath woven together, even multiple regression analysis cannot tear asunder.” It seems like the perfect quote for any study that attempts to build a predictive model for a complicated phenomenon where only limited knowledge of the underlying mechanisms are known. 
  4. I’ve been reading up a lot on how to summarize and communicate risk. At the moment, I’ve been following a lot of David Spiegelhalter’s stuff, and really liked this 30,000 foot view summary.
  5. It is interesting how often you see R popping up in random places these days. Here is a blog post with some clearly R-created plots that appeared on Business Insider about predicting the stock-market. 
  6. Roger and I had a post on MOOC’s this week from the perspective of faculty teaching the courses. For a more departmental/administrative level view, be sure to re-read Rafa’s post on the future of graduate education
22
Jul

Sunday Data/Statistics Link Roundup (7/22/12)

  1. This paper is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a huge amount of press in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar. I think the idea is clever, but I wonder if the Normal model is the best choice here…could the estimates be similar because it was the same experimenter, etc.? I suppose the proof is in the pudding though, several of the papers he identifies have been retracted. 
  2. This is an amazing rant by a history professor at Swarthmore over the development of massive online courses, like the ones Roger, Brian and I are teaching. I think he makes some important points (especially about how we could do the same thing with open access in a heart beat if universities/academics through serious muscle behind it), but I have to say, I’m personally very psyched to be involved in teaching one of these big classes. I think that statistics is a field that a lot of people would like to learn something about and I’d like to make it easier for them to do that because I love statistics. I also see the strong advantage of in-person education. The folks who enroll at Hopkins and take our courses will obviously get way more one-on-one interaction, which is clearly valuable. I don’t see why it has to be one or the other…
  3. An interesting discussion with Facebook’s former head of big data. I think the first point is key. A lot of the “big data” hype has just had to do with the infrastructure needed to deal with all the data we are collecting. The bigger issue (and where statisticians will lead) is figuring out what to do with the data. 
  4. This is a great post about data smuggling. The two key points that I think are raised are: (1) how when the data get big enough, they have their own mass and aren’t going to be moved, and (2) how physically mailing harddrives is still the fastest way of transferring big data sets. That is certainly true in genomics where it is called “sneaker net” when a collaborator walks a hard drive over to our office. Hopefully putting data in physical terms will drive home the point that the new scientists are folks that deal with/manipulate/analyze data. 
  5. Not statistics related, but here is a high-bar to hold your work to: the bus-crash test. If you died in a bus-crash tomorrow, would your discipline notice? Yikes. Via C.T. Brown. 
15
Jul

Sunday Data/Statistics Link Roundup (7/15/12)

  1. A really nice list of journals software/data release policies from Titus’ blog. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world. 
  2. Some interesting software that can be used to build virtual workflows for computational science. It seems like a lot of data analysis is still done via “drag and drop” programs. I can’t help but wonder if our effort should be focused on developing drag and drop or educating the next generation of scientists to have minimum scripting capabilities. 
  3. We added StatsChat by Thomas L. and company to our blogroll. Lots of good stuff there, for example, this recent post on when randomized trials don’t help. You can also follow them on twitter.  
  4. A really nice post on processing public data with R. As more and more public data becomes available, from governments, companies, APIs, etc. the ability to quickly obtain, process, and visualize public data is going to be hugely valuable. 
  5. Speaking of public data, you could get it from APIs or from government websites. But beware those category 2 problems
17
Jun

Sunday data/statistics link roundup (6/17)

Happy Father’s Day!

  1. A really interesting read on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another. 
  2. Roger points to this link in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history. On Twitter, Leonid K. asks: ‘Since when has “genome” becaome a generic term for “a bunch of information”?’. I completely understand the reaction against the “genome of x”, which is an over-used analogy. I actually think the analogy isn’t that unreasonable; like a genome, the information contained in your purchase/consumer history says something about you, but doesn’t tell the whole picture. I wonder how this information could be used for public health, since it is already being used for advertising….
  3. This PeerJ journal looks like it has the potential to be good.  They even encourage open peer review, which has some benefits. Not sure if it is sustainable, see for example, this breakdown of the costs. I still think we can do better.  
  4. Elon Musk is one of my favorite entrepreneurs. He tackles what I consider to be some of the most awe-inspiring and important problems around. This article about the Tesla S got me all fired up about how a person with vision can literally change the fuel we run on. Nothing to do with statistics, other than I think now is a similarly revolutionary time for our discipline. 
  5. There was some interesting discussion on Twitter of the usefulness of the Yelp dataset I posted for academic research. Not sure if this ever got resolved, but I think more and more as data sets from companies/startups become available, the terms of use for these data will be critical. 
  6. I’m still working on Roger’s puzzle from earlier this week.