An interview with Anthony Goldbloom, CEO of Kaggle. I’m not sure I’d agree with the characterization that all data scientists are: creative, curious, and competitive and certainly those characteristics aren’t unique to data scientists. And I didn’t know this: “We have 65,000 data scientists signed up to Kaggle, and just like with golf tournaments, we have them all ranked from 1 to 65,000.“ Check it out, art with R!
An interview with Brad Efron about scientific writing. I haven’t watched the whole interview, but I do know that Efron is one of my favorite writers among statisticians. Slidify, another approach for making HTML5 slides directly from R. I love the idea of making HTML slides, I would definitely do this regularly. But there are a couple of issues I feel still aren’t resolved: (1) It is still just a little too hard to change the theme/feel of the slides in my opinion.
An important article about anti-science sentiment in the U.S. (via David S.). The politicization of scientific issues such as global warming, evolution, and healthcare (think vaccination) makes the U.S. less competitive. I think the lack of statistical literacy and training in the U.S. is one of the sources of the problem. People use/skew/mangle statistical analyses and experiments to support their view and without a statistically well trained public, it all looks “reasonable and scientific”.
This is scientific variant on the #whatshouldwecallme meme isn’t exclusive to statistics, but it is hilarious. This is a really interesting post that is a follow-up to the XKCD password security comic. The thing I find most interesting about this is that researchers realized the key problem with passwords was that we were looking at them purely from a computer science perspective. But _people _use passwords, so we need a person-focused approach to maximize security.
A fascinating article about the debate on whether to regulate sugary beverages. One of the protagonists is David Allison, a statistical geneticist, among other things. It is fascinating to see the interplay of statistical analysis and public policy. Yet another example of how statistics/data will drive some of the most important policy decisions going forward. A related article is this one on the way risk is reported in the media.
First off, a quick apology for missing last week, and thanks to Augusto for noticing! On to the links: Unbelievably the BRCA gene patents were upheld by the lower court despite the Supreme Court coming down pretty unequivocally against patenting correlations between metabolites and health outcomes. I wonder if this one will be overturned if it makes it back up to the Supreme Court. A really nice interview with David Spiegelhalter on Statistics and Risk.
An interesting blog post about the top N reasons to do a Ph.D. in bioinformatics or computational biology. A couple of things that I find interesting and could actually be said of any program in biostatistics as well are: computing is the key skill of the 21st century and computational skills are highly transferrable. Via Andrew J. Here is an interesting auto-complete map of the United States where the prompt was, “Why is [state] so”.
This paper is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a huge amount of press in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar.
A really nice list of journals software/data release policies from Titus’ blog. Interesting that he couldn’t find a data/release policy for the New England Journal of Medicine. I wonder if that is because it publishes mostly clinical studies, where the data are often protected for privacy reasons? It seems like there is going to eventually be a big discussion of the relative importance of privacy and open data in the clinical world.
Happy Father’s Day! A really interesting read on randomized controlled trials (RCTs) and public policy. The examples in the boxes are fantastic. This seems to be one of the cases where the public policy folks are borrowing ideas from Biostatistics, which has been involved in randomized controlled trials for a long time. It’s a cool example of adapting good ideas in one discipline to the specific challenges of another. Roger points to this link in the NY Times about the “Consumer Genome”, which basically is a collection of information about your purchases and consumer history.
Yelp put a data set online for people to play with, including reviews, star ratings, etc. This could be a really neat data set for a student project. The data they have made available focuses on the area around 30 universities. My alma mater is one of them. A sort of goofy talk about how to choose the optimal marriage partner when viewing the problem as an optimal stopping problem.
Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works.
It’s grant season around here so I’ll be brief: I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline.
Patenting statistical sampling? I’m pretty sure the Supreme Court who threw out the Mayo Patent wouldn’t have much trouble tossing this patent either. The properties of sampling are a “law of nature” right? via Leonid K. This video has me all fired up, its called 23 1⁄2 hours and talks about how the best preventative health measure is getting 30 minutes of exercise - just walking - every day. He shows how in some cases this beats doing much more high-tech interventions.
Nature genetics has an editorial on the Mayo and Myriad cases. I agree with this bit: “In our opinion, it is not new judgments or legislation that are needed but more innovation. In the era of whole-genome sequencing of highly variable genomes, it is increasingly hard to justify exclusive ownership of particularly useful parts of the genome, and method claims must be more carefully described.” Via Andrew J. One of Tech Review’s 10 emerging technologies from a February 2003 article?
Now we know who is to blame for the pie chart. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.) A nice article in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.
This is a great article about the illusion of progress in machine learning. In part, I think it explains why the Leekasso (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data.
The psychologist whose experiment didn’t replicate then went off on the scientists who did the replication experiment is at it again. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. Berkeley is running a data science course with instructors Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing.
A really interesting proposal by Rafa (in Spanish - we’ll get on him to write a translation) for the University of Puerto Rico. The post concerns changing the focus from simply teaching to creating knowledge and the potential benefits to both the university and to Puerto Rico. It also has a really nice summary of the benefits that the university system in the United States has produced. Definitely worth a read.
This is the big one. ESPN has opened up access to their API! It looks like there may only be access to some of the data for the general public though, does anyone know more? Looks like ESPN isn’t the only sports-related organization in the API mood, Nike plans to open up an API too. It would be great if they had better access to individual, downloadable data. Via Leonid K.
A cool article on Github by the folks at Wired. I’m starting to think the fact that I’m not on Github is a serious dent in my nerd cred. Datawrapper - a less intensive, but less flexible open source data visualization creator. I have seen a few of these types of services starting to pop up. I think that some statistics training should be mandatory before people use them.
An awesome alternative to D3.js - R’s svgAnnotation package. Here’s the paper in JSS. I feel like this is one step away from gaining broad use in the statistics community - it still feels a little complicated building the graphics, but there is plenty of flexibility there. I feel like a great project for a student at any level would be writing some easy wrapper functions for these functions. How to run R on your Android device.
Cool app, you can write out an equation on the screen and it translates the equation to latex. Via Andrew G. Yet another D3 tutorial. Stay tuned for some cool stuff on this front here at Simply Stats in the near future. Via Vishal. Our favorite Greek statistician in the news again. How measurement of academic output harms science. Related: is submitting scientific papers too time consuming? Stay tuned for more on this topic this week.
A really nice D3 tutorial. I’m 100% on board with D3, if they could figure out a way to export the graphics as pdfs, I think this would be the best visualization tool out there. A personalized calculator that tells you what number (of the 7 billion or so) that you are based on your birth day. I’m person 4,590,743,884. Makes me feel so special…. An old post of ours, on dongle communism.
Is the microarray dead? Jeremey Leipzig seems to think that statistical methods for microarrays should be. I’m not convinced, the technology has finally matured to the point we can use it for personalized medicine and we abandon it for the next hot thing? Not to Andrew for the link. Data from 5 billion webpages available from the Common Crawl. Want to build your own search tool - or just find out whats on the web?
Statistics help for journalists (don’t forget to keep rating stories!) This is the kind of thing that could grow into a statisteracy page. The author also has a really nice plug for public schools. An interactive graphic to determine if you are in the 1% from the New York Times (I’m not…). Mike Bostock’s d3.js presentation, this is some really impressive visualization software. You have to change the slide numbers manually but it is totally worth it.
A few data/statistics related links of interest: Eric Lander Profile The math of lego (should be “The statistics of lego”) Where people are looking for homes. Hans Rosling’s Ted Talk on the Developing world (an oldie but a goodie) Elsevier is trying to make open-access illegal (not strictly statistics related, but a hugely important issue for academics who believe government funded research should be freely accessible), more here.