Tag: sunday links


Sunday data/statistics link roundup (6/10)

  1.  Yelp put a data set online for people to play with, including reviews, star ratings, etc. This could be a really neat data set for a student project. The data they have made available focuses on the area around 30 universities. My alma mater is one of them. 
  2. A sort of goofy talk about how to choose the optimal marriage partner when viewing the problem as an optimal stopping problem. The author suggests that you need to date around 196,132 partners to make sure you have made the optimal decision. Fortunately for the Simply Statistics authors, it took many fewer for us all to end up with our optimal matches. Via @fhuszar.
  3. An interesting article on the recent Kaggle contest that sought to identify statistical algorithms that could accurately match human scoring of written essays. Several students in my advanced biostatistics course competed in this competition and did quite well. I understand the need for these kinds of algorithms, since it takes a huge amount of human labor to score these essays well. But it also makes me a bit sad since it still seems even the best algorithms will have a hard time scoring creativity. For example, this phrase from my favorite president, doesn’t use big words, but it sure is clever, “I think there is only one quality worse than hardness of heart and that is softness of head.”
  4. A really good article by friend of the blog, Steven, on the perils of gene patents. This part sums it up perfectly, “Genes are not inventions. This simple fact, which no serious scientist would dispute, should be enough to rule them out as the subject of patents.” Simply Statistics has weighed in on this issue a couple of times before. But I think in light of 23andMe’s recent Parkinson’s patent it bears repeating. Here is an awesome summary of the issue from Genomics Lawyer.
  5. A proposal for a really fast statistics journal I wrote about a month or two ago. Expect more on this topic from me this week. 

Sunday data/statistics link roundup (5/27)

  1. Amanda Cox on the process they went through to come up with this graphic about the Facebook IPO. So cool to see how R is used in the development process. A favorite quote of mine, “But rather than bringing clarity, it just sort of looked chaotic, even to the seasoned chart freaks of 620 8th Avenue.” One of the more interesting things about posts like this is you get to see how statistics versus a deadline works. This is typically the role of the analyst, since they come in late and there is usually a deadline looming…
  2. An interview with Steve Blank about Silicon valley and how venture capitalists (VC’s) are focused on social technologies since they can make a profit quickly. A depressing/fascinating quote from this one is, “If I have a choice of investing in a blockbuster cancer drug that will pay me nothing for ten years,  at best, whereas social media will go big in two years, what do you think I’m going to pick? If you’re a VC firm, you’re tossing out your life science division.” He also goes on to say thank goodness for the NIH, NSF, and Google who are funding interesting “real science” problems. This probably deserves its own post later in the week, the difference between analyzing data because it will make money and analyzing data to solve a hard science problem. The latter usually takes way more patience and the data take much longer to collect. 
  3. An interesting post on how Obama’s analytics department ran an A/B test which improved the number of people who signed up for his mailing list. I don’t necessarily agree with their claim that they helped raise $60 million, there may be some confounding factors that mean that the individuals who sign up with the best combination of image/button don’t necessarily donate as much. But still, an interesting look into why Obama needs statisticians
  4. A cute statistics cartoon from @kristin_linn  via Chris V. Yes, we are now shamelessly reposting cute cartoons for retweets :-). 
  5. Rafa’s post inspired some interesting conversation both on our blog and on some statistics mailing lists. It seems to me that everyone is making an effort to understand the increasingly diverse field of statistics, but we still have a ways to go. I’m particularly interested in discussion on how we evaluate the contribution/effort behind making good and usable academic software. I think the strength of the Bioconductor community and the rise of Github among academics are a good start.  For example, it is really useful that Bioconductor now tracks the number of package downloads

Sunday data/statistics link roundup (5/20)

It’s grant season around here so I’ll be brief:
  1. I love this article in the WSJ about the crisis at JP Morgan. The key point it highlights is that looking only at the high-level analysis and summaries can be misleading, you have to look at the raw data to see the potential problems. As data become more complex, I think its critical we stay in touch with the raw data, regardless of discipline. At least if I miss something in the raw data I don’t lose a couple billion. Spotted by Leonid K. 
  2. On the other hand, this article in the Times drives me a little bonkers. It makes it sound like there is one mathematical model that will solve the obesity epidemic. Lines like this are ridiculous: “Because to do this experimentally would take years. You could find out much more quickly if you did the math.” The obesity epidemic is due to a complex interplay of cultural, sociological, economic, and policy factors. The idea you could “figure it out” with a set of simple equations is laughable. If you check out their model this is clearly not the answer to the obesity epidemic. Just another example of why statistics is not math. If you don’t want to hopelessly oversimplify the problem, you need careful data collection, analysis, and interpretation. For a broader look at this problem, check out this article on Science vs. PR. Via Andrew J. 
  3. Some cool applications of the raster package in R. This kind of thing is fun for student projects because analyzing images leads to results that are easy to interpret/visualize.
  4. Check out John C.’s really fascinating post on determining when a white-collar worker is great. Inspired by Roger’s post on knowing when someone is good at data analysis. 

Sunday data/statistics link roundup (5/13)

  1. Patenting statistical sampling? I’m pretty sure the Supreme Court who threw out the Mayo Patent wouldn’t have much trouble tossing this patent either. The properties of sampling are a “law of nature” right? via Leonid K.
  2. This video has me all fired up, its called 23 1/2 hours and talks about how the best preventative health measure is getting 30 minutes of exercise - just walking - every day. He shows how in some cases this beats doing much more high-tech interventions. My favorite part of this video is how he uses a ton of statistical/epidemiological terms like “effect sizes”, “meta-analysis”, “longitudinal study”, “attributable fractions”, but makes them understandable to a broad audience. This is a great example of “statistics for good”.
  3. A very nice collection of 2-minute tutorials in R. This is a great way to teach the concepts, most of which don’t need more than 2 minutes, and it covers a lot of ground. One thing that drives me crazy is when I go into Rafa’s office with a hairy computational problem and he says, “Oh you didn’t know about function x?”. Of course this only happens after I’ve wasted an hour re-inventing the wheel. If more people put up 2 minute tutorials on all the cool tricks they know, the better we’d all be.
  4. A plot using ggplot2, developed by this week’s interviewee Hadley Wickham appears in the Atlantic! Via David S.
  5. I’m refusing to buy into Apple’s hegemony, so I’m still running OS 10.5. I’m having trouble getting github up and running. Anyone have this same problem/know a solution? I know, I know, I’m way behind the times on this…

Sunday data/statistics link roundup (4/29)

  1. Nature genetics has an editorial on the Mayo and Myriad cases. I agree with this bit: “In our opinion, it is not new judgments or legislation that are needed but more innovation. In the era of whole-genome sequencing of highly variable genomes, it is increasingly hard to justify exclusive ownership of particularly useful parts of the genome, and method claims must be more carefully described.” Via Andrew J.
  2. One of Tech Review’s 10 emerging technologies from a February 2003 article? Data mining. I think doing interesting things with data has probably always been a hot topic, it just gets press in cycles. Via Aleks J. 
  3. An infographic in the New York Times compares the profits and taxes of Apple over time, here is an explanation of how they do it. (Via Tim O.)
  4. Saw this tweet via Joe B. I’m not sure if the frequentists or the Bayesians are winning, but it seems to me that the battle no longer matters to my generation of statisticians - there are too many data sets to analyze, better to just use what works!
  5. Statistical and computational algorithms that write news stories. Simply Statistics remains 100% human written (for now). 
  6. The 5 most critical statistical concepts. 

Sunday data/statistics link roundup (4/22)

  1. Now we know who is to blame for the pie chart. I had no idea it had been around, straining our ability to compare relative areas, since 1801. However, the same guy (William Playfair) apparently also invented the bar chart. So he wouldn’t be totally shunned by statisticians. (via Leonid K.)
  2. A nice article in the Guardian about the current group of scientists that are boycotting Elsevier. I have to agree with the quote that leads the article, “All professions are conspiracies against the laity.” On the other hand, I agree with Rafa that academics are partially to blame for buying into the closed access hegemony. I think more than a boycott of a single publisher is needed; we need a change in culture. (first link also via Leonid K)
  3. A blog post on how to add a transparent image layer to a plot. For some reason, I have wanted to do this several times over the last couple of weeks, so the serendipity of seeing it on R Bloggers merited a mention. 
  4. I agree the Earth Institute needs a better graphics advisor. (via Andrew G.)
  5. A great article on why multiple choice tests are used - they are an easy way to collect data on education. But that doesn’t mean they are the right data. This reminds me of the Tukey quote: “The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data”. It seems to me if you wanted to have a major positive impact on education right now, the best way would be to develop a new experimental design that collects the kind of data that really demonstrates mastery of reading/math/critical thinking. 
  6. Finally, a bit of a bleg…what is the best way to do the SVD of a huge (think 1e6 x 1e6), sparse matrix in R? Preferably without loading the whole thing into memory…

Sunday data/statistics link roundup (4/8)

  1. This is a great article about the illusion of progress in machine learning. In part, I think it explains why the Leekasso (just using the top 10) isn’t a totally silly idea. I also love how he talks about sources of uncertainty in real prediction problems that aren’t part of the classical models when developing prediction algorithms. I think that this is a hugely underrated component of building an accurate classifier - just finding the quirks particular to a type of data. Via @chlalanne.
  2. An interesting post from Michael Eisen on a serious abuse of statistical ideas in the New York Times. The professor of genetics quoted in the story apparently wasn’t aware of the birthday problem. Lack of statistical literacy, even among scientists, is becoming critical. I would love it if the Kahn academy (or some enterprising students) would come up with a set of videos that just explained a bunch of basic statistical concepts - skipping all the hard math and focusing on the ideas. 
  3.  TechCrunch finally caught up to our Mayo vs. Prometheus coverage. This decision is going to affect more than just personalized medicine. Speaking of the decision, stay tuned for more on that topic from the folks over here at Simply Statistics. 
  4. How much is a megabyte? I love this question. They asked people on the street how much data was in a megabyte. The answers were pretty far ranging looks like. This question is hyper-critical for scientists in the new era, but the better question might be, “How much is a terabyte?”

Sunday data/statistics link roundup (3/25)

  1. The psychologist whose experiment didn’t replicate then went off on the scientists who did the replication experiment is at it again. I don’t see a clear argument about the facts of the matter in his post, just more name calling. This seems to be a case study in what not to do when your study doesn’t replicate. More on “conceptual replication” in there too. 
  2. Berkeley is running a data science course with instructors Jeff Hammerbacher and Mike Franklin, I looked through the notes and it looks pretty amazing. Stay tuned for more info about my applied statistics class which starts this week. 
  3. A cool article about Factual, one of the companies whose sole mission in life is to collect and distribute data. We’ve linked to them before. We are so out ahead of the Times on this one…
  4. This isn’t statistics related, but I love this post about Jeff Bezos. If we all indulged our inner 11 year old a little more, it wouldn’t be a bad thing. 
  5. If you haven’t had a chance to read Reeves guest post on the Mayo Supreme Court decision yet, you should, it is really interesting. A fascinating intersection of law and statistics is going on in the personalized medicine world right now. 

Sunday data/statistics link roundup (3/18)

  1. A really interesting proposal by Rafa (in Spanish - we’ll get on him to write a translation) for the University of Puerto Rico. The post concerns changing the focus from simply teaching to creating knowledge and the potential benefits to both the university and to Puerto Rico. It also has a really nice summary of the benefits that the university system in the United States has produced. Definitely worth a read. The comments are also interesting, it looks like Rafa’s post is pretty controversial…
  2. An interesting article suggesting that the Challenger Space Shuttle disaster was at least in part due to bad data visualization. Via @DatainColour
  3. The Snyderome is getting a lot of attention in genomics circles. He used as many new technologies as he could to measure a huge amount of molecular information about his body over time. I am really on board with the excitement about measurement technologies, but this poses a huge challenge for statistics and and statistical literacy. If this kind of thing becomes commonplace, the potential for false positives and ghost diagnoses is huge without a really good framework for uncertainty. Via Peter S. 
  4. More news about the Nike API. Now that is how to unveil some data! 
  5. Add the Nike API to the list of potential statistics projects for students. 

Sunday Data/Statistics Link Roundup (3/11)

  1. This is the big one. ESPN has opened up access to their API! It looks like there may only be access to some of the data for the general public though, does anyone know more? 
  2. Looks like ESPN isn’t the only sports-related organization in the API mood, Nike plans to open up an API too. It would be great if they had better access to individual, downloadable data. 
  3. Via Leonid K.: a highly influential psychology study failed to replicate in a study published in PLoS One. The author of the original study went off on the author of the paper, on PLoS One, and on the reporter who broke the story (including personal attacks!). It looks like the authors of the PLoS One paper actually did a more careful study than the original authors to me. The authors of the PLoS One paper, the reporter, and the editor of PLoS One all replied in a much more reasonable way. See this excellent summary for all the details. Here are a few choice quotes from the comments: 

1. But there’s a long tradition in social psychology of experiments as parables,

2. I’d love to write a really long response, but let’s just say: priming methods like these fail to replicate all the time (frequently in my own studies), and the news that one of Bargh’s studies failed to replicate is not surprising to me at all.

3. This distinction between direct and conceptual replication helps to explain why a psychologist isn’t particularly concerned whether Bargh’s finding replicates or not.

D.  Reproducible != Replicable in scientific research. But Roger’s perspective on reproducible research still seems appropriate here.