Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Lowering the GWAS threshold would save millions of dollars

A recent publication (pay-walled) by Boyle et al. introducing the concept of an omnigenic model has generated much discussion. It reminded me of a question I’ve had for a while about the way genetics data is analyzed. Before getting into this, I’ll briefly summarize the general issue. With the completion of the human genome project, human geneticists saw much promise in the possibility of scanning the entire genome for the genes associated with a trait.

The future of education is plain text

I was recently at a National Academy meeting on Envisioning the Data Science Curriculum. It was a fun meeting and one of the questions that came up was what kind of infrastructure do we need to enable shared curricula, compatibility across schools, and not reinventing the wheel. My answer to this question was that we need lecture notes stored in plain text files (like rmarkdown files) and data stored in csv files with direct links.

papr - rate papers on biorxiv in a single swipe and help science!

Like a lot of modern scientists I now find papers to read to a large extent based on what I see on social media. It is a great way to find out what my colleagues are reading and keep up with the newest cool research. A few weeks ago I released a really simple prototype for an app that would improve on this experience called papr, its like Tinder but for papers :).

Don't use deep learning your data isn't that big

The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes: Best quote from NAS DS Round Table: "I mean, do we need deep learning to analyze 30 subjects?" - B Caffo @simplystats #datascienceinreallife — CMU Stats (@CMU_Stats) May 1, 2017 When I saw that quote I was reminded of the blog post Don’t use hadoop - your data isn’t that big.

Toward tidy analysis

Tidy data at its heart is a set of three rules for organizing a data set: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. This is an incredibly useful abstraction for thinking about organizing data sets for analysis. In particular any data set that can be conveniently rectangled to look like this fun diagram from Jenny Bryan Data rectangle

The Past and Future of Data Analysis

On May 3rd I gave my Dean’s lecture titled “The Past and Future of Data Analysis”, which was a lot of fun and gave me the opportunity to play lots of different kinds of music on stage! I talked a little bit about it on the latest episode of Not So Standard Deviations. Now the School has posted the full video of the lecture and you can watch it here:

Data on the Comey Effect

There is currently a debate about whether or not the Comey letter flipped the election. Nate Cohn makes a convincing argument that the letter had little to no effect. Some time ago I looked at this myself and came to a similar conclusion. If anything, it was the ACA price hike announcement that had the bigger effect. To test out blogdown (thanks Yihui Xie!) I decided to write this post showing the code I used for the simple analysis I performed, hoping to get others to look at the data, point out mistakes, or show me a better way to do what I did.

Will Machine Learning and AI Ever Solve the Last Mile?

Facebook just recently announced that they were hiring 3,000 people (on top of an existing 4,500) people to review images, videos, and posts for inappropriate content. From Popular Science: The scale of this labor is vast: Facebook is hiring more people than work in the combined newsrooms of the New York Times, the Wall Street Journal, and the Washington Post. Facebook isn’t saying at this time if the jobs will be employees or contractors, and if they’ll be based in the United States or abroad.

Some default and debt restructuring data

Yesterday the government of Puerto Rico asked for bankruptcy relief in federal court. Puerto Rico owes about \$70 billion to bondholders and about $50 billion in pension obligations. Before asking for protection the government offered to pay back some of the debt (50% according to some news reports) but bondholders refused. Bondholders will now fight in court to recover as much of what is owed as possible while the government and a federal oversight board will try to lower this amount.

Science really is non-partisan: facts and skepticism annoy everybody

This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right. First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF.