Simply Statistics

A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The Past and Future of Data Analysis

On May 3rd I gave my Dean’s lecture titled “The Past and Future of Data Analysis”, which was a lot of fun and gave me the opportunity to play lots of different kinds of music on stage! I talked a little bit about it on the latest episode of Not So Standard Deviations. Now the School has posted the full video of the lecture and you can watch it here:

Data on the Comey Effect

There is currently a debate about whether or not the Comey letter flipped the election. Nate Cohn makes a convincing argument that the letter had little to no effect. Some time ago I looked at this myself and came to a similar conclusion. If anything, it was the ACA price hike announcement that had the bigger effect. To test out blogdown (thanks Yihui Xie!) I decided to write this post showing the code I used for the simple analysis I performed, hoping to get others to look at the data, point out mistakes, or show me a better way to do what I did.

Will Machine Learning and AI Ever Solve the Last Mile?

Facebook just recently announced that they were hiring 3,000 people (on top of an existing 4,500) people to review images, videos, and posts for inappropriate content. From Popular Science: The scale of this labor is vast: Facebook is hiring more people than work in the combined newsrooms of the New York Times, the Wall Street Journal, and the Washington Post. Facebook isn’t saying at this time if the jobs will be employees or contractors, and if they’ll be based in the United States or abroad.

Some default and debt restructuring data

Yesterday the government of Puerto Rico asked for bankruptcy relief in federal court. Puerto Rico owes about \$70 billion to bondholders and about $50 billion in pension obligations. Before asking for protection the government offered to pay back some of the debt (50% according to some news reports) but bondholders refused. Bondholders will now fight in court to recover as much of what is owed as possible while the government and a federal oversight board will try to lower this amount.

Science really is non-partisan: facts and skepticism annoy everybody

This is a short open letter to those that believe scientists have a “liberal bias” and question their objectivity. I suspect that for many conservatives, this Saturday’s March for Science served as confirmation of this fact. In this post I will try to convince you that this is not the case specifically by pointing out how scientists often annoy the left as much as the right. First, let me emphasize that scientists are highly appreciative of members of Congress and past administrations that have supported Science funding though the DoD, NIH and NSF.

La matrícula, el costo del crédito y las huelgas en la UPR

La Universidad de Puerto Rico (UPR) recibe aproximádamente 800 millones de dólares del estado cada año. Esta inversión le permite ofrecer salarios más altos, lo cual atrae a los mejores profesores, tener las mejores instalaciones para la investigación y enseñanza, y mantener el precio por crédito más bajo que las universidades privadas. Gracias a estas grandes ventajas, la UPR suele ser la primera opción del estudiantado puertorriqueño, en particular los dos recintos más grandes, Río Piedras (UPRRP) y Mayagüez.


This page was generated in error. The “Science really is non-partisan: facts and skepticism annoy everybody” blog post is here Apologies for the inconvenience.

The Importance of Interactive Data Analysis for Data-Driven Discovery

Data analysis workflows and recipes are commonly used in science. They are actually indispensable since reinventing the wheel for each project would result in a colossal waste of time. On the other hand, mindlessly applying a workflow can result in totally wrong conclusions if the required assumptions don’t hold. This is why successful data analysts rely heavily on interactive data analysis (IDA). I write today because I am somewhat concerned that the importance of IDA is not fully appreciated by many of the policy makers and thought leaders that will influence how we access and work with data in the future.

The levels of data science class

In a recent post, Nathan Yau points to a comment by Jake Porway about data science hackathons. They both say that for data science/visualization projects to be successful you have to start with an important question, not with a pile of data. This is the problem forward not solution backward approach to data science and big data. This is the approach also advocated in the really nice piece on teaching data science by Stephanie and Rafa

When do we need interpretability?

I just saw a link to an interesting article by Finale Doshi-Velez and Been Kim titled “Towards A Rigorous Science of Interpretable Machine Learning”. From the abstract: Unfortunately, there is little consensus on what interpretability in machine learning is and how to evaluate it for benchmarking. Current interpretability evaluation typically falls into two categories. The first evaluates interpretability in the context of an application: if the system is useful in either a practical application or a simplified version of it, then it must be somehow interpretable.