Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

The Mystery of Palantir Continues

Palantir, the secretive data science/consulting/software company, continues to be a mystery to most people, but recent reports have not been great. Reuters reports that the U.S. Department of Labor is suing it for employment discrimination:

The lawsuit alleges Palantir routinely eliminated Asian applicants in the resume screening and telephone interview phases, even when they were as qualified as white applicants.

Interestingly, the report indicates a statistical argument:

In one example cited by the Labor Department, Palantir reviewed a pool of more than 130 qualified applicants for the role of engineering intern. About 73 percent of applicants were Asian. The lawsuit, which covers Palantir’s conduct between January 2010 and the present, said the company hired 17 non-Asian applicants and four Asians. “The likelihood that this result occurred according to chance is approximately one in a billion,” said the lawsuit, which was filed with the department’s Office of Administrative Law Judges.

Note the use of the phrase “qualified applicants” in reference to the 130. Presumably, there was a screening process that removed “unqualified applicants” and that led us to 130. Of the 130, 73 were Asian, or about 56%. Presumably, there was a follow up selection process (interview, exam) that led to 4 Asians being hired out of 21 (about 19%). Clearly there’s a difference between 19% and 56% but the reasons may not be nefarious. If you assume the number of Asians hired is proportional to the number in the qualified pool, then the p-value for the observed data is about 0.0006, which is not quite “1 in a billion” as the report claims. But my guess is the Labor Department has more than this test of binomial proportions in terms of evidence if they were to go through with a suit.

Alfred Lee from The Information reports that a mutual fund run by Valic sold their shares of Palantir for below the recent valuation:

The Valic fund sold its stake at $4.50 per share, filings show, down from the $11.38 per share at which the company raised money in December. The value of the stake at the sale price was $621,000. Despite the price drop, Valic made money on the deal, as it had acquired stock in preferred fundraisings in 2012 and 2013 at between $3.06 and $3.51 per share.

In my previous post on Palantir, I noted that while other large-scale consulting companies certainly make a lot of money, none have the sky-high valuation that Palantir commands. However, a more “down-to-Earth” valuation of $8 billion might be more or less in line with these other companies. It may be bad news for Palantir, but should the company ever have an IPO, it would be good for the public for market participants to realize the intrinsic value of the company.

Thinking like a statistician: this is not the election for progressives to vote third party

Democratic elections permit us to vote for whomever we perceive has the highest expectation to do better with the issues we care about. Let’s simplify and assume we can quantify how satisfied we are with an elected official’s performance. Denote this quantity with X. Because when we cast our vote we still don’t know for sure how the candidate will perform, we base our decision on what we expect, denoted here with E(X). Thus we try to maximize E(X). However, both political theory and data tell us that in US presidential elections only two parties have a non-negligible probability of winning. This implies that E(X) is 0 for some candidates no matter how large X could potentially be. So what we are really doing is deciding if E(X-Y) is positive or negative with X representing one candidate and Y the other.

In past elections some progressives have argued that the difference between candidates is negligible and have therefore supported the Green Party ticket. The 2000 election is a notable example. The 2000 election was won by George W. Bush by just five electoral votes. In Florida, which had 25 electoral votes, Bush beat Al Gore by just 537 votes. Green Party candidate Ralph Nader obtained 97,488 votes. Many progressive voters were OK with this outcome because they perceived E(X-Y) to be practically 0.

In contrast, in 2016, I suspect few progressives think that E(X-Y) is anywhere near 0. In the figures below I attempt to quantify the progressive’s pre-election perception of consequences for the last five contests. The first figure shows E(X) and E(Y) and the second shows E(X-Y). Note despite E(X) being the lowest in the last past five elections, E(X-Y) is by far the largest. So if these figures accurately depict your perception and you think like a statistician, it becomes clear that this is not the election to vote third party.



Facebook and left censoring

From the Wall Street Journal:

Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.

A classic case of left censoring (in this case, by “accident”).

Also this:

Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.

What does this information tell us about the actual time spent watching Facebook videos?

Not So Standard Deviations Episode 22 - Number 1 Side Project

Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.

Also, Hilary and I have just published a new book, Conversations on Data Science, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!

If you have questions you’d like us to answer, you can send them to nssdeviations @ or tweet us at @NSSDeviations.

Subscribe to the podcast on iTunes or Google Play.

Please leave us a review on iTunes!

Support us through our Patreon page.

Show Notes:

Download the audio for this episode.

Listen here:

Mastering Software Development in R

Today I’m happy to announce that we’re launching a new specialization on Coursera titled Mastering Software Development in R. This is a 5-course sequence developed with Sean Kross and Brooke Anderson.

This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing software. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.

The first course, The R Programming Environment, launches today. In the following months, we will launch the remaining courses:

  • Advanced R Programming
  • Building R Packages
  • Building Data Visualization Tools

In addition to the course, we have a companion textbook that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.