Simply Statistics


A thanksgiving dplyr Rubik's cube puzzle for you

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Nick Carchedi is back visiting from DataCamp and for fun we came up with a dplyr Rubik's cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:

To solve the puzzle you need to pipe this data frame in 

and pipe out the Thanksgiving data frame using only the dplyr commands arrange, mutate, slice, filter and select. For advanced users you can try our slightly more complicated puzzle:

See if you can do it this fast. Post your solutions in the comments and Happy Thanksgiving!


20 years of Data Science: from Music to Genomics

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I finally got around to reading David Donoho's 50 Years of Data Science paper.  I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:

The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.

The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were "fired up about the new era where data is abundant and statisticians are scientists". It was clear that many disciplines were becoming data-driven and  that interest in data analysis was growing rapidly. We were further motivated because, despite this new found interest in our work, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take leadership roles in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of our posts exhorted academic departments to embrace larger numbers of applied statisticians:

[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.

Donoho points out that John Tukey had a similar preoccupation 50 years ago:

For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to teach the gory details of what what they do, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled 20 years of Data Science: from Music to Genomics. The goal was to explain why applied statistician is not considered synonymous with data scientist even when we focus on the same goal: extract knowledge or insights from data.

The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. My dissertation work was about finding meaningful parametrization of musical sound signals thatSpectrogram my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them here). None of these data science aspects were highlighted in the papers I wrote about my thesis. Here is a screen shot from this paper:

Screen Shot 2015-04-15 at 12.24.40 PM

I am actually glad I wrote out and published all the technical details of this work.  It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.

The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a department chair that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how developing software packages, planning the gathering of data to aid method development, developing web tools to assess data analysis techniques in the wild, and facilitating data-driven discovery in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to President Obama, and go do some great data science.




Some Links Related to Randomized Controlled Trials for Policymaking

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In response to my previous post, Avi Feller sent me these links related to efforts promoting the use of RCTs  and evidence-based approaches for policymaking:

  •  The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see here and here).
  • Jeff Liebman has written extensively about the use of randomized experiments in policy (see here for a recent interview).
  • The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report here.
  • JPAL North America just launched a major initiative to help state and local governments run randomized trials (see here).

Given the history of medicine, why are randomized trials not used for social policy?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Policy changes can have substantial societal effects. For example, clean water and  hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, prohibition, or the "noble experiment", boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.

The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered one of the greatest medical advances of the 20th century.

Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In this post, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In this TED talk Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs  are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least two RCTs finding that universal pre-K programs are not effective, polymakers in New York are implementing a $400 million a year program. Supporters of this noble endeavor defend their decision by pointing to observational studies and "expert" opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.

Today, when we compare conclusions from non-RCT studies to RCTs, we note the unintended strong effects that preconceived notions can have. The first chapter in this book provides a summary and some examples. One example comes from a study of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:

Design Marked Improvement Moderate Improvement None
No control 24 7 1
Controls; but no randomized 10 3 2
Randomized 0 1 3

Compare the first and last column to appreciate the importance of the randomized trials.

A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has terrible side effects. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.

Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.

Update: A reader pointed me to these preprints which point out that the control group in one of the cited early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In this preprint they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the RCT data facilitated the principal stratification framework analysis. I also want to restate what I've posted before, "I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted."


So you are getting crushed on the internet? The new normal for academics.

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Roger and I were just talking about all the discussion around the Case and Deaton paper on death rates for middle class people. Andrew Gelman discussed it among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday an economist blogger wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn't fully fleshed out.

I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of f1000Research. I think it is pretty scary for folks who aren't used to "internet speed" to see this play out and I thought it would be helpful to make a few points.

  1. Everyone is an internet scientist now. The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.
  2. The internet loves a fight. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.
  3. Takedowns are easier than new research and happen faster. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn't mean the critique won't be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.

In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you'd expect (point 3). The first thing to keep in mind is that the internet wants you to "fight back" and wants to declare a "winner". Reading about amicable disagreements doesn't build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.

I think these are the three ways to respond to rapid internet criticism of your work.

  • Option 1: Respond on internet time. This means if you publish a big paper that you think might be controversial  you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren't. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.
  • Option2 : Respond in academic time. You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like "you didn't have a good answer so you didn't respond immediately". The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.
  • Option 3: Do not respond. This is what a lot of people do and I'm unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.

All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won't have this issue and if you keep your claims small, this also probably won't apply. But I thought it was useful to try to work out how to act under this "new normal".


Prediction Markets for Science: What Problem Do They Solve?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

I've recently seen a bunch of press on this paper, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:

Although replication is essential for verifying results, the current scientific culture does little to encourage it in most fields. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, could be common in the scientific literature. Indeed, a 2005 study claimed that most published research findings are false.


The researchers began by selecting some studies slated for replication in the Reproducibility Project: Psychology — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in two prediction markets. These are the same types of markets that people use to bet on who’s going to be president. In this case, though, researchers were betting on whether a study would replicate or not.

There are all kinds of prediction markets these days--for politics, general ideas--so having one for scientific ideas is not too controversial. But I'm not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I'll admit that's an interesting result, but I'm not yet convinced.

First off, it's worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they have a fundamentally flawed definition of replication. So I'm not sure I can really agree with the idea of basing a prediction market on such a definition, but I'll let that go for now.

The purpose of most markets is some general notion of "price discovery". One popular market is the stock market and I think it's instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it's probably best to look at...what people are willing to pay. That's exactly what the stock market gives you. You only run into trouble when there's no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.

Now, suppose you're interested in finding out what the "true fundamental value" of Apple, Inc. Some people think the stock market gives you that at every instance, while others think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.

What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they'd be willing to pay for different ideas. But that's not what we're talking about.

Here, we're trying to get at whether a given hypothesis is true or not. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.

But if market participants aren't contributing real information about an hypothesis, what are they contributing? Well, they're contributing their opinion about an hypothesis. How is that related to science? I'm not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it's consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it's not clear how valuable people's opinions are.

In a way, this reminds me of a time a while back when the EPA was soliciting "expert opinion" about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money--just the price of a conference call.

There's a version of this playing out in the health tech market right now. Companies like Theranos and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.

Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the $9 billion valuation of the private company is another indicator that people...highly value the company. But ultimately, we still don't know if their blood tests are accurate because we don't have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?

I think there may be a role to play for prediction markets in science, but I'm not sure discovering the truth about nature is one of them.


Biostatistics: It's not what you think it is

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

My department recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of Biostatistics and most of the rest thought Biostatistics was the study of clinical trials. We have posted on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data scientists and who are considering PhD programs.

There are dozens of Biostatistics departments and most run PhD programs. You may have never heard of it because they are usually in schools that undergrads don't regularly frequent: Public Health and Medicine.  However, they are very active in research and teaching graduate students. In fact, the 2014 US News & World Report ranking of Statistics Departments includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.

How is it different from Statistics? In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.

What type of job can I get with a Phd In Biostatistics? A well paying one. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the Bio in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.

How should I prepare if I want to apply to a PhD program? First you need to decide if you are going to like it. One way to do this is to participate in one of the  many summer programs where you get a glimpse of what we do. My department runs one of these as well.  However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it's like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is
difficult as an undergrad because you have to make sure to do well in your courses,
prepare for the GRE, and get a solid mathematical and computing
foundation in order to conduct research later. This is why these
programs are usually in the summer.

If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I  recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve
you well if you plan to analyze data regardless of what academic route
you follow. Python and a low-level language such as C++ are more
powerful languages that many biostatisticians use these days.

I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.



Not So Standard Deviations: Episode 4 - A Gajillion Time Series

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Episode 4 of Not So Standard Deviations is hot off the audio editor. In this episode Hilary first explains to me what heck is DevOps and then we talk about the statistical challenges in detecting rare events in an enormous set of time series data. There's also some discussion of Ben and Jerry's and the t-test, so you'll want to hang on for that.




How I decide when to trust an R package

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

One thing that I've given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from trolling me on Twitter to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor.  A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:

The primary reason is because Bioc/CRAN demonstrate something about the developer's willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn't just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn't have a user guarantee system in place.

This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:

  1. Quick and dirty analyses for myself
  2. Shareable data analyses that I hope are reproducible
  3. As dependencies of R packages I maintain

As you move from 1-3 it is more and more of a pain if the package I'm depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.

So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:

  1. People prior 
  2. Forced competence
  3. Indirect data

I'll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I'm using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.

People prior

The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like Brian, or Hadley, or Jenny, or Rafa, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn't I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.

Forced competence

For people who I don't know about or whose software I've never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don't know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don't know goes something like:

Screen Shot 2015-11-06 at 1.25.01 PM

This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.

This prior isn't perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.

Indirect data

After I've applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I'm already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the download stats provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:

If I'm going to use a package that is on Github from a person who isn't on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn't actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven't used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a travis badge). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I'd go ahead and trust the package.


In general one of the best things about the R ecosystem is being able to rely on other packages so that you don't have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I've outlined to worry less about trustworthy dependencies.


The Statistics Identity Crisis: Am I a Data Scientist

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The joint ASA/Simply Statistics webinar on the statistics identity crisis is now live!