Simply Statistics


20 years of Data Science: from Music to Genomics

I finally got around to reading David Donoho's 50 Years of Data Science paper.  I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:

The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.

The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were "fired up about the new era where data is abundant and statisticians are scientists". It was clear that many disciplines were becoming data-driven and  that interest in data analysis was growing rapidly. We were further motivated because, despite this new found interest in our work, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take leadership roles in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of our posts exhorted academic departments to embrace larger numbers of applied statisticians:

[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.

Donoho points out that John Tukey had a similar preoccupation 50 years ago:

For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data

Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to teach the gory details of what what they do, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled 20 years of Data Science: from Music to Genomics. The goal was to explain why applied statistician is not considered synonymous with data scientist even when we focus on the same goal: extract knowledge or insights from data.

The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. My dissertation work was about finding meaningful parametrization of musical sound signals thatSpectrogram my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them here). None of these data science aspects were highlighted in the papers I wrote about my thesis. Here is a screen shot from this paper:

Screen Shot 2015-04-15 at 12.24.40 PM

I am actually glad I wrote out and published all the technical details of this work.  It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.

The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a department chair that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how developing software packages, planning the gathering of data to aid method development, developing web tools to assess data analysis techniques in the wild, and facilitating data-driven discovery in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to President Obama, and go do some great data science.




Some Links Related to Randomized Controlled Trials for Policymaking

In response to my previous post, Avi Feller sent me these links related to efforts promoting the use of RCTs  and evidence-based approaches for policymaking:

  •  The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see here and here).
  • Jeff Liebman has written extensively about the use of randomized experiments in policy (see here for a recent interview).
  • The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report here.
  • JPAL North America just launched a major initiative to help state and local governments run randomized trials (see here).

Given the history of medicine, why are randomized trials not used for social policy?

Policy changes can have substantial societal effects. For example, clean water and  hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, prohibition, or the "noble experiment", boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.

The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered one of the greatest medical advances of the 20th century.

Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In this post, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In this TED talk Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs  are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least two RCTs finding that universal pre-K programs are not effective, polymakers in New York are implementing a $400 million a year program. Supporters of this noble endeavor defend their decision by pointing to observational studies and "expert" opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.

Today, when we compare conclusions from non-RCT studies to RCTs, we note the unintended strong effects that preconceived notions can have. The first chapter in this book provides a summary and some examples. One example comes from a study of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:

Design Marked Improvement Moderate Improvement None
No control 24 7 1
Controls; but no randomized 10 3 2
Randomized 0 1 3

Compare the first and last column to appreciate the importance of the randomized trials.

A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has terrible side effects. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.

Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.

Update: A reader pointed me to these preprints which point out that the control group in one of the cited early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In this preprint they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the RCT data facilitated the principal stratification framework analysis. I also want to restate what I've posted before, "I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted."


Biostatistics: It's not what you think it is

My department recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of Biostatistics and most of the rest thought Biostatistics was the study of clinical trials. We have posted on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data scientists and who are considering PhD programs.

There are dozens of Biostatistics departments and most run PhD programs. You may have never heard of it because they are usually in schools that undergrads don't regularly frequent: Public Health and Medicine.  However, they are very active in research and teaching graduate students. In fact, the 2014 US News & World Report ranking of Statistics Departments includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.

How is it different from Statistics? In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.

What type of job can I get with a Phd In Biostatistics? A well paying one. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the Bio in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.

How should I prepare if I want to apply to a PhD program? First you need to decide if you are going to like it. One way to do this is to participate in one of the  many summer programs where you get a glimpse of what we do. My department runs one of these as well.  However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it's like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is
difficult as an undergrad because you have to make sure to do well in your courses,
prepare for the GRE, and get a solid mathematical and computing
foundation in order to conduct research later. This is why these
programs are usually in the summer.

If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I  recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve
you well if you plan to analyze data regardless of what academic route
you follow. Python and a low-level language such as C++ are more
powerful languages that many biostatisticians use these days.

I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.



We need a statistically rigorous and scientifically meaningful definition of replication

Replication and confirmation are indispensable concepts that help define scientific facts.  However, the way in which we reach scientific consensus on a given finding is rather complex. Although some press releases try to convince us otherwise, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made.  These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question.  This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans.  In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.

Regulatory agencies, such as the FDA, are an exception since they clearly spell out a definition of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.

In response to a growing concern over a reproducibility crisis, projects such as the Open Science Collaboration have commenced to systematically try to replicate published results. In a recent post, Jeff described one of their recent papers on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value < 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.

To further illustrate this, let's consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values <0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p<0.05). So which drug do you take? Let's give a bit more information to help you decide. Here are the p-values for both original and replication trials:

Drug Original Replication Replicated
A 0.0001 0.001 Yes
B <0.000001 0.03 Yes
C 0.03 0.06 No
D <0.000001 0.10 No

Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?


I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.

It seems that before continuing the debate over replication, and certainly before declaring that we are in a reproducibility crisis, we need a statistically rigorous and scientifically meaningful definition of replication.  This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.

Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.


We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15

You can sign up following links here

Last semester we successfully ran version 2 of my Data Analysis course. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:

  • We no longer use R packages that Microsoft Windows users had trouble installing in the first course.
  • All courses are now designed to be completed in 4 weeks.
  • We added new assessment questions.
  • We improved the assessment questions determined to be problematic.
  • We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.
  • We consolidated the case studies into one course.
  • We combined the materials from the statistics courses into a book, which you can download here. The material in the book match the materials taught in class so you can use it to follow along.

You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.

  1. Statistics and R for the Life Sciences starts October 15.
  2. Introduction to Linear Models and Matrix Algebra starts November 15.
  3. Statistical Inference and Modeling for High-throughput Experiments starts December 15.
  4. High-Dimensional Data Analysis starts January 15.
  5. Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays starts February 15.
  6. High-performance Computing for Reproducible Genomics starts March 15.
  7. Case Studies in Functional Genomics start April 15.

The landing page for the series continues to be here.


Data Analysis for the Life Sciences - a book completely written in R markdown

The book Data Analysis for the Life Sciences is now available on Leanpub.

title_pageData analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of Stat Labs, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.  We use simulations and data analysis examples to teach statistical concepts. The book includes links to computer code that readers can use to program along as they read the book.

It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects.

 The text was completely written in R markdown and every section contains a link to the  document that was used to create that section. This means that you can use knitr to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from  GitHub. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions.  But you can see the solutions if  you take our online course quizzes. If we find that most readers want access to the solutions, we will open them up as well.

The material is based on the online courses I have been teaching with Mike Love. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used jekyll to create a webpage with html versions of the markdown documents. Jeff then convinced us to publish it on LeanbupLeanpub. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it.


The Next National Library of Medicine Director Can Help Define the Future of Data Science

The main motivation for starting this blog was to share our enthusiasm about the increased importance of data and data analysis in science, industry, and society in general. Based on recent initiatives, such as BD2k, it is clear that the NIH is also enthusiastic and very much interested in supporting data science. For those that don't know, the National Institutes of Health (NIH) is the largest public funder of biomedical research in the world. This federal agency has an annual budget of about $30 billion.

The NIH has several institutes, each with its own budget and capability to guide funding decisions. Currently, the missions of most of these institutes relate to a specific disease or public health challenge.  Many of them fund research in statistics and computing because these topics are important components of achieving their specific mission. Currently, however, there is no institute directly tasked with supporting data science per se. This is about to change.

The National Library of Medicine (NLM) is one of the few NIH institutes that is not focused on a particular disease or public health challenge. Apart from the important task of maintaining an actual library, it supports, among many other initiatives, indispensable databases such as PubMed, GeneBank and GEO. After over 30 years of successful service as NLM director, Dr. Donald Lindberg stepped down this year and, as is customary, an advisory board was formed to advice the NIH on what's next for NLM. One of the main recommendations of the report is the following:

NLM  should be the intellectual and programmatic epicenter for data science at NIH and stimulate its advancement throughout biomedical research and application.

Data science features prominently throughout the report making it clear the NIH is very much interested in further supporting this field. The next director can therefore have an enormous influence in the futre of data science. So, if you love data, have administrative experience, and a vision about the future of data science as it relates to the medical and related sciences, consider this exciting opportunity.

Here is the ad.





Correlation is not a measure of reproducibility

Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, the ENCODE data standards document states

A typical R2 (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.

However, for  reasons I will explain here, correlation is not necessarily informative with regards to reproducibility. The mathematical results described below are not inconsequential theoretical details, and understanding them will help you assess new technologies, experimental procedures and computation methods.

Suppose you have collected data from an experiment

x1x2,..., xn

and want to determine if  a second experiment replicates these findings. For simplicity, we represent data from the second experiment as adding unbiased (averages out to 0) and statistically independent measurement error d to the first:

y1=x1+d1, y2=x2+d2, ... yn=xn+dn.

For us to claim reproducibility we want the differences

d1=y1-x1, d2=y2-x2,... ,dn=yn-xn

to be "small". To give this some context, imagine the x and y are log scale (base 2) gene expression measurements which implies the d represent log fold changes. If these differences have a standard deviation of 1, it implies that fold changes of 2 are typical between replicates. If our replication experiment produces measurements that are typically twice as big or twice as small as the original, I am not going to claim the measurements are reproduced. However, as it turns out, such terrible reproducibility can still result in correlations higher than 0.92.

To someone basing their definition of correlation on the current common language usage this may seem surprising, but to someone basing it on math, it is not. To see this, note that the mathematical definition of correlation tells us that because and x are independent:


This tells us that correlation summarizes the variability of relative to the variability of x. Because of the wide range of gene expression values we observe in practice, the standard deviation of x can easily be as large as 3 (variance is 9). This implies we expect to see correlations as high as 1/sqrt(1+1/9) = 0.95, despite the lack of reproducibility when comparing to y.

Note that using Spearman correlation does not fix this problem. A Spearman correlation of 1 tells us that the ranks of and are preserved, yet doest not summarize the actual differences. The problem comes down to the fact that we care about the variability of and correlation, Pearson or Spearman, does not provide an optimal summary. While correlation relates to the preservation of ranks, a much more appropriate summary of reproducibly is the distance between x and y which is related to the standard deviation of the differences d. A very simple R command you can use to generate this summary statistic is:


or the robust version:

median(abs(d)) ##multiply by 1.4826 for unbiased estimate of true sd

The equivalent suggestion for plots it to make an MA-plot instead of a scatterplot.

But aren't correlations and distances directly related? Sort of, and this actually brings up another problem. If the x and y are standardized to have average 0 and standard deviation 1 then, yes, correlation and distance are directly related:


However, if instead x and y have different average values, which would put into question reproducibility, then distance is sensitive to this problem while correlation is not. If the standard devtiation is 1, the formula is:



Once we consider units (standard deviations different from 1) then the relationship becomes even more complicated. Two advantages of distance you should be aware of are:

  1. it is in the same units as the data, while correlations have no units making it hard to interpret and select thresholds, and
  2. distance accounts for bias (differences in average), while correlation does not.

A final important point relates to the use of correlation with data that is not approximately normal. The useful interpretation of correlation as a summary statistic stems from the bivariate normal approximation: for every standard unit increase in the first variable, the second variable increased standard units, with the correlation. A  summary of this is here. However, when data is not normal this interpretation no longer holds. Furthermore, heavy tail distributions, which are common in genomics, can lead to instability. Here is an example of uncorrelated data with a single pointed added that leads to correlations close to 1. This is quite common with RNAseq data.




rafalib package now on CRAN

For the last several years I have been collecting functions I routinely use during exploratory data analysis in a private R package. Mike Love and I used some of these in our HarvardX course and now, due to popular demand, I have created man pages and added the rafalib package to CRAN. Mike has made several improvements and added some functions of his own. Here is quick descriptions of the rafalib functions I most use:

mypar - Before making a plot in R I almost always type mypar(). This basically gets around the suboptimal defaults of par. For example, it makes the margins (mar, mpg) smaller and defines RColorBrewer colors as defaults.  It is optimized for the RStudio window. Another advantage is that you can type mypar(3,2) instead of par(mfrow=c(3,2)). bigpar() is optimized for R presentations or PowerPoint slides.

as.fumeric - This function turns characters into factors and then into numerics. This is useful, for example, if you want to plot values x,y with colors defined by their corresponding categories saved in a character vector labsplot(x,y,col=as.fumeric(labs)).

shist (smooth histogram, pronounced shitz) - I wrote this function because I have a hard time interpreting the y-axis of density. The height of the curve drawn by shist can be interpreted as the height of a histogram if you used the units shown on the plot. Also, it automatically draws a smooth histogram for each entry in a matrix on the same plot.

splot (subset plot) - The datasets I work with are typically large enough that
plot(x,y) involves millions of points, which is a problem. Several solution are available to avoid over plotting, such as alpha-blending, hexbinning and 2d kernel smoothing. For reasons I won't explain here, I generally prefer subsampling over these solutions. splot automatically subsamples. You can also specify an index that defines the subset.

sboxplot (smart boxplot) - This function draws points, boxplots or outlier-less boxplots depending on sample size. Coming soon is the kaboxplot (Karl Broman box-plots) for when you have too many boxplots.

install_bioc - For Bioconductor users, this function simply does the source("") for you and then uses BiocLite to install.