# R Workshop: Reading in Large Data Frames

One question I get a lot about how to read large data frames into R. There are some useful tricks that can save you both time and memory when reading large data frames but I find that many people are not aware of them. Of course, your ability to read data is limited by your available memory. I usually do a rough calculation along the lines of

# rows * # columns * 8 bytes / 2^20

This gives you the number of megabytes of the data frame (roughly speaking, it could be less). If this number is more than half the amount of memory on your computer, then you might run into trouble.

First, read the help page for ‘read.table’. It contains many hints for how to read in large tables. Of course, help pages tend to be a little confusing so I’ll try to distill the relevant details here.

The following options to ‘read.table()’ can affect R’s ability to read large tables:

colClasses

This option takes a vector whose length is equal to the number of columns in year table. Specifying this option instead of using the default can make ‘read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set ‘colClasses = “numeric”’. If the columns are all different classes, or perhaps you just don’t know, then you can have R do some of the work for you.

You can read in just a few rows of the table and then create a vector of classes from just the few rows. For example, if I have a file called “datatable.txt”, I can read in the first 100 rows and determine the column classes from that:

tab5rows <- read.table("datatable.txt", header = TRUE, nrows = 100)
classes <- sapply(tab5rows, class)