New Feather Format for Data Frames

Roger Peng
2016-03-31

This past Tuesday, Hadley Wickham and Wes McKinney announced a new binary file format specifically for storing data frames.

One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.

Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is here.

Initial thoughts:

I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.