Simply Statistics

12
Aug

UCLA Statistics 2015 Commencement Address

I was asked to speak at the UCLA Department of Statistics Commencement Ceremony this past June. As one of the first graduates of that department back in 2003, I was tremendously honored to be invited to speak to the graduates. When I arrived I was just shocked at how much the department had grown. When I graduated I think there were no more than 10 of us between the PhD and Master's programs. Now they have ~90 graduates per year with undergrad, Master's and PhD. It was just stunning.

Here's the text of what I said, which I think I mostly stuck to in the actual speech.

 

UCLA Statistics Graduation: Some thoughts on a career in statistics

When I asked Rick [Schoenberg] what I should talk about, he said to 'talk for 95 minutes on asymptotic properties of maximum likelihood estimators under nonstandard conditions". I thought this is a great opportunity! I busted out Tom Ferguson’s book and went through my old notes. Here we go. Let X be a complete normed vector space….

I want to thank the department for inviting me here today. It’s always good to be back. I entered the UCLA stat department in 1999, only the second entering class, and graduated from UCLA Stat in 2003. Things were different then. Jan was the chair and there were not many classes so we could basically do whatever we wanted. Things are different now and that’s a good thing. Since 2003, I’ve been at the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, where I was first a postdoctoral fellow and then joined the faculty. It’s been a wonderful place for me to grow up and I’ve learned a lot there.

It’s just an incredible time to be a statistician. You guys timed it just right. I’ve been lucky enough to witness two periods like this, the first time being when I graduated from college at the height of the dot come boom. Today, it’s not computer programming skills that the world needs, but rather it’s statistical skills. I wish I were in your shoes today, just getting ready to startup. But since I’m not, I figured the best thing I could do is share some of the things I’ve learned and talk about the role that these things have played in my own life.

Know your edge: What’s the one thing that you know that no one else seems to know? You’re not a clone—you have original ideas and skills. You might think they’re not valuable but you’re wrong. Be proud of these ideas and use them to your advantage. As an example, I’ll give you my one thing. Right now, I believe the greatest challenge facing the field of statistics today is getting the entire world to know what we in this room already know. Data are everywhere today and the biggest barrier to progress is our collective inability to process and analyze those data to produce useful information. The need for the things that we know has absolutely exploded and we simply have not caught up. That’s why I created, along with Jeff Leek and Brian Caffo, the Johns Hopkins Data Science Specialization, which is currently the most successful massive open online course program ever. Our goal is to teach the entire world statistics, which we think is an essential skill. We’re not quite there yet, but—assuming you guys don’t steal my idea—I’m hopeful that we’ll get there sometime soon.

At some point the edge you have will no longer work: That sounds like a bad thing, but it’s actually good. If what you’re doing really matters, then at some point everyone will be doing it. So you’ll need to find something else. I’ve been confronted with this problem at least 3 times in my life so far. Before college, I was pretty good at the violin, and it opened a lot of doors for me. It got me into Yale. But when I got to Yale, I quickly realized that there were a lot of really good violinists here. Suddenly, my talent didn’t have so much value. This was when I started to pick up computer programming and in 1998 I learned an obscure little language called R. When I got to UCLA I realized I was one of the only people who knew R. So I started a little brown bag lunch series where I’d talk about some feature of R to whomever would show up (which wasn’t many people usually). Picking up on R early on turned out to be really important because it was a small community back then and it was easy to have a big impact. Also, as more and more people wanted to learn R, they’d usually call on me. It’s always nice to feel needed. Over the years, the R community exploded and R’s popularity got to the point where it was being talked about in the New York Times. But now you see the problem. Saying that you know R doesn’t exactly distinguish you anymore, so it’s time to move on again. These days, I’m realizing that the one useful skill that I have is the ability to make movies. Also, my experience being a performer on the violin many years ago is coming in handy. My ability to quickly record and edit movies was one of the key factors that enabled me to create an entire online data science program in 2 months last year.

Find the right people, and stick with them forever. Being a statistician means working with other people. Choose those people wisely and develop a strong relationship. It doesn’t matter how great the project is or how famous or interesting the other person is, if you can’t get along then bad things will happen. Statistics and data analysis is a highly verbal process that requires constant and very clear communication. If you’re uncomfortable with someone in any way, everything will suffer. Data analysis is unique in this way—our success depends critically on other people. I’ve only had a few collaborators in the past 12 years, but I love them like family. When I work with these people, I don’t necessarily know what will happen, but I know it will be good. In the end, I honestly don’t think I’ll remember the details of the work that I did, but I’ll remember the people I worked with and the relationships I built.

So I hope you weren’t expecting a new asymptotic theorem today, because this is pretty much all I’ve got. As you all go on to the next phase of your life, just be confident in your own ideas, be prepared to change and learn new things, and find the right people to do them with. Thank you.

09
Aug

Statistical Theory is our "Write Once, Run Anywhere"

Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing "native apps", software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase "Write Once, Run Anywhere" to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously banned Flash from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.

What's the problem with "write once, run anywhere", or of cross-platform development more generally, assuming it's possible? Well, there are a number of issues: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the "lowest common denominator" of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.

In statistics, theory and theorems are our version of "write once, run anywhere". The basic idea is that theorems provide an abstract layer (a "virtual machine") that allows us to reason across a large number of specific problems. Think of the central limit theorem, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.

But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we'll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.

Because the CLT doesn't apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of "write once, run anywhere" is always tantalizing, but the reality never seems to meet that expectation.

Ironically, if you look across history and all programming languages, probably the most "cross-platform" language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C's success I think are that it's a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.

In a sense, we need "compilers" that can help us translate statistical theory for specific data analysis problems. In many cases, I'd imagine the compiler would "fail", meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.

More practically (perhaps), we could develop data analysis pipelines that could be applied to broad classes of data analysis problems. Then a "compiler" could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.

The key point is to recognize that there is a "translation" process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit "compiler" for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.

28
Jul

Announcing the JHU Data Science Hackathon 2015

We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever JHU Data Science Hackathon (DaSH) on September 21-23, 2015 at the Baltimore Marriott Waterfront.

This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.

To get more details and to sign up for the hackathon, you can go to the DaSH web site. We will be posting more information as the event nears.

Organizers:

  • Jeff Leek
  • Brian Caffo
  • Roger Peng
  • Leah Jager

Funding:

  • National Institutes of Health
  • Johns Hopkins University

 

24
Jul

stringsAsFactors: An unauthorized biography

Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like 'read.table()' and 'read.csv()' in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of

Why does stringsAsFactors not default to FALSE????

The argument 'stringsAsFactors' is an argument to the 'data.frame()' function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in 'read.table()' and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, 'stringsAsFactors' is set to TRUE.

This argument dates back to May 20, 2006 when it was originally introduced into R as the 'charToFactor' argument to 'data.frame()'. Soon afterwards, on May 24, 2006, it was changed to 'stringsAsFactors' to be compatible with S-PLUS by request from Bill Dunlap.

Most people I talk to today who use R are completely befuddled by the fact that 'stringsAsFactors' is set to TRUE by default. First of all, it should be noted that before the 'stringsAsFactors' argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn't want this behavior, you had to manually coerce each column to be character.

So here's the story:

In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by 'factor' vectors and so character columns got converted factor.

Why do we need factor variables to begin with? Because of modeling functions like 'lm()' and 'glm()'. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There's no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting 'stringsAsFactors = TRUE' when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.

There's also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like "disease" and "non-disease" will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn't have to repeat the strings "disease" and "non-disease" for as many observations that you had, which would have been wasteful.

Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use 'factors' for the modeling functions.

The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about 'stringsAsFactors' not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you're upset about 'stringsAsFactors = TRUE', then it's a pretty good indicator that you're either not a statistician by training, or you're doing non-traditional statistical things.

For example, in genomics, you might have the names of the genes in one column of data. It really doesn't make sense to encode these as factors because they won't be used in any modeling function. They're just labels, essentially. And because of CHARSXP hashing, you don't gain anything from an efficiency standpoint by converting them to factors either.

But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about 'stringsAsFactors'.

I fully expect that this blog post will now make all R users happy. If you think I've missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).

10
Jul

The Mozilla Fellowship for Science

This looks like an interesting opportunity for grad students, postdocs, and early career researchers:

We're looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.

Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.

Here's what you get:

Fellows will receive:

  • A stipend of $60,000 USD, paid in 10 monthly installments.
  • One-time health insurance supplement for Fellows and their families, ranging from $3,500 for single Fellows to $7,000 for a couple with two or more children.
  • One-time childcare allotment for families with children of up to $6,000.
  • Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.
  • All approved fellowship trips – domestic and international – are covered in full.

Deadline is August 14.

08
Jul

JHU, UMD researchers are getting a really big Big Data center

From Technical.ly Baltimore:

A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The $30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.

The web site has a pretty cool time-lapse video of the construction of the computing center. There's also a bit more detail at the JHU Hub site.

03
Jul

The Massive Future of Statistics Education

NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. 

Data are eating the world, but our collective ability to analyze data is going on a starvation diet.

Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions.

So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we are interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone.

In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college.

Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight.

Limited Capacity

The McKinsey Global Institute, in a highly cited report, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality.

Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of 730 people. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent.

On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in some data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for everyone to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time.

Enter the MOOCs

In April of 2014, Jeff Leek, Brian Caffo, and I launched the Johns Hopkins Data Science Specialization on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is

  1. Formulating a question that can be answered with data
  2. Assembling, cleaning, tidying data relevant to a question
  3. Exploring data, checking, eliminating hypotheses
  4. Developing a statistical model
  5. Making statistical inference
  6. Communicating findings
  7. Making the work reproducible

We took these basic steps and designed courses around each one of them.

Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month.

We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is what we do every single day. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work.

The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist.

For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with SwiftKey, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public.

Degree Alternatives

The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis.

All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses.

Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses.

Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs.

Early Numbers

So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project.

Some Early Lessons

From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned.

Data Science as Art and Science. Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower.

When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach?

Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his 1991 talk at the National Academies of Science, we had a process that we regularly espoused but barely understood.

Content vs. Curation. Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing?

Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a curation of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated.

Skill sets vs. Certification. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly.

In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing.

Conclusions

As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work.

02
Jul

Looks like this R thing might be for real

Not sure how I missed this, but the Linux Foundation just announced the R Consortium for supporting the "world’s most popular language for analytics and data science and support the rapid growth of the R user community". From the Linux Foundation:

The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.

Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.

01
Jul

How Airbnb built a data science team

From Venturebeat:

Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.

But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.

16
Jun

Interview at Leanpub

A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book R Programming for Data Science. So far, I've only published one book through Leanpub but I'm a huge fan. They've developed a system that is, in my opinion, perfect for academic publishing. The book's written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.

The full interview transcript is over at the Leanpub blog. If you want to listen to the audio of the interview, you can subscribe to the Leanpub podcast on iTunes.

R Programming for Data Science is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who've already bought a copy!