Navigating Big Data Careers with a Statistics PhD

Jeff Leek
2015-02-18

Editor’s note: This is a guest post by Sherri Rose. She is an Assistant Professor of Biostatistics in the Department of Health Care Policy at Harvard Medical School. Her work focuses on nonparametric estimation, causal inference, and machine learning in health settings. Dr. Rose received her BS in statistics from The George Washington University and her PhD in biostatistics from the University of California, Berkeley, where she coauthored a book on Targeted Learning. She tweets @sherrirose.

A quick scan of the science and technology headlines often yields two words: big data. The amount of information we collect has continued to increase, and this data can be found in varied sectors, ranging from social media to genomics. Claims are made that big data will solve an array of problems, from understanding devastating diseases to predicting political outcomes. There is substantial “big data” hype in the press, as well as business and academic communities, but how do upcoming, current, and recent statistical science PhDs handle the array of training opportunities and career paths in this new era? Undergraduate interest in statistics degrees is exploding, bringing new talent to graduate programs and the post-PhD job pipeline.  Statistics training is diversifying, with students focusing on theory, methods, computation, and applications, or a blending of these areas. A few years ago, Rafa outlined the academic career options for statistics PhDs in two posts, which cover great background material I do not repeat here. The landscape for statistics PhD careers is also changing quickly, with a variety of companies attracting top statistics students in new roles.  As a new faculty member at the intersection of machine learning, causal inference, and health care policy, I’ve already found myself frequently giving career advice to trainees.  The choices have become much more nuanced than just academia vs. industry vs. government.

So, you find yourself inspired by big data problems and fascinated by statistics. While you are a student, figuring out what you enjoy working on is crucial. This exploration could involve engaging in internship opportunities or collaborating with multiple faculty on different types of projects. Both positive and negative experiences can help you identify your preferences.

Undergraduates may wish to spend a couple months at a Summer Institute for Training in Biostatistics or National Science Foundation Research Experience for Undergraduates. There are also many MOOC options to get a taste of different areas ofstatistics. Selecting a graduate program for PhD study can be a difficult choice, especially when your interests within statistics have yet to be identified, as is often the case for undergraduates. However, if you know that you have interests in software and programming, it can be easy to sort which statistical science PhD programs have a curricular or research focus in this area by looking at department websites. Similarly, if you know you want to work in epidemiologic methods, genomics, or imaging, specific programs are going to jump right to the top as good fits. Getting advice from faculty in your department will be important. Competition for admissions into statistics and biostatistics PhD programs has continued to increase, and most faculty advise applying to as many relevant programs as is reasonable given the demands on your time and finances. If you end up sitting on multiple (funded) offers come April, talking to current students, student alums, and looking at alumni placement can be helpful. Don’t hesitate to contact these people, selectively. Most PhD programs genuinely do want you to end up in the place that is best for you, even if it is not with them.

Once you’re in a PhD program, internship opportunities for graduate students are listed each year by the American Statistical Association. Your home department may also have ties with local research organizations and companies with openings. Internships can help you identify future positions and the types of environments where you will flourish in your career. Lauren Kunz, a recent PhD graduate in biostatistics from Harvard University, is currently a Statistician at the National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health. Dr. Kunz said, “As a previous summer intern at the NHLBI, I was able to get a feel for the day to day life of a biostatistician at the NHLBI. I found the NHLBI Office of Biostatistical Research to be a collegial, welcoming environment, and I soon learned that NHLBI biostatisticians have the opportunity to work on a variety of projects, very often collaborating with scientists and clinicians. Due to the nature of these collaborations, the biostatisticians are frequently presented with scientifically interesting and important statistical problems. This work often motivates methodological research which in turn has immediate, practical applications. These factors matched well with my interest in collaborative research that is both methodological and applied.”

Industry is also enticing to statistics PhDs, particularly those with an applied or computational focus, like Stephanie Sapp and Alyssa Frazee. Dr. Sapp has a PhD in statistics from the University of California, Berkeley, and is currently a Quantitative Analyst at Google. She also completed an internship there the summer before she graduated. In commenting about her choice to join Google, Dr. Sapp said,  “I really enjoy both academic research and seeing my work used in practice.  Working at Google allows me to continue pursuing new and interesting research topics, as well as see my results drive more immediate impact.”  Dr. Frazee just finished her PhD in biostatistics at Johns Hopkins University and previously spent a summer exploring her interests in Hacker School.  While she applied to both academic and industry positions, receiving multiple offers, she ultimately chose to go into industry and work for Stripe: “I accepted a tech company’s offer for many reasons, one of them being that I really like programming and writing code. There are tons of opportunities to grow as a programmer/engineer at a tech company, but building an academic career on that foundation would be more of a challenge. I’m also excited about seeing my statistical work have more immediate impact. At smaller companies, much of the work done there has visible/tangible bearing on the product. Academic research in statistics is operating a lot closer to the boundaries of what we know and discovering a lot of cool stuff, which means researchers get to try out original ideas more often, but the impact is less immediately tangible. A new method or estimator has to go through a lengthy peer review/publication process and be integrated into the community’s body of knowledge, which could take several years, before its impact can be fully observed.”  One of Dr. Frazee, Dr. Sapp, and Dr. Kunz’s considerations in choosing a job reflects many of those in the early career statistics community: having an impact.

Interest in both developing methods and translating statistical advances into practice is a common theme in the big data statistics world, but not one that always leads to an industry or government career. There are also academic opportunities in statistics, biostatistics, and interdisciplinary departments like my own where your work can have an impact on current science.  The Department of Health Care Policy (HCP) at Harvard Medical School has 5 tenure-track/tenured statistics faculty members, including myself, among a total of about 20 core faculty members. The statistics faculty work on a range of theoretical and methodological problems while collaborating with HCP faculty (health economists, clinician researchers, and sociologists) and leading our own substantive projects in health care policy (e.g., Mass-DAC). I find it to be a unique and exciting combination of roles, and love that the science truly informs my statistical research, giving it broader impact. Since joining the department a year and a half ago, I’ve worked in many new areas, such as plan payment risk adjustment methodology. I have also applied some of my previous work in machine learning to predicting adverse health outcomes in large datasets. Here, I immediately saw a need for new avenues of statistical research to make the optimal approach based on statistical theory align with an optimal approach in practice. My current research portfolio is diverse; example projects include the development of a double robust estimator for the study of chronic disease, leading an evaluation of a new state-wide health plan initiative, and collaborating with department colleagues on statistical issues in all-payer claims databases, physician prescribing intensification behavior, and predicting readmissions. The larger statistics community at Harvard also affords many opportunities to interact with statistics faculty across the campus, and university-wide junior faculty events have connected me with professors in computer science and engineering. I feel an immense sense of research freedom to pursue my interests at HCP, which was a top priority when I was comparing job offers.

Hadley Wickam, of ggplot2 and Advanced R fame, took on a new role as Chief Scientist at RStudio in 2013. Freedom was also a key component in his choice to move sectors: “For me, the driving motivation is freedom: I know what I want to work on, I just need the freedom (and support) to work on it. It’s pretty unusual to find an industry job that has more freedom than academia, but I’ve been noticeably more productive at RStudio because I don’t have any meetings, and I can spend large chunks of time devoted to thinking about hard problems. It’s not possible for everyone to get that sort of job, but everyone should be thinking about how they can negotiate the freedom to do what makes them happy. I really like the thesis of Cal Newport’s book So Good They Can’t Ignore You - the better you are at your job, the greater your ability to negotiate for what you want.”

There continues to be a strong emphasis in the work force on the vaguely defined field of “data science,” which incorporates the collection, storage, analysis, and interpretation of big data.  Statisticians not only work in and lead teams with other scientists (e.g., clinicians, biologists, computer scientists) to attack big data challenges, but with each other. Your time as a statistics trainee is an amazing opportunity to explore your strengths and preferences, and which sectors and jobs appeal to you. Do your due diligence to figure out which employers are interested in and supportive of the type of career you want to create for yourself. Think about how you want to spend your time, and remember that you’re the only person who has to live your life once you get that job. Other people’s opinions are great, but your values and instincts matter too. Your definition of “best” doesn’t have to match someone else’s. Ask questions! Try new things! The potential for breakthroughs with novel flexible methods is strong. Statistical science training has progressed to the point where trainees are armed with thorough knowledge in design, methodology, theory, and, increasingly, data collection, applications, and computation.  Statisticians working in data science are poised to continue making important contributions in all sectors for years to come. Now, you just need to decide where you fit.