Tag: interview


Interview with Nathan Yau of FlowingData

Nathan Yau

Nathan Yau is a graduate student in statistics at UCLA and the author of the extremely popular data visualization blog flowingdata.com. He recently published a book Visualize This - a really nice guide to modern data visualization using R, Illustrator and Javascript - which should be on the bookshelf of any statistician working on data visualization. 

Do you consider yourself a statistician/data scientist/or something else?

Statistician. I feel like statisticians can call them data scientists, but not the other way around. Although with data scientists there’s an implied knowledge of programming, which statisticians need to get better at.

Who have been good mentors to you and what qualities have been most helpful for you?

I’m visualization-focused, and I really got into the area during a summer internship at The New York Times. Before that, I mostly made graphs in R for reports. I learned a lot about telling stories with data and presenting data to a general audience, and that has stuck with me ever since.

Similarly, my adviser Mark Hansen has showed me how data is more free-flowing and intertwined with everything. It’s hard to describe. I mean coming into graduate school, I thought in terms of datasets and databases, but now I see it as something more organic. I think that helps me see what the data is about more clearly.

How did you get into statistics/data visualization?

In undergrad, an introduction to statistics (for engineering) actually pulled me in. The professor taught with so much energy, and the material sort of clicked with me. My friends who were also taking the course complained and had trouble with it, but I wanted more for some reason. I eventually switched from electrical engineering to statistics.

I got into visualization during my first year in grad school. My adviser gave a presentation on visualization, but from a media arts perspective rather than a charts-and-graphs-in-R-Tufte point of view. I went home after that class, googled visualization and that was that.

Why do you think there has been an explosion of interest in data visualization?

The Web is a really visual place, so it’s easy for good visualization to spread. It’s also easier for a general audience to read a graph than it is to understand statistical concepts. And from a more analytical point of view, there’s just a growing amount of data and visualization is a good way to poke around.

Other than R, what tools should students learn to improve their data visualizations?

For static graphics, I use Illustrator all the time to bring storytelling into the mix or to just provide some polish. For interactive graphics on the Web, it’s all about JavaScript nowadays. D3, Raphael.js, and Processing.js are all good libraries to get started.

Do you think the rise of infographics has led to a “watering down” of data visualization?

So I actually just wrote a post along these lines. It’s true that there a lot of low-quality infographics, but I don’t think that takes away from visualization at all. It makes good work more obvious. I think the flood of infographics is a good indicator of people’s eagerness to read data.

How did you decide to write your book “Visualize This”?
Pretty simple. I get emails and comments all the time when I post graphics on FlowingData that ask how something was done. There aren’t many resources that show people how to do that. There are books that describe what makes good graphics but don’t say anything about how to actually go about doing it, and there are programming books for say, R, but are too technical for most and aren’t visualization-centric. I wanted to write a book that I wish I had in the early days.
Any final thoughts on statistics, data and visualization? 

Keep an open mind. Oftentimes, statisticians seem to box themselves into positions of analysis and reports. Statistics is an applied field though, and now more than ever, there are opportunities to work anywhere there is data, which is practically everywhere.

Interview with Héctor Corrada Bravo

Héctor Corrada Bravo

Héctor Corrada Bravo is an assistant professor in the Department of Computer Science and the Center for Bioinformatics and Computational Biology at the University of Maryland, College Park. He moved to College Park after finishing his Ph.D. in computer science at the University of Wisconsin and a postdoc in biostatistics at the Johns Hopkins Bloomberg School of Public Health. He has done outstanding work at the intersection of molecular biology, computer science, and statistics. For more info check out his webpage.

Which term applies to you: statistician/data scientist/computer
scientist/machine learner?

I want to understand interesting phenomena (in my case mostly in
biology and medicine) and I believe that our ability to collect a large number of relevant
measurements and infer characteristics of these phenomena can drive
scientific discovery and commercial innovation in the near future.
Perhaps that makes me a data scientist and means that depending on the
task at hand one or more of the other terms apply.

A lot of the distinctions many people make between these terms are
vacuous and unnecessary, but some are nonetheless useful to think
about. For example, both statisticians and machine learners [sic] know
how to create statistical algorithms that compute interesting and informative objects using measurements (perhaps) obtained through some stochastic or partially observed
process. These objects could be genomic tools for cancer screening, or
statistics that better reflect the relative impact of baseball players
on team success.

Both fields also give us ways to evaluate and characterize these objects.
However, there are times when these objects are tools that fulfill an
immediately utilitarian purpose and thinking like an engineer might
(as many people in Machine Learning do) is the right approach.
Other times, these objects are there to help us get insights about our
world and thinking in ways that many statisticians do is the right
approach.  You need both of these ways of thinking to do interesting
science and dogmatically avoiding either of them is a terrible idea.

How did you get into statistics/data science (i.e. your history)?

I got interested in Artificial Intelligence at one point, and found
that my mathematics background was nicely suited to work on this. Once
I got into it, thinking about statistics and how to analyze and
interpret data was natural and necessary. I started working with two
wonderful advisors at Wisconsin, Raghu Ramakrishnan (CS) and Grace Wahba (Statistics)
that helped shape the way I approach problems from different angles
and with different goals. The last piece was discovering that
computational biology is a fantastic setting in which to apply and
devise these methods to answer really interesting questions.

What is the problem currently driving you?

I’ve been working on cancer epigenetics to find specific genomic
measurements for which increased stochasticity appears to be general
across multiple cancer types. Right now, I’m really wondering how far
into the clinic can these discoveries be taken, if at all. For
example, can we build tools that use these genomic measurements to
improve cancer screening?

How do you see CS/statistics merging in the future?

I think that future got here some time ago, but is about to get much
more interesting.

Here is one example: Computer Science is about creating and analyzing
algorithms and building the systems that can implement them. Some of
what many computer scientists have done looks at problems concerning how to
keep, find and ship around information (Operating Systems, Networks,
Databases, etc.). Many times these have been driven by very specific
needs, e.g., commercial transactions in databases. In some ways,
companies have moved from from asking how do I use data to keep track
of my activities to how do I use data to decide which activities to do
and how to do them. Statistical tools should be used to answer these
questions, and systems built by computer scientists have statistical
algorithms at their core.

Beyond R, what are some really useful computational tools for
statisticians to know about?

I think a computational tool that everyone can benefit a lot from
understanding better is algorithm design and analysis. This doesn’t
have to be at a particularly deep level, but just getting a sense of
how long a particular process might take, and how to devise a different way of doing it that might make it more efficient is really useful. I’ve been toying with the idea of creating a CS course called (something like) “Highlights of continuous
mathematics for computer science” that reminds everyone of the cool
stuff that one learns in math now that we can appreciate their usefulness. Similarily, I think
statistics students can benefit from “Highlights of discrete
mathematics for statisticians”.

Now a request for comments below from you and readers: (5a) Beyond R,
what are some really useful statistical tools for computer scientists
to know about?

Review times in statistics journals are long, should statisticians
move to conference papers?

I don’t think so. Long review times (anything more than 3 weeks) are
really not necessary. We tend to publish in journals with fairly quick
review times that produce (for the most part) really useful and
insightful reviews.

I was recently talking to senior members in my field who were telling
me stories about the “old times” when CS was moving from mainly
publishing in journals to now mainly publishing in conferences. But
now, people working in collaborative projects (like computational biology) work in fields
that primarily publish in journals, so the field needs to be able to
properly evaluate their impact and productivity. There is no perfect

For instance, review requests in fields where conferences are the main
publication venue come in waves (dictated by conference schedule).
Reviewers have a lot of papers to go over in a relatively short time
which makes their job of providing really helpful and fair reviews not
so easy. So, in that respect, the journal system can be better. The one thing that is universally true is that you don’t need long review times.

Previous Interviews: Daniela Witten, Chris Barr, Victoria Stodden


Interview with Victoria Stodden

Victoria Stodden

Victoria Stodden is an assistant professor of statistics at Columbia University in New York City. She moved to Columbia after getting her Ph.D. at Stanford University. Victoria has made major contributions to the area of reproducible research and has been appointed to the NSF’s Advisory Committee for Infrastructure. She is the recent recipient of an NSF grant for “Policy Design for Reproducibility and Data Sharing in Computational Science”

Which term applies to you: data scientist/statistician/analyst (or something else)?

Definitely statistician. My PhD is from the stats department at Stanford University.

How did you get into statistics/data science (e.g. your history)?

Since my undergrad days I’ve been motivated by problems in what’s called ‘social welfare economics.’ I interpret that as studying how people can best reach their potential, particularly how the policy environment affects outcomes. This includes the study of regulatory design, economic growth, access to knowledge, development, and empowerment. My undergraduate degree was in economics, and I thought I would carry on with a PhD in economics as well. I realized that folks with my interests were mostly doing empirical work so I thought I should prepare myself with the best training I could in statistics. Hence I chose to do a PhD in statistics to augment my data analysis capabilities as much as I could since I envisioned myself immersed in empirical research in the future.

What is the problem currently driving you?

Right now I’m working on the problem of reproducibility in our body of published computational science. This ties into my interests because of the critical role of knowledge and reasoning in advancing social welfare. Scientific research is becoming heavily computational and as a result the empirical work scientists do is becoming more complex and yet less tacit: the myriad decisions made in data filtering, analysis, and modeling are all recordable in code. In computational research there are so many details in the scientific process it is nearly impossible to communicate them effectively in the traditional scientific paper – rendering our published computational results unverifiable, if there isn’t access to the code and data that generated them.

Access to the code and data permits readers to check whether the descriptions in the paper correspond to the published results, and allows people to understand why independent implementations of the methods in the paper might produce differing results. It also puts the tools of scientific reasoning into people’s hands – this is new. For much of scientific research today all you need is an internet connection to download the reasoning associated with a particular result. Wide availability of the data and code is still largely a dream, but one the scientific community is moving towards.

Who were really good mentors to you? What were the qualities that really helped you?

My advisor, David Donoho, is an enormous influence. He is the clearest scientific thinker I have ever been exposed to. I’ve been so very lucky with the people who have come into my life. Through his example, Dave is the one who has had the most impact on how I think about and prioritize problems and how I understand our role as statisticians and scientific thinkers. He’s given me an example of how to do this and it’s hard to underestimate his influence in my life.

What do you think are the barriers to reproducible research?

At this point, incentives. There are many concrete barriers, which I talk about in my papers and talks (available on my website http://stodden.net), but they all stem from misaligned incentives. If you think about it, scientists do lots of things they don’t particularly like in the interest of research communication and scientific integrity. I don’t know any computational scientist who really loves writing up their findings into publishable articles for example, but they do. This is because the right incentives exist. A big part of the work I am doing concerns the scientific reward structure.  For example, my work on the Reproducible Research Standard is an effort to realign the intellectual property rules scientists are subject to, to be closer to our scientific norms. Scientific norms create the incentive structure for the production of scientific research, providing rewards for doing things people might not do otherwise. For example, scientists have a long established norm of giving up all intellectual property rights over their work in exchange for attribution, which is the currency of success. It’s the same for sharing the code and data that underlies published results – not part of the scientific incentive and reward structure today but becoming so, through adjusting a variety of other factors like finding agency policy, journal publication policy, and expectations at the institutional level.

What have been some success stories in reproducible research?

I can’t help but point to my advisor, David Donoho. An example he gives is his release of http://www-stat.stanford.edu/~wavelab - the first implementation of wavelet routines in MATLAB, before MATLAB included their own wavelet toolbox.  The release of the Wavelab code was a factor that he believes made him one of the top 5 highly cited authors in Mathematics in 2000.

Hiring and promotion committees seem to be starting to recognize the difference between candidates that recognize the importance of reproducibility and clear scientific communication, compared to others who seem to be wholly innocent of these issues.

There is a nascent community of scientific software developers that is achieving remarkable success.  I co-organized a workshop this summer bringing some of these folks together, see http://www.stodden.net/AMP2011. There are some wonderful projects underway to assist in reproducibility, from workflow tracking to project portability to unique identifiers for results reproducible in the cloud. Fascinating stuff.

Can you tell us a little about the legal ramifications of distributing code/data?

Sure. Many aspects of our current intellectual property laws are quite detrimental to the sharing of code and data. I’ll discuss the two most impactful ones. Copyright creates exclusive rights vested in the author for original expressions of ideas – and it’s a default. What this means is that your expression of your idea – your code, your writing, figures you create – are by default copyright to you. So for your lifetime and 70+ years after that, you (or your estate) need to give permission for the reproduction and re-use of the work – this is exactly counter to scientific norms or independent verification and building on others’ findings. The Reproducible Research Standard is a suite of licenses that permit scientists to set the terms of use of their code, data, and paper according to scientific norms: use freely but attribute. I have written more about this here: http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4720221

In 1980 Congress passed the Bayh-Dole Act, which was designed to create incentives for access to federally funded scientific discoveries by securing ownership rights for universities with regard to inventions by their researchers. The idea was that these inventions could then by patented and licensed by the university, making the otherwise unavailable technology available for commercial development. Notice that Bayh-Dole was passed on the eve of the computer revolution and Congress could not have foreseen the future importance of code to scientific investigation and its subsequent susceptibility to patentability. The patentability of scientific code now creates incentives to keep the code hidden: to avoid creating prior art in order to maximize the chance of obtaining the patent, and to keep hidden from potential competitors any information that might be involved in commercialization. Bayh-Dole has created new incentives for computational scientists – that of startups and commercialization – that must be reconciled with traditional scientific norms of openness.

Related Posts: Jeff’s interviews with Daniela Witten and Chris Barr. Roger’s talk on reproducibility 


Interview With Chris Barr

Chris Barr

Chris Barr is an assistant professor of biostatistics at the Harvard School of Public Health in Boston. He moved to Boston after getting his Ph.D. at UCLA and then doing a postdoc at Johns Hopkins Bloomberg School of Public Health. Chris has done important work in environmental biostatistics and is also the co-founder of OpenIntro, a very cool open-source (and free!) educational resource for statistics.  

 Which term applies to you: data scientist/statistician/analyst?

I’m a “statistician” by training. One day, I hope to graduate to “scientist”. The distinction, in my mind, is that a scientist can bring real insight to a tough problem, even when the circumstances take them far beyond their training.

 Statisticians get a head start on becoming scientists. Like chemists and economists and all the rest, we were trained to think hard as independent researchers. Unlike other specialists, however, we are given the opportunity, from a young age, to see all types of different problems posed from a wide range of perspectives.

How did you get into statistics/data science (e.g. your history)?

I studied economics in college, and I had planned to pursue a doctorate in the same field. One day a senior professor of statistics asked me about my future, and in response to my stated ambition, said: “Whatever an economist can do, a statistician can do better.” I started looking at graduate programs in statistics and noticed UCLA’s curriculum. It was equal parts theory, application, and computing, and that sounded like how I wanted to spend my next few years. I couldn’t have been luckier. The program and the people were fantastic.

What is the problem currently driving you?

I’m working on so many projects, it’s difficult to single out just one. Our work on smoking bans (joint with Diez, Wang, Samet, and Dominici) has been super exciting. It is a great example about how careful modeling can really make a big difference. I’m also soloing a methods paper on residual analysis for point process models that is bolstered by a simple idea from physics. When I’m not working on research, I spend as much time as I can on OpenIntro.

What is your favorite paper/idea you have had? Why?

 I get excited about a lot of the problems and ideas. I like the small teams (one, two, or three authors) that generally take on theory and methods problems; I also like the long stretches of thinking time that go along with those papers. That said, big science papers, where I get to team up with smart folks from disciplines and destinations far and wide, really get me fired up. Last, but not least, I really value the work we do on open source education and reproducible research. That work probably has the greatest potential for introducing me to people, internationally and in small local communities, that I’d never know otherwise.

Who were really good mentors to you? What were the qualities that really helped you?

Identifying key mentors is such a tough challenge, so I’ll adhere to a self-imposed constraint by picking just one: Rick Schoenberg. Rick was my doctoral advisor, and has probably had the single greatest impact on my understanding of what it means to be a scientist and colleague. I could tell you a dozen stories about the simple kindness and encouragement that Rick offered. Most importantly, Rick was positive and professional in every interaction we ever had. He was diligent, but relaxed. He offered structure and autonomy. He was all the things a student needs, and none of the things that make students want to read those xkcd comics. Now that I’m starting to make my own way, I’m grateful to Rick for his continuing friendship and collaboration.

I know you asked about mentors, but if I could mention somebody who, even though not my mentor, has taught me a ton, it would be David Diez. David was my classmate at UCLA and colleague at Harvard. We are also cofounders of OpenIntro. David is probably the hardest working person I know. He is also the most patient and clear thinking. These qualities, like Rick’s, are often hard to find in oneself and can never be too abundant.

 What is OpenIntro?

OpenIntro is part of the growing movement in open source education. Our goal, with the help of community involvement, is to improve the quality and reduce the cost of educational materials at the introductory level. Founded by two statisticians (Diez, Barr), our early activities have generated a full length textbook (OpenIntro Statistics: Diez, Barr, Cetinkaya-Rundel) that is available for free in PDF and at cost ($9.02) in paperback. People can also use openintro.org to manage their course materials for free, whether they are using our book or not. The software, developed almost entire by David Diez, makes it easy for people to post lecture notes, assignments, and other resources. Additionally, it gives people access to our online question bank and quiz utility. Last but not least, we are sponsoring a student project competition. The first round will be this semester, and interested people can visit openintro.org/stat/comp for additional information. We are little fish, but with the help of our friends (openintro.org/about.php) and involvement from the community, we hope to do a good thing.

How did you get the idea for OpenIntro?


 Regarding the book and webpage - David and I had both started writing a book on our own; David was keen on an introductory text, and I was working on one about statistical computing. We each realized that trying to solo a textbook while finishing a PhD was nearly impossible, so we teamed up. As the project began to grow, we were very lucky to be joined by Mine Cetinkaya-Rundel, who became our co-author on the text and has since played a big role in developing the kinds of teaching supplements that instructors find so useful (labs and lecture notes to name a few). Working with the people at OpenIntro has been a blast, and a bucket full of nights and weekends later, here we are!

 Regarding making everything free - David and I started the OpenIntro project during the peak of the global financial crisis. With kids going to college while their parents’ house was being foreclosed, it seemed timely to help out the best way we knew how. Three years later, as I write this, the daily news is running headline stories about the Occupy Wall Street movement featuring hard times for young people in America and around the world. Maybe “free” will always be timely.

For More Information

Check out Chris’ webpage, his really nice publications including this one on the public health benefits of cap and trade, and the OpenIntro project website. Keep your eye open for the paper on cigarette bans Chris mentions in the interview, it is sure to be good. 

Related Posts: Jeff’s interview with Daniela Witten, Rafa on the future of graduate education, Roger on colors in R.


Interview With Daniela Witten

Note: This is the first in a series of posts where we will be interviewing junior, up-and-coming statisticians/data scientists. Our goal is to build visibility for people who are at the early stages of their careers. 

Daniela Witten

Daniela is an assistant professor of Biostatistics at the University of Washington in Seattle. She moved to Seattle after getting her Ph.D. at Stanford. Daniela has been developing exciting new statistical methods for analyzing high dimensional data and is a recipient of the NIH Director’s Early Independence Award.

Which term applies to you: data scientist/statistician/analyst?

Statistician! We have to own the term. Some of us have a tendency to try to sugarcoat what we do. But I say that I’m a statistician with pride! It means that I have been rigorously trained, that I have a broadly applicable skill set, and that I’m always open to new and interesting problems. Also, I sometimes get surprised reactions from people at cocktail parties, which is funny.

To the extent that there is a stigma associated with being a statistician, we statisticians need to face the problem and overcome it. The future of our field depends on it.

How did you get into statistics/data science?

I definitely did not set out to become a statistician. Before I got to college, I was planning to study foreign languages. Like most undergrads, I changed my mind, and eventually I majored in biology and math. I spent a summer in college doing experimental biology, but quickly discovered that I had neither the hand-eye coordination nor the patience for lab work. When I was nearing the end of college, I wasn’t sure what was next. I wanted to go to grad school, but I didn’t want to commit to one particular area of study for the next five years and potentially for my entire career. 

I was lucky to be at Stanford and to stumble upon the Stat department there. Initially, statistics appealed to me because it was a good way to combine my interests in math and biology from the safety of a computer terminal instead of a lab bench. After spending more time in the department, I realized that if I studied statistics, I could develop a broad skill set that could be applied to a variety of areas, from cancer research to movie recommendations to the stock market.

What is the problem currently driving you?

My research involves the development of statistical methods for the analysis of very large data sets. Recently, I’ve been interested in better understanding networks and their applications to biology. In the past few years there has been a lot of work in the statistical community on network estimation, or graphical modeling. In parallel, biologists have been interested in taking network-based approaches to understanding large-scale biological data sets. There is a real need for these two areas of research to be brought closer together, so that statisticians can develop useful tools for rigorous network-based analysis of biological data sets.

For example, the standard approach for analyzing a gene expression data set with samples from two classes (like cancer and normal tissue) involves testing each gene for differential expression between the two classes, for instance using a two-sample t-statistic. But we know that an individual gene does not drive the differences between cancer and normal tissue; rather, sets of genes work together in pathways in order to have an effect on the phenotype. Instead of testing individual genes for differential expression, can we develop an approach to identify aspects of the gene network that are perturbed in cancer?

What are the top 3 skills you look for in a student who works with you?

I look for a student who is intellectually curious, self-motivated, and a good personality fit. Intellectual curiosity is a prerequisite for grad school, self-motivation is needed to make it through the 2 years of PhD level coursework and 3 years of research that make up a typical Stat/Biostat PhD, and a good personality fit is needed because grad school is long and sometimes frustrating*, and it’s important to have an advisor who can be a friend along the way!

*but ultimately very rewarding

Who were really good mentors to you? What were the qualities that really helped you?

My PhD advisor, Rob Tibshirani, has been a great mentor. In addition to being a top statistician, he is also an enthusiastic advisor, a tireless advocate for his students, and a loyal friend. I learned from him the value of good collaborations and of simple solutions to complicated problems. I also learned that it is important to maintain a relaxed attitude and to occasionally play pranks on students.

For more information:

Check out her website. Or read her really nice papers on penalized classification and penalized matrix decompositions.