Tag: interview

19
Dec

Rafa interviewed about statistical genomics

He talks about the problems created by the speed of increase in data sizes in molecular biology, the way that genomics is hugely driven by data analysis/statistics, how Bioconductor is an example of bottom up science, Simply Statistics gets a shout out, how new data are going to lead to new modeling/statistical challenges, and gives an ode to boxplots. It's worth watching the whole thing...

09
Nov

Interview with Tom Louis - New Chief Scientist at the Census Bureau

Tom Louis


Tom Louis is a professor of Biostatistics at Johns Hopkins and will be joining the Census Bureau through an interagency personnel agreement as the new associate director for research and methodology and chief scientist. Tom has an impressive history of accomplishment in developing statistical methods for everything from environmental science to genomics. We talked to Tom about his new role at the Census, how it relates to his impressive research career, and how young statisticians can get involved in the statistical work at the Census. 


SS: How did you end up being invited to lead the research branch of the Census?

TL: Last winter, then-director Robert Groves (now Provost at Georgetown University) asked if I would be interested in  the possibility of becoming the next Associate Director of Research and Methodology (R&M) and Chief Scientist, succeeding  Rod Little (Professor of Biostatistics at the University of Michigan) in these roles.  I expressed interest and after several discussions with Bob and Rod, decided that if offered, I would accept.  It was offered and I did accept.  

As background,  components of my research, especially Bayesian methods, is Census-relevant.  Furthermore, during my time as a member of the National Academies Committee on National Statistics I served on the panel that recommended improvements in small area income and poverty estimates, chaired the panel that evaluated methods for allocating federal and state program funds by formula, and chaired a workshop on facilitating innovation in the Federal statistical system.

Rod and I noted that it’s interesting and possibly not coincidental that with my appointment the first two associate directors are both former chairs of Biostatistics departments.  It is the case that R&D’s mission is quite similar to that of a Biostatistics department; methods and collaborative research, consultation and education.  And, there are many statisticians at the Census Bureau who are not in the R&D directorship, a sociology quite similar to that in a School of Public Health or a Medical campus. 

SS: What made you interested in taking on this major new responsibility?

TL: I became energized by the opportunity for national service, and excited by the scientific, administrative, and sociological responsibilities and challenges.  I’ll be engaged in hiring and staff development, and increasing the visibility of the bureau’s pre- and post-doctoral programs.  The position will provide the impetus to take a deep dive into finite-population statistical approaches, and contribute to the evolving understanding of the strengths and weakness of design-based, model-based and hybrid approaches to inference.  That I could remain a Hopkins employee by working via an Interagency Personnel Agreement, sealed the deal.  I will start in January 2013 and serve through 2015, and will continue to participate in some Hopkins-based activities.

In addition to activities within the Census Bureau, I’ll be increasing connections among statisticians in other federal statistical agencies, have a role in relations with researchers funded through the NSF to conduct census-related research.

SS: What are the sorts of research projects the Census is involved in? 

TL: The Census Bureau designs and conducts the decennial Census, the Current Population Survey, the American Community Survey, many, many other surveys for other Federal Statistical Agencies including the Bureau of Labor Statistics, and a quite extraordinary portfolio of others. Each identifies issues in design and analysis that merit attention, many entail “Big Data” and many require combining information from a variety of sources.  I give a few examples, and encourage exploration of www.census.gov/research.

You can get a flavor of the types of research from the titles of the six current centers within R&M: The Center for Adaptive Design, The Center for Administrative Records Research and Acquisition, The Center for Disclosure Avoidance Research, The Center for Economic Studies, The Center for Statistical Research and Methodology and The Center for Survey Measurement.  Projects include multi-mode survey approaches, stopping rules for household visits, methods of combining information from surveys and administrative records, provision of focused estimates while preserving identity protection,  improved small area estimates of income and of limited english skills (used to trigger provision of election ballots in languages other than English), and continuing investigation of issues related to model-based and design-based inferences.

 
SS: Are those projects related to your research?

TL: Some are, some will be, some will never be.  Small area estimation, hierarchical modeling with a Bayesian formalism, some aspects of adaptive design, some of combining evidence from a variety of sources, and general statistical modeling are in my power zone.  I look forward to getting involved in these and contributing to other projects.

SS: How does research performed at the Census help the American Public?

TL: Research innovations enable the bureau to produce more timely and accurate information at lower cost, improve validity (for example, new approaches have at least maintained respondent participation in surveys), enhancing the reputation of the the Census Bureau as a trusted source of information.  Estimates developed by Census are used to allocate billions of dollars in school aid, and the provide key planning information for businesses and governments.

SS: How can young statisticians get more involved in government statistical research?

TL: The first step is to become aware of the wide variety of activities and their high impact.  Visiting the Census website and those of other federal and state agencies, and the Committee on National Statistics (http://sites.nationalacademies.org/DBASSE/CNSTAT/) and the National Institute of Statistical Sciences (http://www.niss.org/) is a good start.   Make contact with researchers at the JSM and other meetings and be on the lookout for pre- and post-doctoral positions at Census and other federal agencies.

05
Oct

Not just one statistics interview...John McGready is the Jon Stewart of statistics

Editor’s Note: We usually reserve Friday’s for posting Simply Statistics Interviews. This week, we have a special guest post by John McGready, a colleague of ours who has been doing interviews with many of us in the department and has some cool ideas about connecting students in their first statistics class with cutting edge researchers wrestling with many of the same concepts applied to modern problems. I’ll let him explain…

I teach a two quarter course in introductory biostatistics to master’s students in public health at Johns Hopkins.  The majority of the class is composed of MPH students, but there are also students doing professional master’s degrees in environmental health, molecular biology, health policy and mental health. Despite the short length of the course, it covers the “greatest hits” of biostatistics, encompassing everything from exploratory data analysis up through and including multivariable proportional hazards regression.  The course focus is more conceptual and less mathematical/computing centric than the other two introductory sequences taught at Hopkins: as such it has earned the unfortunate nickname “baby biostatistics” from some at the School.  This, in my opinion, is an unfortunate misnomer: statistical reasoning is often the most difficult part of learning statistics.  We spend a lot of time focusing on the current literature, and making sense or critiquing research by considering not only the statistical methods employed and the numerical findings, but also the study design and the logic of the substantive conclusions made by the study authors.

Via the course, I always hope to demonstrate the importance  biostatistics as a core driver of public health discovery, the importance of statistical reasoning in the research process, and how the fundamentals that are covered are the framework for more advanced methodology. At some point it dawned on me that the best approach for doing this was to have my colleagues speak to my students about these ideas.  Because of timing and scheduling constraints, this proved difficult to do in a live setting.  However, in June of 2012 a video recording studio opened here at the Hopkins Bloomberg School. At this point, I knew that I had to get my colleagues on video so that I could share their wealth of experiences and expertise with my students, and give the students multiple perspectives. To my delight my colleagues are very amenable to being interviewed and have been very generous with their time. I plan to continue doing the interviews so long as my colleagues are willing and the studio is available.

I have created a Youtube channel for these interviews.  At some point in the future, I plan to invite the biostatistics community as a whole to participate.  This will include interviews with visitors to my department, and submissions by biostatistics faculty and students from other schools. (I realize I am very lucky to have these facilities and video expertise at Hopkins: but many folks are tech savvy enough to film their own videos on their cameras, phones etc… in fact you have seen such creativity by the editors of this here blog). With the help of some colleagues I plan on making a complimentary website that will allow for easy submission of videos for posting, so stay tuned!

17
Aug

Interview with C. Titus Brown - Computational biologist and open access champion

C. Titus Brown 


C. Titus Brown is an assistant professor in the Department of Computer Science and Engineering at Michigan State University. He develops computational software for next generation sequencing and the author of the blog, “Living in an Ivory Basement”. We talked to Titus about open access (he publishes his unfunded grants online!), improving the reputation of PLoS One, his research in computational software development, and work-life balance in academics. 

Do you consider yourself a statistician, data scientist, computer scientist, or something else?

Good question.  Short answer: apparently somewhere along the way I
became a biologist, but with a heavy dose of “computational scientist”
in there.

The longer answer?  Well, it’s a really long answer…

My first research was on Avida, a bottom-up model for evolution that
Chris Adami, Charles Ofria and I wrote together at Caltech in 1993:
http://en.wikipedia.org/wiki/Avida.  (Fun fact: Chris, Charles and I
are now all faculty at Michigan State!  Chris and I have offices one
door apart, and Charles has an office one floor down.)  Avida got me
very interested in biology, but not in the undergrad “memorize stuff”
biology — more in research.  This was computational science: using
simple models to study biological phenomena.

While continuing evolution research, I did my undergrad in pure math at Reed
College, which was pretty intense; I worked in the Software Development
lab there, which connected me to a bunch of reasonably well known hackers
including Keith Packard, Mark Galassi, and Nelson Minar.

I also took a year off and worked on Earthshine:

http://en.wikipedia.org/wiki/Planetshine#Earthshine

and then rebooted the project as an RA in 1997, the summer after
graduation.  This was mostly data analysis, although it included a
fair amount of hanging off of telescopes adjusting things as the
freezing winter wind howled through the Big Bear Solar Observatory’s
observing room, aka “data acquisition”.

After Reed, I applied to a bunch of grad schools, including Princeton
and Caltech in bio, UW in Math, and UT Austin and Ohio State in
physics.  I ended up at Caltech, where I switched over to
developmental biology and (eventually) regulatory genomics and genome
biology in Eric Davidson’s lab.  My work there included quite a bit
of wet bench biology, which is not something many people associate with me,
but was nonetheless something I did!

Genomics was really starting to hit the fan in the early 2000s, and I
was appalled by how biologists were handling the data — as one
example, we had about $500k worth of sequences sitting on a shared
Windows server, with no metadata or anything — just the filenames.
As another example, I watched a postdoc manually BLAST a few thousand
ESTs against the NCBI nr database; he sat there and did them three by
three, having figured out that he could concatenate three sequences
together and then manually deconvolve the results.  As probably the
most computationally experienced person in the lab, I quickly got
involved in data analysis and Web site stuff, and ended up writing
some comparative sequence analysis software that was mildly popular
for a while.

As part of the sequence analysis Web site I wrote, I became aware that
maintaining software was a *really* hard problem.  So, towards the end
of my 9 year stint in grad school, I spent a few years getting into
testing, both Web testing and more generally automated software
testing.  This led to perhaps my most used piece of software, twill, a
scripting language for Web testing.  It also ended up being one of the
things that got me elected into the Python Software Foundation,
because I was doing everything in Python (which is a really great
language, incidentally).

I also did some microbial genome analysis (which led to my first
completely reproducible paper (Brown and Callan, 2004;
http://www.ncbi.nlm.nih.gov/pubmed/14983022) and collaborated with the
Orphan lab on some metagenomics:
http://www.ncbi.nlm.nih.gov/pubmed?term=18467493.  This led to a
fascination with the biological “dark matter” in nature that is the
subject of some of my current work on metagenomics.

I landed my faculty position at MSU right out of grad school, because
bioinformatics is sexy and CS departments are OK with hiring grad
students as faculty.  However, I deferred for two years to do a
postdoc in Marianne Bronner-Fraser’s lab because I wanted to switch to
the chick as a model organism, and so I ended up arriving at MSU in
2009.  I had planned to focus a lot on development gene regulatory
networks, but 2009 was when Illumina sequencing hit, and as one of the
few people around who wasn’t visibly frightened by the term “gigabyte”
I got inextricably involved in a lot of different sequence analysis
projects.  These all converged on assembly, and, well, that seems to
be what I work on now :).

The two strongest threads that run through my research are these:

1. “better science through superior software” — so much of science
depends on computational inference these days, and so little of the
underlying software is “good”.  Scientists *really* suck at software
development (for both good and bad reasons) and I worry that a lot of
our current science is on a really shaky foundation.  This is one
reason I’m invested in Software Carpentry
(http://software-carpentry.org), a training program that Greg Wilson
has been developing — he and I agree that science is our best hope
for a positive future, and good software skills are going to be
essential for a lot of that science.  More generally I hope to turn
good software development into a competitive advantage for my lab
and my students.

2. “better hypothesis generation is needed” — biologists, in
particular, tend to leap towards the first testable hypothesis they
find.  This is a cultural thing stemming (I think) from a lot of
really bad interactions with theory: the way physicists and
mathematicians think about the world simply doesn’t fit with the Rube
Goldberg-esque features of biology (see
http://ivory.idyll.org/blog/is-discovery-science-really-bogus.html).

So getting back to the question, uh, yeah, I think I’m a computational
scientist who is working on biology?  And if I need to write a little
(or a lot) of software to solve my problems, I’ll do that, and I’ll
try to do it with some attention to good software development
practice — not just out of ethical concern for correctness, but
because it makes our research move faster.

One thing I’m definitely *not* is a statistician.  I have friends who
are statisticians, though, and they seem like perfectly nice people.

You have a pretty radical approach to open access, can you tell us a little bit about that?

Ever since Mark Galassi introduced me to open source, I thought it
made sense.  So I’ve been an open source-nik since … 1988?

From there it’s just a short step to thinking that open science makes
a lot of sense, too.  When you’re a grad student or a postdoc, you
don’t get to make those decisions, though; it took until I was a PI
for me to start thinking about how to do it.  I’m still conflicted
about *how* open to be, but I’ve come to the conclusion that posting
preprints is obvious
(http://ivory.idyll.org/blog/blog-practicing-open-science.html).

The “radical” aspect that you’re referring to is probably my posting
of grants (http://ivory.idyll.org/blog/grants-posted.html).  There are
two reasons I ended up posting all of my single-PI grants.  Both have
their genesis in this past summer, when I spent about 5 months writing
6 different grants — 4 of which were written entirely by me.  Ugh.

First, I was really miserable one day and joked on Twitter that “all
this grant writing is really cutting into my blogging” — a mocking
reference to the fact that grant writing (to get $$) is considered
academically worthwhile, while blogging (which communicates with the
public and is objectively quite valuable) counts for naught with my
employer.  Jonathan Eisen responded by suggesting that I post all of
the grants and I thought, what a great idea!

Second, I’m sure it’s escaped most people (hah!), but grant funding
rates are in the toilet — I spent all summer writing grants while
expecting most of them to be rejected.  That’s just flat-out
depressing!  So it behooves me to figure out how to make them serve
multiple duties.  One way to do that is to attract collaborators;
another is to serve as google bait for my lab; a third is to provide
my grad students with well-laid-out PhD projects.  A fourth duty they
serve (and I swear this was unintentional) is to point out to people
that this is MY turf and I’m already solving these problems, so maybe
they should go play in less occupied territory.  I know, very passive
aggressive…

So I posted the grants, and unknowingly joined a really awesome cadre
of folk who had already done the same
(http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/).
Most feedback I’ve gotten has been from grad students and undergrads
who really appreciate the chance to look at grants; some people told
me that they’d been refused the chance to look at grants from their
own PIs!

At the end of the day, I’d be lucky to be relevant enough that people
want to steal my grants or my software (which, by the way, is under a
BSD license — free for the taking, no “theft” required…).  My
observation over the years is that most people will do just about
anything to avoid using other people’s software.

In theoretical statistics, there is a tradition of publishing pre-prints while papers are submitted. Why do you think biology is lagging behind?

I wish I knew!  There’s clearly a tradition of secrecy in biology;
just look at the Cold Spring Harbor rules re tweeting and blogging
(http://meetings.cshl.edu/report.html) - this is a conference, for
chrissakes, where you go to present and communicate!  I think it’s
self-destructive and leads to an insider culture where only those who
attend meetings and chat informally get to be members of the club,
which frankly slows down research. Given the societal and medical
challenges we face, this seems like a really bad way to continue doing
research.

One of the things I’m proudest of is our effort on the cephalopod
genome consortium’s white paper,
http://ivory.idyll.org/blog/cephseq-cephalopod-genomics.html, where a
group of bioinformaticians at the meeting pushed really hard to walk
the line between secrecy and openness.  I came away from that effort
thinking two things: first, that biologists were erring on the side of
risk aversity; and second, that genome database folk were smoking
crack when they pushed for complete openness of data.  (I have a blog
post on that last statement coming up at some point.)

The bottom line is that the incentives in academic biology are aligned
against openness.  In particular, you are often rewarded for the first
observation, not for the most useful one; if your data is used to do
cool stuff, you don’t get much if any credit; and it’s all about
first/last authorship and who is PI on the grants.  All too often this
means that people sit on their data endlessly.

This is getting particularly bad with next-gen data sets, because
anyone can generate them but most people have no idea how to analyze
their data, and so they just sit on it forever…

Do you think the ArXiv model will catch on in biology or just within the bioinformatics community?

One of my favorite quotes is: “Making predictions is hard, especially
when they’re about the future.” I attribute it to Niels Bohr.

It’ll take a bunch of big, important scientists to lead the way. We
need key members of each subcommunity of biology to decide to do it on
a regular basis. (At this point I will take the obligatory cheap shot
and point out that Jonathan Eisen, noted open access fan, doesn’t post
his stuff to preprint servers very often.  What’s up with that?)  It’s
going to be a long road.

What is the reaction you most commonly get when you tell people you have posted your un-funded grants online?

“Ohmigod what if someone steals them?”

Nobody has come up with a really convincing model for why posting
grants is a bad thing.  They’re just worried that it *might* be.  I
get the vague concerns about theft, but I have a hard time figuring
out exactly how it would work out well for the thief — reputation is
a big deal in science, and gossip would inevitably happen.  And at
least in bioinformatics I’m aiming to be well enough known that
straight up ripping me off would be suicidal.  Plus, if reviewers
do/did google searches on key concepts then my grants would pop up,
right?  I just don’t see it being a path to fame and glory for anyone.

Revisiting the passive-aggressive nature of my grant posting, I’d like
to point out that most of my grants depend on preliminary results from
our own algorithms.  So even if they want to compete on my turf, it’ll
be on a foundation I laid.  I’m fine with that — more citations for
me, either way :).

More optimistically, I really hope that people read my grants and then
find new (and better!) ways of solving the problems posed in them.  My
goal is to enable better science, not to hunker down in a tenured job
and engage in irrelevant science; if someone else can use my grants as
a positive or negative signpost to make progress, then broadly
speaking, my job is done.

Or, to look at it another way: I don’t have a good model for either
the possible risks OR the possible rewards of posting the grants, and
my inclinations are towards openness, so I thought I’d see what
happens.

How can junior researchers correct misunderstandings about open access/journals like PLoS One that separate correctness from impact? Do you have any concrete ideas for changing minds of senior folks who aren’t convinced?

Render them irrelevant by becoming senior researchers who supplant them
when they retire.  It’s the academic tradition, after all!  And it’s
really the only way within the current academic system, which — for
better or for worse — isn’t going anywhere.

Honestly, we need fewer people yammering on about open access and more
people simply doing awesome science and submitting it to OA journals.
Conveniently, many of the high impact journals are shooting themselves
in the foot and encouraging this by rejecting good science that then
ends up in an OA journal; that wonderful ecology oped on PLoS One
citation rates shows this well
(http://library.queensu.ca/ojs/index.php/IEE/article/view/4351).

Do you have any advice on what computing skills/courses statistics students interested in next generation sequencing should take?

For courses, no — in my opinion 80% of what any good researcher
learns is self-motivated and often self-taught, and so it’s almost
silly to pretend that any particular course or set of skills is
sufficient or even useful enough to warrant a whole course.  I’m not a
big fan of our current undergrad educational system :)

For skills?  You need critical thinking coupled with an awareness that
a lot of smart people have worked in science, and odds are that there
are useful tricks and approaches that you can use.  So talk to other
people, a lot!  My lab has a mix of biologists, computer scientists,
graph theorists, bioinformaticians, and physicists; more labs should
be like that.

Good programming skills are going to serve you well no matter what, of
course.  But I know plenty of good programmers who aren’t very
knowledgeable about biology, and who run into problems doing actual
science.  So it’s not a panacea.

How does replicable or reproducible research fit into your interests?

I’ve wasted *so much time* reproducing other people’s work that when
the opportunity came up to put down a marker, I took it.

http://ivory.idyll.org/blog/replication-i.html

The digital normalization paper shouldn’t have been particularly
radical; that it is tells you all you need to know about replication
in computational biology.

This is actually something I first did a long time ago, with what was
perhaps my favorite pre-faculty-job paper: if you look at the methods
for Brown & Callan (2004) you’ll find a downloadable package that
contains all of the source code for the paper itself and the analysis
scripts.  But back then I didn’t blog :).

Lack of reproducibility and openness in methods has serious
consequences — how much of cancer research has been useless, for
example?  See `this horrific report
<http://online.wsj.com/article/SB10001424052970203764804577059841672541590.html>`__.)
Again, the incentives are all wrong: you get grant money for
publishing, not for being useful.  The two are not necessarily the
same…

Do you have a family, and how do you balance work life and home life?

Why, thank you for asking!  I do have a family — my wife, Tracy Teal,
is a bioinformatician and microbial ecologist, and we have two
wonderful daughters, Amarie (4) and Jessie (1).  It’s not easy being a
junior professor and a parent at the same time, and I keep on trying
to figure out how to balance the needs of travel with the need to be a
parent (hint: I’m not good at it).  I’m increasingly leaning towards
blogging as being a good way to have an impact while being around
more; we’ll see how that goes.

20
Jul

Interview with Lauren Talbot - Quantitative analyst for the NYC Financial Crime Task Force

Lauren Talbot


Lauren Talbot is a quantitative analyst for the New York City Financial Crime Task Force. Before working for NYC she was an analyst at Acumen LLC and got her degree in economics from Stanford University. She is a key player turning spatial data in NYC into new tools for government management. We talked to Lauren about her work, how she is using open data to do things like predict where fires might occur, and how she got started in the Financial Crime Task Force. 

SS: Do you consider yourself a statistician, computer scientist, or something else?

LT: A lot of us can’t call ourselves statisticians or computer scientists, even if that is a large part of what we do, because we never studied those fields formally. Quantitative or Data Analyst are popular job titles, but don’t really do justice to all the code infrastructure/systems you have to build and cultivate — you aren’t simply analyzing, you are matching and automating and illustrating, too. There is also a large creative aspect, because you have to figure out how to present the data in a way that is useful and compelling to people, many of whom have no prior experience working with data. So I am glad people have started using the term “Data Scientist,” even if makes me chuckle a little. Ideally I would call myself “Data Artist,” or “Data Whisperer,” but I don’t think people would take me seriously.

SS: How did you end up in the NYC Mayor’s Financial Crimes Task Force?

LT: I actually responded to a Craigslist posting. While I was still in the Bay Area (where I went to college), I was looking for jobs in NYC because I wanted to relocate back here, where I am originally from. I was searching for SAS programmer jobs, and finding a lot of stuff in healthcare that made me yawn a little. And then I had the idea to try the government jobs section. The Financial Crimes Task Force (now part of a broader citywide analytics effort under the Office of Policy and Strategic Planning) was one of two listings that popped up, and I read the description and immediately thought “dream job!” It has turned out to be even better than I imagined, because there is such a huge opportunity to make a difference — the Bloomberg administration is actually very interested in operationalizing insights from city data, so they are listening to the data people and using their work to inform agency resource allocation and even sometimes policy. My fellow are also just really fun and intelligent. I’m constantly impressed by how quickly they pick up new skills, get to the bottom of things, and jump through hoops to get things done. We also amuse and entertain each other throughout the day, which is awesome. 

SS: Can you tell us about one of the more interesting cases you have tackled and how data analysis/statistics played into the case?

LT: Since this is the NYC Mayor’s Office, dealing with city data, almost of our analyses are in some way location-based. We are trying to answer questions like, “what locations are most likely to have a catastrophic event (e.g. fire) in the near future?” This involves combining many disparate datasets such as fire data, buildings data, emergency calls data, city planning data, even garbage data. We use the tax lot ID as a common identifier, but many of the datasets do not come with this variable - they only have a text address or intersection. In many cases, the address is entered manually and has spelling mistakes. In the beginning, we were using a point-and-click geocoding tool that the city provides that reads the text field and assigns the tax lot ID. However, it was taking a long time to prepare the data so it could be used by the program, and the program was returning many errors. When we visually inspected the errors, we saw that they were caused by minor spelling differences and naming conventions. Now, almost every week we get new datasets in different structures, and we need to geocode them immediately before we can really work with them. So we needed a geocoding program that was automated and flexible, as well as capable of geocoding addresses and intersections with spelling errors and different conventions. Over the past few months, using publicly available city planning datasets and regular expressions, my side project has been creating such a program in SAS. My first test case was self-reported data created solely through user entry. This dataset, which could only be 40% geocoded using the original tool, is now 93% geocoded using the program we developed. The program is constantly evolving and improving. Now it is assigning block faces, spellchecking street and city names, and accounting for the occasional gaps in the data. We use it for everything.

SS: What are the computational tools and ideas you use most frequently in your day to day work (R, databases, regression analysis, etc.)?

LT: In the beginning, all of the data was sent to us in SQL or Excel, which was not very efficient. Now we are building a multi-agency SAS platform that can be used by programmers and non-programmers. Since there are so many data sources that can work together, having a unified platform creates new discoveries that agencies can use to be more efficient or effective. For example, a building investigator can use 311 noise complaints to uncover vacated properties that are being illegally occupied. The platform employs Palantir, which is an excellent front-end tool for playing around with the data and exploring many-to-many relationships.  Internally, my team has also used R, Python, Java, even VBA. Whatever gets the job done. We use a good mix of statistical tools. The bread and butter is usually manipulating and understanding new data sources, which is necessary before we can start trying to do something like run a multiple regression, for example. In the end, it’s really a mashup: text parsing, name matching, summarizing/describing/reporting using comparative statistics, geomapping, graphing, logistic regression, even kernel density, can all be part of the mix. Our guiding principle is to use the tool/analysis/strategy that has the highest return on investment of time and analyst resources for the city.

SS: What are the challenges of working as a quantitative analyst in a regulatory role? Is it hard to make your analyses/discoveries understandable?

LT: A lot of data analysts working in government have a difficult time getting agencies and policymakers to take action based on their work due to political priorities and organizational structures. We circumvent that issue by operating based on the needs and requests of the agencies, as well as paying attention to current events. An agency or official may come to us with a problem, and we figure out what we can deliver that will be of use to them. This starts a dialogue. It becomes an iterative process, and projects can grow and morph once we have feedback. Oftentimes, it is better to use a data-mining approach, which is more understandable to non-statisticians, rather than a regression, which can seem like a black box. For example, my colleague came up with an algorithm to target properties that were a high fire risk based on the presence of illegal conversion complaints and evidence that the property owner was under financial distress. He began with a simple list of properties for the Department of Buildings to focus on, and now they go out to inspect a list of places selected by his algorithm weekly. This video of the fire chief speaking about the project illustrates the challenges encountered and why the simpler approach was ultimately successful:http://www.youtube.com/watch?v=425QSx0U8lU&feature=youtube_gdata_player

SS: Do you have any advice for statisticians/data scientists who want to get involved with open government or government data analysis?

LT: I’ve found that people in government are actually very open to and interested in using data. The first challenge is that they don’t know that the data they have is of value. To be the most effective, you should get in touch with the people who have subject matter expertise (usually employees who have been working on the ground for some time), interview them, check your assumptions, and share whatever you’re seeing in the data on an ongoing basis. Not only will both parties learn faster, but it helps build a culture of interest in the data. Once people see what is possible, they will become more creative and start requesting deliverables that are increasingly actionable. The second challenge is getting data, and the legal and social/political issues surrounding that. The big secret is that so much useful data is actually publicly available. Do your research — you may find what you need without having to fight for it. If what you need is protected, however, consider whether the data would still be useful to you if scrubbed of personally identifiable information. Location-based data is a good example of this. If so, see whether you can negotiate with the data owner to obtain only the parts needed to do your analysis. Finally, you may find that the cohort of data scientists in government is all too sparse, and too few people “speak your language.” Reach out and align yourself with people in other agencies who are also working with data. This is a great way to gain new insight into the goals and issues of your administration, as well as friends to support and advise you as you navigate “the system.”

01
Jun

Interview with Amanda Cox - Graphics Editor at the New York Times

Amanda Cox 



Amanda Cox received her M.S. in statistics from the University of Washington in 2005. She then moved to the New York Times, where she is a graphics editor. She, and the graphics team at the New York Times, are responsible for many of the cool, informative, and interactive graphics produced by the Times. For example, this, this and this (the last one, Olympic Symphony, is one of my all time favorites). 

You have a background in statistics, do you consider yourself a statistician? Do you consider what you do statistics?

I don’t deal with uncertainty in a formal enough way to call what I do statistics, or myself a statistician. (My technical title is “graphics editor,” but no one knows what this means. On the good days, what we do is “journalism.”) Mark Hansen, a statistician at UCLA, has possibly changed my thinking on this a little bit though, by asking who I want to be the best at visualizing data, if not statisticians.

How did you end up at the NY Times?

In the middle of my first year of grad school (in statistics at the University of Washington), I started applying for random things. One of them was to be a summer intern in the graphics department at the Times.

How are the graphics and charts you develop different than producing graphs for a quantitative/scientific audience?


“Feels like homework” is a really negative reaction to a graphic or a story here. In practice, that means a few things: we don’t necessarily assume our audience already cares about a topic. We try to get rid of jargon, which can be useful shorthand for technical audiences, but doesn’t belong in a newspaper. Most of our graphics can stand on their own, meaning you shouldn’t need to read any accompanying text to understand the basic point. Finally, we probably pay more attention to things like typography and design, which, done properly, are really about hierarchy and clarity, and not just about making things cute. 


How do you use R to prototype graphics? 

I sketch in R, which mostly just means reading data, and trying on different forms or subsets or levels of aggregation. It’s nothing fancy: usually just points and lines and text from base graphics. For print, I will sometimes clean up a pdf of R output in Illustrator. You can see some of that in practice at chartsnthings.tumblr.com, which where one of my colleagues, Kevin Quealy, posts some of the department’s sketches. (Kevin and I are the only regular R users here, so the amount of R used on chartsnthings is not at all representative of NYT graphics as a whole.)

Do you have any examples where the R version and the eventual final web version are nearly identical?

Real interactivity changes things, so my use of R for web graphics is mostly just a proof-of-concept thing. (Sometimes I will also generate “poor-man’s interactivity,” which means hitting the pagedown key on a pdf of charts made in a for loop.) But here are a couple of proof-of-concept sketches, where the initial R output doesn’t look so different from the final web version.

The Jobless Rate for People Like You

How Different Groups Spend Their Day

You consistently produce arresting and informative graphics about a range of topics. How do you decide on which topics to tackle?

News value and interestingness are probably the two most important criteria for deciding what to work on. In an ideal world, you get both, but sometimes, one is enough (or the best you can do).

Are your project choices motivated by availability of data?

Sure. The availability of data also affects the scope of many projects. For example, the guys who work on our live election results will probably map them by county, even though precinct-level results are so much better. But precinct-level data isn’t generally available in real time.

What is the typical turn-around time from idea to completed project?

The department is most proud of some of its one-day, breaking news work, but very little of that is what I would think of as data-heavy.  The real answer to “how long does it take?” is “how long do we have?” Projects always find ways to expand to fill the available space, which often ranges from a couple of days to a couple of weeks.


Do you have any general principles for how you make complicated data understandable to the general public?

I’m a big believer in learning by example. If you annotate three points in a scatterplot, I’m probably good, even if I’m not super comfortable reading scatterplots. I also think the words in a graphic should highlight the relevant pattern, or an expert’s interpretation, and not merely say “Here is some data.” The annotation layer is critical, even in a newspaper (where the data is not usually super complicated).

What do you consider to be the most informative graphical elements or interactive features that you consistently use?

I like sliders, because there’s something about them that suggests story (beginning-middle-end), even if the thing you’re changing isn’t time. Using movement in a way that means something, like this or this, is still also fun, because it takes advantage of one of the ways the web is different from print.

11
May

Interview with Hadley Wickham - Developer of ggplot2

Hadley Wickham



Hadley Wickham is the Dobelman Family Junior Chair of Statistics at Rice University. Prior to moving to Rice, he completed his Ph.D. in Statistics from Iowa State University. He is the developer of the wildly popular ggplot2 software for data visualization and a contributor to the Ggobi project. He has developed a number of really useful R packages touching everything from data processing, to data modeling, to visualization. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?

I’m an assistant professor of statistics, so I at least partly
associate with statistics :).  But the idea of data science really
resonates with me: I like the combination of tools from statistics and
computer science, data analysis and hacking, with the core goal of
developing a better understanding of data. Sometimes it seems like not
much statistics research is actually about gaining insight into data.

You have created/maintain several widely used R packages. Can you
describe the unique challenges to writing and maintaining packages
above and beyond developing the methods themselves?

I think there are two main challenges: turning ideas into code, and
documentation and community building.

Compared to other languages, the software development infrastructure
in R is weak, which sometimes makes it harder than necessary to turn
my ideas into code. Additionally, I get less and less time to do
software development, so I can’t afford to waste time recreating old
bugs, or releasing packages that don’t work. Recently, I’ve been
investing time in helping build better dev infrastructure; better
tools for documentation [roxygen2], unit testing [testthat], package development [devtools], and creating package website [staticdocs]. Generally, I’ve
found unit tests to be a worthwhile investment: they ensure you never
accidentally recreate an old bug, and give you more confidence when
radically changing the implementation of a function.

Documenting code is hard work, and it’s certainly something I haven’t
mastered. But documentation is absolutely crucial if you want people
to use your work. I find the main challenge is putting yourself in the
mind of the new user: what do they need to know to use the package
effectively. This is really hard to do as a package author because
you’ve internalised both the motivating problem and many of the common
solutions.

Connected to documentation is building up a community around your
work. This is important to get feedback on your package, and can be
helpful for reducing the support burden. One of the things I’m most
proud of about ggplot2 is something that I’m barely responsible for:
the ggplot2 mailing list. There are now ggplot2 experts who answer far
more questions on the list than I do. I’ve also found github to be
great: there’s an increasing community of users proficient in both R
and git who produce pull requests that fix bugs and add new features.

The flip side of building a community is that as your work becomes
more popular you need to be more careful when releasing new versions.
The last major release of ggplot2 (0.9.0) broke over 40 (!!) CRAN
packages, and forced me to rethink my release process. Now I advertise
releases a month in advance, and run `R CMD check` on all downstream
dependencies (`devtools::revdep_check` in the development version), so
I can pick up potential problems and give other maintainers time to
fix any issues.

Do you feel that the academic culture has caught up with and supports
non-traditional academic contributions (e.g. R packages instead of
papers)?

It’s hard to tell. I think it’s getting better, but it’s still hard to
get recognition that software development is an intellectual activity
in the same way that developing a new mathematical theorem is. I try
to hedge my bets by publishing papers to accompany my major packages:
I’ve also found the peer-review process very useful for improving the
quality of my software. Reviewers from both the R journal and the
Journal of Statistical Software have provided excellent suggestions
for enhancements to my code.

You have given presentations at several start-up and tech companies.
Do the corporate users of your software have different interests than
the academic users?

By and large, no. Everyone, regardless of domain, is struggling to
understand ever larger datasets. Across both industry and academia,
practitioners are worried about reproducible research and thinking
about how to apply the principles of software engineering to data
analysis.

You gave one of my favorite presentations called Tidy Data/Tidy Tools
at the NYC Open Statistical Computing Meetup. What are the key
elements of tidy data that all applied statisticians should know?

Thanks! Basically, make sure you store your data in a consistent
format, and pick (or develop) tools that work with that data format.
The more time you spend munging data in the middle of an analysis, the
less time you have to discover interesting things in your data. I’ve
tried to develop a consistent philosophy of data that means when you
use my packages (particularly plyr and ggplot2), you can focus on the
data analysis, not on the details of the data format. The principles
of tidy data that I adhere to are that every column should be a
variable, every row an observation, and different types of data should
live in different data frames. (If you’re familiar with database
normalisation this should sound pretty familiar!). I expound these
principles in depth in my in-progress [paper on the
topic]

How do you decide what project to work on next? Is your work inspired
by a particular application or more general problems you are trying to
tackle?

Very broadly, I’m interested in the whole process of data analysis:
the process that takes raw data and converts it into understanding,
knowledge and insight. I’ve identified three families of tools
(manipulation, modelling and visualisation) that are used in every
data analysis, and I’m interested both in developing better individual
tools, but also smoothing the transition between them. In every good
data analysis, you must iterate multiple times between manipulation,
modelling and visualisation, and anything you can do to make that
iteration faster yields qualitative improvements to the final analysis
(that was one of the driving reasons I’ve been working on tidy data).

Another factor that motivates a lot of my work is teaching. I hate
having to teach a topic that’s just a collection of special cases,
with no underlying theme or theory. That drive lead to [stringr] (for
string manipulation) and [lubridate] (with Garrett Grolemund for working
with dates). I recently released the [httr] package which aims to do a similar thing for http requests - I think this is particularly important as more and more data starts living on the web and must be accessed through an API.

What do you see as the biggest open challenges in data visualization
right now? Do you see interactive graphics becoming more commonplace?

I think one of the biggest challenges for data visualisation is just
communicating what we know about good graphics. The first article
decrying 3d bar charts was published in 1951! Many plots still use
rainbow scales or red-green colour contrasts, even though we’ve known
for decades that those are bad. How can we ensure that people
producing graphics know enough to do a good job, without making them
read hundreds of papers? It’s a really hard problem.

Another big challenge is balancing the tension between exploration and
presentation. For explotary graphics, you want to spend five seconds
(or less) to create a plot that helps you understand the data, while you might spend
five hours on a plot that’s persuasive to an audience who
isn’t as intimately familiar with the data as you. To date, we have
great interactive graphics solutions at either end of the spectrum
(e.g. ggobi/iplots/manet vs d3) but not much that transitions from one
end of the spectrum to the other. This summer I’ll be spending some
time thinking about what ggplot2 + [d3], might
equal, and how we can design something like an interactive grammar of
graphics that lets you explore data in R, while making it easy to
publish interaction presentation graphics on the web.

13
Apr

Interview with Drew Conway - Author of "Machine Learning for Hackers"

Drew Conway

Drew Conway is a Ph.D. student in Politics at New York University and the co-ordinator of the New York Open Statistical Programming Meetup. He is the creator of the famous (or infamous) data science Venn diagram, the basis for our R function to determine if your a data scientist. He is also the co-author of Machine Learning for Hackers, a book of case studies that illustrates data science from a hacker’s perspective. 

Which term applies to you: data scientist, statistician, computer
scientist, or something else?
Technically, my undergraduate degree is in computer science, so that term can be applied.  I was actually double-major in CS and political science, however, so it wouldn’t tell the whole story.  I have always been most interested in answering social science problems with the tools of computer science, math and statistics.
I have struggled a bit with the term “data scientist.”  About a year ago, when it seemed to be gaining a lot of popularity, I bristled at it.  Like many others, I complained that it was simply a corporate rebranding of other skills, and that the term “science” was appended to give some veil of legitimacy.  Since then, I have warmed to the term, but—-as is often the case—-only when I can define what data science is in my own terms.  Now, I do think of what I do as being data science, that is, the blending of technical skills and tools from computer science, with the methodological training of math and statistics, and my own substantive interest in questions about collective action and political ideology.
I think the term is very loaded, however, and when many people invoke it they often do so as a catch-all for talking about working with a certain a set of tools: R, map-reduce, data visualization, etc.  I think this actually hurts the discipline a great deal, because if it is meant to actually be a science the majority of our focus should be on questions, not tools.
 

You are in the department of politics? How is it being a “data
person” in a non-computational department?

Data has always been an integral part of the discipline, so in that sense many of my colleagues are data people.  I think the difference between my work and the work that many other political scientist do is simply a matter of where and how I get my data.  
For example, a traditional political science experiment might involve a small set of undergraduates taking a survey or playing a simple game on a closed network.  That data would then be collected and analyzed as a controlled experiment.  Alternatively, I am currently running an experiment wherein my co-authors and I are attempting to code text documents (political party manifestos) with ideological scores (very liberal to very conservative).  To do this we have broken down the documents into small chunks of text and are having workers on Mechanical Turk code single chunks—rather than the whole document at once.  In this case the data scale up very quickly, but by aggregating the results we are able to have a very different kind of experiment with much richer data.
At the same time, I think political science—-and perhaps the social sciences more generally—suffer from a tradition of undervaluing technical expertise. In that sense, it is difficult to convince colleagues that developing software tools is important. 
 

Is that what inspired you to create the New York Open Statistical Meetup?

I actually didn’t create the New York Open Statistical Meetup (formerly the R meetup).  Joshua Reich was the original founder, back in 2008, and shortly after the first meeting we partnered and ran the Meetup together.  Once Josh became fully consumed by starting / running BankSimple I took it over by myself.  I think the best part about the Meetup is how it brings people together from a wide range of academic and industry backgrounds, and we can all talk to each other in a common language of computational programming.  The cross-pollination of ideas and talents is inspiring.
We are also very fortunate in that the community here is so strong, and that New York City is a well traveled place, so there is never a shortage of great speakers.
 

You created the data science Venn diagram. Where do you fall on the diagram?

Right at the center, of course! Actually, before I entered graduate school, which is long before I drew the Venn diagram, I fell squarely in the danger zone.  I had a lot of hacking skills, and my work (as an analyst in the U.S. intelligence community) afforded me a lot of substantive expertise, but I had little to no formal training in statistics.  If you could describe my journey through graduate school within the framework of the data science Venn diagram, it would be about me trying to pull myself out of the danger zone by gaining as much math and statistics knowledge as I can.  
 

I see that a lot of your software (including R packages) are on Github. Do you post them on CRAN as well? Do you think R developers will eventually move to Github from CRAN?

I am a big proponent of open source development, especially in the context of sharing data and analyses; and creating reproducible results.  I love Github because it creates a great environment for following the work of other coders, and participating in the development process.  For data analysis, it is also a great place to upload data and R scripts and allow the community to see how you did things and comment.  I also think, however, that there is a big opportunity for a new site—-like Github—-to be created that is more tailored for data analysis, and storing and disseminating data and visualizations.
I do post my R packages to CRAN, and I think that CRAN is one of the biggest strengths of the R language and community.  I think ideally more package developers would open their development process, on Github or some other social coding platform, and then push their well-vetted packages to CRAN.  This would allow for more people to participate, but maintain the great community resource that CRAN provides. 
 

What inspired you to write, “Machine Learning for Hackers”? Who
was your target audience?

A little over a year ago John Myles White (my co-author) and I were having a lot of conversations with other members of the data community in New York City about what a data science curriculum would look like.  During these conversations people would always cite the classic text; Elements of Statistical Learning, Pattern Recognition and Machine Learning, etc., which are excellent and deep treatments of the foundational theories of machine learning.  From these conversations it occurred to us that there was not a good text on machine learning for people who thought more algorithmically.  That is, there was not a text for “hackers,” people who enjoy learning about computation by opening up black-boxes and getting their hands dirty with code.
It was from this idea that the book, and eventually the title, were borne.  We think the audience for the book is anyone who wants to get a relatively broad introduction to some of the basic tools of machine learning, and do so through code—-not math.  This can be someone working at a company with data that wants to add some of these tools to their belt, or it can be an undergraduate in a computer science or statistics program that can relate to the material more easily through this presentation than the more theoretically heavy texts they’re probably already reading for class. 
19
Mar

Interview with Amy Heineike - Director of Mathematics at Quid

Amy Heineike

Amy Heineike is the Director of Mathematics at Quid, a startup that seeks to understand technology development and dissemination through data analysis. She was the first employee at Quid, where she helped develop their technology early on. She has been recognized as one of the top Big Data Scientists. As a part of our ongoing interview series talked to Amy about data science, Quid, and how statisticians can get involved in the tech scene. 

Which term applies to you: data scientist, statistician, computer scientist, or something else?
Data Scientist fits better than any, because it captures the mix of analytics, engineering and product management that is my current day to day.  
When I started with Quid I was focused on R&D - developing the first prototypes of what are now our core analytics technologies, and working to define and QA new data streams.  This required the analysis of lots of unstructured data, like news articles and patent filings, as well as the end visualisation and communication of the results.  
After we raised VC funding last year I switched to building our data science and engineering teams out.  These days I jump from conversations with the team about ideas for new analysis, to defining refinements to our data model, to questions about scalable architecture and filling out pivotal tracker tickets.  The core challenge is translating the vision for the product back to the team so they can build it.
 
 How did you end up at Quid?
In my previous work I’d been building models to improve our understanding of complex human systems - in particular the complex interaction of cities and their transportation networks in order to evaluate the economic impacts of, Crossrail, a new train line across London, and the implications of social networks on public policy.  Through this work it became clear that data was the biggest constraint - I became fascinated by a quest to find usable data for these questions - and thats what led me to Silicon Valley.  I knew the founders of Quid from University, and approached them with the idea of analysing their data according to ideas I’d had - especially around network analysis - and the initial work we collaborated on became core to the founding techology of Quid.
Who were really good mentors to you? What were the qualities that helped you? 
I’ve been fortunate to work with some brilliant people in my career so far.  While I still worked in London I worked closely with two behavioural economists - Paul Ormerod, who’s written some fantastic books on the subject (mostly recently Why Things Fail), and Bridget Rosewell, until recently the Chief Economist to the Greater London Authority (the city government for London).  At Quid I’ve had a very productive collaboration with Sean Gourley, our CTO.
One unifying characteristic of these three is their ability to communicate complex ideas in a powerful way to a broad audience.  Its an incredibly important skill, a core part of analytics work is taking the results to where they are needed which is often beyond those who know the technical details, to those who care about the implications first.
 
How does Quid determine relationships between organizations and develop insight based on data? 
The core questions our clients ask us are around how technology is changing and how this impacts their business.  Thats a really fascinating and huge question that requires not just discovering a document with the answer in it, but organizing lots and lots of pieces of data to paint a picture of the emergent change.  What we can offer is not only being able to find a snapshot of that, but also being able to track how it changes over time.
We organize the data firstly through the insight that much disruptive technology emerges in organizations, and that the events that occur between and to organizations are a fantastic way to signal both the traction of technologies and to observe strategic decision making by key actors.
The first kind of relationship thats important is of the transactional type, who is acquiring, funding or partnering with who, and the second is an estimate of the technological clustering of organizations, what trends do particular organizations represent.  Both of these can be discovered through documents about them, including in government filings, press releases and news, but requires analysis of unstructured natural language.  
 
We’ve experimented with some very engaging visualisations of the results, and have had particular success with network visualisations, which are a very powerful way of allowing people to interact with a large amount of data in a quite playful way.  You can see some of our analyses in the press links at http://quid.com/in-the-news.php
What skills do you think are most important for statisticians/data scientists moving into the tech industry?
Technical statistical chops are the foundation. You need to be able to take a dataset and discover and communicate what’s interesting about it for your users.  To turn this into a product requires understanding how to turn one-off analysis into something reliable enough to run day after day, even as the data evolves and grows, and as different users experience different aspects of it.  A key part of that is being willing to engage with questions about where the data comes from (how it can be collected, stored, processed and QAed on an ongoing basis), how the analytics will be run (how will it be tested, distributed and scaled) and how people interact with it (through visualisations, UI features or static presentations?).  
For your ideas to become great products, you need to become part of a great team though!  One of the reasons that such a broad set of skills are associated with Data Science is that there are a lot of pieces that have to come together for it to all work out - and it really takes a team to pull it off.  Generally speaking, the earlier stage the company that you join, the broader the range of skills you need, and the more scrappy you need to be about getting involved in whatever needs to be done.  Later stage teams, and big tech companies may have roles that are purer statistics.
 
Do you have any advice for grad students in statistics/biostatistics on how to get involved in the start-up community or how to find a job at a start-up? 
There is a real opportunity for people who have good statistical and computational skills to get into the startup and tech scenes now.  Many people in Data Science roles have statistics and biostatistics backgrounds, so you shouldn’t find it hard to find kindred spirits.
We’ve always been especially impressed with people who have built software in a group and shared or distributed that software in some way.  Getting involved in an open source project, working with version control in a team, or sharing your code on github are all good ways to start on this.
Its really important to be able to show that you want to build products though.  Imagine the clients or users of the company and see if you get excited about building something that they will use.  Reach out to people in the tech scene, explore who’s posting jobs - and then be able to explain to them what it is you’ve done and why its relevant, and be able to think about their business and how you’d want to help contribute towards it.  Many companies offer internships, which could be a good way to contribute for a short period and find out if its a good fit for you.

20
Jan

Interview With Joe Blitzstein

Joe Blitzstein
Joe Blitzstein is Professor of the Practice in Statistics at Harvard University and co-director of the graduate program. He moved to Harvard after obtaining his Ph.D. with Persi Diaconis at Stanford University. Since joining the faculty at Harvard, he has been immortalized in Youtube prank videos, been awarded a “favorite professor” distinction four times, and performed interesting research on the statistical analysis of social networks. Joe was also the first person to discover our blog on Twitter. You can find more information about him on his personal website. Or check out his Stat 110 class, now available from iTunes!
Which term applies to you: data scientist/statistician/analyst?

Statistician, but that should and does include working with data! I
think statistics at its best interweaves modeling, inference,
prediction, computing, exploratory data analysis (including
visualization), and mathematical and scientific thinking. I don’t
think “data science” should be a separate field, and I’m concerned
about people working with data without having studied much statistics
and conversely, statisticians who don’t consider it important ever to
look at real data. I enjoyed the discussions by Drew Conway and on
your blog (at http://www.drewconway.com/zia/?p=2378 and
http://simplystatistics.tumblr.com/post/11271228367/datascientist )
and think the relationships between statistics, machine learning, data
science, and analytics need to be clarified.

How did you get into statistics/data science (e.g. your history)?

I always enjoyed math and science, and became a math major as an
undergrad Caltech partly because I love logic and probability and
partly because I couldn’t decide which science to specialize in. One
of my favorite things about being a math major was that it felt so
connected to everything else: I could often help my friends who were
doing astronomy, biology, economics, etc. with problems, once they had
explained enough so that I could see the essential pattern/structure
of the problem. At the graduate level, there is a tendency for math to
become more and more disconnected from the rest of science, so I was
very happy to discover that statistics let me regain this, and have
the best of both worlds: you can apply statistical thinking and tools
to almost anything, and there are so many opportunities to do things
that are both beautiful and useful.

Who were really good mentors to you? What were the qualities that really
helped you?

I’ve been extremely lucky that I have had so many inspiring
colleagues, teachers, and students (far too numerous to list), so I
will just mention three. My mother, Steffi, taught me at an early age
to love reading and knowledge, and to ask a lot of “what if?”
questions. My PhD advisor, Persi Diaconis, taught me many beautiful
ideas in probability and combinatorics, about the importance of
starting with a simple nontrivial example, and to ask a lot of “who
cares?” questions. My colleague Carl Morris taught me a lot about how
to think inferentially (Brad Efron called Carl a “natural”
statistician in his interview at
http://www-stat.stanford.edu/~ckirby/brad/other/2010Significance.pdf ,
by which I think he meant that valid inferential thinking does not
come naturally to most people), about parametric and hierarchical
modeling, and to ask a lot of “does that assumption make sense in the
real world?” questions.

How do you get students fired up about statistics in your classes?

Statisticians know that their field is both incredibly useful in the
real world and exquisitely beautiful aesthetically. So why isn’t that
always conveyed successfully in courses? Statistics is often
misconstrued as a messy menagerie of formulas and tests, rather than a
coherent approach to scientific reasoning based on a few fundamental
principles. So I emphasize thinking and understanding rather than
memorization, and try to make sure everything is well-motivated and
makes sense both mathematically and intuitively. I talk a lot about
paradoxes and results which at first seem counterintuitive, since
they’re fun to think about and insightful once you figure out what’s
going on.

And I emphasize what I call “stories,” by which I mean an
application/interpretation that does not lose generality. As a simple
example, if X is Binomial(m,p) and Y is Binomial(n,p) independently,
then X+Y is Binomial(m+n,p). A story proof would be to interpret X as
the number of successes in m Bernoulli trials and Y as the number of
successes in n different Bernoulli trials, so X+Y is the number of
successes in the m+n trials. Once you’ve thought of it this way,
you’ll always understand this result and never forget it. A
misconception is that this kind of proof is somehow less rigorous than
an algebraic proof; actually, rigor is determined by the logic of the
argument, not by how many fancy symbols and equations one writes out.

My undergraduate probability course, Stat 110, is now worldwide
viewable for free on iTunes U at
http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewPodcast?id=495213607
with 34 lecture videos and about 250 practice problems with solutions.
I hope that will be a useful resource, but in any case looking through
those materials says more about my teaching style than anything I can
write here does.

What are your main research interests these days?

I’m especially interested in the statistics of networks, with
applications to social network analysis and in public health. There is
a tremendous amount of interest in networks these days, coming from so
many different fields of study, which is wonderful but I think there
needs to be much more attention devoted to the statistical issues.
Computationally, most network models are difficult to work with since
the space of all networks is so vast, and so techniques like Markov
chain Monte Carlo and sequential importance sampling become crucial;
but there remains much to do in making these algorithms more efficient
and in figuring out whether one has run them long enough (usually the
answer is “no” to the question of whether one has run them long
enough). Inferentially, I am especially interested in how to make
valid conclusions when, as is typically the case, it is not feasible
to observe the full network. For example, respondent-driven sampling
is a link-tracing scheme being used all over the world these days to
study so-called “hard-to-reach” populations, but much remains to be
done to know how best to analyze such data; I’m working on this with
my student Sergiy Nesterko. With other students and collaborators I’m
working on various other network-related problems. Meanwhile, I’m also
finishing up a graduate probability book with Carl Morris,
“Probability for Statistical Science,” which has quite a few new
proofs and perspectives on the parts of probability theory that are
most useful in statistics.

You have been immortalized in several Youtube videos. Do you think this
helped make your class more “approachable”?

There were a couple strange and funny pranks that occurred in my first
year at Harvard. I’m used to pranks since Caltech has a long history
and culture of pranks, commemorated in several “Legends of Caltech”
volumes (there’s even a movie in development about this), but pranks
are quite rare at Harvard. I try to make the class approachable
through the lectures and by making sure there is plenty of support,
help, and encouragement is available from the teaching assistants and
me, not through YouTube, but it’s fun having a few interesting
occasions from the history of the class commemorated there.