Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Not So Standard Deviations Episode 30 - Philately and Numismatology

Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Download the audio for this episode

Listen here:

Some things I've found help reduce my stress around science

Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to getting blown up on the internet.

Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba’s class on essential skills in reproducibility and came across this set of slides by Phillip Stark. The one that caught my attention said:

If I say just trust me and I’m wrong, I’m untrustworthy. If I say here’s my work and it’s wrong, I’m honest, human, and serving scientific progress.

I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.

  1. Put everything out in the open. We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work.
  2. Admit mistakes quickly. Since my code/data are out in the open I’ve had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary.
  3. Respond to requests for support at my own pace. I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this right away when I would get the emails. I still try to be prompt, but I don’t let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets.
  4. Treat rejection as a feature not a bug. This one is by far the hardest for me but preprints have helped a ton. The academic system is designed to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow.
  5. Don’t argue with people on the internet, especially on Twitter. This is a new one for me and one I’m having to practice hard every single day. But I’ve found that I’ve had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn’t help me accomplish much.
  6. Redefine success. I’ve found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I’m much less stressed out.
  7. Don’t compare myself to other scientists. It is very hard to get good evaluation in science and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.
  8. When comparing, at least pick a metric I’m good at. I’d like to claim I never compare myself to others, but the reality is I do it more than I’d like. I’ve found one way to not stress myself out for my own internal comparisons is to pick metrics I’m good at - even if they aren’t the “right” metrics. That way at least if I’m comparing I’m not hurting my own psyche.
  9. Let myself be bummed sometimes. Some days despite all of that I still get the imposter syndrome feels and can’t get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work.
  10. Try very hard to be positive in my interactions. This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc.
  11. Realize that giving credit doesn’t take away from me. In my research career I have worked with some extremely generous mentors. They taught me to always give credit whenever possible. I also learned from Roger that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better.

The last thing I’d say is that having a blog has helped reduce my stress, because sometimes I’m having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done…

A non-comprehensive list of awesome things other people did in 2016

Editor’s note: For the last few years I have made a list of awesome things that other people did (2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.

  • Thomas Lin Pedersen created the tweenr package for interpolating graphs in animations. Check out this awesome logo he made with it.
  • Yihui Xie is still blowing away everything he does. First it was bookdown and then the yolo feature in xaringan package.
  • J Alammar built this great visual introduction to neural networks
  • Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her Data Rectangling talk. The analogy between exponential families and data frames is so so good.
  • Hadley Wickham’s book on R for data science is everything you’d expect. Super clear, great examples, just a really nice book.
  • David Robinson is a machine put on this earth to create awesome data science stuff. Here is analyzing Trump’s tweets and here he is on empirical Bayes modeling explained with baseball.
  • Julia Silge and David created the tidytext package. This is a holy moly big contribution to NLP in R. They also have a killer book on tidy text mining.
  • Julia used the package to do this fascinating post on mining Reddit after the election.
  • It would be hard to pick just five different major contributions from JJ Allaire (great interview here), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely churning out awesome stuff at a rate that is hard to keep up with. I loved R notebooks and have used them extensively for teaching.
  • Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their 10 simple rules piece Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently.
  • Yaniv Erlich just can’t stop himself from doing interesting things like seeq.io and dna.land.
  • Thomaz Berisa and Joe Pickrell set up a freaking Python API for genomics projects.
  • DataCamp continues to do great things. I love their DataChats series and they have been rolling out tons of new courses.
  • Sean Rife and Michele Nuijten created statcheck.io for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself.
  • This was the artificial intelligence tweet of the year
  • I loved seeing PLoS Genetics start a policy of looking for papers in biorxiv.
  • Matthew Stephens post on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now!
  • Lorena Barba made this amazing reproducibility syllabus then won the Leamer-Rosenthal prize in open science.
  • Colin Dewey continues to do just stellar stellar work, this time on re-annotating genomics samples. This is one of the key open problems in genomics.
  • I love FlowingData sooooo much. Here is one on the changing American diet.
  • If you like computational biology and data science and like super detailed reports of meetings/talks you MIchael Hoffman is your man. How he actually summarizes that much information in real time is still beyond me.
  • I really really wish I had been at Alyssa Frazee’s talk at startup.ml but loved this review of it. Sampling, inverse probability weighting? Love that stats flavor!
  • I have followed Cathy O’Neil for a long time in her persona as mathbabedotorg so it is no surprise to me that her new book Weapons of Math Descruction is so good. One of the best works on the ethics of data out there.
  • A related and very important piece is on Machine bias in sentencing by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica.
  • Dimitris Rizopolous created this stellar integrated Shiny app for his repeated measures class. I wish I could build things half this nice.
  • Daniel Engber’s piece on Who will debunk the debunkers? at fivethirtyeight just keeps getting more relevant.
  • I rarely am willing to watch a talk posted on the internet, but Amelia McNamara’s talk on seeing nothing was an exception. Plus she talks so fast #jealous.
  • Sherri Rose’s post on economic diversity in the academy focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive.
  • If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas’s data science handbook and the associated Jupyter notebooks.
  • I love Thomas Lumley being snarky about the stats news. Its a guilty pleasure. If he ever collected them into a book I’d buy it (hint Thomas :)).
  • Dorothy Bishop’s blog is one of the ones I read super regularly. Her post on When is a replication a replication is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well.
  • Ben Goldacre’s crowd is doing a bunch of interesting things. I really like their OpenPrescribing project.
  • I’m really excited to see what Elizabeth Rhodes does with the experimental design for the Ycombinator Basic Income Experiment.
  • Lucy D’Agostino McGowan made this amazing explanation of Hill’s criterion using xckd.
  • It is hard to overstate how good Leslie McClure’s blog is. This post on biostatistics is public health should be read aloud at every SPH in the US.
  • The ASA’s statement on p-values is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together.
  • I really liked this piece on the relationship between income and life expectancy by Raj Chetty and company.
  • Christie Aschwanden continues to be the voice of reason on the statistical crises in science.

That’s all I have for now, I know I’m missing things. Maybe my New Year’s resolution will be to keep better track of the awesome things other people are doing :).

The four eras of data

I’m teaching a class in data science for our masters and PhD students here at Hopkins. I’ve been teaching a variation on this class since 2011 and over time I’ve introduced a number of new components to the class: high-dimensional data methods (2011), data manipulation and cleaning (2012), real, possibly not doable data analyses (2012,2013), peer reviews (2014), building swirl tutorials for data analysis techniques (2015), and this year building data analytic web apps/R packages.

I’m the least efficient teacher in the world, probably because I’m very self conscious about my teaching. So I always feel like I have to completely re-do my lecture materials every year I teach the class (I know, I know I’m a dummy). This year I was reviewing my notes on high-dimensional data and I was looking at this breakdown of the three eras of statistics from Brad Efron’s book:

  1. The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?
  2. The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple — Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.
  3. The era of scientific mass production, in which new technologies typi- fied by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.

While I think this is a useful breakdown, I realized I think about it in a slightly different way as a statistician. My breakdown goes more like this:

  1. The era of not much data This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed.
  2. The era of lots of measurements on a few samples This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise.
  3. The era of a few measurements on lots of samples This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration.
  4. The era of all the data on everything. This is an era that currently we as civilians don’t get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I’m speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2).

I’ve focused here on the implications of these eras from a statistical modeling perspective, but as we discussed in my class, era 4 coupled with advances in machine learning methods mean that there are social, economic, and behaviorial implications of these eras as well.

Not So Standard Deviations Episode 28 - Writing is a lot Harder than Just Talking

Hilary and I talk about building data science products that provide a good user experience while adhering to some kind of ground truth, whether it’s in medicine, education, news, or elsewhere. Also Gilmore Girls.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Download the audio for this episode

Listen here: