Simply Statistics A statistics blog by Rafa Irizarry, Roger Peng, and Jeff Leek

Got a data app idea? Apply to get it prototyped by the JHU DSL!

Get your app built

Last fall we ran the first iteration of a class at the Johns Hopkins Data Science Lab where we teach students to build data web-apps using Shiny, R, GoogleSheets and a number of other technologies. Our goals were to teach students to build data products, to reduce friction for students who want to build things with data, and to help people solve important data problems with web and SMS apps.

We are going to be running a second iteration of our program from March-June this year. We are looking for awesome projects for students to build that solve real world problems. We are particularly interested in projects that could have a positive impact on health but are open to any cool idea. We generally build apps that are useful for:

  • Data donation - if you have a group of people you would like to donate data to your project.
  • Data collection - if you would like to build an app for collecting data from people.
  • Data visualziation - if you have a data set and would like to have a web app for interacting with the data
  • Data interaction - if you have a statistical or machine learning model and you would like a web interface for it.

But we are interested in any consumer-facing data product that you might be interested in having built. We want you to submit your wildest, most interesting ideas and we’ll see if we can get them built for you.

We are hoping to solicit a large number of projects and then build as many as possible. The best part is that we will build the prototype for you for free! If you have an idea of something you’d like built please submit it to this Google form.

Students in the class will select projects they are interested in during early March. We will let you know if your idea was selected for the program by mid-March. If you aren’t selected you will have the opportunity to roll your submission over to our next round of prototyping.

I’ll be writing a separate post targeted at students, but if you are interested in being a data app prototyper, sign up here.

Interview with Al Sommer - Effort Report Episode 23

My colleage Elizabeth Matsui and I had a great opportunity to talk with Al Sommer on the latest episode of our podcast The Effort Report. Al is the former Dean of the Johns Hopkins Bloomberg School of Public Health and is Professor of Epidemiology and International Health at the School. He is (among other things) world reknown for his pioneering research in vitamin A deficiency and mortality in children.

Al had some good bits of advice for academics and being successful in academia.

What you are excited about and interested in at the moment, you’re much more likely to be succesful at—because you’re excited about it! So you’re going to get up at 2 in the morning and think about it, you’re going to be putting things together in ways that nobody else has put things together. And guess what? When you do that you’re more succesful [and] you actual end up getting academic promotions.

On the slow rate of progress:

It took ten years, after we had seven randomized trials already to show that you get this 1/3 reduction in child mortality by giving them two cents worth of vitamin A twice a year. It took ten years to convince the child survival Nawabs of the world, and there are still some that don’t believe it.

On working overseas:

It used to be true [that] it’s a lot easier to work overseas than it is to work here because the experts come from somewhere else. You’re never an expert in your own home.

You can listen to the entire episode here:

Not So Standard Deviations Episode 30 - Philately and Numismatology

Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R.

If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.

Show notes:

Download the audio for this episode

Listen here:

Some things I've found help reduce my stress around science

Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to getting blown up on the internet.

Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba’s class on essential skills in reproducibility and came across this set of slides by Phillip Stark. The one that caught my attention said:

If I say just trust me and I’m wrong, I’m untrustworthy. If I say here’s my work and it’s wrong, I’m honest, human, and serving scientific progress.

I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.

  1. Put everything out in the open. We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work.
  2. Admit mistakes quickly. Since my code/data are out in the open I’ve had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary.
  3. Respond to requests for support at my own pace. I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this right away when I would get the emails. I still try to be prompt, but I don’t let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets.
  4. Treat rejection as a feature not a bug. This one is by far the hardest for me but preprints have helped a ton. The academic system is designed to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow.
  5. Don’t argue with people on the internet, especially on Twitter. This is a new one for me and one I’m having to practice hard every single day. But I’ve found that I’ve had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn’t help me accomplish much.
  6. Redefine success. I’ve found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I’m much less stressed out.
  7. Don’t compare myself to other scientists. It is very hard to get good evaluation in science and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.
  8. When comparing, at least pick a metric I’m good at. I’d like to claim I never compare myself to others, but the reality is I do it more than I’d like. I’ve found one way to not stress myself out for my own internal comparisons is to pick metrics I’m good at - even if they aren’t the “right” metrics. That way at least if I’m comparing I’m not hurting my own psyche.
  9. Let myself be bummed sometimes. Some days despite all of that I still get the imposter syndrome feels and can’t get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work.
  10. Try very hard to be positive in my interactions. This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc.
  11. Realize that giving credit doesn’t take away from me. In my research career I have worked with some extremely generous mentors. They taught me to always give credit whenever possible. I also learned from Roger that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better.

The last thing I’d say is that having a blog has helped reduce my stress, because sometimes I’m having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done…

A non-comprehensive list of awesome things other people did in 2016

Editor’s note: For the last few years I have made a list of awesome things that other people did (2015, 2014, 2013). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.

  • Thomas Lin Pedersen created the tweenr package for interpolating graphs in animations. Check out this awesome logo he made with it.
  • Yihui Xie is still blowing away everything he does. First it was bookdown and then the yolo feature in xaringan package.
  • J Alammar built this great visual introduction to neural networks
  • Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her Data Rectangling talk. The analogy between exponential families and data frames is so so good.
  • Hadley Wickham’s book on R for data science is everything you’d expect. Super clear, great examples, just a really nice book.
  • David Robinson is a machine put on this earth to create awesome data science stuff. Here is analyzing Trump’s tweets and here he is on empirical Bayes modeling explained with baseball.
  • Julia Silge and David created the tidytext package. This is a holy moly big contribution to NLP in R. They also have a killer book on tidy text mining.
  • Julia used the package to do this fascinating post on mining Reddit after the election.
  • It would be hard to pick just five different major contributions from JJ Allaire (great interview here), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely churning out awesome stuff at a rate that is hard to keep up with. I loved R notebooks and have used them extensively for teaching.
  • Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their 10 simple rules piece Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently.
  • Yaniv Erlich just can’t stop himself from doing interesting things like seeq.io and dna.land.
  • Thomaz Berisa and Joe Pickrell set up a freaking Python API for genomics projects.
  • DataCamp continues to do great things. I love their DataChats series and they have been rolling out tons of new courses.
  • Sean Rife and Michele Nuijten created statcheck.io for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself.
  • This was the artificial intelligence tweet of the year
  • I loved seeing PLoS Genetics start a policy of looking for papers in biorxiv.
  • Matthew Stephens post on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now!
  • Lorena Barba made this amazing reproducibility syllabus then won the Leamer-Rosenthal prize in open science.
  • Colin Dewey continues to do just stellar stellar work, this time on re-annotating genomics samples. This is one of the key open problems in genomics.
  • I love FlowingData sooooo much. Here is one on the changing American diet.
  • If you like computational biology and data science and like super detailed reports of meetings/talks you MIchael Hoffman is your man. How he actually summarizes that much information in real time is still beyond me.
  • I really really wish I had been at Alyssa Frazee’s talk at startup.ml but loved this review of it. Sampling, inverse probability weighting? Love that stats flavor!
  • I have followed Cathy O’Neil for a long time in her persona as mathbabedotorg so it is no surprise to me that her new book Weapons of Math Descruction is so good. One of the best works on the ethics of data out there.
  • A related and very important piece is on Machine bias in sentencing by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica.
  • Dimitris Rizopolous created this stellar integrated Shiny app for his repeated measures class. I wish I could build things half this nice.
  • Daniel Engber’s piece on Who will debunk the debunkers? at fivethirtyeight just keeps getting more relevant.
  • I rarely am willing to watch a talk posted on the internet, but Amelia McNamara’s talk on seeing nothing was an exception. Plus she talks so fast #jealous.
  • Sherri Rose’s post on economic diversity in the academy focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive.
  • If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas’s data science handbook and the associated Jupyter notebooks.
  • I love Thomas Lumley being snarky about the stats news. Its a guilty pleasure. If he ever collected them into a book I’d buy it (hint Thomas :)).
  • Dorothy Bishop’s blog is one of the ones I read super regularly. Her post on When is a replication a replication is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well.
  • Ben Goldacre’s crowd is doing a bunch of interesting things. I really like their OpenPrescribing project.
  • I’m really excited to see what Elizabeth Rhodes does with the experimental design for the Ycombinator Basic Income Experiment.
  • Lucy D’Agostino McGowan made this amazing explanation of Hill’s criterion using xckd.
  • It is hard to overstate how good Leslie McClure’s blog is. This post on biostatistics is public health should be read aloud at every SPH in the US.
  • The ASA’s statement on p-values is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together.
  • I really liked this piece on the relationship between income and life expectancy by Raj Chetty and company.
  • Christie Aschwanden continues to be the voice of reason on the statistical crises in science.

That’s all I have for now, I know I’m missing things. Maybe my New Year’s resolution will be to keep better track of the awesome things other people are doing :).