A little over a year ago I saw a request from the Howard Hughes Medical Institute for proposals focused on undergraduate teaching. I decided to apply for this grant since it combines all of the things I’m interested in: teaching, education research, biology, and data science. So I put together a proposal, got a couple of colleagues to write me letters of support, and sent it off. I was optimistic about the proposal since we have a cool opportunity through our work in scalable education to hit a large student population and we have been spending a lot of time thinking about using this platform to create a “science of data science” platform (more about that soon!
As a self-interested podcaster, I’m obviously interested in getting more people to listen to more podcasts. For those who may be interested in getting into podcasts but wondering where to begin I thought I’d thought I’d make a list of what I’m currently listening to (perhaps to be updated as I acquire new ones). In no particular order: Hello Internet - YouTubers Brady Haran and CGP Grey basically just talk, often for an hour to an hour and a half.
I’ve previously written about how business (that are built by humans) can initially become successful by optimizing the user experience. That great user experience is what defines the value that their product provides. However, over time, companies have to find a way to provide value that is defined outside of the business-user relationship. The difficulty here is that this new definition of value is not within the control of the company–it is defined by the community via laws.
I dislike violin plots because they look like Christmas ornaments. It’s a pet peeve but there is somewhat of a practical reason as well. To demonstrate I created a dataset called dat that contains an outcome value from 25 different groups. One of the first steps I take when analyzing data is to look at the distribution of my data. If there are groups, I like to stratify and look at the distributions.
I was genuinely amazed at this article by George Nott published in Computerworld quoting Peter Norvig on explainable artificial intelligence. From the article: Speaking at an event at UNSW in Sydney on Thursday, Norvig – who at NASA developed software that flew on Deep Space 1 – said: “You can ask a human, but, you know, what cognitive psychologists have discovered is that when you ask a human you’re not really getting at the decision process.
A recent publication (pay-walled) by Boyle et al. introducing the concept of an omnigenic model has generated much discussion. It reminded me of a question I’ve had for a while about the way genetics data is analyzed. Before getting into this, I’ll briefly summarize the general issue. With the completion of the human genome project, human geneticists saw much promise in the possibility of scanning the entire genome for the genes associated with a trait.
I was recently at a National Academy meeting on Envisioning the Data Science Curriculum. It was a fun meeting and one of the questions that came up was what kind of infrastructure do we need to enable shared curricula, compatibility across schools, and not reinventing the wheel. My answer to this question was that we need lecture notes stored in plain text files (like rmarkdown files) and data stored in csv files with direct links.
Like a lot of modern scientists I now find papers to read to a large extent based on what I see on social media. It is a great way to find out what my colleagues are reading and keep up with the newest cool research. A few weeks ago I released a really simple prototype for an app that would improve on this experience called papr, its like Tinder but for papers :).
The other day Brian was at a National Academies meeting and he gave one of his usual classic quotes: Best quote from NAS DS Round Table: "I mean, do we need deep learning to analyze 30 subjects?" - B Caffo @simplystats #datascienceinreallife — CMU Stats (@CMU_Stats) May 1, 2017 When I saw that quote I was reminded of the blog post Don’t use hadoop - your data isn’t that big.
Tidy data at its heart is a set of three rules for organizing a data set: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. This is an incredibly useful abstraction for thinking about organizing data sets for analysis. In particular any data set that can be conveniently rectangled to look like this fun diagram from Jenny Bryan Data rectangle