I recently coded up a simple package that implements a file-based queue abstract data type. This package was needed for a different package that I’m working on involving parallel processing (more on that in the near future). Actually, this othe package is a project that I started almost nine years ago but was never able to get off the ground. I tried to implement a queue interface in the filehash package but it never served the purpose that I needed.
During preparation for class I sometimes think up of animations that will explain the concept I am teaching. I sometimes share the resulting animations on social media via @rafalab. John Storey recently asked if the source code is publicly available. Because I am not that organized, and these ideas come about during last minute preparations, the code was spread across several unrelated files. John’s request motivated me to include the code in one post.
A few years ago I helped write a paper where we proposed scraping p-values from the medical literature to try to estimate the science-wise false discovery rate. The paper generated a ton of interesting discussion and inspired other groups to start collecting p-values from the literature. As I’ve mentioned before the p-value is the most popular statistic ever invented so there are a lot of published p-values out there. The tidypvals package is an effort to find previous collections of published p-values, synthesize them, and tidy them into one analyzable data set.
A little over a year ago I saw a request from the Howard Hughes Medical Institute for proposals focused on undergraduate teaching. I decided to apply for this grant since it combines all of the things I’m interested in: teaching, education research, biology, and data science. So I put together a proposal, got a couple of colleagues to write me letters of support, and sent it off. I was optimistic about the proposal since we have a cool opportunity through our work in scalable education to hit a large student population and we have been spending a lot of time thinking about using this platform to create a “science of data science” platform (more about that soon!
As a self-interested podcaster, I’m obviously interested in getting more people to listen to more podcasts. For those who may be interested in getting into podcasts but wondering where to begin I thought I’d thought I’d make a list of what I’m currently listening to (perhaps to be updated as I acquire new ones). In no particular order: Hello Internet - YouTubers Brady Haran and CGP Grey basically just talk, often for an hour to an hour and a half.
I’ve previously written about how business (that are built by humans) can initially become successful by optimizing the user experience. That great user experience is what defines the value that their product provides. However, over time, companies have to find a way to provide value that is defined outside of the business-user relationship. The difficulty here is that this new definition of value is not within the control of the company–it is defined by the community via laws.
I dislike violin plots because they look like Christmas ornaments. It’s a pet peeve but there is somewhat of a practical reason as well. To demonstrate I created a dataset called dat that contains an outcome value from 25 different groups. One of the first steps I take when analyzing data is to look at the distribution of my data. If there are groups, I like to stratify and look at the distributions.
I was genuinely amazed at this article by George Nott published in Computerworld quoting Peter Norvig on explainable artificial intelligence. From the article: Speaking at an event at UNSW in Sydney on Thursday, Norvig – who at NASA developed software that flew on Deep Space 1 – said: “You can ask a human, but, you know, what cognitive psychologists have discovered is that when you ask a human you’re not really getting at the decision process.
A recent publication (pay-walled) by Boyle et al. introducing the concept of an omnigenic model has generated much discussion. It reminded me of a question I’ve had for a while about the way genetics data is analyzed. Before getting into this, I’ll briefly summarize the general issue. With the completion of the human genome project, human geneticists saw much promise in the possibility of scanning the entire genome for the genes associated with a trait.
I was recently at a National Academy meeting on Envisioning the Data Science Curriculum. It was a fun meeting and one of the questions that came up was what kind of infrastructure do we need to enable shared curricula, compatibility across schools, and not reinventing the wheel. My answer to this question was that we need lecture notes stored in plain text files (like rmarkdown files) and data stored in csv files with direct links.