Category: Uncategorized


Interview at Leanpub

A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book R Programming for Data Science. So far, I've only published one book through Leanpub but I'm a huge fan. They've developed a system that is, in my opinion, perfect for academic publishing. The book's written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.

The full interview transcript is over at the Leanpub blog. If you want to listen to the audio of the interview, you can subscribe to the Leanpub podcast on iTunes.

R Programming for Data Science is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who've already bought a copy!


Johns Hopkins Data Science Specialization Captsone 2 Top Performers

The second capstone session of the Johns Hopkins Data Science Specialization concluded recently. This time, we had 1,040 learners sign up to participate in the session, which again featured a project developed in collaboration with the amazingly innovative folks at SwiftKey

We've identified the learners listed below as the top performers in this capstone session. This is an incredibly talented group of people who have worked very hard throughout the entire nine-course specialization.  Please take some time to read their stories and look at their work. 

Ben Apple


Ben Apple is a Data Scientist and Enterprise Architect with the Department of Defense.  Mr. Apple holds a MS in Information Assurance and is a PhD candidate in Information Sciences.

Why did you take the JHU Data Science Specialization?

As a self trained data scientist I was looking for a program that would formalize my established skills while expanding my data science knowledge and tool box.

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone project was the most demanding aspect of the program.  As such I most proud of the finale project.  The project stretched each of us beyond the standard course work of the program and was quite satisfying.

How are you planning on using your Data Science Specialization Certificate?

To open doors so that I may further my research into the operational value of applying data science thought and practice to analytics of my domain.

Final Project:

Project Slide Deck:


Ivan Corneillet


A technologist, thinker, and tinkerer, Ivan facilitates the establishment of start-up companies by advising these companies about the hiring process, product development, and technology development, including big data, cloud computing, and cybersecurity. In his 17-year career, Ivan has held a wide range of engineering and management positions at various Silicon Valley companies. Ivan is a recent Wharton MBA graduate, and he previously earned his master’s degree in computer science from the Ensimag, and his master’s degree in electrical engineering from Université Joseph Fourier, both located in France.

Why did you take the JHU Data Science Specialization?

There are three reasons why I decided to enroll in the JHU Data Science Specialization. First, fresh from college, my formal education was best suited for scaling up the Internet’s infrastructure. However, because every firm in every industry now creates products and services from analyses of data, I challenged myself to learn about Internet-scale datasets. Second, I am a big supporter of MOOCs. I do not believe that MOOCs should replace traditional education; however, I do believe that MOOCs and traditional education will eventually coexist in the same way that open-source and closed-source software does (read my blog post for more information on this topic: Third, the Johns Hopkins University brand certainly motivated me to choose their program. With a great name comes a great curriculum and fantastic professors, right?
Once I had completed the program, I was not disappointed at all. I had read a blog post that explained that the JHU Data Science Specialization was only a start to learning about data science. I certainly agree, but I would add that this program is a great start, because the curriculum emphasizes information that is crucial, while providing additional resources to those who wish to deepen their understanding of data science. My thanks to Professors Caffo, Leek, and Peng; the TAs, and Coursera for building and delivering this track!

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone project made for a very rich and exhilarating learning experience, and was my favorite course in the specialization. Because I did not have prior knowledge in natural language processing (NLP), I had to conduct a fair amount of research. However, the program’s minimal-guidance approach mimicked a real-world environment, and gave me the opportunity to leverage my experience with developing code and designing products to get the most out of the skillset taught in the track. The result was that I created a data product that implemented state-of-the-art NLP algorithms using what I think are the best technologies (i.e., C++, JavaScript, R, Ruby, and SQL), given the choices that I had made. Bringing everything together is what made me the most proud. Additionally, my product capabilities are a far cry from IBM’s Watson, but while I am well versed in supercomputer hardware, this track helped me to gain a much deeper appreciation of Watson’s AI.

How are you planning on using your Data Science Specialization Certificate?

Thanks to the broad skillset that the specialization covered, I feel confident wearing a data science hat. The concepts and tools covered in this program helped me to better understand the concerns that data scientists have and the challenges they face. From a business standpoint, I am also better equipped to identify the opportunities that lie ahead.

Final Project:

Project Slide Deck:

Oscar de León


Oscar is an assistant researcher at a research institute in a developing country, graduated as a licentiate in biochemistry and microbiology in 2010 from the same university which hosts the institute. He has always loved technology, programming and statistics and has engaged in self learning of these subjects from an early age, finally using his abilities in the health-related research in which he has been involved since 2008. He is now working on the design, execution and analysis of various research projects, consulting for other researchers and students, and is looking forward to develop his academic career in biostatistics.

Why did you take the JHU Data Science Specialization?

I wanted to integrate my R experience into a more comprehensive data analysis workflow, which is exactly what this specialization offers. This was in line with the objectives of my position at the research institute in which I work, so I presented a study plan to my supervisor and she approved it. I also wanted to engage in an activity which enabled me to document my abilities in a verifiable way, and a Coursera Specialization seemed like a good option.

Additionally, I've followed the JHSPH group's courses since the first offering of Mathematical Biostatistics Bootcamp in November 2012. They have proved the standards and quality of education at their institution, and it was not something to let go by.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm not one to usually interact with other students, and certainly didn't do it during most of the specialization courses, but I decided to try out the fora on the Capstone project. It was wonderful; sharing ideas with, and receiving criticism form, my peers provided a very complete learning experience. After all, my contributions ended being appreciated by the community and a few posts stating it were very rewarding. This re-kindled my passion for teaching, and I'll try to engage in it more from now on.

How are you planning on using your Data Science Specialization Certificate?

First, I'll file it with HR at my workplace, since our research projects payed for the specialization :)

I plan to use the certificate as a credential for data analysis with R when it is relevant. For example, I've been interested in offering an R workshop for life sciences students and researchers at my University, and this certificate (and the projects I prepared during the specialization) could help me show I have a working knowledge on the subject.

Final Project:

Project Slide Deck:

Jeff Hedberg


I am passionate about turning raw data into actionable insights that solve relevant business problems. I also greatly enjoy leading large, multi-functional projects with impact in areas pertaining to machine and/or sensor data.  I have a Mechanical Engineering Degree and an MBA, in addition to a wide range of Data Science (IT/Coding) skills.

Why did you take the JHU Data Science Specialization?

I was looking to gain additional exposure into Data Science as a current practitioner, and thought this would be a great program.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am most proud of completing all courses with distinction (top of peers).  Also, I'm proud to have achieved full points on my Capstone project having no prior experience in Natural Language Processing.

How are you planning on using your Data Science Specialization Certificate?

I am going to add this to my Resume and LinkedIn Profile.  I will use it to solidify my credibility as a data science practitioner of value.

Final Project:

Project Slide Deck:

Hernán Martínez-Foffani


I was born in Argentina but now I'm settled in Spain. I've been working in computer technology since the eighties, in digital networks, programming, consulting, project management. Now, as CTO in a software company, I lead a small team of programmers developing a supply chain management app.

Why did you take the JHU Data Science Specialization?

In my opinion the curriculum is carefully designed with a nice balance between theory and practice. The JHU authoring and the teachers' widely known prestige ensure the content quality. The ability to choose the learning pace, one per month in my case, fits everyone's schedule.

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone definitely. It resulted in a fresh and interesting challenge. I sweat a lot, learned much more and in the end had a lot of fun.

How are you planning on using your Data Science Specialization Certificate?

While for the time being I don't have any specific plan for the certificate, it's a beautiful reward for the effort done.

Final Project:

Project Slide Deck:

Francois Schonken


Francois Schonken

I'm a 36 year old South African male born and raised. I recently (4 years now) immigrated to lovely Melbourne, Australia. I wrapped up a BSc (Hons) Computer Science with specialization in Computer Systems back in 2001. Next I co-found a small boutique Software Development house operating from South Africa. I wrapped my MBA, from Melbourne Business School, in 2013 and now I consult for my small boutique Software Development house and 2 (very) small internet start-ups.

Why did you take the JHU Data Science Specialization?

One of the core subjects in my MBA was Data Analysis, basically an MBA take on undergrad Statistics with focus on application over theory (not that there was any shortage of theory). Waiting in a lobby room some 6 months later I was paging through the financial section of business focused weekly. I came across an article explaining how a Melbourne local applied a language called R to solve a grammatically and statistically challenging issue. The rest, as they say, is history.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm quite proud of both my Developing Data Products and Capstone projects, but for me these tangible outputs merely served as a vehicle to better understand a different way of thinking about data. I've spend most of my Software Development life dealing with one form or the other form of RDBS (Relational Database Management System). This, in my experience, leads to a very set oriented way of thinking about data.

I'm most proud of developing a new tool in my "Skills Toolbox" that I consider highly complementary to both my Software and Business outlook on projects.

How are you planning on using your Data Science Specialization Certificate?

Honestly, I had not planned on using my Certificate in and of itself. The skills I've acquired has already helped shape my thinking in designing an in-house web based consulting collaboration platform.

I do not foresee this being the last time I'll be applying Data Science thinking moving forward on my journey.

Final Project:

Project Slide Deck:

David J. Tagler


David J. Tagler

David is passionate about solving the world’s most important and challenging problems. His expertise spans chemical/biomedical engineering, regenerative medicine, healthcare technology management, information technology/security, and data science/analysis. David earned his Ph.D. in Chemical Engineering from Northwestern University and B.S. in Chemical Engineering from the University of Notre Dame.

Why did you take the JHU Data Science Specialization?

I enrolled in this specialization in order to advance my statistics, programming, and data analysis skills. I was interested in taking a series of courses that covered the entire data science pipeline. I believe that these skills will be critical for success in the future.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am most proud of the R programming and modeling skills that I developed throughout this specialization. Previously, I had no experience with R. Now, I can effectively manage complex data sets, perform statistical analyses, build prediction models, create publication-quality figures, and deploy web applications.

How are you planning on using your Data Science Specialization Certificate?

I look forward to utilizing these skills in future research projects. Furthermore, I plan to take additional courses in data science, machine learning, and bioinformatics.

Final Project:

Project Slide Deck:

Melissa Tan



I'm a financial journalist from Singapore. I did philosophy and computer science at the University of Chicago, and I'm keen on picking up more machine learning and data viz skills.

Why did you take the JHU Data Science Specialization?

I wanted to keep up with coding, while learning new tools and techniques for wrangling and analyzing data that I could potentially apply to my job. Plus, it sounded fun. :)

What are you most proud of doing as part of the JHU Data Science Specialization?

Building a word prediction app pretty much from scratch (with a truckload of forum reading). The capstone project seemed insurmountable initially and ate up all my weekends, but getting the app to work passably was worth it.

How are you planning on using your Data Science Specialization Certificate?

It'll go on my CV, but I think it's more important to be able to actually do useful things. I'm keeping an eye out for more practical opportunities to apply and sharpen what I've learnt.

Final Project:

Project Slide Deck:

Felicia Yii


Felicia likes to dream, think and do. With over 20 years in the IT industry, her current fascination is at the intersection of people, information and decision-making.  Ever inquisitive, she has acquired an expertise in subjects as diverse as coding to cookery to costume making to cosmetics chemistry. It’s not apparent that there is anything she can’t learn to do, apart from housework.  Felicia lives in Wellington, New Zealand with her husband, two children and two cats.

Why did you take the JHU Data Science Specialization?

Well, I love learning and the JHU Data Science Specialization appealed to my thirst for a new challenge. I'm really interested in how we can use data to help people make better decisions.  There's so much data out there these days that it is easy to be overwhelmed by it all. Data visualisation was at the heart of my motivation when starting out. As I got into the nitty gritty of the course, I really began to see the power of making data accessible and appealing to the data-agnostic world. There's so much potential for data science thinking in my professional work.

What are you most proud of doing as part of the JHU Data Science Specialization?

Getting through it for starters while also working and looking after two children. Seriously though, being able to say I know what 'practical machine learning' is all about.  Before I started the course, I had limited knowledge of statistics, let alone knowing how to apply them in a machine learning context.  I was thrilled to be able to use what I learned to test a cool game concept in my final project.

How are you planning on using your Data Science Specialization Certificate?

I want to use what I have learned in as many ways possible. Firstly, I see opportunities to apply my skills to my day-to-day work in information technology. Secondly, I would like to help organisations that don't have the skills or expertise in-house to apply data science thinking to help their decision making and communication. Thirdly, it would be cool one day to have my own company consulting on data science. I've more work to do to get there though!

Final Project:

Project Slide Deck:



I'm a data scientist - mind if I do surgery on your heart?

There has been a lot of recent interest from scientific journals and from other folks in creating checklists for data science and data analysis. The idea is that the checklist will help prevent results that won't reproduce or replicate from the literature. One analogy that I'm frequently hearing is the analogy with checklists for surgeons that can help reduce patient mortality.

The one major difference between checklists for surgeons and checklists I'm seeing for research purposes is the difference in credentialing between people allowed to perform surgery and people allowed to perform complex data analysis. You would never let me do surgery on you. I have no medical training at all. But I'm frequently asked to review papers that include complicated and technical data analyses, but have no trained data analysts or statisticians. The most common approach is that a postdoc or graduate student in the group is assigned to do the analysis, even if they don't have much formal training. Whenever this happens red flags are up all over the place. Just like I wouldn't trust someone without years of training and a medical license to do surgery on me, I wouldn't let someone without years of training and credentials in data analysis make major conclusions from complex data analysis.

You might argue that the consequences for surgery and for complex data analysis are on completely different scales. I'd agree with you, but not in the direction that you might think. I would argue that high pressure and complex data analysis can have much larger consequences than surgery. In surgery there is usually only one person that can be hurt. But if you do a bad data analysis, say claiming say that vaccines cause autism, that can have massive consequences for hundreds or even thousands of people. So complex data analysis, especially for important results, should be treated with at least as much care as surgery.

The reason why I don't think checklists alone will solve the problem is that they are likely to be used by people without formal training. One obvious (and recent) example that I think makes this really clear is the HealthKit data we are about to start seeing. A ton of people signed up for studies on their iPhones and it has been all over the news. The checklist will (almost certainly) say to have a big sample size. HealthKit studies will certainly pass the checklist, but they are going to get Truman/Deweyed big time if they aren't careful about biased sampling.

If I walked into an operating room and said I'm going to start dabbling in surgery I would be immediately thrown out. But people do that with statistics and data analysis all the time. What they really need is to require careful training and expertise in data analysis on each paper that analyzes data. Until we treat it as a first class component of the scientific process we'll continue to see retractions, falsifications, and irreproducible results flourish.

Interview with Class Central

Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.

See the full interview at the Class Central site. Below is short excerpt.


Interview with Chris Wiggins, chief data scientist at the New York Times

Editor's note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you'd like to ask, you can tweet them @simplystats and I'll see if I can work them in. After the livestream we'll leave the video on Youtube so you can check out the interview if you can't watch the live stream. I'm embedding the Youtube video here but if you can't see the live stream when it is running go check out the event page:


Science is a calling and a career, here is a career planning guide for students and postdocs

Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead wrote up for his postdocs which you should go check out right now. You can also find the slightly adapted Leek group career planning guide here.

The most common reason that people go into science is altruistic. They loved dinosaurs and spaceships when they were a kid and that never wore off. On some level this is one of the reasons I love this field so much, it is an area where if you can get past all the hard parts can really keep introducing wonder into what you work on every day.

Sometimes I feel like this altruism has negative consequences. For example, I think that there is less emphasis on the career planning and development side in the academic community. I don’t think this is malicious, but I do think that sometimes people think of the career part of science as unseemly. But if you have any job that you want people to pay you to do, then there will be parts of that job that will be career oriented. So if you want to be a professional scientist, being brilliant and good at science is not enough. You also need to pay attention to and plan carefully your career trajectory.

A colleague of mine, Ben Langmead, created a really nice guide for his postdocs to thinking about and planning the career side of a postdoc which he has over on Github. I thought it was such a good idea that I immediately modified it and asked all of my graduate students and postdocs to fill it out. It is kind of long so there was no penalty if they didn’t finish it, but I think it is an incredibly useful tool for thinking about how to strategize a career in the sciences. I think that the more we are concrete about the career side of graduate school and postdocs, including being honest about all the realistic options available, the better prepared our students will be to succeed on the market.

You can find the Leek Group Guide to Career Planning here and make sure you also go check out Ben’s since it was his idea and his is great.



Is it species or is it batch? They are confounded, so we can't know

In a 2005 OMICS paper, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that "any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue". Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc... Two follow-up papers (here and here) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely confounded. In a 2010 paper, we confirmed that once this technical variability  was accounted for, the number of genes expressed in common  between the same tissue across the two species was much higher than the those expressed in common  between two species across the different tissues (see Figure 2 here).

So what is confounding and why is it a problem? This topic has been discussed broadly. We wrote a review some time ago. But based on recent discussions I've participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use

  • Xi to represent the gene expression measurements for human tissue i,
  • aX to represent the level of expression that is specific to humans and
  • bX to represent the batch effect introduced by the use of the human microarray platform.
  • Therefore Xi =a+ bX + ei,with ei the tissue i  effect and other uninteresting sources of variability.

Similarly, we will use:

  • Yi to represent the measurements for mouse tissue i
  • aY  to represent the mouse specific level and
  • bY the batch effect introduced by the use of the mouse microarray platform.
  • Therefore Yi =a+ bY +fi,with fi tissue i  effect and other uninteresting sources of variability.

If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:

 aY - aX

Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: Y - X, Y2 - X, etc... The problem is that aand bare always together as are aand bY .We say that the batch effect bX is confounded with the species effect aX. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a a model like this:

Y - Xi = (aY - aX) + (bY - bX) + other sources of variability

and then estimate the unknown quantities of interest: (aY - aX) and (bY - bX) from the observed data Y1 - X1, Y2 - X2, etc... The problem is that, we can estimate the aggregate effect (aY - aX) + (bY - bX), but, mathematically, we can't tease apart the two differences.  To see this note that if we are using least squares, the estimates (aY - aX) = 7,  (bY - bX)=3  will fit the data exactly as well as (aY - aX)=3,(bY - bX)=7 since

{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.

In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in this linear models course.

Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the bY - bX  differences for each gene are squared and added up.

In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a 2010 Nature Genetics Review  about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach.

Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in this PNAS paper:  "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered*. Therefore, the authors implicitly assumed that (bY - bX)=0. However, in a recent F1000 Research publication Gilad and Mizrahi-Man describe describe an exercise in forensic bioinformatics that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see Figure 1). As pointed out by these authors, with this experimental design we can't  simply accept that (bY - bX)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a linear model (ComBat) to account for the batch/species effect and find that samples cluster almost perfectly by tissue. However, Gilad and Mizrahi-Man correctly note that,  due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data  and more analyses are needed.

Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.

 * The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols.

Update: Shin Lin has repeated a small version of the experiment described in the PNAS paper. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments here.


Residual expertise - or why scientists are amateurs at most of science

Editor's note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I've adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across. 

An episode of The Daily Show with Jon Stewart featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.

Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:

Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”

Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”

Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.

Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn't already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.

This is an example of "residual expertise" - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the argument from authority:

A is an authority on a particular topic

A says something about that topic

A is probably correct

the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from "that person is a doctor" to "that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication" is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.

What does this have to do with you?

Most people aren't necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don't fully understand - you can still learn a ton and have a pretty fun time doing it.




The tyranny of the idea in science

There are a lot of analogies between startups and academic science labs. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, Paul Graham has said:

Actually, startup ideas are not million dollar ideas, and here's an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there's no market for startup ideas suggests there's no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.

In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven't gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:

  1. Higgs Boson - Peter Higgs postulated the Boson in 1964, he won the Nobel Prize in 2013 for that prediction, in between tons of people did follow on work, someone convinced Europe to build one of the most expensive pieces of scientific equipment ever built and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.
  2. Human genome - Watson and Crick postulated the structure of DNA in 1953, they won the Nobel Prize in  medicine in 1962 for this work. But the real value of the human genome was realized when the largest biological collaboration in history sequenced the human genome, along with all of the subsequent work in the genomics revolution.

These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, "the tyranny of the idea." I notice a similar issue on a much smaller scale, for example when people don't recognize software as a primary product of science. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn't clear that some ideas wouldn't be discovered by someone else - but surely we aren't going to build another large hadron collider. I'd like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.


Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously

Joe Pickrell has an interesting new paper out about Mendelian randomization. He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.

The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel's laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak's group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:



If you look at the allele inherited from the two parental strains (BY, RM)  at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring)  they do appear to be random and independent:

Screen Shot 2015-05-07 at 11.20.46 AM



So this is a randomized trial in yeast where the yeast were each randomized to many many genetic "treatments" simultaneously. Now this isn't strictly true, since genes on the same chromosomes near each other aren't exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic "treatments" to downstream things like gene expression ( and even do a reasonable job in the model organism case).

In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments.  Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I'm seeing lots of interesting proposals for new trial designs (which may or may not work), so I thought I'd contribute my own interesting idea.