Simply Statistics 2017-02-20T16:35:09+00:00 http://simplystats.github.io My Podcasting Setup 2017-02-20T00:00:00+00:00 http://simplystats.github.io/2017/02/20/podcasting-setup <p>I’ve gotten a number of inquiries over the last 2 years about my podcasting setup and I’ve been meaning to write about it but….</p> <p>But here it is! I actually wanted to write this because I felt like there actually wasn’t a ton of good information about this on the Internet that wasn’t for people who wanted to do it professionally but were rather more “casual” podcasters. So here’s what I’ve got.</p> <p>There are two types of podcasts roughly: The kind you record with everyone in the same room and the kind you record where everyone is in different rooms. They both require slightly different setups so I’ll talk about both. For me, Elizabeth Matsui and I record <a href="http://effortreport.libsyn.com">The Effort Report</a> locally because we’re both at Johns Hopkins. But Hilary Parker and I record <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> remotely because she’s on the other side of the country most of the time.</p> <h2 id="recording-equipment">Recording Equipment</h2> <p>When Hilary and I first started we just used the microphone attached to the headphones you get with your iPhone or whatever. That’s okay but the sound feels very “narrow” to me. That said, it’s a good way to get started and it likely costs you nothing.</p> <p>The next level up for many people is the <a href="https://www.amazon.com/Blue-Yeti-USB-Microphone-Silver/dp/B002VA464S/">Blue Yeti USB Microphone</a> which is perfectly fine microphone and not too expensive. Also, it uses USB (as opposed to more professional XLR) so it connects to any computer, which is nice. However, it typically retails for $120, which isn’t nothing, and there are probably cheaper microphones that are just as good. For example, Jason Snell recommends the <a href="https://www.amazon.com/Audio-Technica-ATR2100-USB-Cardioid-Dynamic-Microphone/dp/B004QJOZS4/ref=as_li_ss_tl?ie=UTF8&amp;qid=1479488629&amp;sr=8-2&amp;keywords=audio-technica+atr&amp;linkCode=sl1&amp;tag=incomparablepod-20&amp;linkId=0919132824ac2090de45f2b1135b0163">Audio Technica ATR2100</a> which is only about $70.</p> <p>If you’re willing to shell out a little more money, I’d highly recommend the <a href="https://www.zoom-na.com/products/field-video-recording/field-recording/zoom-h4n-handy-recorder">Zoom H4n</a> portable recorder. This is actually two things: a microphone <em>and</em> a recorder. It has a nice stero microphone built into the top along with two XLR inputs on the bottom that allow you to record from external mics. It records to SD cards so it’s great for a portable setup where you don’t want to carry a computer around with you. It retails for about $200 so it’s <em>not</em> cheap, but in my opinion it is worth every penny. I’ve been using my H4n for years now.</p> <p>Because we do a lot or recording for our online courses here, we’ve actually got a bit more equipment in the office. So for in-person podcasts I sometimes record using a <a href="https://en-us.sennheiser.com/short-gun-tube-microphone-camera-films-mkh-416-p48u3">Sennheiser MKH416-P48US</a> attached to an <a href="https://www.amazon.com/gp/product/B00D4AGIBS/">Auray MS-5230T microphone stand</a> which is decidedly not cheap but is a great piece of hardware.</p> <p>By the way, a microphone stand is great to have, if you can get one, so you don’t have to set the microphone on your desk or table. That way if you bump the table by accident or generally like to bang the table, it won’t get picked up on the microphone. It’s not something to get right away, but maybe later when you make the big time.</p> <h2 id="recording-software">Recording Software</h2> <p>If you’re recording by yourself, you can just hook up your microphone to your computer and record to any old software that records sound (on the Mac you can use Quicktime). If you have multiple people, you can either</p> <ol> <li>Speak into the same mic and have both your voices recorded on the same audio file</li> <li>Use separate mics (and separate computers) and record separtely on to separate audio files. This requires synching the audio files in an editor, but that’s not too big a deal if you only have 2-3 people.</li> </ol> <p>For local podcasts, I actually just use the H4n and record directly to the SD card. This creates separate WAV files for each microphone that are already synced so you can just plop them in the editor.</p> <p>For remote podcasts, you’ll need some communication software. Hilary and I use <a href="https://zencastr.com">Zencastr</a> which has its own VoIP system that allows you to talk to anyone by just sending your guests a link. So I create a session in Zencastr, send Hilary the link for the session, she logs in (without needing any credentials) and we just start talking. The web site records the audio directly off of your microphone and then uploads the audio files (one for each guest) to Dropbox. The service is really nice and there are now a few just like it. Zencastr costs $20 a month right now but there is a limited free tier.</p> <p>The other approach is to use something like Skype and then use something like <a href="http://www.ecamm.com/mac/callrecorder/">ecamm call-recorder</a> to record the conversation. The downside with this approach is that if you have any network trouble that messes up the audio, then you will also record that. However, Zencastr (and related services) do not work on iOS devices and other devices that use WebKit based browsers. So if you have someone calling in on a mobile device via Skype or something, then you’ll have to use this approach. Otherwise, I prefer the Zencastr approach and can’t really see any downside except for the cost.</p> <h2 id="editing-software">Editing Software</h2> <p>There isn’t a lot of software that’s specifically designed for editing podcasts. I actually started off editing podcasts in Final Cut Pro X (nonlinear video editor) because that’s what I was familiar with. But now I use <a href="http://www.apple.com/logic-pro/">Logic Pro X</a>, which is not really designed for podcasts, but it’s a real digital audio workstation and has nice features (like <a href="https://support.apple.com/kb/PH13055?locale=en_US">strip silence</a>). But I think something like <a href="http://www.audacityteam.org">Audacity</a> would be fine for basic editing.</p> <p>The main thing I need to do with editing is merge the different audio tracks together and cut off any extraneous material at the beginning or the end. I don’t usually do a lot of editing in the middle unless there’s a major mishap like a siren goes by or a cat jumps on the computer. Once the editing is done I bounce to an AAC or MP3 file for uploading.</p> <h2 id="hosting">Hosting</h2> <p>You’ll need a service for hosting your audio files if you don’t have your own server. You can technically host your audio files anywhere, but specific services have niceties like auto-generating the RSS feed. For Not So Standard Deviations I use <a href="https://soundcloud.com/stream">SoundCloud</a> and for The Effort Report I use <a href="https://www.libsyn.com">Libsyn</a>.</p> <p>Of the two services, I think I prefer Libsyn, because it’s specifically designed for podcasting and has somewhat better analytics. The web site feels a little bit like it was designed in 2003, but otherwise it works great. Libsyn also has features for things like advertising and subscriptions, but I don’t use any of those. SoundCloud is fine but wasn’t really designed for podcasting and sometimes feels a little unnatural.</p> <h2 id="summary">Summary</h2> <p>If you’re interested in getting started in podcasting, here’s my bottom line:</p> <ol> <li>Get a partner. It’s more fun that way!</li> <li>If you and your partner are remote, use Zencastr or something similar.</li> <li>Splurge for the Zoom H4n if you can, otherwise get a reasonable cheap microphone like the Audio Technica or the Yeti.</li> <li>Don’t focus too much on editing. Just clip off the beginning and the end.</li> <li>Host on Libsyn.</li> </ol> Data Scientists Clashing at Hedge Funds 2017-02-15T00:00:00+00:00 http://simplystats.github.io/2017/02/15/Data-Scientists-Clashing-at-Hedge-Funds <p>There’s an interesting article over at Bloomberg about how <a href="https://www.bloomberg.com/news/articles/2017-02-15/point72-shows-how-firms-face-culture-clash-on-road-to-quantland">data scientists have struggled at some hedge funds</a>:</p> <blockquote> <p>The firms have been loading up on data scientists and coders to deliver on the promise of quantitative investing and lift their ho-hum returns. But they are discovering that the marriage of old-school managers and data-driven quants can be rocky. Managers who have relied on gut calls resist ceding control to scientists and their trading signals. And quants, emboldened by the success of computer-driven funds like Renaissance Technologies, bristle at their second-class status and vie for a bigger voice in investing.</p> </blockquote> <p>There are some interesting tidbits in the article that I think hold lessons for any collaboration between a data scientist or analyst and a non-data scientist (for lack of a better word).</p> <p>At Point72, the family office successor to SAC Capital, problems at the quant unit (known as Aperio):</p> <blockquote> <p>The divide between Aperio quants and fundamental money managers was also intellectual. They struggled to communicate about the basics, like how big data could inform investment decisions. [Michael] Recce’s team, which was stacked with data scientists and coders, developed trading signals but didn’t always fully explain the margin of error in the analysis to make them useful to fund managers, the people said.</p> </blockquote> <p>It’s hard to know the details of what actually happened, but for data scientists collaborating with others, there always needs to be an explanation of “what’s going on”. There’s a general feeling that it’s okay that machine learning techniques build complicated uninterpretable models because they work better. But in my experience that’s not enough. People want to know why they work better, when they work better, and when they <em>don’t</em> work.</p> <p>On over-theorizing:</p> <blockquote> <p>Haynes, who joined Stamford, Connecticut-based Point72 in early 2014 after about two decades at McKinsey &amp; Co., and other senior managers grew dissatisfied with Aperio’s progress and impact on returns, the people said. When the group obtained new data sets, it spent too much time developing theories about how to process them rather than quickly producing actionable results.</p> </blockquote> <p>I don’t necessarily agree with this “criticism”, but I only put it here because the land of hedge funds isn’t generally viewed on the outside as a place where lots of theorizing goes on.</p> <p>At BlueMountain, another hedge fund:</p> <blockquote> <p>When quants showed their risk analysis and trading signals to fundamental managers, they sometimes were rejected as nothing new, the people said. Quants at times wondered if managers simply didn’t want to give them credit for their ideas.</p> </blockquote> <p>I’ve seen this quite a bit. When a data scientist presents results to collaborators, there’s often two responses:</p> <ol> <li>“I knew that already” and so you haven’t taught me anything new</li> <li>“I didn’t know that already” and so you must be wrong</li> </ol> <p>The common link here, of course, is the inability to admit that there are things you don’t know. Whether this is an inherent character flaw or something that can be overcome through teaching is not yet clear to me. But it is common when data is brought to bear on a problem that previously lacked data. One of the key tasks that a data scientist in any industry must prepare for is the task of giving people information that will make them uncomfortable.</p> Not So Standard Deviations Episode 32 - You Have to Reinvent the Wheel a Few Times 2017-02-13T00:00:00+00:00 http://simplystats.github.io/2017/02/13/nssd-episode-32 <p>Hilary and I discuss training in PhD programs, estimating the variance vs. the standard deviation, the bias variance tradeoff, and explainable machine learning.</p> <p>We’re also introducing a new level of support on our Patreon page, where you can get access to some of the outtakes from our episodes. Check out our <a href="https://www.patreon.com/NSSDeviations">Patreon page</a> for details.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.darpa.mil/program/explainable-artificial-intelligence">Explainable AI</a></p> </li> <li> <p><a href="http://multithreaded.stitchfix.com/blog/2016/11/22/nba-rankings/">Stitch Fix Blog NBA Rankings</a></p> </li> <li> <p><a href="http://varianceexplained.org/r/empirical-bayes-book/">David Robinson’s Empirical Bayes book</a></p> </li> <li> <p><a href="https://warontherocks.com/2017/01/introducing-bombshell-the-explosive-first-episode/">War on the Rocks podcast</a></p> </li> <li> <p><a href="https://twitter.com/rdpeng">Roger on Twitter</a></p> </li> <li> <p><a href="https://twitter.com/hspter">Hilary on Twitter</a></p> </li> <li> <p><a href="https://leanpub.com/conversationsondatascience/">Get the Not So Standard Deviations book</a></p> </li> <li> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a></p> </li> <li> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a></p> </li> <li> <p><a href="https://soundcloud.com/nssd-podcast">Find past episodes</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-32-you-have-to-reinvent-the-wheel-a-few-times">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/306883468&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Reproducible Research Needs Some Limiting Principles 2017-02-01T00:00:00+00:00 http://simplystats.github.io/2017/02/01/reproducible-research-limits <p>Over the past 10 years thinking and writing about reproducible research, I’ve come to the conclusion that much of the discussion is incomplete. While I think we as a scientific community have come a long way in changing people’s thinking about data and code and making them available to others, there are some key sticking points that keep coming up that are preventing further progress in the area.</p> <p>When I used to write about reproducibility, I felt that the primary challenge/roadblock was a lack of tooling. Much has changed in just the last five years though, and many new tools have been developed to make life a lot easier. Packages like knitr (for R), markdown, and iPython notebooks, have made writing reproducible data analysis documents a lot easier. Web sites like GitHub and many others have made distributing analyses a lot simpler because now everyone effectively has a free web site (this was NOT true in 2005).</p> <p>Even still, our basic definition of reproducibility is incomplete. Most people would say that a data analysis is reproducible if the analytic data and metadata are available and the code that did the analysis is available. Furthermore, it would be preferable to have some documentation to go along with both. But there are some key issues that need to be resolved to complete this general definition.</p> <h2 id="reproducible-for-whom">Reproducible for Whom?</h2> <p>In discussions about reproducibility with others, the topic of <strong>who</strong> should be able to reproduce the analysis only occasionally comes up. There’s a general sense, especially amongst academics, that <strong>anyone</strong> should be able to reproduce any analysis if they wanted to.</p> <p>There is an analogy with free software here in the sense that free software can be free for some people and not for others. This made more sense in the days before the Internet when distribution was much more costly. The idea here was that I could write software for a client and give them the source code for that software (as they would surely demand). The software is free for them but not for anyone else. But free software ultimately only matters when it comes to distribution. Once I distribute a piece of software, that’s when all the restrictions come into play. However, if I only distribute it to a few people, I only need to guarantee that those few people have those freedoms.</p> <p>Richard Stallman once said that something like 90% of software was free software because almost all software being written was custom software for individual clients (I have no idea where he got this number). Even if the number is wrong, the point still stands that if I write software for a single person, it can be free for that person even if no one in the world has access to the software.</p> <p>Of course, now with the Internet, everything pretty much gets distributed to everyone because there’s nothing stopping someone from taking a piece of free software and posting it on a web site. But the idea still holds: Free software only needs to be free for the people who receive it.</p> <p>That said, the analogy is not perfect. Software and research are not the same thing. They key difference is that you can’t call something research unless is generally available and disseminated. If Pfizer comes up with the cure for cancer and never tells anyone about it, it’s not research. If I discover that there’s a 9th planet and only tell my neighbor about it, it’s not research. Many companies might call those activities research (particularly from an tax/accounting point of view) but since society doesn’t get to learn about them, it’s not research.</p> <p>If research is by definition disseminated to all, then it should therefore be reproducible by all. However, there are at least two circumstances in which we do not even pretend to believe this is possible.</p> <ol> <li><strong>Imbalance of resources</strong>: If I conduct a data analysis that requires the <a href="https://www.top500.org/lists/2016/06/">world’s largest supercomputer</a>, I can make all the code and data available that I want–few people will be able to actually reproduce it. That’s an extreme case, but even if I were to make use of a <a href="https://jhpce.jhu.edu">dramatically smaller computing cluster</a> it’s unlikely that anyone would be able to recreate those resources. So I can distribute something that’s reproducible in theory but not in reality by most people.</li> <li><strong>Protected data</strong>: Numerous analyses in the biomedical sciences make use of protected health information that cannot easily be disseminated. Privacy is an important issue, in part, because in many cases it allows us to collect the data in the first place. However, most would agree we cannot simply post that data for all to see in the name of reproducibility. First, it is against the law, and second it would likely deter anyone from agreeing to participate in any study in the future.</li> </ol> <p>We can pretend that we can make data analyses reproducible for all, but in reality it’s not possible. So perhaps it would make sense for us to consider whether a limiting principle should be applied. The danger of not considering it is that one may take things to the extreme—if it can’t be made reproducible for all, then why bother trying? A partial solution is needed here.</p> <h2 id="for-how-long">For How Long?</h2> <p>Another question that needs to be resolved for reproducibility to be a widely implemented and sustainable phenomenon is for how long should something be reproducible? Ultimately, this is a question about time and resources because ensuring that data and code can be made available and can run on current platforms <em>in perpetuity</em> requires substantial time and money. In the academic community, where projects are often funded off of grants or contracts with finite lifespans, often the money is long gone even though the data and code must be maintained. The question then is who pays for the maintainence and the upkeep of the data and code?</p> <p>I’ve never heard a satisfactory answer to this question. If the answer is that data analyses should be reproducible forever, then we need to consider a different funding model. This position would require a perpetual funds model, essentially an endowment, for each project that is disseminated and claims to be reproducible. The endowment would pay for things like servers for hosting the code and data and perhaps engineers to adapt and adjust the code as the surrounding environment changes. While there are a number of <a href="http://dataverse.org">repositories</a> that have developed scalable operating models, it’s not clear to me that the funding model is completely sustainable.</p> <p>If we look at how scientific publications are sustained, we see that it’s largely private enterprise that shoulders the burden. Journals house most of the publications out there and they charge a fee for access (some for profit, some not for profit). Whether the reader pays or the author pays is not relevant, the point is that a decision has been made about <em>who</em> pays.</p> <p>The author-pays model is interesting though. Here, an author pays a publication charge of ~$2,000, and the reader never pays anything for access (in perpetuity, presumably). The $2,000 payment by the author is like a one-time capital expense for maintaining that one publication forever (a mini-endowment, in a sense). It works for authors because grant/contract supported research often budget for one-time publication charges. There’s no need for continued payments after a grant/contract has expired.</p> <p>The publication system is quite a bit simpler because almost all publications are the same size and require the same resources for access—basically a web site that can serve up PDF files and people to maintain it. For data analyses, one could see things potentially getting out of control. For a large analysis with terabytes of data, what would the one-time up-front fee be to house the data and pay for anyone to access it for free forever?</p> <p>Using Amazon’s <a href="http://calculator.s3.amazonaws.com/index.html">monthly cost estimator</a> we can get a rough sense of what the pure data storage might cost. Suppose we have a 10GB dataset that we want to store and we anticipate that it might be downloaded 10 times per month. This would cost about $7.65 per month, or $91.80 per year. If we assume Amazon raises their prices about 3% per year and a discount rate of 5%, the total cost for the storage is $4,590. If we tack on 20% for other costs, that brings us to $5,508. This is perhaps not unreasonable, and the scenario would certainly include most people. For comparison a 1 TB dataset downloaded once a year, using the same formula gives us a one-time cost of about $40,000. This is real money when it comes to fixed research budgets and would likely require some discussion of trade-offs.</p> <h2 id="summary">Summary</h2> <p>Reproducibility is a necessity in science, but it’s high time that we start considering the practical implications of actually doing the job. There are still holdouts when it comes to the basic idea of reproducibiltiy, but they are fewer and farther between. If we do not seriously consider the details of how to implement reproducibility, perhaps by introducing some limiting principles, we may never be able to achieve any sort of widespread adoption.</p> Turning data into numbers 2017-01-31T00:00:00+00:00 http://simplystats.github.io/2017/01/31/data-into-numbers <p><em>Editor’s note: This is the third chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only three chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p> <blockquote> <p>“It is a capital mistake to theorize before one has data.” Arthur Conan Doyle</p> </blockquote> <h2 id="data-data-everywhere">Data, data everywhere</h2> <p>I already have some data about you. You are reading this book. Does that seem like data? It’s just something you did, that’s not data is it? But if I collect that piece of information about you, it actually tells me a surprising amount. It tells me you have access to an internet connection, since the only place to get the book is online. That in turn tells me something about your socioeconomic status and what part of the world you live in. It also tells me that you like to read, which suggests a certain level of education.</p> <p>Whether you know it or not, everything you do produces data - from the websites you read to the rate at which your heart beats. Until pretty recently, most of the data you produced wasn’t collected, it floated off unmeasured. Data were painstakingly gathered by scientists one number at a time in small experiments with a few people. This laborious process meant that data were expensive and time-consuming to collect. Yet many of the most amazing scientific discoveries over the last two centuries were squeezed from just a few data points. But over the last two decades, the unit price of data has dramatically dropped. New technologies touching every aspect of our lives from our money, to our health, to our social interactions have made data collection cheap and easy.</p> <p>To give you an idea of how steep the drop in the price of data has been, in 1967 Stanley Milgram did an experiment to determine the number of degrees of separation between two people in the U.S. (Travers and Milgram 1969). In his experiment he sent 296 letters to people in Omaha, Nebraska and Wichita, Kansas. The goal was to get the letters to a specific person in Boston, Massachusetts. The trick was people had to send the letters to someone they knew, and they then sent it to someone they knew and so on. At the end of the experiment, only 64 letters made it to the individual in Boston. On average, the letters had gone through 6 people to get there.</p> <p>This is an idea that is so powerful it even became part of the popular consciousness. For example it is the foundation of the internet meme “the 6-degrees of Kevin Bacon” (Wikipedia contributors 2016a) - the idea that if you take any actor and look at the people they have been in movies with, then the people those people have been in movies with, it will take you at most six steps to end up at the actor Kevin Bacon. This idea, despite its popularity was originally studied by Milgram using only 64 data points. A 2007 study updated that number to “7 degrees of Kevin Bacon”. The study was based on 30 billion instant messaging conversations collected over the course of a month or two with the same amount of effort (Leskovec and Horvitz 2008).</p> <p>Once data started getting cheaper to collect, it got cheaper fast. Take another example, the human genome. The genome is the unique DNA code in every one of your cells. It consists of a set of 3 billion letters that is unique to you. By many measures, the race to be the first group to collect all 3 billion letters from a single person kicked off the data revolution in biology. The project was completed in 2000 after a decade of work and $3 billion to collect the 3 billion letters in the first human genome (Venter et al. 2001). This project was actually a stunning success, most people thought it would be much more expensive. But just over a decade later, new technology means that we can now collect all 3 billion letters from a person’s genome for about $1,000 in about a week (“The Cost of Sequencing a Human Genome,” n.d.), soon it may be less than $100 (Buhr 2017).</p> <p>You may have heard that this is the era of “big data” from The Economist or The New York Times. It is really the era of cheap data collection and storage. Measurements we never bothered to collect before are now so easy to obtain that there is no reason not to collect them. Advances in computer technology also make it easier to store huge amounts of data digitally. This may not seem like a big deal, but it is much easier to calculate the average of a bunch of numbers stored electronically than it is to calculate that same average by hand on a piece of paper. Couple these advances with the free and open distribution of data over the internet and it is no surprise that we are awash in data. But tons of data on their own are meaningless. It is understanding and interpreting the data where the real advances start to happen.</p> <p>This explosive growth in data collection is one of the key driving influences behind interest in artificial intelligence. When teaching computers to do something that only humans could do previously, it helps to have lots of examples. You can then use statistical and machine learning models to summarize that set of examples and help a computer make decisions what to do. The more examples you have, the more flexible your computer model can be in making decisions, and the more “intelligent” the resulting application.</p> <h2 id="what-is-data">What is data?</h2> <h3 id="tidy-data">Tidy data</h3> <p>“What is data”? Seems like a relatively simple question. In some ways this question is easy to answer. According to <a href="https://en.wikipedia.org/wiki/Data">Wikipedia</a>:</p> <blockquote> <p>Data (/ˈdeɪtə/ day-tə, /ˈdætə/ da-tə, or /ˈdɑːtə/ dah-tə)[1] is a set of values of qualitative or quantitative variables. An example of qualitative data would be an anthropologist’s handwritten notes about her interviews with people of an Indigenous tribe. Pieces of data are individual pieces of information. While the concept of data is commonly associated with scientific research, data is collected by a huge range of organizations and institutions, ranging from businesses (e.g., sales data, revenue, profits, stock price), governments (e.g., crime rates, unemployment rates, literacy rates) and non-governmental organizations (e.g., censuses of the number of homeless people by non-profit organizations).</p> </blockquote> <p>When you think about data, you probably think of orderly sets of numbers arranged in something like an Excel spreadsheet. In the world of data science and machine learning this type of data has a name - “tidy data” (Wickham and others 2014). Tidy data has the properties that all measured quantities are represented by numbers or character strings (think words). The data are organized such that.</p> <ol> <li>Each variable you measured is in one column</li> <li>Each different measurement of that variable is in a different row</li> <li>There is one data table for each “type” of variable.</li> <li>If there are multiple tables then they are linked by a common ID.</li> </ol> <p>This idea is borrowed from data management schemas that have long been used for storing data in databases. Here is an example of a tidy data set of swimming world records.</p> <table> <thead> <tr> <th style="text-align: right">year</th> <th style="text-align: right">time</th> <th style="text-align: left">sex</th> </tr> </thead> <tbody> <tr> <td style="text-align: right">1905</td> <td style="text-align: right">65.8</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1908</td> <td style="text-align: right">65.6</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1910</td> <td style="text-align: right">62.8</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1912</td> <td style="text-align: right">61.6</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1918</td> <td style="text-align: right">61.4</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1920</td> <td style="text-align: right">60.4</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1922</td> <td style="text-align: right">58.6</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1924</td> <td style="text-align: right">57.4</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1934</td> <td style="text-align: right">56.8</td> <td style="text-align: left">M</td> </tr> <tr> <td style="text-align: right">1935</td> <td style="text-align: right">56.6</td> <td style="text-align: left">M</td> </tr> </tbody> </table> <p>This type of data, neat, organized and nicely numeric is not the kind of data people are talking about when they say the “era of big data”. Data almost never start their lives in such a neat and organized format.</p> <h3 id="raw-data">Raw data</h3> <p>The explosion of interest in AI has been powered by a variety of types of data that you might not even think of when you think of “data”. The data might be pictures you take and upload to social media, the text of the posts on that same platform, or the sound captured from your voice when you speak to your phone.</p> <p>Social media and cell phones aren’t the only area where data is being collected more frequently. Speed cameras on roads collect data on the movement of cars, electronic medical records store information about people’s health, wearable devices like Fitbit collect information on the activity of people. GPS information stores the location of people, cars, boats, airplanes, and an increasingly wide array of other objects.</p> <p>Images, voice recordings, text files, and GPS coordinates are what experts call “raw data”. To create an artificial intelligence application you need to begin with a lot of raw data. But as we discussed in the simple AI example from the previous chapter - a computer doesn’t understand raw data in its natural form. It is not always immediately obvious how the raw data can be turned into numbers that a computer can understand. For example, when an artificial intelligence works with a picture the computer doesn’t “see” the picture file itself. It sees a set of numbers that represent that picture and operates on those numbers. The first step in almost every artificial intelligence application is to “pre-process” the data - to take the image files or the movie files or the text of a document and turn it into numbers that a computer can understand. Then those numbers can be fed into algorithms that can make predictions and ultimately be used to make an interface look intelligent.</p> <h2 id="turning-raw-data-into-numbers">Turning raw data into numbers</h2> <p>So how do we convert raw data into a form we can work with? It depends on what type of measurement or data you have collected. Here I will use two examples to explain how you can convert images and the text of a document into numbers that an algorithm can be applied to.</p> <h3 id="images">Images</h3> <p>Suppose that we were developing an AI to identify pictures of the author of this book. We would need to collect a picture of the author - maybe an embarrassing one.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff.jpg" alt="An embarrassing picture of the author" /></p> <p>This picture is made of pixels. You can see that if you zoom in very close on the image and look more closely. You can see that the image consists of many hundreds of little squares, each square just one color. Those squares are called pixels and they are one step closer to turning the image into numbers.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile.png" alt="A zoomed in view of the author's smile - you can see that each little square corresponds to one pixel and has an individual color" /></p> <p>You can think of each pixel like a dot of color. Let’s zoom in a little bit more and instead of showing each pixel as a square show each one as a colored dot.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-dots.png" alt="A zoomed in view of the author's smile - now each of the pixels are little dots one for each pixel." /></p> <p>Imagine we are going to build an AI application on the basis of lots of images. Then we would like to turn a set of images into “tidy data”. As described above a tidy data set is defined as the following.</p> <ol> <li>Each variable you measured is in one column</li> <li>Each different measurement of that variable is in a different row</li> <li>There is one data table for each “type” of variable.</li> <li>If there are multiple tables then they are linked by a common ID.</li> </ol> <p>A translation of tidy data for a collection of images would be the following.</p> <ol> <li><em>Variables</em>: Are the pixels measured in the images. So the top left pixel is a variable, the bottom left pixel is a variable, and so on. So each pixel should be in a separate column.</li> <li><em>Measurements</em>: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.</li> <li><em>Tables</em>: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).</li> </ol> <p>To start to turn the image into a row of the data set we need to stretch the dots into a single row. One way to do this is to snake along the image going from top left corner to bottom right corner and creating a single line of dots.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-smile-lines.png" alt="Follow the path of the arrows to see how you can turn the two dimensional picture into a one dimensional picture" /></p> <p>This still isn’t quite data a computer can understand - a computer doesn’t know about dots. But we could take each dot and label it with a color name.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-color-names.png" alt="Labeling each color with a name" /></p> <p>We could take each color name and give it a number, something like <code class="highlighter-rouge">rosybrown = 1</code>, <code class="highlighter-rouge">mistyrose = 2</code>, and so on. This approach runs into some trouble because we don’t have names for every possible color and because it is pretty inefficient to have a different number for every hue we could imagine.</p> <p>But that would be both inefficient and not very understandable by a computer. An alternative strategy that is often used is to encode the intensity of the red, green, and blue colors for each pixel. This is sometimes called the rgb color model (Wikipedia contributors 2016b). So for example we can take these dots and show how much red, green, and blue they have in them.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/jeff-rgb.png" alt="Breaking each color down into the amount of red, green and blue" /></p> <p>Looking at it this way we now have three measurements for each pixel. So we need to update our tidy data definition to be:</p> <ol> <li><em>Variables</em>: Are the three colors for each pixel measured in the images. So the top left pixel red value is a variable, the top left pixel green value is a variable and so on. So each pixel/color combination should be in a separate column.</li> <li><em>Measurements</em>: The measurements are the values for each pixel in each image. So each row corresponds to the values of the pixels for each row.</li> <li><em>Tables</em>: There would be two tables - one with the data from the pixels and one with the labels of each image (if we know them).</li> </ol> <p>So a tidy data set might look something like this for just the image of Jeff.</p> <table> <thead> <tr> <th>id</th> <th>label</th> <th>p1red</th> <th>p1green</th> <th>p1blue</th> <th>p2red</th> <th>…</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>“jeff”</td> <td>238</td> <td>180</td> <td>180</td> <td>205</td> <td>…</td> </tr> </tbody> </table> <p>Each additional image would then be another row in the data set. As we will see in the chapters that follow we can then feed this data into an algorithm for performing an artificial intelligence task.</p> <h2 id="notes">Notes</h2> <p>Parts of this chapter from appeared in the Simply Statistics blog post <a href="http://simplystatistics.org/2013/06/14/the-vast-majority-of-statistical-analysis-is-not-performed-by-statisticians/">“The vast majority of statistical analysis is not performed by statisticians”</a> written by the author of this book.</p> <h2 id="references">References</h2> <p>Buhr, Sarah. 2017. “Illumina Wants to Sequence Your Whole Genome for $100.” <a href="https://techcrunch.com/2017/01/10/illumina-wants-to-sequence-your-whole-genome-for-100/">https://techcrunch.com/2017/01/10/illumina-wants-to-sequence-your-whole-genome-for-100/</a>.</p> <p>Leskovec, Jure, and Eric Horvitz. 2008. “Planetary-Scale Views on an Instant-Messaging Network,” 6~mar.</p> <p>“The Cost of Sequencing a Human Genome.” n.d. <a href="https://www.genome.gov/sequencingcosts/">https://www.genome.gov/sequencingcosts/</a>.</p> <p>Travers, Jeffrey, and Stanley Milgram. 1969. “An Experimental Study of the Small World Problem.” <em>Sociometry</em> 32 (4). [American Sociological Association, Sage Publications, Inc.]: 425–43.</p> <p>Venter, J Craig, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, et al. 2001. “The Sequence of the Human Genome.” <em>Science</em> 291 (5507). American Association for the Advancement of Science: 1304–51.</p> <p>Wickham, Hadley, and others. 2014. “Tidy Data.” <em>Under Review</em>.</p> <p>Wikipedia contributors. 2016a. “Six Degrees of Kevin Bacon.” <a href="https://en.wikipedia.org/w/index.php?title=Six_Degrees_of_Kevin_Bacon&amp;oldid=748831516">https://en.wikipedia.org/w/index.php?title=Six_Degrees_of_Kevin_Bacon&amp;oldid=748831516</a>.</p> <p>———. 2016b. “RGB Color Model.” <a href="https://en.wikipedia.org/w/index.php?title=RGB_color_model&amp;oldid=756764504">https://en.wikipedia.org/w/index.php?title=RGB_color_model&amp;oldid=756764504</a>.</p> New class - Data App Prototyping for Public Health and Beyond 2017-01-26T00:00:00+00:00 http://simplystats.github.io/2017/01/26/new-prototyping-class <p>Are you interested in building data apps to help save the world, start the next big business, or just to see if you can? We are running a data app prototyping class for people interested in creating these apps.</p> <p>This will be a special topics class at JHU and is open to any undergrad student, grad student, postdoc, or faculty member at the university. We are also seeing if we can make the class available to people outside of JHU so even if you aren’t at JHU but are interested you should let us know below.</p> <p>One of the principles of our approach is that anyone can prototype an app. Our class starts with some tutorials on Shiny and R. While we have no formal pre-reqs for the class you will have much more fun if you have the background equivalent to our Coursera classes:</p> <ul> <li><a href="https://www.coursera.org/learn/data-scientists-tools">Data Scientist’s Toolbox</a></li> <li><a href="https://www.coursera.org/learn/r-programming">R programming</a></li> <li><a href="https://www.coursera.org/learn/r-packages">Building R packages</a></li> <li><a href="https://www.coursera.org/learn/data-products">Developing Data Products</a></li> </ul> <p>If you don’t have that background you can take the classes online starting now to get up to speed! To see some examples of apps we will be building check out our <a href="http://jhudatascience.org/data_app_gallery.html">gallery</a>.</p> <p>We will mostly be able to support development with R and Shiny but would be pumped to accept people with other kinds of development background - we just might not be able to give a lot of technical assistance.</p> <p>As part of the course we are also working with JHU’s <a href="https://ventures.jhu.edu/fastforward/">Fast Forward</a> program to streamline and ease the process of starting a company around the app you build for the class. So if you have entrepreneurial ambitions, this is the class for you!</p> <p>We are in the process of setting up the course times, locations, and enrollment cap. The class will run from March to May (exact dates TBD). To sign up for announcements about the class please fill out your information <a href="http://jhudatascience.org/prototyping_students.html">here</a>.</p> User Experience and Value in Products - What Regression and Surrogate Variables can Teach Us 2017-01-23T00:00:00+00:00 http://simplystats.github.io/2017/01/23/ux-value <p>Over the past year, there have been a number of recurring topics in my global news feed that have a shared theme to them. Some examples of these topics are:</p> <ul> <li><strong>Fake news</strong>: Before and after the election in 2016, Facebook (or Facebook’s Trending News algorithm) was accused of promoting news stories that turned out to be completely false, promoted by dubious news sources in FYROM and elsewhere.</li> <li><strong>Theranos</strong>: This diagnostic testing company promised to revolutionize the blood testing business and prevent disease for all by making blood testing simple and painless. This way people would not be afraid to get blood tests and would do them more often, presumably catching diseases while they were in the very early stages. Theranos lobbied to allow patients order their own blood tests so that they wouldn’t need a doctor’s order.</li> <li><strong>Homeopathy</strong>: This a so-called <a href="https://nccih.nih.gov/health/homeopathy">alternative medical system</a> developed in the late 18th century based on notions such as “like cures like” and “law of minimum dose.</li> <li><strong>Online education</strong>: New companies like Coursera and Udacity promised to revolutionize education by making it accessible to a broader audience than conventional universities were able.</li> </ul> <p>What exactly do these things have in common?</p> <p>First, consumers love them. Fake news played to people’s biases by confirming to them, from a seemingly trustworthy source, what they always “knew to be true”. The fact that the stories weren’t actually true was irrelevant given that users enjoyed the experience of seeing what they agreed with. Perhaps the best explanation of the entire Facebook fake news issue was from Kim-Mai Cutler:</p> <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">The best way to have the stickiest and most lucrative product? Be a systematic tool for confirmation bias. <a href="https://t.co/8uOHZLomhX">https://t.co/8uOHZLomhX</a></p>&mdash; Kim-Mai Cutler (@kimmaicutler) <a href="https://twitter.com/kimmaicutler/status/796560990854905857">November 10, 2016</a></blockquote> <script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script> <p>Theranos promised to revolutionize blood testing and change the user experience behind the whole industry. Indeed the company had some fans (particularly amongst its <a href="https://www.axios.com/tim-drapers-keeps-defending-theranos-2192078259.html">investor base</a>). However, after investigations by the Center for Medicare and Medicaid Services, the FDA, and an independent laboratory, it was found that Theranos’s blood testing machine was wildly inconsistent and variable, leading to Theranos ultimately retracting all of its blood test results and cutting half its workforce.</p> <p>Homeopathy is not company specific, but is touted by many as an “alternative” treatment for many diseases, with many claiming that it “works for them”. However, the NIH states quite clearly on its <a href="https://nccih.nih.gov/health/homeopathy">web site</a> that “There is little evidence to support homeopathy as an effective treatment for any specific condition.”</p> <p>Finally, companies like Coursera and Udacity in the education space have indeed produced products that people like, but in some instances have hit bumps in the road. Udacity conducted a brief experiment/program with San Jose State University that failed due to the large differences between the population that took online courses and the one that took them in person. Coursera has massive offerings from major universities (including my own) but has run into continuing <a href="http://www.economist.com/news/special-report/21714173-alternative-providers-education-must-solve-problems-cost-and">challenges with drop out</a> and questions over whether the courses offered are suitable for job placement.</p> <h2 id="user-experience-and-value">User Experience and Value</h2> <p>In each of these four examples there is a consumer product that people love, often because they provide a great user experience. Take the fake news example–people love to read headlines from “trusted” news sources that agree with what they believe. With Theranos, people love to take a blood test that is not painful (maybe “love” is the wrong word here). With many consumer products companies, it is the user experience that defines the value of a product. Often when describing the user experience, you are simultaneously describing the value of the product.</p> <p>Take for example Uber. With Uber, you open an app on your phone, click a button to order a car, watch the car approach you on your phone with an estimate of how long you will be waiting, get in the car and go to your destination, and get out without having to deal with paying. If someone were to ask me “What’s the value of Uber?” I would probably just repeat the description in the previous sentence. Isn’t it obvious that it’s better than the usual taxi experience? The same could be said for many companies that have recently come up: Airbnb, Amazon, Apple, Google. With many of the products from these companies, <em>the description of the user experience is a description of its value</em>.</p> <h2 id="disruption-through-user-experience">Disruption Through User Experience</h2> <p>In the example of Uber (and Airbnb, and Amazon, etc.) you could depict the relationship between the product, the user experience, and the value as such:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux1.png" alt="" /></p> <p>Any changes that you can make to the product to improve the user experience will then improve the value that the product offers. Another way to say it is that the user experience serves as a <em>surrogate outcome</em> for the value. We can influence the UX and know that we are improving value. Furthermore, any measurements that we take on the UX (surveys, focus groups, app data) will serve as direct observations on the value provided to customers.</p> <p>New companies in these kinds of consumer product spaces can disrupt the incumbents by providing a much better user experience. When incumbents have gotten fat and lazy, there is often a sizable segment of the customer base that feels underserved. That’s when new companies can swoop in to specifically serve that segment, often with a “worse” product overall (as in fewer features) and usually much cheaper. The Internet has made the “swooping in” much easier by <a href="https://stratechery.com/2015/netflix-and-the-conservation-of-attractive-profits/">dramatically reducing transaction and distribution costs</a>. Once the new company has a foothold, they can gradually work their way up the ladder of customer segments to take over the market. It’s classic disruption theory a la <a href="http://www.claytonchristensen.com">Clayton Christensen</a>.</p> <h2 id="when-value-defines-the-user-experience-and-product">When Value Defines the User Experience and Product</h2> <p>There has been much talk of applying the classic disruption model to every space imaginable, but I contend that not all product spaces are the same. In particular, the four examples I described in the beginning of this post cover some of those different areas:</p> <ul> <li>Medicine (Theranos, homeopathy)</li> <li>News (Facebook/fake news)</li> <li>Education (Coursera/Udacity)</li> </ul> <p>One thing you’ll notice about these areas, particularly with medicine and education, is that they are all heavily regulated. The reason is because we as a community have decided that there is a minimum level of value that is required to be provided by entities in this space. That is, the value that a product offers is <em>defined first</em>, before the product can come to market. Therefore, the value of the product actually constrains the space of products that can be produced. We can depict this relationship as such:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ux2.png" alt="" /></p> <p>In classic regression modeling language, the value of a product must be “adjusted for” before examining the relationship between the product and the user experience. Naturally, as in any regression problem, when you adjust for a variable that is related to the product and the user experience, you reduce the overall variation in the product.</p> <p>In situations where the value defines the product and the user experience, there is much less room to maneuver for new entrants in the market. The reason is because they, like everyone else, are constrained by the value that is agreed upon by the community, usually in the form of regulations.</p> <p>When Theranos comes in and claims that it’s going to dramatically improve the user experience of blood testing, that’s great, but they must be constrained by the value that society demands, which is a certain precision and accuracy in its testing results. Companies in the online education space are welcome to disrupt things by providing a better user experience. Online offerings in fact do this by allowing students to take classes according to their own schedule, wherever they may live in the world. But we still demand that the students learn an agreed-upon set of facts, skills, or lessons.</p> <p>New companies will often argue that the things that we currently value are outdated or no longer valuable. Their incentive is to change the value required so that there is more room for new companies to enter the space. This is a good thing, but it’s important to realize that this cannot happen solely through changes in the product. Innovative features of a product may help us to understand that we should be valuing different things, but ultimately the change in what we preceive as value occurs independently of any given product.</p> <p>When I see new companies enter the education, medicine, or news areas, I always hesitate a bit because I want some assurance that they will still provide the value that we have come to expect. In addition, with these particular areas, there is a genuine sense that failing to deliver on what we value could cause serious harm to individuals. However, I think the discussion that is provoked by new companies entering the space is always welcome because we need to constantly re-evaluate what we value and whether it matches the needs of our time.</p> An example that isn't that artificial or intelligent 2017-01-20T00:00:00+00:00 http://simplystats.github.io/2017/01/20/not-artificial-not-intelligent <p><em>Editor’s note: This is the second chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there are only two chapters in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p> <blockquote> <p>“I am so clever that sometimes I don’t understand a single word of what I am saying.” Oscar Wilde</p> </blockquote> <p>As we have described it artificial intelligence applications consist of three things:</p> <ol> <li>A large collection of data examples</li> <li>An algorithm for learning a model from that training set.</li> <li>An interface with the world.</li> </ol> <p>In the following chapters we will go into each of these components in much more detail, but lets start with a a couple of very simple examples to make sure that the components of an AI are clear. We will start with a completely artificial example and then move to more complicated examples.</p> <h2 id="building-an-album">Building an album</h2> <p>Lets start with a very simple hypothetical example that can be understood even if you don’t have a technical background. We can also use this example to define some of the terms we will be discussing later in the book.</p> <p>In our simple example the goal is to make an album of photos for a friend. For example, suppose I want to take the photos in my photobook and find all the ones that include pictures of myself and my son Dex for his grandmother.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/cartoon-phone-photos.png" alt="The author's drawing of the author's phone album. Don't make fun, he's a data scientist, not an artist" /></p> <p>If you are anything like the author of this book, then you probably have a very large number of pictures of your family on your phone. So the first step in making the photo alubm would be to stort through all of my pictures and pick out the ones that should be part of the album.</p> <p>This is a typical example of the type of thing we might want to train a computer to do in an artificial intelligence application. Each of the components of an AI application is there:</p> <ol> <li><strong>The data</strong>: all of the pictures on the author’s phone (a big training set!)</li> <li><strong>The algorithm</strong>: finding pictures of me and my son Dex</li> <li><strong>The interface</strong>: the album to give to Dex’s grandmother.</li> </ol> <p>One way to solve this problem is for me to sort through the pictures one by one and decide whether they should be in the album or not, then assemble them together, and then put them into the album. If I did it like this then I myself would be the AI! That wouldn’t be very artificial though…imagine we instead wanted to teach a computer to make this album..</p> <blockquote> <p>But what does it mean to “teach” a computer to do something?</p> </blockquote> <p>The terms “machine learning” and “artificial intelligence” invoke the idea of teaching computers in the same way that we teach children. This was a deliberate choice to make the analogy - both because in some ways it is appropriate and because it is useful for explaining complicated concepts to people with limited backgrounds. To teach a child to find pictures of the author and his son, you would show her lots of examples of that type of picture and maybe some examples of the author with other kids who were not his son. You’d repeat to the child that the pictures of the author and his son were the kinds you wanted and the others weren’t. Eventually she would retain that information and if you gave her a new picture she could tell you whether it was the right kind or not.</p> <p>To teach a machine to perform the same kind of recognition you go through a similar process. You “show” the machine many pictures labeled as either the ones you want or not. You repeat this process until the machine “retains” the information and can correctly label a new photo. Getting the machine to “retain” this information is a matter of getting the machine to create a set of step by step instructions it can apply to go from the image to the label that you want.</p> <h2 id="the-data">The data</h2> <p>The images are what people in the fields of artificial intelligence and machine learning call <em>“raw data”</em> (Leek, n.d.). The categories of pictures (a picture of the author and his son or a picture of something else) are called the <em>“labels”</em> or <em>“outcomes”</em>. If the computer gets to see the labels when it is learning then it is called <em>“supervised learning”</em> (Wikipedia contributors 2016) and when the computer doesn’t get to see the labels it is called <em>“unsupervised learning”</em> (Wikipedia contributors 2017a).</p> <p>Going back to our analogy with the child, supervised learning would be teaching the child to recognize pictures of the author and his son together. Unsupervised learning would be giving the child a pile of pictures and asking them to sort them into groups. They might sort them by color or subject or location - not necessarily into categories that you care about. But probably one of the categories they would make would be pictures of people - so she would have found some potentially useful information even if it wasn’t exactly what you wanted. One whole field of artificial intelligence is figuring out how to use the information learned in this “unsupervised” setting and using it for supervised tasks</p> <ul> <li>this is sometimes called <em>“transfer learning”</em> (Raina et al. 2007) by people in the field since you are transferring information from one task to another.</li> </ul> <p>Returning to the task of “teaching” a computer to retain information about what kind of pictures you want we run into a problem - computers don’t know what pictures are! They also don’t know what audio clips, text files, videos, or any other kind of information is. At least not directly. They don’t have eyes, ears, and other senses along with a brain designed to decode the information from these senses.</p> <p>So what can a computer understand? A good rule of thumb is that a computer works best with numbers. If you want a computer to sort pictures into an album for you, the first thing you need to do is to find a way to turn all of the information you want to “show” the computer into numbers. In the case of sorting pictures into albums - a supervised learning problem - we need to turn the labels and the images into numbers the computer can use.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/labels-to-numbers.png" alt="Label each picture as a one or a zero depending on whether it is the kind of picture you want in the album" /></p> <p>One way to do that would be for you to do it for the computer. You could take every picture on your phone and label it with a 1 if it was a picture of the author and his son and a 0 if not. Then you would have a set of 1’s and 0’s corresponding to all of the pictures. This takes some thing the computer can’t understand (the picture) and turns it into something the computer can understand (the label).</p> <p>This process would turn the labels into something a computer could understand, it still isn’t something we could teach a computer to do. The computer can’t actually “look” at the image and doesn’t know who the author or his son are. So we need to figure out a way to turn the images into numbers for the computer to use to generate those labels directly.</p> <p>This is a little more complicated but you could still do it for the computer. Let’s suppose that the author and his son always wear matching blue shirts when they spend time together. Then you could go through and look at each image and decide what fraction of the image is blue. So each picture would get a number ranging from zero to one like 0.30 if the picture was 30% blue and 0.53 if it was 53% blue.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/images-to-numbers.png" alt="Calculate the fraction of each image that is the color blue as a &quot;feature&quot; of the image that is numeric" /></p> <p>The fraction of the picture that is blue is called a <em>“feature”</em> and the process of creating that feature is called <em>“feature engineering”</em> (Wikipedia contributors 2017b). Until very recently feature engineering of text, audio, or video files was best performed by an expert human. In later chapters we will discuss how one of the most exciting parts about AI application is that it is now possible to have computers perform feature engineering for you.</p> <h2 id="the-algorithm">The algorithm</h2> <p>Now that we have converted the images to numbers and the labels to numbers, we can talk about how to “teach” a computer to label the pictures. A good rule of thumb when thinking about algorithms is that a computer can’t “do” anything without being told very explicitly what to do. It needs a step by step set of instructions. The instructions should start with a calculation on the numbers for the image and should end with a prediction of what label to apply to that image. The image (converted to numbers) is the <em>“input”</em> and the label (also a number) is the <em>“output”</em>. You may have heard the phrase:</p> <blockquote> <p>“Garbage in, garbage out”</p> </blockquote> <p>What this phrase means is if the inputs (the images) are bad - say they are all very dark or hard to see. Then the output of the algorithm will also be bad - the predictions won’t be very good.</p> <p>A machine learning <em>“algorithm”</em> can be thought of as a set of instructions with some of the parts left blank - sort of like mad-libs. One example of a really simple algorithm for sorting pictures into the album would be:</p> <blockquote> <ol> <li>Calculate the fraction of blue in the image.</li> <li>If the fraction of blue is above <em>X</em> label it 1</li> <li>If the fraction of blue is less than <em>X</em> label it 0</li> <li>Put all of the images labeled 1 in the album</li> </ol> </blockquote> <p>The machine <em>“learns”</em> by using the examples to fill in the blanks in the instructions. In the case of our really simple algorithm we need to figure out what fraction of blue to use (<em>X</em>) for labeling the picture.</p> <p>To figure out a guess for <em>X</em> we need to decide what we want the algorithm to do. If we set <em>X</em> to be too low then all of the images will be labeled with a 1 and put into the album. If we set <em>X</em> to be too high then all of the images will be labeled 0 and none will appear in the album. In between there is some grey area - do we care if we accidentally get some pictures of the ocean or the sky with our algorithm?</p> <p>But the number of images in the album isn’t even the thing we really care about. What we might care about is making sure that the album is mostly pictures of the author and his son. In the field of AI they usually turn this statement around - we want to make sure the album has a very small fraction of pictures that are not of the author and his son. This fraction - the fraction that are incorrectly placed in the album is called the <em>“loss”</em>. You can think about it like a game where the computer loses a point every time it puts the wrong kind of picture into the album.</p> <p>Using our loss (how many pictures we incorrectly placed in the album) we can now use the data we have created (the numbers for the labels and the images) to fill in the blanks in our mad-lib algorithm (picking the cutoff on the amount of blue). We have a large number of pictures where we know what fraction of each picture is blue and whether it is a picture of the author and his son or not. We can try each possible <em>X</em> and calculate the fraction of pictures in the album that are incorrectly placed into the album (the loss) and find the <em>X</em> that produces the smallest fraction.</p> <p>Suppose that the value of <em>X</em> that gives the smallest faction of wrong pictures in the album is 30. Then our “learned” model would be:</p> <blockquote> <ol> <li>Calculate the fraction of blue in the image</li> <li>If the fraction of blue is above 0.1 label it 1</li> <li>If the fraction of blue is less than 0.1 label it 0</li> <li>Put all of the images labeled 1 in the album</li> </ol> </blockquote> <h2 id="the-interface">The interface</h2> <p>The last part of an AI application is the interface. In this case, the interface would be the way that we share the pictures with Dex’s grandmother. For example we could imagine uploading the pictures to <a href="https://www.shutterfly.com/">Shutterfly</a> and having the album delivered to Dex’s grandmother.</p> <p>Putting this all together we could imagine an application using our trained AI. The author uploads his unlabeled photos. The photos are then passed to the computer program which calculates the fraction of the image that is blue, then applies a label according to the algorithm we learned, then takes all the images predicted to be of the author and his son and sends them off to be a Shutterfly album mailed to the authors’ mother.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ai-album.png" alt="Whoa that computer is smart - from the author's picture to grandma's hands!" /></p> <p>If the algorithm was good, then from the perspective of the author the website would look “intelligent”. I just uploaded pictures and it created an album for me with the pictures that I wanted. But the steps in the process were very simple and understandable behind the scenes.</p> <h2 id="references">References</h2> <p>Leek, Jeffrey. n.d. “The Elements of Data Analytic Style.” <a href="{https://leanpub.com/datastyle}">{https://leanpub.com/datastyle}</a>.</p> <p>Raina, Rajat, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. 2007. “Self-Taught Learning: Transfer Learning from Unlabeled Data.” In <em>Proceedings of the 24th International Conference on Machine Learning</em>, 759–66. ICML ’07. New York, NY, USA: ACM.</p> <p>Wikipedia contributors. 2016. “Supervised Learning.” <a href="https://en.wikipedia.org/w/index.php?title=Supervised_learning&amp;oldid=752493505">https://en.wikipedia.org/w/index.php?title=Supervised_learning&amp;oldid=752493505</a>.</p> <p>———. 2017a. “Unsupervised Learning.” <a href="https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&amp;oldid=760556815">https://en.wikipedia.org/w/index.php?title=Unsupervised_learning&amp;oldid=760556815</a>.</p> <p>———. 2017b. “Feature Engineering.” <a href="https://en.wikipedia.org/w/index.php?title=Feature_engineering&amp;oldid=760758719">https://en.wikipedia.org/w/index.php?title=Feature_engineering&amp;oldid=760758719</a>.</p> What is artificial intelligence? A three part definition 2017-01-19T00:00:00+00:00 http://simplystats.github.io/2017/01/19/what-is-artificial-intelligence <p><em>Editor’s note: This is the first chapter of a book I’m working on called <a href="https://leanpub.com/demystifyai/">Demystifying Artificial Intelligence</a>. The goal of the book is to demystify what modern AI is and does for a general audience. So something to smooth the transition between AI fiction and highly mathematical descriptions of deep learning. I’m developing the book over time - so if you buy the book on Leanpub know that there is only one chaper in there so far, but I’ll be adding more over the next few weeks and you get free updates. The cover of the book was inspired by this <a href="https://twitter.com/notajf/status/795717253505413122">amazing tweet</a> by Twitter user <a href="https://twitter.com/notajf/">@notajf</a>. Feedback is welcome and encouraged!</em></p> <h1 id="what-is-artificial-intelligence">What is artificial intelligence?</h1> <blockquote> <p>“If it looks like a duck and quacks like a duck but it needs batteries, you probably have the wrong abstraction” <a href="https://lostechies.com/derickbailey/2009/02/11/solid-development-principles-in-motivational-pictures/">Derick Bailey</a></p> </blockquote> <p>This book is about artificial intelligence. The term “artificial intelligence” or “AI” has a long and convoluted history (Cohen and Feigenbaum 2014). It has been used by philosophers, statisticians, machine learning experts, mathematicians, and the general public. This historical context means that when people say <em>artificial intelligence</em> the term is loaded with one of many potential different meanings.</p> <h2 id="humanoid-robots">Humanoid robots</h2> <p>Before we can demystify artificial intelligence it is helpful to have some context for what the word means. When asked about artificial intelligence, most people’s imagination leaps immediately to images of robots that can act like and interact with humans. Near-human robots have long been a source of fascination by humans have appeared in cartoons like the <em>Jetsons</em> and science fiction like <em>Star Wars</em>. More recently, subtler forms of near-human robots with artificial intelligence have played roles in movies like <em>Her</em> and <em>Ex machina</em>.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/movie-ai.png" alt="People usually think of artificial intelligence as a human-like robot performing all the tasks that a person could." /></p> <p>The type of artificial intelligence that can think and act like a human is something that experts call artificial general intelligence (Wikipedia contributors 2017a).</p> <blockquote> <p>is the intelligence of a machine that could successfully perform any intellectual task that a human being can</p> </blockquote> <p>There is an understandable fascination and fear associated with robots, created by humans, but evolving and thinking independently. While this is a major area of ressearch (Laird, Newell, and Rosenbloom 1987) and of course the center of most people’s attention when it comes to AI, there is no near term possibility of this type of intelligence (Urban, n.d.). There are a number of barriers to human-mimicking AI from difficulty with robotics (Couden 2015) to needed speedups in computational power (Langford, n.d.).</p> <p>One of the key barriers is that most current forms of the computer models behind AI are trained to do one thing really well, but can not be applied beyond that narrow task. There are extremely effective artificial intelligence applications for translating between languages (Wu et al. 2016), for recognizing faces in images (Taigman et al. 2014), and even for driving cars (Santana and Hotz 2016).</p> <p>But none of these technologies are generalizable across the range of tasks that most adult humans can accomplish. For example, the AI application for recognizing faces in images can not be directly applied to drive cars and the translation application couldn’t recognize a single image. While some of the internal technology used in the applications is the same, the final version of the applications can’t be transferred. This means that when we talk about artificial intelligence we are not talking about a general purpose humanoid replacement. Currently we are talking about technologies that can typically accomplish one or two specific tasks that a human could accomplish.</p> <h2 id="cognitive-tasks">Cognitive tasks</h2> <p>While modern AI applications couldn’t do everything that an adult could do (Baciu and Baciu 2016), they can perform individual tasks nearly as well as a human. There is a second commonly used definition of artificial intelligence that is considerably more narrow (Wikipedia contributors 2017b)</p> <blockquote> <p>… the term “artificial intelligence” is applied when a machine mimics “cognitive” functions that humans associate with other human minds, such as “learning” and “problem solving”.</p> </blockquote> <p>This definition encompasses applications like machine translation and facial recognition. They are “cognitive” functions that are generally usually only performed by humans. A difficulty with this definition is that it is relative. People refer to machines that can do tasks that we thought humans could only do as artificial intelligence. But over time, as we become used to machines performing a particular task it is no longer surprising and we stop calling it artificial intelligence. John McCarthy, one of the leading early figures in artificial intelligence said (Vardi 2012):</p> <blockquote> <p>As soon as it works, no one calls it AI anymore…</p> </blockquote> <p>As an example, when you send a letter in the mail, there is a machine that scans the writing on the letter. A computer then “reads” the characters on the front of the letter. The computer reads the characters in several steps - the color of each pixel in the picture of the letter is stored in a data set on the computer. Then the computer uses an algorithm that has been built using thousands or millions of other letters to take the pixel data and turn it into predictions of the characters in the image. Then the characters are identified as addresses, names, zipcodes, and other relevant pieces of information. Those are then stored in the computer as text which can be used for sorting the mail.</p> <p>This task used to be considered “artificial intelligence” (Pavlidis, n.d.). It was surprising that a computer could perform the tasks of recognizing characters and addresses just based on a picture of the letter. This task is now called “optical character recognition” (Wikipedia contributors 2016). Many tutorials on the algorithms behind machine learning begin with this relatively simple task (Google Tensorflow Team, n.d.). Optical character recognition is now used in a wide range of applications including in Google’s effort to digitize millions of books (Darnton 2009).</p> <p>Since this type of algorithm has become so common it is no longer called “artificial intelligence”. This transition happened becasue we no longer think it is surprising that computers can do this task - so it is no longer considered intelligent. This process has played out with a number of other technologies. Initially it is thought that only a human can do a particular cognitive task. As computers become increasingly proficient at that task they are called artificially intelligent. Finally, when that task is performed almost exclusively by computers it is no longer considered “intelligent” and the boundary moves.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/timeline-ai.png" alt="Timeline of tasks we were surprised that computers could do as well as humans." /></p> <p>Over the last two decades tasks from optical character recognition, to facial recognition in images, to playing chess have started as artificially intelligent applications. At the time of this writing there are a number of technologies that are currently on the boundary between doable only by a human and doable by a computer. These are the tasks that are considered AI when you read about the term in the media. Examples of tasks that are currently considered “artificial intelligence” include:</p> <ul> <li>Computers that can drive cars</li> <li>Computers that can identify human faces from pictures</li> <li>Computers that can translate text from one language to another</li> <li>Computers that can label pictures with text descriptions</li> </ul> <p>Just as it used to be with optical character recognition, self-driving cars and facial recognition are tasks that still surprise us when performed by a computer. So we still call them artificially intelligent. Eventually, many or most of these tasks will be performed nearly exclusively by computers and we will no longer think of them as components of computer “intelligence”. To go a little further we can think about any task that is repetitive and performed by humans. For example, picking out music that you like or helping someone buy something at a store. An AI can eventually be built to do those tasks provided that: (a) there is a way of measuring and storing information about the tasks and (b) there is technology in place to perform the task if given a set of computer instructions.</p> <p>The more narrow definition of AI is used colloquially in the news to refer to new applications of computers to perform tasks previously thought impossible. It is important to know both the definition of AI used by the general public and the more narrow and relative definition used to describe modern applications of AI by companies like Google and Facebook. But neither of these definitions is satisfactory to help demystify the current state of artificial intelligence applications.</p> <h2 id="a-three-part-definition">A three part definition</h2> <p>The first definition describes a technology that we are not currently faced with - fully functional general purpose artificial intelligence. The second definition suffers from the fact that it is relative to the expectations of people discussing applications. For this book, we need a definition that is concrete, specific, and doesn’t change with societal expectations.</p> <p>We will consider specific examples of human-like tasks that computers can perform. So we will use the definition that artificial intelligence requires the following components:</p> <ol> <li><em>The data set</em> : A of data examples that can be used to train a statistical or machine learning model to make predictions.</li> <li><em>The algorithm</em> : An algorithm that can be trained based on the data examples to take a new example and execute a human-like task.</li> <li><em>The interface</em> : An interface for the trained algorithm to receive a data input and execute the human like task in the real world.</li> </ol> <p>This definition encompases optical character recognition and all the more modern examples like self driving cars. It is also intentionally broad, covering even examples where the data set is not large or the algorithm is not complicated. We will use our definition to break down modern artificial intelligence applications into their constituitive parts and make it clear how the computer represents knowledge learned from data examples and then applies that knowledge.</p> <p>As one example, consider Amazon Echo and Alexa - an application currently considered to be artificially intelligent (Nuñez, n.d.). This combination meets our definition of artificially intelligent since each of the components is in place.</p> <ol> <li><em>The data set</em> : The large set of data examples consist of all the recordings that Amazon has collected of people talking to their Amazon devices.</li> <li><em>The machine learning algorithm</em> : The Alexa voice service (Alexa Developers 2016) is a machine learning algorithm trained using the previous recordings of people talking to Amazon devices.</li> <li><em>The interface</em> : The interface is the Amazon Echo (Amazon Inc 2016) a speaker that can record humans talking to it and respond with information or music.</li> </ol> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/alexa-ai.png" alt="The three parts of an artificial intelligence illustrated with Amazon Echo and Alexa" /></p> <p>When we break down artificial intelligence into these steps it makes it clearer why there has been such a sudden explosion of interest in artificial intelligence over the last several years.</p> <p>First, the cost of data storage and collection has gone down steadily (Irizarry, n.d.) but dramatically (Quigley, n.d.) over the last several years. As the costs have come down, it is increasingly feasible for companies, governments, and even individuals to store large collections of data (Component 1 - <em>The Data</em>). To take advantage of these huge collections of data requires incredibly flexible statistical or machine learning algorithms that can capture most of the patterns in the data and re-use them for prediction. The most common type of algorithms used in modern artificial intelligence are something called “deep neural networks”. These algorithms are so flexible they capture nearly all of the important structure in the data. They can only be trained well if huge data sets exist and computers are fast enough. Continual increases in computing speed and power over the last several decades now make it possible to apply these models to use collections of data (Component 2 - <em>The Algorithm</em>).</p> <p>Finally, the most underappreciated component of the AI revolution does not have to do with data or machine learning. Rather it is the development of new interfaces that allow people to interact directly with machine learning models. For a number of years now, if you were an expert with statistical and machine learning software it has been possible to build highly accurate predictive models. But if you were a person without technical training it was not possible to directly interact with algorithms.</p> <p>Or as statistical experts Diego Kuonen and Rafael Irizarry have put it:</p> <blockquote> <p>The big in big data refers to importance, not size</p> </blockquote> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/importance-not-size.jpg" alt="It isn't about how much data you have, it is about how many people you can get to use it." /></p> <p>The explosion of interfaces for regular, non-technical people to interact with machine learning is an underappreciated driver of the AI revolution of the last several years. Artificial intelligence can now power labeling friends on Facebook, parsing your speech to your personal assistant Siri or Google Assistant, or providing you with directions in your car, or when you talk to your Echo. More recently sensors and devices make it possible for the instructions created by a computer to steer and drive a car.</p> <p>These interfaces now make it possible for hundreds of millions of people to directly interact with machine learning algorithms. These algorithms can range from exceedingly simple to mind bendingly complex. But the common result is that the interface allows the computer to perform a human-like action and makes it look like artificial intelligence to the person on the other side. This interface explosion only promises to accelerate as we are building sensors for both data input and behavior output in objects from phones to refrigerators to cars (Component 3 - <em>The interface</em>).</p> <p>This definition of artificial intelligence in three components will allow us to demystify artificial intelligence applications from self driving cars to facial recognition. Our goal is to provide a high-level interface to the current conception of AI and how it can be applied to problems in real life. It will include discussion and references to the sophisticated models and data collection methods used by Facebook, Tesla, and other companies. However, the book does not assume a mathematical or computer science background and will attempt to explain these ideas in plain language. Of course, this means that some details will be glossed over, so we will attempt to point the interested reader toward more detailed resources throughout the book.</p> <h2 id="references">References</h2> <p>Alexa Developers. 2016. “Alexa Voice Service.” <a href="https://developer.amazon.com/alexa-voice-service">https://developer.amazon.com/alexa-voice-service</a>.</p> <p>Amazon Inc. 2016. “Amazon Echo.” <a href="https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E">https://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E</a>.</p> <p>Baciu, Assaf, and Assaf Baciu. 2016. “Artificial Intelligence Is More Artificial Than Intelligent.” <em>Wired</em>, 7~dec.</p> <p>Cohen, Paul R, and Edward A Feigenbaum. 2014. <em>The Handbook of Artificial Intelligence</em>. Vol. 3. Butterworth-Heinemann. <a href="https://goo.gl/wg5rMk">https://goo.gl/wg5rMk</a>.</p> <p>Couden, Craig. 2015. “Why It’s so Hard to Make Humanoid Robots | Make:” <a href="http://makezine.com/2015/06/15/hard-make-humanoid-robots/">http://makezine.com/2015/06/15/hard-make-humanoid-robots/</a>.</p> <p>Darnton, Robert. 2009. <em>Google &amp; the Future of Books</em>. na.</p> <p>Google Tensorflow Team. n.d. “MNIST for ML Beginners | TensorFlow.” <a href="https://www.tensorflow.org/tutorials/mnist/beginners/">https://www.tensorflow.org/tutorials/mnist/beginners/</a>.</p> <p>Irizarry, Rafael. n.d. “The Big in Big Data Relates to Importance Not Size · Simply Statistics.” <a href="http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/">http://simplystatistics.org/2014/05/28/the-big-in-big-data-relates-to-importance-not-size/</a>.</p> <p>Laird, John E, Allen Newell, and Paul S Rosenbloom. 1987. “Soar: An Architecture for General Intelligence.” <em>Artificial Intelligence</em> 33 (1). Elsevier: 1–64.</p> <p>Langford, John. n.d. “AlphaGo Is Not the Solution to AI « Machine Learning (Theory).” <a href="http://hunch.net/?p=3692542">http://hunch.net/?p=3692542</a>.</p> <p>Nuñez, Michael. n.d. “Amazon Echo Is the First Artificial Intelligence You’ll Want at Home.” <a href="http://www.popsci.com/amazon-echo-first-artificial-intelligence-youll-want-home">http://www.popsci.com/amazon-echo-first-artificial-intelligence-youll-want-home</a>.</p> <p>Pavlidis, Theo. n.d. “Computers Versus Humans - 2002 Lecture.” <a href="http://www.theopavlidis.com/comphumans/comphuman.htm">http://www.theopavlidis.com/comphumans/comphuman.htm</a>.</p> <p>Quigley, Robert. n.d. “The Cost of a Gigabyte over the Years.” <a href="http://www.themarysue.com/gigabyte-cost-over-years/">http://www.themarysue.com/gigabyte-cost-over-years/</a>.</p> <p>Santana, Eder, and George Hotz. 2016. “Learning a Driving Simulator,” 3~aug.</p> <p>Taigman, Y, M Yang, M Ranzato, and L Wolf. 2014. “DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” In <em>2014 IEEE Conference on Computer Vision and Pattern Recognition</em>, 1701–8.</p> <p>Urban, Tim. n.d. “The AI Revolution: How Far Away Are Our Robot Overlords?” <a href="http://gizmodo.com/the-ai-revolution-how-far-away-are-our-robot-overlords-1684199433">http://gizmodo.com/the-ai-revolution-how-far-away-are-our-robot-overlords-1684199433</a>.</p> <p>Vardi, Moshe Y. 2012. “Artificial Intelligence: Past and Future.” <em>Commun. ACM</em> 55 (1). New York, NY, USA: ACM: 5–5.</p> <p>Wikipedia contributors. 2016. “Optical Character Recognition.” <a href="https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&amp;oldid=757150540">https://en.wikipedia.org/w/index.php?title=Optical_character_recognition&amp;oldid=757150540</a>.</p> <p>———. 2017a. “Artificial General Intelligence.” <a href="https://en.wikipedia.org/w/index.php?title=Artificial_general_intelligence&amp;oldid=758867755">https://en.wikipedia.org/w/index.php?title=Artificial_general_intelligence&amp;oldid=758867755</a>.</p> <p>———. 2017b. “Artificial Intelligence.” <a href="https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&amp;oldid=759177704">https://en.wikipedia.org/w/index.php?title=Artificial_intelligence&amp;oldid=759177704</a>.</p> <p>Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, et al. 2016. “Google’s Neural Machine Translation System: Bridging the Gap Between Human and Machine Translation,” 26~sep.</p> Got a data app idea? Apply to get it prototyped by the JHU DSL! 2017-01-18T00:00:00+00:00 http://simplystats.github.io/2017/01/18/data-prototyping-class <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/papr.png" alt="Get your app built" /></p> <p>Last fall we ran the first iteration of a class at the <a href="http://jhudatascience.org/">Johns Hopkins Data Science Lab</a> where we teach students to build data web-apps using Shiny, R, GoogleSheets and a number of other technologies. Our goals were to teach students to build data products, to reduce friction for students who want to build things with data, and to help people solve important data problems with web and SMS apps.</p> <p>We are going to be running a second iteration of our program from March-June this year. We are looking for awesome projects for students to build that solve real world problems. We are particularly interested in projects that could have a positive impact on health but are open to any cool idea. We generally build apps that are useful for:</p> <ul> <li><strong>Data donation</strong> - if you have a group of people you would like to donate data to your project.</li> <li><strong>Data collection</strong> - if you would like to build an app for collecting data from people.</li> <li><strong>Data visualziation</strong> - if you have a data set and would like to have a web app for interacting with the data</li> <li><strong>Data interaction</strong> - if you have a statistical or machine learning model and you would like a web interface for it.</li> </ul> <p>But we are interested in any consumer-facing data product that you might be interested in having built. We want you to submit your wildest, most interesting ideas and we’ll see if we can get them built for you.</p> <p>We are hoping to solicit a large number of projects and then build as many as possible. The best part is that we will build the prototype for you for free! If you have an idea of something you’d like built please submit it to this <a href="https://docs.google.com/forms/d/1UPl7h8_SLw4zNFl_I9li_8GN14gyAEtPHtwO8fJ232E/edit?usp=forms_home&amp;ths=true">Google form</a>.</p> <p>Students in the class will select projects they are interested in during early March. We will let you know if your idea was selected for the program by mid-March. If you aren’t selected you will have the opportunity to roll your submission over to our next round of prototyping.</p> <p>I’ll be writing a separate post targeted at students, but if you are interested in being a data app prototyper, sign up <a href="http://jhudatascience.org/prototyping_students.html">here</a>.</p> Interview with Al Sommer - Effort Report Episode 23 2017-01-17T00:00:00+00:00 http://simplystats.github.io/2017/01/17/effort-report-episode-23 <p>My colleage <a href="https://twitter.com/elizabethmatsui">Elizabeth Matsui</a> and I had a great opportunity to talk with Al Sommer on the <a href="http://effortreport.libsyn.com/23-special-guest-al-sommer">latest episode</a> of our podcast <a href="http://effortreport.libsyn.com">The Effort Report</a>. Al is the former Dean of the Johns Hopkins Bloomberg School of Public Health and is Professor of Epidemiology and International Health at the School. He is (among other things) world reknown for his pioneering research in vitamin A deficiency and mortality in children.</p> <p>Al had some good bits of advice for academics and being successful in academia.</p> <blockquote> <p>What you are excited about and interested in at the moment, you’re much more likely to be succesful at—because you’re excited about it! So you’re going to get up at 2 in the morning and think about it, you’re going to be putting things together in ways that nobody else has put things together. And guess what? When you do that you’re more succesful [and] you actual end up getting academic promotions.</p> </blockquote> <p>On the slow rate of progress:</p> <blockquote> <p>It took ten years, after we had seven randomized trials already to show that you get this 1/3 reduction in child mortality by giving them two cents worth of vitamin A twice a year. It took ten years to convince the child survival Nawabs of the world, and there are still some that don’t believe it.</p> </blockquote> <p>On working overseas:</p> <blockquote> <p>It used to be true [that] it’s a lot easier to work overseas than it is to work here because the experts come from somewhere else. You’re never an expert in your own home.</p> </blockquote> <p>You can listen to the entire episode here:</p> <iframe style="border: none" src="//html5-player.libsyn.com/embed/episode/id/4992405/height/90/width/700/theme/custom/autonext/no/thumbnail/yes/autoplay/no/preload/no/no_addthis/no/direction/forward/render-playlist/no/custom-color/87A93A/" height="90" width="700" scrolling="no" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" oallowfullscreen="" msallowfullscreen=""></iframe> Not So Standard Deviations Episode 30 - Philately and Numismatology 2017-01-09T00:00:00+00:00 http://simplystats.github.io/2017/01/09/nssd-episode-30 <p>Hilary and I follow up on open data and data sharing in government. They also discuss artificial intelligence, self-driving cars, and doing your taxes in R.</p> <p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Show notes:</p> <ul> <li> <p>Lucy D’Agostino McGowan (@LucyStats) made a <a href="http://www.lucymcgowan.com/hill-for-data-scientists.html">great translation of Hill’s criteria using XKCD comics</a></p> </li> <li> <p><a href="http://www.lucymcgowan.com">Lucy’s web page</a></p> </li> <li> <p><a href="https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf">Preparing for the Future of Artificial Intelligence</a></p> </li> <li> <p><a href="http://12%20Dec%202016%20White%20House%20Special%20with%20DJ%20Patil,%20US%20Chief%20Data%20Scientist">Partially Derivative White House Special – with DJ Patil, US Chief Data Scientist</a></p> </li> <li> <p><a href="https://soundcloud.com/nssd-podcast/episode-29-standards-are-like-toothbrushes">Not So Standard Deviations – Standards are Like Toothbrushes – with with Daniel Morgan, Chief Data Officer for the U.S. Department of Transportation and Terah Lyons, Policy Advisor to the Chief Technology Officer of the U.S.</a></p> </li> <li> <p><a href="http://www.hgitner.com">Henry Gitner Philatelists</a></p> </li> <li> <p><a href="https://drive.google.com/file/d/0B678uTpUfn80a2RkOUc5LW51cVU/view?usp=sharing">Some Pioneers of Modern Statistical Theory: A Personal Reflection by Sir David R. Cox</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-30-philately-and-numismatology">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/301065336&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Some things I've found help reduce my stress around science 2016-12-29T00:00:00+00:00 http://simplystats.github.io/2016/12/29/some-stress-reducers <p>Being a scientist can be pretty stressful for any number of reasons, from the peer review process, to getting funding, to <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">getting blown up on the internet</a>.</p> <p>Like a lot of academics I suffer from a lot of stress related to my own high standards and the imposter syndrome that comes from not meeting them on a regular basis. I was just reading through the excellent material in Lorena Barba’s class on <a href="https://barbagroup.github.io/essential_skills_RRC/">essential skills in reproducibility</a> and came across this <a href="http://www.stat.berkeley.edu/~stark/Seminars/reproNE16.htm#1">set of slides</a> by Phillip Stark. The one that caught my attention said:</p> <blockquote> <p>If I say just trust me and I’m wrong, I’m untrustworthy. If I say here’s my work and it’s wrong, I’m honest, human, and serving scientific progress.</p> </blockquote> <p>I love this quote because it shows how being open about both your successes and failures makes it less stressful to be a scientist. Inspired by this quote I decided to make a list of things that I’ve learned through hard experience do not help me with my own imposter syndrome and do help me to feel less stressed out about my science.</p> <ol> <li><em>Put everything out in the open.</em> We release all of our software, data, and analysis scripts. This has led to almost exclusively positive interactions with people as they help us figure out good and bad things about our work.</li> <li><em>Admit mistakes quickly.</em> Since my code/data are out in the open I’ve had people find little bugs and big whoa this is bad bugs in my code. I used to freak out when that happens. But I found the thing that minimizes my stress is to just quickly admit the error and submit updates/changes/revisions to code and papers as necessary.</li> <li><em>Respond to requests for support at my own pace.</em> I try to be as responsive as I can when people email me about software/data/code/papers of mine. I used to stress about doing this <em>right away</em> when I would get the emails. I still try to be prompt, but I don’t let that dominate my attention/time. I also prioritize things that are wrong/problematic and then later handle the requests for free consulting every open source person gets.</li> <li><em>Treat rejection as a feature not a bug.</em> This one is by far the hardest for me but preprints have helped a ton. The academic system is <em>designed</em> to be critical. That is a good thing, skepticism is one of the key tenets of the scientific process. It took me a while to just plan on one or two rejections for each paper, one or two or more rejections for each grant, etc. But now that I plan on the rejection I find I can just focus on how to steadily move forward and constructively address criticism rather than taking it as a personal blow.</li> <li><em>Don’t argue with people on the internet, especially on Twitter.</em> This is a new one for me and one I’m having to practice hard every single day. But I’ve found that I’ve had very few constructive debates on Twitter. I also found that this is almost purely negative energy for me and doesn’t help me accomplish much.</li> <li><em>Redefine success.</em> I’ve found that if I recalibrate what success means to include accomplishing tasks like peer reviewing papers, getting letters of recommendation sent at the right times, providing support to people I mentor, and the submission rather than the success of papers/grants then I’m much less stressed out.</li> <li><em>Don’t compare myself to other scientists.</em> It is <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">very hard to get good evaluation in science</a> and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.</li> <li><em>When comparing, at least pick a metric I’m good at.</em> I’d like to claim I never compare myself to others, but the reality is I do it more than I’d like. I’ve found one way to not stress myself out for my own internal comparisons is to pick metrics I’m good at - even if they aren’t the “right” metrics. That way at least if I’m comparing I’m not hurting my own psyche.</li> <li><em>Let myself be bummed sometimes.</em> Some days despite all of that I still get the imposter syndrome feels and can’t get out of the funk. I used to beat myself up about those days, but now I try to just build that into the rhythm of doing work.</li> <li><em>Try very hard to be positive in my interactions.</em> This is another hard one, because it is important to be skeptical/critical as a scientist. But I also try very hard to do that in as productive a way as possible. I try to assume other people are doing the right thing and I try very hard to stay positive or neutral when writing blog posts/opinion pieces, etc.</li> <li><em>Realize that giving credit doesn’t take away from me.</em> In my research career I have worked with some extremely <a href="http://genomics.princeton.edu/storeylab/">generous</a> <a href="http://rafalab.github.io/">mentors</a>. They taught me to always give credit whenever possible. I also learned from <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a> that you can give credit and not lose anything yourself, in fact you almost always gain. Giving credit is low cost but feels really good so is a nice thing to help me feel better.</li> </ol> <p>The last thing I’d say is that having a blog has helped reduce my stress, because sometimes I’m having a hard time getting going on my big project for the day and I can quickly write a blog post and still feel like I got something done…</p> A non-comprehensive list of awesome things other people did in 2016 2016-12-20T00:00:00+00:00 http://simplystats.github.io/2016/12/20/noncomprehensive-list-of-awesome <p><em>Editor’s note: For the last few years I have made a list of awesome things that other people did (<a href="http://simplystatistics.org/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015/">2015</a>, <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a>, <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a>). Like in previous years I’m making a list, again right off the top of my head. If you know of some, you should make your own list or add it to the comments! I have also avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I write this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data.</em></p> <ul> <li>Thomas Lin Pedersen created the <a href="https://github.com/thomasp85/tweenr">tweenr</a> package for interpolating graphs in animations. Check out this awesome <a href="https://twitter.com/thomasp85/status/809896220906897408">logo</a> he made with it.</li> <li>Yihui Xie is still blowing away everything he does. First it was <a href="https://bookdown.org/yihui/bookdown/">bookdown</a> and then the yolo feature in <a href="https://github.com/yihui/xaringan">xaringan</a> package.</li> <li>J Alammar built this great <a href="https://jalammar.github.io/visual-interactive-guide-basics-neural-networks/">visual introduction to neural networks</a></li> <li>Jenny Bryan is working literal world wonders with legos to teach functional programming. I loved her <a href="https://speakerdeck.com/jennybc/data-rectangling">Data Rectangling</a> talk. The analogy between exponential families and data frames is so so good.</li> <li>Hadley Wickham’s book on <a href="http://r4ds.had.co.nz/">R for data science</a> is everything you’d expect. Super clear, great examples, just a really nice book.</li> <li>David Robinson is a machine put on this earth to create awesome data science stuff. Here is <a href="http://varianceexplained.org/r/trump-tweets/">analyzing Trump’s tweets</a> and here he is on <a href="http://varianceexplained.org/r/hierarchical_bayes_baseball/">empirical Bayes modeling explained with baseball</a>.</li> <li>Julia Silge and David created the <a href="https://cran.r-project.org/web/packages/tidytext/index.html">tidytext</a> package. This is a holy moly big contribution to NLP in R. They also have a killer <a href="http://tidytextmining.com/">book on tidy text mining</a>.</li> <li>Julia used the package to do this <a href="http://juliasilge.com/blog/Reddit-Responds/">fascinating post</a> on mining Reddit after the election.</li> <li>It would be hard to pick just five different major contributions from JJ Allaire (great interview <a href="https://www.rstudio.com/rviews/2016/10/12/interview-with-j-j-allaire/">here</a>), Joe Cheng, and the rest of the Rstudio folks. Rstudio is absolutely <em>churning</em> out awesome stuff at a rate that is hard to keep up with. I loved <a href="https://blog.rstudio.org/2016/10/05/r-notebooks/">R notebooks</a> and have used them extensively for teaching.</li> <li>Konrad Kording and Brett Mensh full on mike dropped on how to write a paper with their <a href="http://biorxiv.org/content/early/2016/11/28/088278">10 simple rules piece</a> Figure 1 from that paper should be affixed to the office of every student/faculty in the world permanently.</li> <li>Yaniv Erlich just can’t stop himself from doing interesting things like <a href="https://seeq.io/">seeq.io</a> and <a href="https://dna.land/">dna.land</a>.</li> <li>Thomaz Berisa and Joe Pickrell set up a freaking <a href="https://medium.com/the-seeq-blog/start-a-human-genomics-project-with-a-few-lines-of-code-dde90c4ef68#.g64meyjim">Python API for genomics projects</a>.</li> <li>DataCamp continues to do great things. I love their <a href="https://www.datacamp.com/community/blog/an-interview-with-david-robinson-data-scientist-at-stack-overflow">DataChats</a> series and they have been rolling out tons of new courses.</li> <li>Sean Rife and Michele Nuijten created <a href="http://statcheck.io/">statcheck.io</a> for checking papers for p-value calculation errors. This was all over the press, but I just like the site as a dummy proofing for myself.</li> <li>This was the artificial intelligence <a href="https://twitter.com/notajf/status/795717253505413122">tweet of the year</a></li> <li>I loved seeing PLoS Genetics start a policy of looking for papers in <a href="http://blogs.plos.org/plos/2016/10/the-best-of-both-worlds-preprints-and-journals/">biorxiv</a>.</li> <li>Matthew Stephens <a href="https://medium.com/@biostatistics/guest-post-matthew-stephens-on-biostatistics-pre-review-and-reproducibility-a14a26d83d6f#.usisi7kd3">post</a> on his preprint getting pre-accepted and reproducibility is also awesome. Preprints are so hot right now!</li> <li>Lorena Barba made this amazing <a href="https://hackernoon.com/barba-group-reproducibility-syllabus-e3757ee635cf#.2orb46seg">reproducibility syllabus</a> then <a href="https://twitter.com/LorenaABarba/status/809641955437051904">won the Leamer-Rosenthal prize</a> in open science.</li> <li>Colin Dewey continues to do just stellar stellar work, this time on <a href="http://biorxiv.org/content/early/2016/11/30/090506">re-annotating genomics samples</a>. This is one of the key open problems in genomics.</li> <li>I love FlowingData sooooo much. Here is one on <a href="http://flowingdata.com/2016/05/17/the-changing-american-diet/">the changing American diet</a>.</li> <li>If you like computational biology and data science and like <em>super</em> detailed reports of meetings/talks you <a href="https://twitter.com/michaelhoffman">MIchael Hoffman</a> is your man. How he actually summarizes that much information in real time is still beyond me.</li> <li>I really really wish I had been at Alyssa Frazee’s talk at startup.ml but loved this <a href="http://www.win-vector.com/blog/2016/09/adversarial-machine-learning/">review of it</a>. Sampling, inverse probability weighting? Love that stats flavor!</li> <li>I have followed Cathy O’Neil for a long time in her persona as <a href="https://twitter.com/mathbabedotorg">mathbabedotorg</a> so it is no surprise to me that her new book <a href="https://www.amazon.com/dp/B019B6VCLO/ref=dp-kindle-redirect?_encoding=UTF8&amp;btkr=1">Weapons of Math Descruction</a> is so good. One of the best works on the ethics of data out there.</li> <li>A related and very important piece is on <a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine bias in sentencing</a> by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner at ProPublica.</li> <li>Dimitris Rizopolous created this stellar <a href="http://iprogn.blogspot.com/2016/03/an-integrated-shiny-app-for-course-on.html">integrated Shiny app</a> for his repeated measures class. I wish I could build things half this nice.</li> <li>Daniel Engber’s piece on <a href="http://fivethirtyeight.com/features/who-will-debunk-the-debunkers/">Who will debunk the debunkers?</a> at fivethirtyeight just keeps getting more relevant.</li> <li>I rarely am willing to watch a talk posted on the internet, but <a href="https://www.youtube.com/watch?v=hps9r7JZQP8">Amelia McNamara’s talk on seeing nothing</a> was an exception. Plus she talks so fast #jealous.</li> <li>Sherri Rose’s post on <a href="http://drsherrirose.com/economic-diversity-and-the-academy-statistical-science">economic diversity in the academy</a> focuses on statistics but should be required reading for anyone thinking about diversity. Everything about it is impressive.</li> <li>If you like your data science with a side of Python you should definitely be checking out Jake Vanderplas’s <a href="http://shop.oreilly.com/product/0636920034919.do">data science handbook</a> and the associated <a href="https://github.com/jakevdp/PythonDataScienceHandbook">Jupyter notebooks</a>.</li> <li>I love Thomas Lumley <a href="http://www.statschat.org.nz/2016/12/19/sauna-and-dementia/">being snarky</a> about the stats news. Its a guilty pleasure. If he ever collected them into a book I’d buy it (hint Thomas :)).</li> <li>Dorothy Bishop’s blog is one of the ones I read super regularly. Her post on <a href="http://deevybee.blogspot.com/2016/12/when-is-replication-not-replication.html">When is a replication a replication</a> is just one example of her very clearly explaining a complicated topic in a sensible way. I find that so hard to do and she does it so well.</li> <li>Ben Goldacre’s crowd is doing a bunch of interesting things. I really like their <a href="https://openprescribing.net/">OpenPrescribing</a> project.</li> <li>I’m really excited to see what Elizabeth Rhodes does with the experimental design for the <a href="http://blog.ycombinator.com/moving-forward-on-basic-income/">Ycombinator Basic Income Experiment</a>.</li> <li>Lucy D’Agostino McGowan made this <a href="http://www.lucymcgowan.com/hill-for-data-scientists.html">amazing explanation</a> of Hill’s criterion using xckd.</li> <li>It is hard to overstate how good Leslie McClure’s blog is. This post on <a href="https://statgirlblog.wordpress.com/2016/09/16/biostatistics-is-public-health/">biostatistics is public health</a> should be read aloud at every SPH in the US.</li> <li>The ASA’s <a href="http://amstat.tandfonline.com/doi/abs/10.1080/00031305.2016.1154108">statement on p-values</a> is a really nice summary of all the issues around a surprisngly controversial topic. Ron Wasserstein and Nicole Lazar did a great job putting it together.</li> <li>I really liked <a href="http://jama.jamanetwork.com/article.aspx?articleId=2513561&amp;guestAccessKey=4023ce75-d0fb-44de-bb6c-8a10a30a6173">this piece</a> on the relationship between income and life expectancy by Raj Chetty and company.</li> <li>Christie Aschwanden continues to be the voice of reason on the <a href="http://fivethirtyeight.com/features/failure-is-moving-science-forward/">statistical crises in science</a>.</li> </ul> <p>That’s all I have for now, I know I’m missing things. Maybe my New Year’s resolution will be to keep better track of the awesome things other people are doing :).</p> The four eras of data 2016-12-16T00:00:00+00:00 http://simplystats.github.io/2016/12/16/the-four-eras-of-data <p>I’m teaching <a href="http://jtleek.com/advdatasci16/">a class in data science</a> for our masters and PhD students here at Hopkins. I’ve been teaching a variation on this class since 2011 and over time I’ve introduced a number of new components to the class: high-dimensional data methods (2011), data manipulation and cleaning (2012), real, possibly not doable data analyses (2012,2013), peer reviews (2014), building <a href="http://swirlstats.com/">swirl tutorials</a> for data analysis techniques (2015), and this year building data analytic web apps/R packages.</p> <p>I’m the least efficient teacher in the world, probably because I’m very self conscious about my teaching. So I always feel like I have to completely re-do my lecture materials every year I teach the class (I know, I know I’m a dummy). This year I was reviewing my notes on high-dimensional data and I was looking at this breakdown of the three eras of statistics from Brad Efron’s <a href="http://statweb.stanford.edu/~ckirby/brad/other/2010LSIexcerpt.pdf">book</a>:</p> <blockquote> <ol> <li>The age of Quetelet and his successors, in which huge census-level data sets were brought to bear on simple but important questions: Are there more male than female births? Is the rate of insanity rising?</li> <li>The classical period of Pearson, Fisher, Neyman, Hotelling, and their successors, intellectual giants who developed a theory of optimal inference capable of wringing every drop of information out of a scientific experiment. The questions dealt with still tended to be simple — Is treatment A better than treatment B? — but the new methods were suited to the kinds of small data sets individual scientists might collect.</li> <li>The era of scientific mass production, in which new technologies typi- fied by the microarray allow a single team of scientists to produce data sets of a size Quetelet would envy. But now the flood of data is accompanied by a deluge of questions, perhaps thousands of estimates or hypothesis tests that the statistician is charged with answering together; not at all what the classical masters had in mind.</li> </ol> </blockquote> <p>While I think this is a useful breakdown, I realized I think about it in a slightly different way as a statistician. My breakdown goes more like this:</p> <ol> <li><strong>The era of not much data</strong> This is everything prior to about 1995 in my field. The era when we could only collect a few measurements at a time. The whole point of statistics was to try to optimaly squeeze information out of a small number of samples - so you see methods like maximum likelihood and minimum variance unbiased estimators being developed.</li> <li><strong>The era of lots of measurements on a few samples</strong> This one hit hard in biology with the development of the microarray and the ability to measure thousands of genes simultaneously. This is the same statistical problem as in the previous era but with a lot more noise added. Here you see the development of methods for multiple testing and regularized regression to separate signals from piles of noise.</li> <li><strong>The era of a few measurements on lots of samples</strong> This era is overlapping to some extent with the previous one. Large scale collections of data from EMRs and Medicare are examples where you have a huge number of people (samples) but a relatively modest number of variables measured. Here there is a big focus on statistical methods for knowing how to model different parts of the data with hierarchical models and separating signals of varying strength with model calibration.</li> <li><strong>The era of all the data on everything.</strong> This is an era that currently we as civilians don’t get to participate in. But Facebook, Google, Amazon, the NSA and other organizations have thousands or millions of measurements on hundreds of millions of people. Other than just sheer computing I’m speculating that a lot of the problem is in segmentation (like in era 3) coupled with avoiding crazy overfitting (like in era 2).</li> </ol> <p>I’ve focused here on the implications of these eras from a statistical modeling perspective, but as we discussed in my class, era 4 coupled with advances in machine learning methods mean that there are social, economic, and behaviorial implications of these eras as well.</p> Not So Standard Deviations Episode 28 - Writing is a lot Harder than Just Talking 2016-12-15T00:00:00+00:00 http://simplystats.github.io/2016/12/15/nssd-episode-28 <p>Hilary and I talk about building data science products that provide a good user experience while adhering to some kind of ground truth, whether it’s in medicine, education, news, or elsewhere. Also Gilmore Girls.</p> <p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Show notes:</p> <ul> <li><a href="https://en.wikipedia.org/wiki/Bradford_Hill_criteria">Hill’s criteria for causation</a></li> <li><a href="https://www.oreilly.com/topics/oreilly-bots-podcast">O’Reilly Bots Podcast</a></li> <li><a href="http://www.nhtsa.gov/nhtsa/av/index.html">NHTSA’s Federal Automated Vehicles Policy</a></li> <li>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</li> <li>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</li> <li>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-28-writing-is-a-lot-harder-than-just-talking">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/297930039&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> What is going on with math education in the US? 2016-12-09T00:00:00+00:00 http://simplystats.github.io/2016/12/09/pisa-us-math <p>When colleagues with young children seeking information about schools ask me if I like the Massachusetts public school my children attend, my answer is always the same: “it’s great…except for math”. The fact is that in our household we supplement our kids’ math education with significant extra curricular work in order to ensure that they receive a math education comparable to what we received as children in the public system.</p> <p>The latest results from the Program for International Student Assessment (PISA) <a href="http://www.businessinsider.com/pisa-worldwide-ranking-of-math-science-reading-skills-2016-12">results</a> show that there is a general problem with math education in the US. Were it a country, Massachusetts would have been in second place in reading, sixth in science, but 20th in math, only ten points above the OECD average of 490. The US as a whole did not fair nearly as well as MA, and the same discrepancy between math and the other two subjects was present. In fact, among the top 30 performing countries ranked by their average of science and reading scores, the US has, by far, the largest discrepancy between math and the other two subjects tested by PISA. The difference of 27 was substantially greater than the second largest difference, which came from Finland at 17. Massachusetts had a difference of 28.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-math-v-others.png" alt="PISA 2015 Math minus average of science and reading" /></p> <p>If we look at the trend of this difference since PISA was started 16 years ago, we see a disturbing progression. While science and reading have <a href="http://www.artofteachingscience.org/wp-content/uploads/2013/12/Screen-Shot-2013-12-17-at-9.28.38-PM.png">remained stable, math has declined</a>. In 2000 the difference between the results in math and the other subjects was only 8.5. Furthermore, the US is not performing exceptionally well in any subject:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/pisa-2015-scatter.png" alt="PISA 2015 Math versus average of science and reading" /></p> <p>So what is going on? I’d love to read theories in the comment section. From my experience comparing my kids’ public schools now with those that I attended, I have one theory of my own. When I was a kid there was a math textbook. Even when a teacher was bad, it provided structure and an organized alternative for learning on your own. Today this approach is seen as being “algorithmic” and has fallen out of favor. “Project based learning” coupled with group activities have become popular replacements.</p> <p>Project based learning is great in principle. But, speaking from experience, I can say it is very hard to come up with good projects, even for highly trained mathematical minds. And it is certainly much more time consuming for the instructor than following a textbook. Teachers don’t have more time now than they did 30 years ago so it is no surprise that this new more open approach leads to improvisation and mediocre lessons. A recent example of a pointless math project involved 5th graders picking a number and preparing a colorful poster showing “interesting” facts about this number. To make things worse in terms of math skills, students are often rewarded for effort, while correctness is secondary and often disregarded.</p> <p>Regardless of the reason for the decline, given the trends we are seeing, we need to rethink the approach to math education. Math education may have had its problems in the past, but recent evidence suggests that the reforms of the past few decades seem to have only worsened the situation.</p> <p>Note: To make these plots I download and read-in the data into R as described <a href="https://www.r-bloggers.com/pisa-2015-how-to-readprocessplot-the-data-with-r/">here</a>.</p> Not So Standard Deviations Episode 27 - Special Guest Amelia McNamara 2016-11-30T00:00:00+00:00 http://simplystats.github.io/2016/11/30/nssd-episode-27 <p>I had the pleasure of sitting down with Amelia McNamara, Visiting Assistant Professor of Statistical and Data Sciences at Smith College, to talk about data science, data journalism, visualization, the problems with R, and adult coloring books.</p> <p>If you have questions you’d like Hilary and me to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.science.smith.edu/~amcnamara/index.html">Amelia McNamara’s web site</a></p> </li> <li> <p><a href="http://datascience.columbia.edu/mark-hansen">Mark Hansen</a></p> </li> <li> <p><a href="https://www.youtube.com/watch?v=dD36IajCz6A">Listening Post</a></p> </li> <li> <p><a href="http://www.nytimes.com/video/arts/1194817116105/moveable-type.html">Moveable Type</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Alan_Kay">Alan Kay</a></p> </li> <li> <p><a href="https://harc.ycr.org/">HARC (Human Advancement Research Community)</a></p> </li> <li> <p><a href="http://www.vpri.org/index.html">VPRI (Viewpoints Research Institute)</a></p> </li> <li> <p><a href="https://www.youtube.com/watch?v=hps9r7JZQP8">Interactive essays</a></p> </li> <li> <p><a href="https://rafaelaraujoart.com/products/golden-ratio-coloring-book">Golden Ratio Coloring Book</a></p> </li> <li> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p> </li> <li> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> </li> <li> <p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-27-special-guest-amelia-mcnamara">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/295593774&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Help choose the Leek group color palette 2016-11-17T00:00:00+00:00 http://simplystats.github.io/2016/11/17/leekgroup-colors <p>My research group just recently finish a paper where several different teams within the group worked on different analyses. If you are interested the paper describes the <a href="http://biorxiv.org/content/early/2016/08/08/068478">recount resource</a> which includes processed versions of thousands of human RNA-seq data sets.</p> <p>As part of this project each group had to contribute some plots to the paper. One thing that I noticed is that each person used their own color palette and theme when building the plots. When we wrote the paper this made it a little harder for the figures to all fit together - especially when different group members worked on a single panel of a multi-panel plot.</p> <p>So I started thinking about setting up a Leek group theme for both base R and ggplot2 graphics. One of the first problems was that every group member had their own opinion about what the best color palette would be. So we are running a little competition to determine what the official Leek group color palette for plots will be in the future.</p> <p>As part of that process, one of my awesome postdocs, Shannon Ellis, decided to collect some data on how people perceive different color palettes. The survey is here:</p> <p>https://docs.google.com/forms/d/e/1FAIpQLSfHMXVsl7pxYGarGowJpwgDSf9lA2DfWJjjEON1fhuCh6KkRg/viewform?c=0&amp;w=1</p> <p>If you have a few minutes and have an opinion about colors (I know you do!) please consider participating in our little poll and helping to determine the future of Leek group plots!</p> Open letter to my lab: I am not "moving to Canada" 2016-11-11T00:00:00+00:00 http://simplystats.github.io/2016/11/11/im-not-moving-to-canada <p>Dear Lab Members,</p> <p>I know that the results of Tuesday’s election have many of you concerned about your future. You are not alone. I am concerned about my future as well. But I want you to know that I have no plans of going anywhere and I intend to dedicate as much time to our projects as I always have. Meeting, discussing ideas and putting them into practice with you is, by far, the best part of my job.</p> <p>We are all concerned that if certain campaign promises are kept many of our fellow citizens may need our help. If this happens, then we will pause to do whatever we can to help. But I am currently cautiously optimistic that we will be able to continue focusing on helping society in the best way we know how: by doing scientific research.</p> <p>This week Dr. Francis Collins assured us that there is strong bipartisan support for scientific research. As an example consider <a href="http://www.nytimes.com/2015/04/22/opinion/double-the-nih-budget.html?_r=0">this op-ed</a> in which Newt Gingrich advocates for doubling the NIH budget. There also seems to be wide consensus in this country that scientific research is highly beneficial to society and an understanding that to do the best research we need the best of the best no matter their gender, race, religion or country of origin. Nothing good comes from creative, intelligent, dedicated people leaving science.</p> <p>I know there is much uncertainty but, as of now, there is nothing stopping us from continuing to work hard. My plan is to do just that and I hope you join me.</p> Not all forecasters got it wrong: Nate Silver does it again (again) 2016-11-09T00:00:00+00:00 http://simplystats.github.io/2016/11/09/not-all-forecasters-got-it-wrong <p>Four years ago we <a href="http://simplystatistics.org/2012/11/07/nate-silver-does-it-again-will-pundits-finally-accept/">posted</a> on Nate Silver’s, and other forecasters’, triumph over pundits. In contrast, after yesterday’s presidential election, results contradicted most polls and data-driven forecasters, several news articles came out wondering how this happened. It is important to point out that not all forecasters got it wrong. Statistically speaking, Nate Silver, once again, got it right.</p> <p>To show this, below I include a plot showing the expected margin of victory for Clinton versus the actual results for the most competitive states provided by 538. It includes the uncertainty bands provided by 538 in <a href="http://projects.fivethirtyeight.com/2016-election-forecast/">this site</a> (I eyeballed the band sizes to make the plot in R, so they are not exactly like 538’s).</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-prediction.png" alt="538-2016-election" /></p> <p>Note that if these are 95% confidence/credible intervals, 538 got 1 wrong. This is exactly what we expect since 15/16 is about 95%. Furthermore, judging by the plot <a href="http://projects.fivethirtyeight.com/2016-election-forecast/">here</a>, 538 estimated the popular vote margin to be 3.6% with a confidence/credible interval of about 5%. This too was an accurate prediction since Clinton is going to win the popular vote by about 1% <del>0.5%</del> (note this final result is in the margin of error of several traditional polls as well). Finally, when other forecasters were giving Trump between 14% and 0.1% chances of winning, 538 gave him about a 30% chance which is slightly more than what a team has when down 3-2 in the World Series. In contrast, in 2012 538 gave Romney only a 9% chance of winning. Also, remember, if in ten election cycles you call it for someone with a 70% chance, you should get it wrong 3 times. If you get it right every time then your 70% statement was wrong.</p> <p>So how did 538 outperform all other forecasters? First, as far as I can tell they model the possibility of an overall bias, modeled as a random effect, that affects every state. This bias can be introduced by systematic lying to pollsters or under sampling some group. Note that this bias can’t be estimated from data from one election cycle but it’s variability can be estimated from historical data. 538 appear to estimate the standard error of this term to be about 2%. More details on this are included <a href="http://simplystatistics.org/html/midterm2012.html">here</a>. In 2016 we saw this bias and you can see it in the plot above (more points are above the line than below). The confidence bands account for this source of variabilty and furthermore their simulations account for the strong correlation you will see across states: the chance of seeing an upset in Pennsylvania, Wisconsin, and Michigan is <strong>not</strong> the product of an upset in each. In fact it’s much higher. Another advantage 538 had is that they somehow were able to predict a systematic, not random, bias against Trump. You can see this by comparing their adjusted data to the raw data (the adjustment favored Trump about 1.5 on average). We can clearly see this when comparing the 538 estimates to The Upshots’:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/us-election-2016-538-v-upshot.png" alt="538-2016-election" /></p> <p>The fact that 538 did so much better than other forecasters should remind us how hard it is to do data analysis in real life. Knowing math, statistics and programming is not enough. It requires experience and a deep understanding of the nuances related to the specific problem at hand. Nate Silver and the 538 team seem to understand this more than others.</p> <p>Update: Jason Merkin points out (via Twitter) that 538 provides 80% credible intervals.</p> Data scientist on a chromebook take two 2016-11-08T00:00:00+00:00 http://simplystats.github.io/2016/11/08/chromebook-part2 <p>My friend Fernando showed me his collection of <a href="https://twitter.com/jtleek/status/795749713966497793">old Apple dongles</a> that no longer work with the latest generation of Apple devices. This coupled with the announcement of the Macbook pro that promises way more dongles and mostly the same computing, had me freaking out about my computing platform for the future. I’ve been using cloudy tools for more and more of what I do and so it had me wondering if it was time to go back and try my <a href="http://simplystatistics.org/2012/01/09/a-statistician-and-apple-fanboy-buys-a-chromebook-and/">Chromebook experiment</a> again. Basically the question is whether I can do everything I need to do comfortably on a Chromebook.</p> <p>So to execute the experience I got a brand new <a href="https://www.asus.com/us/Notebooks/ASUS_Chromebook_Flip_C100PA/">ASUS chromebook flip</a> and the connector I need to plug it into hdmi monitors (there is no escaping at least one dongle I guess :(). Here is what that badboy looks like in my home office with Apple superfanboy Roger on the screen.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/chromebook2.jpg" alt="chromebook2" /></p> <p>In terms of software there have been some major improvements since I last tried this experiment out. Some of these I talk about in my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a>. As of this writing this is my current setup:</p> <ul> <li>Music on <a href="https://play.google.com">Google Play</a></li> <li>Latex on <a href="https://www.overleaf.com">Overleaf</a></li> <li>Blog/website/code on <a href="https://github.com/">Github</a></li> <li>R programming on an <a href="http://www.louisaslett.com/RStudio_AMI/">Amazon AMI with Rstudio loaded</a> although <a href="https://twitter.com/earino/status/795750908457984000">I hear</a> there may be other options that are good there that I should try.</li> <li>Email/Calendar/Presentations/Spreadsheets/Docs with <a href="https://www.google.com/">Google</a> products</li> <li>Twitter with <a href="https://tweetdeck.twitter.com/">Tweetdeck</a></li> </ul> <p>That handles the vast majority of my workload so far (its only been a day :)). But I would welcome suggestions and I’ll report back when either I give up or if things are still going strong in a little while….</p> Not So Standard Deviations Episode 25 - How Exactly Do You Pronounce SQL? 2016-10-28T00:00:00+00:00 http://simplystats.github.io/2016/10/28/nssd-episode-25 <p>Hilary and I go through the overflowing mailbag to respond to listener questions! Topics include causal inference in trend modeling, regression model selection, using SQL, and data science certification.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.amazon.com/gp/product/B0017LNHY2/">Professor Kobre’s Lightscoop Standard Version Bounce Flash Device</a></p> </li> <li> <p><a href="https://www.speechpad.com">Speechpad</a></p> </li> <li> <p><a href="https://www.amazon.com/gp/product/0544703391/">Speaking American by Josh Katz</a></p> </li> <li> <p><a href="https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&amp;_hsmi=36470868#.wybl0l3p7">Data Sets Are The New Server Rooms</a></p> </li> <li> <p><a href="http://simplystatistics.org/2016/10/26/datasets-new-server-rooms/">Are Datasets the New Server Rooms?</a></p> </li> <li> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p> </li> <li> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> </li> <li> <p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-25-how-exactly-do-you-pronounce-sql">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/290164484&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Are Datasets the New Server Rooms? 2016-10-26T00:00:00+00:00 http://simplystats.github.io/2016/10/26/datasets-new-server-rooms <p>Josh Nussbaum has an <a href="https://medium.com/@josh_nussbaum/data-sets-are-the-new-server-rooms-40fdb5aed6b0?_hsenc=p2ANqtz-8IHAReMPP2JjyYs6TqyMYCnjUapQdLQFEaQOjNX9BfUhZV2nzXWwy2NHJHrCs-VN67GxT4djKCUWq8tkhTyiQkb965bg&amp;_hsmi=36470868#.wz8f23tak">interesting post</a> over at Medium about whether massive datasets are the new server rooms of tech business.</p> <p>The analogy comes from the “old days” where in order to start an Internet business, you had to buy racks and servers, rent server space, buy network bandwidth, license expensive server software, backups, and on and on. In order to do all that up front, it required a substantial amount of capital just to get off the ground. As inconvenient as this might have been, it provided an immediate barrier to entry for any other competitors who weren’t able to raise similar capital.</p> <p>Of course,</p> <blockquote> <p>…the emergence of open source software and cloud computing completely eviscerated the costs and barriers to starting a company, leading to deflationary economics where one or two people could start their company without the large upfront costs that were historically the hallmark of the VC industry.</p> </blockquote> <p>So if startups don’t have huge capital costs in the beginning, what costs <em>do</em> they have? Well, for many new companies that rely on machine learning, they need to collect data.</p> <blockquote> <p>As a startup collects the data necessary to feed their ML algorithms, the value the product/service provides improves, allowing them to access more customers/users that provide more data and so on and so forth.</p> </blockquote> <p>Collecting huge datasets ultimately costs money. The sooner a startup can raise money to get that data, the sooner they can defend themselves from competitors who may not yet have collected the huge datasets for training their algorithms.</p> <p>I’m not sure the analogy between datasets and server rooms quite works. Even back when you had to pay a lot of up front costs to setup servers and racks, a lot of that technology was already a commodity, and anyone could have access to it for a price.</p> <p>I see massive datasets used to train machine learning algorithms as more like the new proprietary software. The startups of yore spent a lot of time writing custom software for what we might now consider mundane tasks. This was a time-consuming activity but the software that was developed had value and was a differentiator for the company. Today, many companies write complex machine learning algorithms, but those algorithms and their implmentations are quickly becoming commodities. So the only thing that separates one company from another is the amount and quality of data that they have to train those algorithms.</p> <p>Going forward, it will be interesting see what these companies will do with those massive datasets once they no longer need them. Will they “open source” them and make them available to everyone? Could there be an open data movement analogous to the open source movement?</p> <p>For the most part, I doubt it. While I think many today would perhaps sympathize with the sentiment that <a href="https://www.gnu.org/gnu/manifesto.en.html">software shouldn’t have owners</a>, those same people I think would argue vociferously that data most certainly do have owners. I’m not sure how I’d feel if Facebook made all their data available to anyone. That said, many datasets are made available by various businesses, and as these datasets grow in number and in usefulness, we may see a day where the collection of data is not a key barrier to entry, and that you can train your machine learning algorithm on whatever is out there.</p> Distributed Masochism as a Pedagogical Model 2016-10-20T00:00:00+00:00 http://simplystats.github.io/2016/10/20/distributed-masochism-as-a-pedagogical-model <p><em>Editor’s note: This is a guest post by <a href="http://seankross.com/">Sean Kross</a>. Sean is a software developer in the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. Sean has contributed to several of our specializations including <a href="https://www.coursera.org/specializations/jhu-data-science">Data Science</a>, <a href="https://www.coursera.org/specializations/executive-data-science">Executive Data Science</a>, and <a href="https://www.coursera.org/specializations/r">Mastering Software Development in R</a>. He tweets <a href="https://twitter.com/seankross">@seankross</a>.</em></p> <p>Over the past few months I’ve been helping Jeff develop the Advanced Data Science class he’s teaching at the Johns Hopkins Bloomberg School of Public Health. We’ve been trying to identify technologies that we can teach to students which (we hope) will enable them to rapidly prototype data-based software applications which will serve a purpose in public health. We started with technologies that we’re familiar with (R, Shiny, static websites) but we’re also trying to teach ourselves new technologies (the Amazon Alexa Skills API, iOS and Swift). We’re teaching skills that we know intimately along with skills that we’re learning on the fly which is a style of teaching that we’ve practiced <a href="https://www.coursera.org/specializations/jhu-data-science">several</a> <a href="https://www.coursera.org/specializations/r">times</a>.</p> <p>Jeff and I have come to realize that while building new courses with technologies that are new to us we experience particular pains and frustrations which, when documented, become valuable learning resources for our students. This process of documenting new-tech-induced pain is only a preliminary step. When we actually launch classes either online or in person our students run into new frustrations which we respond to with changes to either documentation or course content. This process of quickly iterating on course material is especially enhanced in online courses where the time span for a course lasts a few weeks compared to a full semester, so kinks in the course are ironed out at a faster rate compared to traditional in-person courses. All of the material in our courses is open-source and available on GitHub, and we teach our students how to use Git and GitHub. We can take advantage of improvements and contributions the students think we should make to our courses through pull requests that we recieve. Student contributions further reduce the overall start-up pain experienced by other students.</p> <p>With students from all over the world participating in our online courses we’re unable to anticipate every technical need considering different locales, languages, and operating systems. Instead of being anxious about this reality we depend on a system of “distributed masochism” whereby documenting every student’s unique technical learning pains is an important aspect of improving the online learning experience. Since we only have a few months head start using some of these technologies compared to our students it’s likely that as instructors we’ve recently climbed a similar learning curve which makes it easier for us to help our students. We believe that this approach of teaching new technologies by allowing any student to contribute to open course material allows a course to rapidly adapt to students’ needs and to the inevitable changes and upgrades that are made to new technologies.</p> <p>I’m extremely interested in communicating with anyone else who is using similar techniques, so if you’re interested please contact me via Twitter (<a href="https://twitter.com/seankross">@seankross</a>) or send me an email: sean at seankross.com.</p> Not So Standard Deviations Episode 24 - 50 Minutes of Blathering 2016-10-16T00:00:00+00:00 http://simplystats.github.io/2016/10/16/nssd-episode-24 <p>Another IRL episode! Hilary and I met at a Jimmy John’s to talk data science, like you do. Topics covered include RStudio Conf, polling, millennials, Karl Broman, and more!</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>. And please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.rstudio.com/conference/">rstudio::conf</a></p> </li> <li> <p><a href="http://www.nytimes.com/interactive/2016/09/20/upshot/the-error-the-polling-world-rarely-talks-about.html?_r=0">We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Millennials">Millenials</a></p> </li> <li> <p><a href="http://kbroman.org">Karl Broman</a></p> </li> <li> <p><a href="https://www.rstudio.com/2016/10/12/interview-with-j-j-allaire/">Interview with J.J. Allaire</a></p> </li> <li> <p><a href="http://varianceexplained.org/r/year_data_scientist/">One Year at Stack Overflow</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-24-50-minutes-of-blathering">Download the audio for this episode</a></p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/287815210&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Should I make a chatbot or a better FAQ? 2016-10-14T00:00:00+00:00 http://simplystats.github.io/2016/10/14/chatabot-or-faq <p>Roger pointed me to this <a href="https://www.theinformation.com/behind-facebooks-messenger-missteps">interesting article</a> (paywalled, sorry!) about Facebook’s chatbot service. I think the article made a couple of interesting points. The first thing I thought was interesting was their explicit acknowledgement of the process I outlined in a previous post for building an AI startup - (1) convince (or in this case pay) some humans to be your training set, and (2) collect the data on the humans and then use it to build your AI.</p> <p>The other point that is pretty fascinating is that they realized how many data points they would need before they could reasonably replace a human with an AI chatbot. The original estimate was tens of thousands and the ultimate number was millions or more. I have been thinking a lot that the AI “revolution” is just a tradeoff between parameters and data points. If you have a billion parameter prediction algorithm it may work pretty awesome - as long as you have a few hundred billion data points to train it with.</p> <p>But the theme of the article was that chatbots may have had some mis-steps/may not be ready for prime time. I think the main reason is that at the moment most AI efforts can only report facts, not intuit intention and alter the question for the user or go beyond the facts/state of the world.</p> <p>One example I’ve run into recently was booking a ticket on an airline. I wanted to know if I could make a certain change to my ticket. The airline didn’t have any information about the change I wanted to make online. After checking thoroughly I clicked on the “Chat with an agent” button and was directed to what was clearly a chatbot. The chatbot asked a question or two and then sent me to the “make changes to a ticket” page of the website.</p> <p>I eventually had to call and get a person on the phone, because what I wanted to ask about didn’t apply to the public information. They set me straight and I booked the ticket. The chatbot wasn’t helpful because it could only respond with information it had available on the website. It couldn’t identify a new situation, realize it had to ask around, figure out there was an edge case, and then make a ruling/help out.</p> <p>I would guess that most of the time if a person interacts with a chatbot they are doing it only because they already looked at all the publicly available information on the FAQ, etc. and couldn’t find it. So an alternative solution, which would require a lot less work and a much smaller training set, is to just have a more complete FAQ.</p> <p>The question to me is does anyone other than Facebook or Google have a big enough training set to make a chatbot worth it?</p> The Dangers of Weighting Up a Sample 2016-10-12T00:00:00+00:00 http://simplystats.github.io/2016/10/12/weighting-survey <p>There’s a <a href="http://www.nytimes.com/2016/10/13/upshot/how-one-19-year-old-illinois-man-is-distorting-national-polling-averages.html">great story</a> by Nate Cohn over at the New York Times’ Upshot about the dangers of “weighting up” a sample from a survey. In this case, it is in regards to a U.S.C/LA Times poll asking who people will vote for President:</p> <blockquote> <p>The U.S.C./LAT poll weights for many tiny categories: like 18-to-21-year-old men, which U.S.C./LAT estimates make up around 3.3 percent of the adult citizen population. Weighting simply for 18-to-21-year-olds would be pretty bold for a political survey; 18-to-21-year-old men is really unusual.</p> </blockquote> <p>The U.S.C./LA Times poll apparently goes even further:</p> <blockquote> <p>When you start considering the competing demands across multiple categories, it can quickly become necessary to give an astonishing amount of extra weight to particularly underrepresented voters — like 18-to-21-year-old black men. This wouldn’t be a problem with broader categories, like those 18 to 29, and there aren’t very many national polls that are weighting respondents up by more than eight or 10-fold. The extreme weights for the 19-year-old black Trump voter in Illinois are not normal.</p> </blockquote> <p>It’s worth noting (as a good thing) that the U.S.C./LA Times poll data is completely open, thus allowing the NYT to reproduce this entire analysis.</p> <p>I haven’t done much in the way of survey analyses, but I’ve done some inverse probability weighting and in my experience it can be a tricky procedure in ways that are not always immediately obvious. The article discusses weight trimming, but also notes the dangers of that procedure. Overall, a good treatment of a complex issue.</p> Information and VC Investing 2016-10-03T00:00:00+00:00 http://simplystats.github.io/2016/10/03/the-information-vc <p>Sam Lessin at The Information has a <a href="http://go.theinformation.com/xXfQ5plmVMI">nice post</a> (sorry, paywall, but it’s a great publication) about how increased measurement and analysis is changing the nature of venture capital investing.</p> <blockquote> <p>This brings me back to what is happening at series A financings. Investors have always, obviously, tried to do diligence at all financing rounds. But series A investments used to be an exercise in a few top-level metrics a company might know, some industry interviews and analysis, and a whole lot of trust. The data that would drive capital market efficiency usually just wasn’t there, so capital was expensive and there were opportunities for financiers. Now, I am seeing more and more that after a seed round to boot up most companies, the backbone of a series A financing is an intense level of detail in reporting and analytics. It can be that way because the companies have the data</p> </blockquote> <p>I’ve seen this happen in other areas where data comes in to disrupt the way things are done. Good analysis only gives you an advantage if no one else is doing it. Once everyone accepts the idea and everyone has the data (and a good analytics team), there’s no more value left in the market.</p> <p>Time to search elsewhere.</p> papr - it's like tinder, but for academic preprints 2016-10-03T00:00:00+00:00 http://simplystats.github.io/2016/10/03/papr <p>As part of the <a href="http://jhudatascience.org/">Johns Hopkins Data Science Lab</a> we are setting up a web and mobile <a href="http://jhudatascience.org/prototyping/">data product prototyping shop</a>. As part of that process I’ve been working on different types of very cheap and easy to prototype apps. A few days ago I posted about creating a <a href="http://simplystatistics.org/2016/08/26/googlesheets/">distributed data collection app with Google Sheets</a>.</p> <p>So for fun I built another kind of app. This one I’m calling <a href="https://jhubiostatistics.shinyapps.io/papr/">papr</a> and its sort of like “Tinder for preprints”. I scraped all of the papers out of the <a href="http://biorxiv.org/">http://biorxiv.org/</a> database. When you open the app you see one at random and you can rate it according to two axes:</p> <ul> <li><em>Is the paper interesting?</em> - a paper can be rated as exciting or boring. We leave the definitions of those terms up to you.</li> <li><em>Is the paper correct or questionable?</em> - a paper can either be solidly correct or potentially questionable in its results. We leave the definitions of those terms up to you.</li> </ul> <p>When you click on your rating you are shown another randomly generated paper from bioRxiv. You can “level up” to different levels if you rate more papers. You can also download your ratings at any time.</p> <p>If you have any feedback on the app I’d love to hear it and if anyone knows how to get custom domain names to work with shinyapps.io I’d also love to hear from you. I tried the instructions with no luck…</p> <p>Try the app here:</p> <p>https://jhubiostatistics.shinyapps.io/papr/</p> Not So Standard Deviations Episode 23 - Special Guest Walt Hickey 2016-10-01T00:00:00+00:00 http://simplystats.github.io/2016/10/01/nssd-episode-23 <p>Hilary and Roger invite Walt Hickey of FiveThirtyEight.com on to the show to talk about polling, movies, and data analysis reproducibility (of course).</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>.</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Get the <a href="https://leanpub.com/conversationsondatascience/">Not So Standard Deviations book</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast/">FiveThirtyEight’s polling methodology</a></p> </li> <li> <p><a href="https://twitter.com/walthickey">Walt Hickey on Twitter</a></p> </li> <li> <p><a href="http://fivethirtyeight.com/features/the-20-most-extreme-cases-of-the-book-was-better-than-the-movie/">The 20 Most Extreme Cases Of ‘The Book Was Better Than The Movie’</a></p> </li> <li> <p><a href="http://practicaltypography.com">Matthew Butterick Typography</a></p> </li> <li> <p><a href="http://www.hoppstudios.com">Hopp</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-23-special-guest-walt-hickey">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/285159790&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Statistical vitriol 2016-09-29T00:00:00+00:00 http://simplystats.github.io/2016/09/29/statistical-vitriol <p>Over the last few months there has been a lot of vitriol around statistical ideas. First there were <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">data parasites</a> and then there were <a href="https://www.dropbox.com/s/9zubbn9fyi1xjcu/Fiske%20presidential%20guest%20column_APS%20Observer_copy-edited.pdf">methodological terrorists</a>. These epithets came from established scientists who have relatively little statistical training. There was the predictable backlash to these folks from their counterparties, typically statisticians or statistically trained folks who care about open source.</p> <p>I’m a statistician who cares about open source but I also frequently collaborate with scientists from different fields. It makes me sad and frustrated that statistics - which I’m so excited about and have spent my entire professional career working on - is something that is causing so much frustration, anxiety, and anger.</p> <p>I have been thinking a lot about the cause of this anger and division in the sciences. As a person who interacts with both groups pretty regularly I think that the reasons are some combination of the following.</p> <ol> <li>Data is now everywhere, so every single publication involves some level of statistical modeling and analysis. It can’t be escaped.</li> <li>The deluge of scientific papers means that only big claims get your work noticed, get you into fancy journals, and get you attention.</li> <li>Most senior scientists, the ones leading and designing studies, <a href="http://simplystatistics.org/2012/04/27/people-in-positions-of-power-that-dont-understand/">have little or no training in statistics</a>. There is a structural reason for this: data was sparse when they were trained and there wasn’t any reason for them to learn statistics. So statistics and data science wasn’t (and still often isn’t) integrated into medical and scientific curricula.</li> <li>There is an imbalance of power in the scientific process between statisticians/computational scientists and scientific investigators or clinicians. The clinicians/scientific investigators are “in charge” and the statisticians are often relegated to a secondary role. Statisticians with some control over their environment (think senior tenured professors of (bio)statistics) can avoid these imbalances and look for collaborators who respect statistical thinking, but not everyone can. There are a large number of <a href="http://www.opiniomics.org/a-guide-for-the-lonely-bioinformatician/">lonely bioinformaticians</a> out there.</li> <li>Statisticians and computational scientists are also frustrated because their is often no outlet for them to respond to these papers in the formal scientific literature - those outlets are controlled by scientists and rarely have statisticians in positions of influence within the journals.</li> </ol> <p>Since statistics is everywhere (1) and only flashy claims get you into journals (2) and the people leading studies don’t understand statistics very well (3), you get many publications where the paper makes a big claim based on shakey statistics but it gets through. This then frustrates the statisticians because they have little control over the process (4) and can’t get their concerns into the published literature (5).</p> <p>This used to just result in lots of statisticians and computational scientists complaining behind closed doors. The internet changed all that, everyone is an <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">internet scientist</a> now. So the statisticians and statistically savvy take to blogs, f1000research, and other outlets to get their point across.</p> <p>Sometimes to get attention, statisticians start to have the same problem as scientists; they need their complaints to get attention to have any effect. So they go over the top. They accuse people of fraud, or being statistically dumb, or nefarious, or intentionally doing things with data, or cast a wide net and try to implicate a large number of scientists in poor statistics. The ironic thing is that these things are the same thing that the scientists are doing to get attention that frustrated the statisticians in the first place.</p> <p>Just to be 100% clear here I am also guilty of this. I have definitely fallen into the hype trap - talking about the “replicability crisis”. I also made the mistake earlier in my blogging career of trashing the statistics of a paper that frustrated me. I am embarrassed I did that now, it wasn’t constructive and the author ended up being very responsive. I think if I had just emailed that person they would have resolved their problem.</p> <p>I just recently had an experience where a very prominent paper hadn’t made their data public and I was having trouble getting the data. I thought about writing a blog post to get attention, but at the end of the day just did the work of emailing the authors, explaining myself over and over and finally getting the data from them. The result is the same (I have the data) but it cost me time and frustration. So I understand when people don’t want to deal with that.</p> <p>The problem is that scientists see the attention the statisticians are calling down on them - primarily negative and often over-hyped. Then they get upset and call the statisticians/open scientists names, or push back on entirely sensible policies because they are worried about being humiliated or discredited. While I don’t agree with that response, I also understand the feeling of “being under attack”. I’ve had that happen to me too and it doesn’t feel good.</p> <p>So where do we go from here? How do we end statistical vitriol and make statistics a positive force? Here is my six part plan:</p> <ol> <li>We should create continuining education for senior scientists and physicians in statistical and open data thinking so people who never got that training can understand the unique requirements of a data rich scientific world.</li> <li>We should encourage journals and funders to incorporate statisticians and computational scientists at the highest levels of influence so that they can drive policy that makes sense in this new data driven time.</li> <li>We should recognize that scientists and data generators have <a href="http://simplystatistics.org/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem/">a lot more on the line</a> when they produce a result or a scientific data set. We should give them appropriate credit for doing that even if they don’t get the analysis exactly right.</li> <li>We should de-escalate the consequences of statistical mistakes. Right now the consequences are: retractions that hurt careers, blog posts that are aggressive and often too personal, and humiliation by the community. We should make it easy to acknowledge these errors without ruining careers. This will be hard - scientists careers often depend on the results they get (recall 2 above). So we need a way to pump up/give credit to/acknowledge scientists who are willing to sacrifice that to get the stats right.</li> <li>We need to stop treating retractions/statistical errors/mistakes like a sport where there are winners and losers. Statistical criticism should be easy, allowable, publishable and not angry or personal.</li> <li>Any paper where statistical analysis is part of the paper must have both a statistically trained author or a statistically trained reviewer or both. I wouldn’t believe a paper on genomics that was performed entirely by statisticians with no biology training any more than I believe a paper with statistics in it performed entirely by physicians with no statistical training.</li> </ol> <p>I think scientists forget that statisticians feel un-empowered in the scientific process and statisticians forget that a lot is riding on any given study for a scientist. So being a little more sympathetic to the pressures we all face would go a long way to resolving statistical vitriol.</p> <p>I’d be eager to hear other ideas too. It makes me sad that statistics has become so political on both sides.</p> The Mystery of Palantir Continues 2016-09-28T00:00:00+00:00 http://simplystats.github.io/2016/09/28/mystery-palantir-continues <p>Palantir, the secretive data science/consulting/software company, continues to be a mystery to most people, but recent reports have not been great. <a href="http://www.nytimes.com/reuters/2016/09/26/business/26reuters-palantir-tech-discrimination-lawsuit.html?smprod=nytcore-iphone&amp;smid=nytcore-iphone-share&amp;_r=0">Reuters reports</a> that the U.S. Department of Labor is suing it for employment discrimination:</p> <blockquote> <p>The lawsuit alleges Palantir routinely eliminated Asian applicants in the resume screening and telephone interview phases, even when they were as qualified as white applicants.</p> </blockquote> <p>Interestingly, the report indicates a statistical argument:</p> <blockquote> <p>In one example cited by the Labor Department, Palantir reviewed a pool of more than 130 qualified applicants for the role of engineering intern. About 73 percent of applicants were Asian. The lawsuit, which covers Palantir’s conduct between January 2010 and the present, said the company hired 17 non-Asian applicants and four Asians. “The likelihood that this result occurred according to chance is approximately one in a billion,” said the lawsuit, which was filed with the department’s Office of Administrative Law Judges.</p> </blockquote> <p><em>Update: Thanks to David Robinson for point out that (a) I read the numbers incorrectly and (b) I should have used the hypergeometric distribution to account for the sampling without replacement. The paragraph below is corrected accordingly.</em></p> <p>Note the use of the phrase “qualified applicants” in reference to the</p> <ol> <li>Presumably, there was a screening process that removed “unqualified applicants” and that led us to 130. Of the 130, 73% were Asian. Presumably, there was a follow up selection process (interview, exam) that led to 4 Asians being hired out of 21 (about 19%). Clearly there’s a difference between 19% and 73% but the reasons may not be nefarious. If you assume the number of Asians hired is proportional to the number in the qualified pool, then the p-value for the observed data is about 10^-8, which is not quite “1 in a billion” as the report claims but it’s indeed small. But my guess is the Labor Department has more than this test of binomial proportions in terms of evidence if they were to go through with a suit.</li> </ol> <p>Alfred Lee from <a href="http://go.theinformation.com/r958P12lLdw">The Information</a> reports that a mutual fund run by Valic sold their shares of Palantir for below the recent valuation:</p> <blockquote> <p>The Valic fund sold its stake at $4.50 per share, filings show, down from the $11.38 per share at which the company raised money in December. The value of the stake at the sale price was $621,000. Despite the price drop, Valic made money on the deal, as it had acquired stock in preferred fundraisings in 2012 and 2013 at between $3.06 and $3.51 per share.</p> </blockquote> <p>The valuation suggested in the article by the recent sale is $8 billion. In my <a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">previous post on Palantir</a>, I noted that while other large-scale consulting companies certainly make a lot of money, none have the sky-high valuation that Palantir commands. However, a more “down-to-Earth” valuation of $8 billion might be more or less in line with these other companies. It may be bad news for Palantir, but should the company ever have an IPO, it would be good for the public for market participants to realize the intrinsic value of the company.</p> Thinking like a statistician: this is not the election for progressives to vote third party 2016-09-27T00:00:00+00:00 http://simplystats.github.io/elections/2016/09/27/thinking-like-statistician-election-2016 <p>Democratic elections permit us to vote for whomever we perceive has the highest expectation to do better with the issues we care about. Let’s simplify and assume we can quantify how satisfied we are with an elected official’s performance. Denote this quantity with <em>X</em>. Because when we cast our vote we still don’t know for sure how the candidate will perform, we base our decision on what we expect, denoted here with <em>E(X)</em>. Thus we try to maximize <em>E(X)</em>. However, both political theory and data tell us that in US presidential elections only two parties have a non-negligible probability of winning. This implies that <em>E(X)</em> is 0 for some candidates no matter how large <em>X</em> could potentially be. So what we are really doing is deciding if <em>E(X-Y)</em> is positive or negative with <em>X</em> representing one candidate and <em>Y</em> the other.</p> <p>In past elections some progressives have argued that the difference between candidates is negligible and have therefore supported the Green Party ticket. The 2000 election is a notable example. The <a href="https://en.wikipedia.org/wiki/United_States_presidential_election,_2000">2000 election</a> was won by George W. Bush by just five <a href="https://en.wikipedia.org/wiki/Electoral_College_(United_States)">electoral votes</a>. In Florida, which had 25 electoral votes, Bush beat Al Gore by just 537 votes. Green Party candidate Ralph Nader obtained 97,488 votes. Many progressive voters were OK with this outcome because they perceived <em>E(X-Y)</em> to be practically 0.</p> <p>In contrast, in 2016, I suspect few progressives think that <em>E(X-Y)</em> is anywhere near 0. In the figures below I attempt to quantify the progressive’s pre-election perception of consequences for the last five contests. The first figure shows <em>E(X)</em> and <em>E(Y)</em> and the second shows <em>E(X-Y)</em>. Note despite <em>E(X)</em> being the lowest in the last past five elections, <em>E(X-Y)</em> is by far the largest. So if these figures accurately depict your perception and you think like a statistician, it becomes clear that this is not the election to vote third party.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election.png" alt="election-2016" /></p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election-diff.png" alt="election-diff-2016" /></p> Facebook and left censoring 2016-09-26T00:00:00+00:00 http://simplystats.github.io/2016/09/26/facebook-left-censoring <p>From the <a href="http://www.wsj.com/articles/facebook-overestimated-key-video-metric-for-two-years-1474586951">Wall Street Journal</a>:</p> <blockquote> <p>Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.</p> </blockquote> <p>A classic case of left censoring (in this case, by “accident”).</p> <p>Also this:</p> <blockquote> <p>Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.</p> </blockquote> <p>What does this information tell us about the actual time spent watching Facebook videos?</p> Not So Standard Deviations Episode 22 - Number 1 Side Project 2016-09-19T00:00:00+00:00 http://simplystats.github.io/2016/09/19/nssd-episode-22 <p>Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="https://www.biostat.washington.edu/suminst/sisbid2016/modules/BD1603">Roger’s reproducible research workshop</a></p> </li> <li> <p><a href="http://radar.oreilly.com/2013/06/theres-more-than-one-kind-of-data-scientist.html">There’s More Than One Kind of Data Scientist by Harlan Harris</a></p> </li> <li> <p><a href="http://sf.curbed.com/maps/mapping-the-10-sf-homes-with-the-highest-property-taxes">Billionaire’s row in San Francisco</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Mindfulness-based_stress_reduction">Mindfulness-based stress reduction</a></p> </li> <li> <p><a href="http://www.asteroidmission.org/">OSIRIS-REx</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-22-1-side-project">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/282927998&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Mastering Software Development in R 2016-09-19T00:00:00+00:00 http://simplystats.github.io/2016/09/19/msdr-launch-announcement <p>Today I’m happy to announce that we’re launching a new specialization on Coursera titled <a href="https://www.coursera.org/specializations/r/"><strong>Mastering Software Development in R</strong></a>. This is a 5-course sequence developed with <a href="https://twitter.com/seankross">Sean Kross</a> and <a href="http://csu-cvmbs.colostate.edu/academics/erhs/Pages/brooke-anderson.aspx">Brooke Anderson</a>.</p> <p>This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing <em>software</em>. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.</p> <p>The first course, <a href="https://www.coursera.org/learn/r-programming-environment">The R Programming Environment</a>, launches today. In the following months, we will launch the remaining courses:</p> <ul> <li>Advanced R Programming</li> <li>Building R Packages</li> <li>Building Data Visualization Tools</li> </ul> <p>In addition to the course, we have a <a href="https://leanpub.com/msdr">companion textbook</a> that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.</p> Interview With a Data Sucker 2016-09-07T00:00:00+00:00 http://simplystats.github.io/open%20science/2016/09/07/interview-with-a-data-sucker <p>A few months ago Jill Sederstrom from ASH Clinical News interviewed me for <a href="http://ashclinicalnews.org/attack-of-the-data-suckers/">this article</a> on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated. The article presented a nice summary, but I thought the original comprehensive set of questions was very good too. So, with permission from ASH Clinical News, I am sharing them here along with my answers.</p> <p>Before I answer the questions below, I want to make an important remark. When writing these answers I am reflecting on data sharing in general. Nuances arise in different contexts that need to be discussed on an individual basis. For example, there are different considerations to keep in mind when sharing publicly funded data in genomics (my field) and sharing privately funded clinical trials data, just to name two examples.</p> <h3 id="in-your-opinion-what-do-you-see-as-the-biggest-pros-of-data-sharing">In your opinion, what do you see as the biggest pros of data sharing?</h3> <p>The biggest pro of data sharing is that it can accelerate and improve the scientific enterprise. This can happen in a variety of ways. For example, competing experts may apply an improved statistical analysis that finds a hidden discovery the original data generators missed. Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project. Finally, sharing data facilitates the merging of datasets from different sources that allow discoveries not possible with just one study.</p> <p>Note that data sharing is not a radical idea. For example, thanks to an organization called <a href="http://fged.org">The MGED Soceity</a>, most journals require all published microarray gene expression data to be public in one of two repositories: GEO or ArrayExpress. This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data as practice examples.</p> <h3 id="the-nejm-editorial-expressed-concern-that-a-new-generation-of-researchers-will-emerge-those-who-had-nothing-to-do-with-collecting-the-research-but-who-will-use-it-to-their-own-ends-it-referred-to-these-as-research-parasites-is-this-a-real-concern">The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?</h3> <p>Absolutely not. If our goal is to facilitate scientific discoveries that improve our quality of life, I would be much more concerned about “data hoarders” than “research parasites”. If an important nugget of knowledge is hidden in a dataset, don’t you want the best data analysts competing to find it? Restricting the researchers who can analyze the data to those directly involved with the generators cuts out the great majority of experts.</p> <p>To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates. Finding a cure is possible but only after analyzing a very very complex genomic assay. If some of the best data analysts in the world want to help, does it make any sense at all to restrict the pool of analysts to, say, a freshly minted masters level statistician working for the genomics core that generated the data? Furthermore, what would be the harm of having someone double check that analysis?</p> <h3 id="the-nejm-editorial-also-presented-several-other-concerns-it-had-with-data-sharing-including-whether-researchers-would-compare-data-across-clinical-trials-that-is-not-in-fact-comparable-and-a-failure-to-provide-correct-attribution-do-you-see-these-as-being-concerns-what-cons-do-you-believe-there-may-be-to-data-sharing">The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?</h3> <p>If such mistakes are made, good peer reviewers will catch the error. If it escapes peer review, we point it out in post publication discussions. Science is constantly self correcting.</p> <p>Regarding attribution, this is a legitimate, but in my opinion, minor concern. Developers of open source statistical methods and software see our methods used without attribution quite often. We survive. But as I elaborate below, we can do things to alleviate this concern.</p> <h3 id="is-data-stealing-a-real-worry-have-you-ever-heard-of-it-happening-before">Is data stealing a real worry? Have you ever heard of it happening before?</h3> <p>I can’t say I can recall any case of data being stolen. But let’s remember that most published data is paid for by tax payers. They are the actual owners. So there is an argument to be made that the public’s data is being held hostage.</p> <h3 id="does-data-sharing-need-to-happen-symbiotically-as-the-editorial-suggests-why-or-why-not">Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?</h3> <p>I think symbiotic sharing is the most effective approach to the repurposing of data. But no, I don’t think we need to force it to happen this way. Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators. And note that the data generators won’t necessarily have time to collaborate with all the groups interested in the data.</p> <h3 id="in-a-recent-blog-post-you-suggested-several-possible-data-sharing-guidelines-what-would-the-advantage-be-of-having-guidelines-in-place-in-help-guide-the-data-sharing-process">In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?</h3> <p>I think you are referring to <a href="http://simplystatistics.org/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem/">a post by Jeff Leek</a> but I am happy to answer. For data to be generated, we need to incentivize the endeavor. Guidelines that assure patient privacy should of course be followed. Some other simple guidelines related to those mentioned by Jeff are:</p> <ol> <li>Reward data generators when their data is used by others.</li> <li>Penalize those that do not give proper attribution.</li> <li>Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis.</li> <li>Include data sharing ethics in scientific education</li> </ol> <h3 id="one-of-the-guidelines-suggested-a-new-designation-for-leaders-of-major-data-collection-or-software-generation-projects-why-do-you-think-this-is-important">One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?</h3> <p>Again, this was Jeff, but I agree. This is important because we need an incentive other than giving the generators exclusive rights to publications emanating from said data.</p> <h3 id="you-also-discussed-the-need-for-requiring-statisticalcomputational-co-authors-for-papers-written-by-experimentalists-with-no-statisticalcomputational-co-authors-and-vice-versa-what-role-do-you-see-the-referee-serving-why-is-this-needed">You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?</h3> <p>I think the same rule should apply to referees. Every paper based on the analysis of complex data needs to have a referee with statistical/computational expertise. I also think biomedical journals publishing data-driven research should start adding these experts to their editorial boards. I should mention that NEJM actually has had such experts on their editorial board for a while now.</p> <h3 id="are-there-certain-guidelines-would-feel-would-be-most-critical-to-include">Are there certain guidelines would feel would be most critical to include?</h3> <p>To me the most important ones are:</p> <ol> <li> <p>The funding agencies and the community should reward data generators when their data is used by others. Perhaps more than for the papers they produce with these data.</p> </li> <li> <p>Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis. Bashing published results and talking about the “replication crisis” has become fashionable. Although in some cases it is very well merited (see Baggerly and Coombes <a href="http://projecteuclid.org/euclid.aoas/1267453942#info">work</a> for example) in some circumstances critiques are made without much care mainly for the attention. If we are not careful about keeping a good balance, we may end up paralyzing scientific progress.</p> </li> </ol> <h3 id="you-mentioned-that-you-think-symbiotic-data-sharing-would-be-the-most-effective-approach-what-are-some-ways-in-which-scientists-can-work-symbiotically">You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?</h3> <p>I can describe my experience. I am trained as a statistician. I analyze data on a daily basis both as a collaborator and method developer. Experience has taught me that if one does not understand the scientific problem at hand, it is hard to make a meaningful contribution through data analysis or method development. Most successful applied statisticians will tell you the same thing.</p> <p>Most difficult scientific challenges have nuances that only the subject matter expert can effectively describe. Failing to understand these usually leads analysts to chase false leads, interpret results incorrectly or waste time solving a problem no one cares about. Successful collaboration usually involve a constant back and forth between the data analysts and the subject matter experts.</p> <p>However, in many circumstances the data generator is not necessarily the only one that can provide such guidance. Some data analysts actually become subject matter experts themselves, others download data and seek out other collaborators that also understand the details of the scientific challenge and data generation process.</p> A Short Guide for Students Interested in a Statistics PhD Program 2016-09-06T00:00:00+00:00 http://simplystats.github.io/advice/2016/09/06/a-short-guide-for-phd-applicants <p>This summer I had several conversations with undergraduate students seeking career advice. All were interested in data analysis and were considering graduate school. I also frequently receive requests for advice via email. We have posted on this topic before, for example <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a> and <a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">here</a>, but I thought it would be useful to share this short guide I put together based on my recent interactions.</p> <h2 id="its-ok-to-be-confused">It’s OK to be confused</h2> <p>When I was a college senior I didn’t really understand what Applied Statistics was nor did I understand what one does as a researcher in academia. Now I love being an academic doing research in applied statistics. But it is hard to understand what being a researcher is like until you do it for a while. Things become clearer as you gain more experience. One important piece of advice is to carefully consider advice from those with more experience than you. It might not make sense at first, but I can tell today that I knew much less than I thought I did when I was 22.</p> <h2 id="should-i-even-go-to-graduate-school">Should I even go to graduate school?</h2> <p>Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science provides a great background, but some more training greatly increases your career options. You may be able to learn on the job, but note that a masters can be as short as a year.</p> <h2 id="a-masters-or-a-phd">A masters or a PhD?</h2> <p>If you want a career in academia or as a researcher in industry or government you need a PhD. In general, a PhD will give you more career options. If you want to become a data analyst or research assistant, a masters may be enough. A masters is also a good way to test out if this career is a good match for you. Many people do a masters before applying to PhD Programs. The rest of this guide focuses on those interested in a PhD.</p> <h2 id="what-discipline">What discipline?</h2> <p>There are many disciplines that can lead you to a career in data science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational Biology, and Ecology are examples that come to mind. I did my PhD in Statistics and got a job in a Department of Biostatistics. So this guide focuses on Statistics/Biostatistics.</p> <p>Note that once you finish your PhD you have a chance to become a postdoctoral fellow and further focus your training. By then you will have a much better idea of what you want to do and will have the opportunity to chose a lab that closely matches your interests.</p> <h2 id="what-is-the-difference-between-statistics-and-biostatistics">What is the difference between Statistics and Biostatistics?</h2> <p>Short answer: very little. I treat them as the same in this guide. Long answer: read <a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">this</a>.</p> <h2 id="how-should-i-prepare-during-my-senior-year">How should I prepare during my senior year?</h2> <h3 id="math">Math</h3> <p>Good grades in math and statistics classes are almost a requirement. Good GRE scores help and you need to get a near perfect score in the Quantitative Reasoning part of the GRE. Get yourself a practice book and start preparing. Note that to survive the first two years of a statistics PhD program you need to prove theorems and derive relatively complicated mathematical results. If you can’t easily handle the math part of the GRE, this will be quite challenging.</p> <p>When choosing classes note that the area of math most related to your stat PhD courses is Real Analysis. The area of math most used in applied work is Linear Algebra, specifically matrix theory including understanding eigenvalues and eigenvectors. You might not make the connection between what you learn in class and what you use in practice until much later. This is totally normal.</p> <p>If you don’t feel ready, consider doing a masters first. But also, get a second opinion. You might be being too hard on yourself.</p> <h3 id="programming">Programming</h3> <p>You will be using a computer to analyze data so knowing some programming is a must these days. At a minimum, take a basic programming class. Other computer science classes will help especially if you go into an area dealing with large datasets. In hindsight, I wish I had taken classes on optimization and algorithm design.</p> <p>Know that learning to program and learning a computer language are different things. You need to learn to program. The choice of language is up for debate. If you only learn one, learn R. If you learn three, learn R, Python and C++.</p> <p>Knowing Linux/Unix is an advantage. If you have a Mac try to use the terminal as much as possible. On Windows get an emulator.</p> <h3 id="writing-and-communicating">Writing and Communicating</h3> <p>My biggest educational regret is that, as a college student, I underestimated the importance of writing. To this day I am correcting that mistake.</p> <p>Your success as a researcher greatly depends on how well you write and communicate. Your thesis, papers, grant proposals and even emails have to be well written. So practice as much as possible. Take classes, read works by good writers, and <a href="http://bulletin.imstat.org/2011/09/terence%E2%80%99s-stuff-speaking-reading-writing/">practice</a>. Consider starting a blog even if you don’t make it public. Also note that in academia, job interviews will involve a 50 minute talk as well as several conversations about your work and future plans. So communication skills are also a big plus.</p> <h2 id="but-wait-why-so-much-math">But wait, why so much math?</h2> <p>The PhD curriculum is indeed math heavy. Faculty often debate the possibility of changing the curriculum. But regardless of differing opinions on what is the right amount, math is the foundation of our discipline. Although it is true that you will not directly use much of what you learn, I don’t regret learning so much abstract math because I believe it positively shaped the way I think and attack problems.</p> <p>Note that after the first two years you are pretty much done with courses and you start on your research. If you work with an applied statistician you will learn data analysis via the apprenticeship model. You will learn the most, by far, during this stage. So be patient. Watch <a href="https://www.youtube.com/watch?v=R37pbIySnjg">these</a> <a href="https://www.youtube.com/watch?v=Bg21M2zwG9Q">two</a> Karate Kid scenes for some inspiration.</p> <h2 id="what-department-should-i-apply-to">What department should I apply to?</h2> <p>The top 20-30 departments are practically interchangeable in my opinion. If you are interested in applied statistics make sure you pick a department with faculty doing applied research. Note that some professors focus their research on the mathematical aspects of statistics. By reading some of their recent papers you will be able to tell. An applied paper usually shows data (not simulated) and motivates a subject area challenge in the abstract or introduction. A theory paper shows no data at all or uses it only as an example.</p> <h2 id="can-i-take-a-year-off">Can I take a year off?</h2> <p>Absolutely. Especially if it’s to work in a data related job. In general, maturity and life experiences are an advantage in grad school.</p> <h2 id="what-should-i-expect-when-i-finish">What should I expect when I finish?</h2> <p>You will have many many options. The demand of your expertise is great and growing. As a result there are many high-paying options. If you want to become an academic I recommend doing a postdoc. <a href="http://simplystatistics.org/2011/12/28/grad-students-in-bio-statistics-do-a-postdoc/">Here</a> is why. But there are many other options as we describe <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a> and <a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/">here</a>.</p> Not So Standard Deviations Episode 21 - This Might be the Future! 2016-08-26T00:00:00+00:00 http://simplystats.github.io/2016/08/26/nssd-episode-21 <p>Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="http://projects.fivethirtyeight.com/2016-election-forecast/">FiveThirtyEight election dashboard</a></p> </li> <li> <p><a href="http://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html">The Upshot’s election dashboard</a></p> </li> <li> <p><a href="http://varianceexplained.org/r/trump-tweets/">David Robinson’s post on Trump’s tweets</a></p> </li> <li> <p><a href="https://twitter.com/juliasilge">Julia Silge’s Twitter account</a></p> </li> <li> <p><a href="http://thekateringshow.com">The Katering Show</a></p> </li> <li> <p><a href="https://www.beomni.com">Omni</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-21-this-might-be-the-future">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/279922412&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> How to create a free distributed data collection "app" with R and Google Sheets 2016-08-26T00:00:00+00:00 http://simplystats.github.io/2016/08/26/googlesheets <p><a href="http://www.stat.ubc.ca/~jenny/">Jenny Bryan</a>, developer of the <a href="https://github.com/jennybc/googlesheets">google sheets R package</a>, <a href="https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015">gave a talk</a> at Use2015 about the package.</p> <p>One of the things that got me most excited about the package was an example she gave in her talk of using the Google Sheets package for data collection at ultimate frisbee tournaments. One reason is that I used to play a little ultimate <a href="http://www.pbase.com/jmlane/image/76852417">back in the day</a>.</p> <p>Another is that her idea is an amazing one for producing cool public health applications. One of the major issues with public health is being able to do distributed data collection cheaply, easily, and reproducibly. So I decided to write a little tutorial on how one could use <a href="https://www.google.com/sheets/about/">Google Sheets</a> and R to create a free distributed data collecton “app” for public health (or anything else really).</p> <h3 id="what-you-will-need">What you will need</h3> <ul> <li>A Google account and access to <a href="https://www.google.com/sheets/about/">Google Sheets</a></li> <li><a href="https://www.r-project.org/">R</a> and the <a href="https://github.com/jennybc/googlesheets">googlesheets</a> package.</li> </ul> <h3 id="the-app">The “app”</h3> <p>What we are going to do is collect data in a Google Sheet or sheets. This sheet can be edited by anyone with the link using their computer or a mobile phone. Then we will use the <code class="highlighter-rouge">googlesheets</code> package to pull the data into R and analyze it.</p> <h3 id="making-the-google-sheet-work-with-googlesheets">Making the Google Sheet work with googlesheets</h3> <p>After you have a first thing to do is to go to the Google Sheets I suggest bookmarking this page: https://docs.google.com/spreadsheets/u/0/ which skips the annoying splash screen.</p> <p>Create a blank sheet and give it an appropriate title for whatever data you will be collecting.</p> <p>Next, we need to make the sheet <em>public on the web</em> so that the <em>googlesheets</em> package can read it. This is different from the sharing settings you set with the big button on the right. To make the sheet public on the web, go to the “File” menu and select “Publish to the web…”. Like this:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publishweb.png" alt="" /></p> <p>then it will ask you if you want to publish the sheet, just hit publish</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publish.png" alt="" /></p> <p>Copy the link it gives you and you can use it to read in the Google Sheet. If you want to see all the Google Sheets you can read in, you can load the package and use the <code class="highlighter-rouge">gs_ls</code> function.</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">googlesheets</span><span class="p">)</span><span class="w"> </span><span class="n">sheets</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_ls</span><span class="p">()</span><span class="w"> </span><span class="n">sheets</span><span class="p">[</span><span class="m">1</span><span class="p">,]</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 1 x 10 ## sheet_title author perm version updated ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;time&gt; ## 1 app_example jtleek rw new 2016-08-26 17:48:21 ## # ... with 5 more variables: sheet_key &lt;chr&gt;, ws_feed &lt;chr&gt;, ## # alternate &lt;chr&gt;, self &lt;chr&gt;, alt_key &lt;chr&gt; </code></pre> </div> <p>It will pop up a dialog asking for you to authorize the <code class="highlighter-rouge">googlesheets</code> package to read from your Google Sheets account. Then you should see a list of spreadsheets you have created.</p> <p>In my example I created a sheet called “app_example” so I can load the Google Sheet like this:</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet </span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_title</span><span class="p">(</span><span class="s2">"app_example"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet successfully identified: "app_example" </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Reads the data </span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8. </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 3 x 5 ## who_collected at_work person time date ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; ## 1 jeff no ingo 13:47 08/26/2016 ## 2 jeff yes roger 13:47 08/26/2016 ## 3 jeff yes brian 13:47 08/26/2016 </code></pre> </div> <p>In this case the data I’m collecting is about who is at work right now as I’m writing this post :). But you could collect whatever you want.</p> <h3 id="distributing-the-data-collection">Distributing the data collection</h3> <p>Now that you have the data published to the web, you can read it into Google Sheets. Also, anyone with the link will be able to view the Google Sheet. But if you don’t change the sharing settings, you are the only one who can edit the sheet.</p> <p>This is where you can make your data collection distributed if you want. If you go to the “Share” button, then click on advanced you will get a screen like this and have some options.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_share_advanced.png" alt="" /></p> <p><em>Private data collection</em></p> <p>In the example I’m using I haven’t changed the sharing settings, so while you can <em>see</em> the sheet, you can’t edit it. This is nice if you want to collect some data and allow other people to read it, but you don’t want them to edit it.</p> <p><em>Controlled distributed data collection</em></p> <p>If you just enter people’s emails then you can open the data collection to just those individuals you have shared the sheet with. Be careful though, if they don’t have Google email addresses, then they get a link which they could share with other people and this could lead to open data collection.</p> <p><em>Uncontrolled distributed data collection</em></p> <p>Another option is to click on “Change” next to “Private - Only you can access”. If you click on “On - Anyone with the link” and click on “Can View”.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_can_view.png" alt="" /></p> <p>Then you can modify it to say “Can Edit” and hit “Save”. Now anyone who has the link can edit the Google Sheet. This means that you can’t control who will be editing it (careful!) but you can really widely distribute the data collection.</p> <h3 id="collecting-data">Collecting data</h3> <p>Once you have distributed the link either to your collaborators or more widely it is time to collect data. This is where I think that the “app” part of this is so cool. You can edit the Google Sheet from a Desktop computer, but if you have the (free!) Google Sheets app for your phone then you can also edit the data on the go. There is even an offline mode if the internet connection isn’t available where you are working (more on this below).</p> <h3 id="quality-control">Quality control</h3> <p>One of the major issues with distributed data collection is quality control. If possible you want people to input data using (a) a controlled vocubulary/system and (b) the same controlled vocabulary/system. My suggestion here depends on whether you are using a controlled distributed system or an uncontrolled distributed system.</p> <p>For the controlled distributed system you are specifically giving access to individual people - you can provide some training or a walk through before giving them access.</p> <p>For the uncontrolled distributed system you should create a <em>very</em> detailed set of instructions. For example, for my sheet I would create a set of instructions like:</p> <ol> <li>Every data point must have a label of who collected in in the <code class="highlighter-rouge">who_collected</code> column. You should pick a username that does not currently appear in the sheet and stick with it. Use all lower case for your username.</li> <li>You should either report “yes” or “no” in lowercase in the <code class="highlighter-rouge">at_work</code> column.</li> <li>You should report the name of the person in all lower case in the <code class="highlighter-rouge">person</code> column. You should search and make sure that the person you are reporting on doesn’t appear before introducing a new name. If the name already exists, use the name spelled exactly as it is in the sheet already.</li> <li>You should report the <code class="highlighter-rouge">time</code> in the format hh:mm on a 24 hour clock in the eastern time zone of the United States.</li> <li>You should report the <code class="highlighter-rouge">date</code> in the mm/dd/yyyy format.</li> </ol> <p>You could be much more detailed depending on the case.</p> <h3 id="offline-editing-and-conflicts">Offline editing and conflicts</h3> <p>One of the cool things about Google Sheets is that they can even be edited without an internet connection. This is particularly useful if you are collecting data in places where internet connections may be spotty. But that may generate conflicts if you use only one sheet.</p> <p>There may be different ways to handle this, but one I thought of is to just create one sheet for each person collecting data (if you are using controlled distributed data collection). Then each person only edits the data in their sheet, avoiding potential conflicts if multiple people are editing offline and non-synchronously.</p> <h3 id="reading-the-data">Reading the data</h3> <p>Anyone with the link can now read the most up-to-date data with the following simple code.</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet </span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_url</span><span class="p">(</span><span class="s2">"https://docs.google.com/spreadsheets/d/177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o/pubhtml"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet-identifying info appears to be a browser URL. ## googlesheets will attempt to extract sheet key from the URL. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Putative key: 177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet successfully identified: "app_example" </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Reads the data </span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">,</span><span class="w"> </span><span class="n">ws</span><span class="o">=</span><span class="s2">"Sheet1"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8. </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">dat</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 3 x 5 ## who_collected at_work person time date ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; ## 1 jeff no ingo 13:47 08/26/2016 ## 2 jeff yes roger 13:47 08/26/2016 ## 3 jeff yes brian 13:47 08/26/2016 </code></pre> </div> <p>Here the url is the one I got when I went to the “File” menu and clicked on “Publish to the web…”. The argument <code class="highlighter-rouge">ws</code> in the <code class="highlighter-rouge">gs_read</code> command is the name of the worksheet. If you have multiple sheets assigned to different people, you can read them in one at a time and then merge them together.</p> <h3 id="conclusion">Conclusion</h3> <p>So that’s it, its pretty simple. But as I gear up to teach advanced data science here at Hopkins I’m thinking a lot about Sean Taylor’s awesome post <a href="http://seanjtaylor.com/post/41463778912/real-scientists-make-their-own-data">Real scientists make their own data</a></p> <p>I think this approach is a super cool/super lightweight system for collecting data either on your own or as a team. As I said I think it could be really useful in public health, but it could also be used for any data collection you want.</p> Interview with COPSS award winner Nicolai Meinshausen. 2016-08-24T00:00:00+00:00 http://simplystats.github.io/2016/08/24/meinshausen-copss <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The COPSS Award is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to Nicolai Meinshausen from ETH Zurich. He is known for his work in causality, high-dimensional statistics, and machine learning. Also see our past COPSS award interviews with <a href="http://simplystatistics.org/2015/08/25/interview-with-copss-award-winner-john-storey/">John Storey</a> and <a href="http://simplystatistics.org/2014/08/18/interview-with-copss-award-winner-martin-wainright/">Martin Wainwright</a>.</em></p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/meinshausen.png" alt="Nicolai Meinshausen" /></p> <h2 id="do-you-consider-yourself-to-be-a-statistician-data-scientist-machine-learner-or-something-else">Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</h2> <p>Perhaps all of the above. If you forced me to pick one, then statistician but I hope we will soon come to a point where these distinctions do not matter much any more.</p> <h2 id="how-did-you-find-out-you-had-won-the-copss-award">How did you find out you had won the COPSS award?</h2> <p>Jeremy Taylor called me. I know I am expected to say I did not expect it but that was indeed the case and it was a genuine surprise.</p> <h2 id="how-do-you-see-the-fields-of-causal-inference-and-high-dimensional-statistics-merging">How do you see the fields of causal inference and high-dimensional statistics merging?</h2> <p>Causal inference is already very challenging in the low-dimensional case - if understood as data for which the number of observations exceeds the number of variables. There are commonalities between high-dimensional statistics and the subfield of causal discovery, however, as we try to recover a sparse underlying structure from data in both cases (say when trying to reconstruct a gene network from observational and intervention data). The interpretations are just slightly different. A further difference is the implicit optimization. High-dimensional estimators can often be framed as convex optimization problems and the question is whether causal discovery can or should be pushed in this direction as well.</p> <h2 id="can-you-explain-a-little-about-how-you-can-infer-causal-effects-from-inhomogeneous-data">Can you explain a little about how you can infer causal effects from inhomogeneous data?</h2> <p>Why do we want a causal model in the first place? In most cases the benefit of a causal over a regression model is that the predictions of a causal model continue to be valid even if we intervene on the variables we use for prediction.</p> <p>The inference we proposed turns this around and is looking for all models that are invariant in the sense that they give the same prediction accuracy across a number of different settings or environments. If we just have observational data, then this invariance holds for all models but if the data are inhomogeneous, certain models can be discarded since they work better in one environment than in another and can thus not be causal. If all models that show invariance use a certain variable, we can claim that the variable in question has a causal effect (while controlling type I error rates) and construct confidence intervals for the strength of the effect.</p> <h2 id="you-have-worked-on-studying-the-effects-of-climate-change---do-you-think-statisticians-can-play-an-important-role-in-this-debate">You have worked on studying the effects of climate change - do you think statisticians can play an important role in this debate?</h2> <p>To a certain extent. I have worked on several projects with physicists and the general caveat is that physicists are in general quite advanced in their methodology already and have quite a good understanding of the relevant statistical concepts. Biology is thus maybe a field where even more external input is required. Then again, it saves one from having to calculate t-tests in collaborations with physicists and just the interestingand challenging problems are left.</p> <h2 id="what-advice-would-you-give-young-statisticians-getting-into-the-discipline-right-now">What advice would you give young statisticians getting into the discipline right now?</h2> <p>First I would say that they have made a good choice since these are interesting times for the field with many challenging and relevant problems still open and unsolved (but not completely out of reach either). I think its important to keep an open mind and read as much literature as possible from neighbouring fields. My personal experience has been that it is very beneficial to get involved in some scientific collaborations.</p> <h2 id="what-sorts-of-things-is-your-group-working-on-these-days">What sorts of things is your group working on these days?</h2> <p>Two general themes: the first is what people would call more classical machine learning. For example, how can we detect interactions in large-scale datasets in sub-quadratic time? Secondly, we are trying to extend the invariance approach to causal inference to more general settings, for example allowing for nonlinearities and hidden variables while at the same time improving the computational aspects.</p> A Simple Explanation for the Replication Crisis in Science 2016-08-24T00:00:00+00:00 http://simplystats.github.io/2016/08/24/replication-crisis <p>By now, you’ve probably heard of the <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis in science</a>. In summary, many conclusions from experiments done in a variety of fields have been found to not hold water when followed up in subsequent experiments. There are now any number of famous examples now, particularly from the fields of <a href="http://science.sciencemag.org/content/349/6251/aac4716">psychology</a> and <a href="http://biorxiv.org/content/early/2016/04/27/050575">clinical medicine</a> that show that the rate of replication of findings is less than the the expected rate.</p> <p>The reasons proposed for this crisis are wide ranging, but typical center on the preference for “novel” findings in science and the pressure on investigators (especially new ones) to “publish or perish”. My favorite reason places the blame for the entire crisis on <a href="http://www.nature.com/news/psychology-journal-bans-p-values-1.17001">p-values</a>.</p> <p>I think to develop a better understanding of why there is a “crisis”, we need to step back and look across differend fields of study. There is one key question you should be asking yourself: <em>Is the replication crisis evenly distributed across different scientific disciplines?</em> My reading of the literature would suggest “no”, but the reasons attributed to the replication crisis are common to all scientists in every field (i.e. novel findings, publishing, etc.). So why would there be any heterogeneity?</p> <h2 id="an-aside-on-replication-and-reproducibility">An Aside on Replication and Reproducibility</h2> <p>As Lorena Barba recently <a href="https://twitter.com/LorenaABarba/status/764836487212957696">pointed</a> <a href="https://github.com/ReScience/ReScience-article/issues/5#issuecomment-241232791">out</a>, there can be tremendous confusion over the use of the words <strong>replication</strong> and <strong>reproducibility</strong>, so it’s best that we sort that out now. Here’s how I use both words:</p> <ul> <li> <p><em>replication</em>: This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods).</p> </li> <li> <p><em>reproducibility</em>: A study is reproducible if you can take the original data and the <em>computer code</em> used to analyze the data and reproduce all of the numerical findings from the study. This may initially sound like a trivial task but experience has shown that it’s not always easy to achieve this seemly minimal standard.</p> </li> </ul> <p>For more precise definitions of what I mean by these terms, you can take a look at <a href="http://biorxiv.org/content/early/2016/07/29/066803">my recent paper with Jeff Leek and Prasad Patil</a>.</p> <p>One key distinction between replication and reproducibility is that with replication, there is no need to trust any of the original findings. In fact, you may be attempting to replicate a study <em>because</em> you do not trust the findings of the original study. A recent example includes the creation of stem cells from ordinary cells, a finding that was so extraodinary that it led several laboratories to quickly try to replicate the study. Ultimately, <a href="http://www.nature.com/nature/journal/v525/n7570/full/nature15513.html">seven separate laboratories could not replicate the finding</a> and the original study was ultimately retracted.</p> <h2 id="astronomy-and-epidemiology">Astronomy and Epidemiology</h2> <p>What do the fields of astronomy and epidemiology have in common? You might think nothing. Those two departments are often not even on the same campus at most universities! However, they have at least one common element, which is that the things that they study are generally reluctant to be controlled by human beings. As a result, both astronomers and epidemiologist rely heavily on one tools: the <strong>observational study</strong>. Much has been written about observational studies of late, and I’ll spare you the literature search by saying that the bottom line is they can’t be trusted (particularly observational studies that have not been pre-registered!).</p> <p>But that’s fine—we have a method for dealing with things we don’t trust: It’s called replication. Epidemiologists actually codified their understanding of when they believe a causal claim (see <a href="https://en.wikipedia.org/wiki/Bradford_Hill_criteria">Hill’s Criteria</a>), which is typically only after a claim has been replicated in numerous studies in a variety of settings. My understanding is that astronomers have a similar mentality as well—no single study will result in anyone believe something new about the universe. Rather, findings need to be replicated using different approaches, instruments, etc.</p> <p>The key point here is that in both astronomy and epidemiology expectations are low with respect to individual studies. <strong>It’s difficult to have a replication crisis when nobody believes the findings in the first place</strong>. Investigators have a culture of distrusting individual one-off findings until they have been replicated numerous times. In my own area of research, the idea that ambient air pollution causes health problems was difficult to believe for decades, until we started seeing the same associations appear in numerous studies conducted all around the world. It’s hard to imagine any single study “proving” that connection, no matter how well it was conducted.</p> <h2 id="theory-and-experimentation-in-science">Theory and Experimentation in Science</h2> <p>I’ve already described the primary mode of investigation in astronomy and epidemiology, but there are of course other methods in other fields. One large category of methods includes the <strong>controlled experiment</strong>. Controlled experiments come in a variety of forms, whether they are laboratory experiments on cells or randomized clinical trials with humans, all of them involve intentional manipulation of some factor by the investigator in order to observe how such manipulation affects an outcome. In clinical medicine and the social sciences, controlled experiments are considered the “gold standard” of evidence. Meta-analyses and literature summaries generally weight publications with controlled experiments more highly than other approaches like observational studies.</p> <p>The other aspect I want to look at here is whether a field has a strong basic theoretical foundation. The idea here is that some fields, like say physics, have a strong set of basic theories whose predictions have been consistently validated over time. Other fields, like medicine, lack even the most rudimentary theories that can be used to make basic predictions. Granted, the distinction between fields with or without “basic theory” is a bit arbitrary on my part, but I think it’s fair to say that different fields of study fall on a spectrum in terms of how much basic theory they can rely on.</p> <p>Given the theoretical nature of different fields and the primary mode of investigation, we can develop the following crude 2x2 table, in which I’ve inserted some representative fields of study.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/replication_2x2.png" alt="Theory vs. Experimentation in Science" /></p> <p>My primary contention here is</p> <blockquote> <p>The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.</p> </blockquote> <p>Further, in general, I don’t believe that there’s anything wrong with the people tirelessly working in the upper right box. At least, I don’t think there’s anything <em>more</em> wrong with them compared to the good people working in the other three boxes.</p> <p>In case anyone is wondering where the state of clinical science is relative to, say, particle physics with respect to basic theory, I only point you to the web site for the <a href="https://nccih.nih.gov">National Center for Complementary and Integrative Health</a>. This is essentially a government agency with a budget of $124 million dedicated to <a href="http://www.forbes.com/sites/stevensalzberg/2011/08/29/nihs-350000-questionnaire/#1ee73d4d4fc6">advancing pseudoscience</a>. This is the state of “basic theory” in clinical medicine.</p> <h2 id="the-bottom-line">The Bottom Line</h2> <p>The people working in the upper right box have an uphill battle for at least two reasons</p> <ol> <li>The lack of strong basic theory makes it more difficult to guide investigation, leading to wider ranging efforts that may be less likely to replicate.</li> <li>The tradition of controlled experimentation places <em>high expectations</em> that research produced here is “valid”. I mean, hey, they’re using the gold standard of evidence, right?</li> </ol> <p>The confluence of these two factors leads to a much greater disappointment when findings from these fields do not replicate. This leads me to believe that <strong>the replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields</strong>. Further, the reliance of controlled experiements in certain fields has lulled us into believing that we can place tremendous weight on a small number of studies. Ultimately, when someone asks, “Why <em>haven’t</em> we cured cancer yet?” the answer is “Because it’s hard”.</p> <h2 id="the-silver-lining">The Silver Lining</h2> <p>It’s important to remember that, as my colleague Rafa Irizarry <a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">pointed out</a>, findings from many of the fields in the upper right box, especially clinical medicine, can have tremendous positive impacts on our lives when they do work out. Rafa says</p> <blockquote> <p>…I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.</p> </blockquote> <p>It is certainly possible to reduce the rate of false positives—one way would be to do no experiments at all! But is that what we want? Would that most benefit us as a society?</p> <h2 id="the-takeaway">The Takeaway</h2> <p>What to do? Here are a few thoughts:</p> <ul> <li>We need to stop thinking that any single study is definitive or confirmatory, no matter if it was a controlled experiment or not. Science is always a cumulative business, and the value of a given study should be understood in the context of what came before it.</li> <li>We have to recognize that some areas are more difficult to study and are less mature than other areas because of the lack of basic theory to guide us.</li> <li>We need to think about what the tradeoffs are for discovering many things that may not pan out relative to discovering only a few things. What are we willing to accept in a given field? This is a discussion that I’ve not seen much of.</li> <li>As Rafa pointed out in his article, we can definitely focus on things that make science better for everyone (better methods, rigorous designs, etc.).</li> </ul> A meta list of what to do at JSM 2016 2016-07-30T00:00:00+00:00 http://simplystats.github.io/2016/07/30/jsm2016 <p>I’m going to be heading out tomorrow for JSM 2016. If you want to catch up I’ll be presenting in the 6-8PM poster session on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=213079">The Extraordinary Power of Data</a> on Sunday and on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=212543">data visualization (and other things) in MOOCs</a> at 8:30am on Monday. Here is a little sneak preview, the first slide from my talk:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/firstslide.jpg" alt="Was too scared to use GIFs" /></p> <p>This year I am so excited that other people have done all the work of going through the program for me and picking out what talks to see. Here is a list of lists.</p> <ul> <li><a href="https://kbroman.wordpress.com/2016/07/27/my-jsm-2016-itinerary/">Karl Broman</a> - if you like open source software, data viz, and genomics.</li> <li><a href="https://blog.rstudio.org/2016/07/19/discover-r-and-rstudio-at-jsm-2016-chicago/">Rstudio</a> - if you like Rstudio</li> <li><a href="http://citizen-statistician.org/2016/07/29/my-jsm2016-itinerary/">Mine Cetinkaya Rundel</a> - if you like stat ed, data science, data viz, and data journalism.</li> <li><a href="https://twitter.com/DrJWolfson/status/758990552754827264">Julian Wolfson</a> - if you like missing sessions and guilt.</li> <li><a href="https://github.com/stephaniehicks/classroomNotes/blob/master/conferences/JSM2016.md">Stephanie Hicks</a> - if you like lots of sessions and can’t make up your mind (also stat genomics, open source software, stat computing, stats for social good…)</li> </ul> <p>If you know about more lists, please feel free to tweet at me or send pull requests.</p> <p>I also saw the materials for this <a href="https://github.com/simonmunzert/rscraping-jsm-2016">awesome tutorial on webscraping</a> that I’m sorry I’ll miss.</p> The relativity of raw data 2016-07-20T00:00:00+00:00 http://simplystats.github.io/2016/07/20/relativity-raw-data <p>“Raw data” is one of those terms that everyone in statistics and data science uses but no one defines. For example, we all agree that we should be able to recreate results in scientific papers from the raw data and the code for that paper.</p> <blockquote> <p>But what do we mean when we say raw data?</p> </blockquote> <p>When working with collaborators or students I often find myself saying - could you just give me the raw data so I can do the normalization or processing myself. To give a concrete example, I work in the analysis of data from <a href="http://www.nature.com/nbt/journal/v26/n10/full/nbt1486.html">high-throughput genomic sequencing experiments</a>.</p> <p>These experiments produce data by breaking up genomic molecules into short fragements of DNA - then reading off parts of those fragments to generate “reads” - usually 100 to 200 letters long per read. But the reads are just puzzle pieces that need to be fit back together and then quantified to produce measurements on DNA variation or gene expression abundances.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/sequencing.png" alt="High throughput sequencing" /></p> <p><a href="http://cbcb.umd.edu/~hcorrada/CFG/lectures/lect22_seqIntro/seqIntro.pdf">Image from Hector Corrata Bravo’s lecture notes</a></p> <p>When I say “raw data” when talking to a collaborator I mean the reads that are reported from the sequencing machine. To me that is the rawest form of the data I will look at. But to generate those reads the sequencing machine first (1) created a set of images for each letter in the sequence of reads, (2) measured the color at the spots on that image to get the quantitative measurement of which letter, and (3) calculated which letter was there with a confidence measure. The raw data I ask for only includes the confidence measure and the sequence of letters itself, but ignores the images and the colors extracted from them (steps 1 and 2).</p> <p>So to me the “raw data” is the files of reads. But to the people who produce the machine for sequencing the raw data may be the images or the color data. To my collaborator the raw data may be the quantitative measurements I calculate from the reads. When thinking about this I realized an important characteristics of raw data.</p> <blockquote> <p>Raw data is relative to your reference frame.</p> </blockquote> <p>In other words the raw data is raw to <em>you</em> if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the <em>rawest</em> version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.</p> <p>The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provencance of the data back to the original source.</p> Not So Standard Deviations Episode 18 - Divide by n-1, or n-2, or Whatever 2016-07-18T00:00:00+00:00 http://simplystats.github.io/2016/07/18/nssd-episode-19 <p>Hilary and I talk about statistical software in fMRI analyses, the differences between software testing differences in proportions (a must listen!), and a preview of JSM 2016.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The books is available from Leanpub and will be updated as we record more episodes.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="http://www.theregister.co.uk/2016/07/03/mri_software_bugs_could_upend_years_of_research/?mt=1467760452040">fMRI bugs could upend years of research</a></p> </li> <li> <p><a href="http://www.pnas.org/content/113/28/7900.full">Eklund et al. PNAS Paper</a></p> </li> <li> <p><a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/index.cfm">JSM 2016 Program</a></p> </li> <li> <p><a href="https://leanpub.com/conversationsondatascience">Conversations on Data Science</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-19-divide-by-n-1-or-n-2-or-whatever">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/274214566&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Tuesday update 2016-07-11T00:00:00+00:00 http://simplystats.github.io/2016/07/11/tuesday-update <h2 id="it-might-all-be-wrong">It Might All Be Wrong</h2> <p>Tom Nichols and colleagues have published a paper on the software used to analyze fMRI data:</p> <blockquote> <p>Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.</p> </blockquote> <h2 id="criminal-justice-forecasts">Criminal Justice Forecasts</h2> <p>The <a href="http://www.theatlantic.com/technology/archive/2016/06/when-algorithms-take-the-stand/489566/">ongoing discussion</a> over the use of prediction algorithms in the criminal justice system reminds me a bit of the introduction of DNA evidence decades ago. Ultimately, there is a technology that few people truly understand and there are questions as to whether the information they provide is fair or accurate.</p> <h2 id="shameless-promotion">Shameless Promotion</h2> <p>I have a <a href="https://leanpub.com/conversationsondatascience">new book</a> coming out with Hilary Parker, based on our <em>Not So Standard Deviations</em> podcast. Sign up to be notified of its release (which should be Real Soon Now).</p> Not So Standard Deviations Episode 18 - Back on Planet Earth 2016-07-05T00:00:00+00:00 http://simplystats.github.io/2016/07/05/nssd-episode-18 <p>With Hilary fresh from Use R! 2016, Hilary and I discuss some of the highlights from the conference. Also, some followup about a previous Free Advertising and the NSSD drinking game.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Theranos movie with Jennifer Lawrence and Adam McKay</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Snowden_(film)">Snowden movie</a></p> </li> <li> <p><a href="http://www.npr.org/2016/06/19/482514949/welcome-to-mongolias-new-postal-system-an-atlas-of-random-words">What3Words being used in Mongolia</a></p> </li> <li> <p><a href="https://github.com/jimhester/lintr">lintr package</a></p> </li> <li> <p><a href="https://youtu.be/dhh8Ao4yweQ">“The Electronic Coach” with Don Knuth</a></p> </li> <li> <p><a href="http://alyssafrazee.com/gender-and-github-code.html">Exploring the data on gender and GitHub repo ownership</a></p> </li> <li> <p><a href="https://blog.codinghorror.com/falling-into-the-pit-of-success/">Jeff Atwood “Falling Into the Pit of Success”</a></p> </li> <li> <p><a href="https://research.googleblog.com/2014/08/doing-data-science-with-colaboratory.html">Google coLaboratory</a></p> </li> <li> <p><a href="https://www.stickermule.com/marketplace/12936-number-rcatladies">#rcatladies stickers</a></p> </li> <li> <p><a href="https://twitter.com/astrokatie/status/745529809669787649">Katie Mack time-lapse video</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-18-back-on-planet-earth">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/272064450&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Tuesday Update 2016-06-28T00:00:00+00:00 http://simplystats.github.io/2016/06/28/tuesday-update <h2 id="if-you-werent-sick-of-theranos-yet">If you weren’t sick of Theranos yet….</h2> <p>Looks like there will be a movie version of the <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">Theranos saga</a> which, as far as I can tell, isn’t over yet, but no matter. It will be done by Adam McKay, the writer-director of The Big Short (excellent film), and will star Jennifer Lawrence as Elizabeth Holmes. From <a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Vanity Fair</a>:</p> <blockquote> <p>Legendary Pictures snapped up rights to the hot-button biopic for a reported $3 million Thursday evening, after outbidding and outlasting a swarm of competition from Warner Bros., Twentieth Century Fox, STX Entertainment, Regency Enterprises, Cross Creek, Amazon Studios, AG Capital, the Weinstein Company, and, in the penultimate stretch, Paramount, among other studio suitors.</p> </blockquote> <blockquote> <p>Based on a book proposal by two-time Pulitzer Prize-winning journalist John Carreyrou titled Bad Blood: Secrets and Lies in Silicon Valley, the project (reported to be in the $40 million to $50 million budget range) has made the rounds to almost every studio in town. It’s been personally pitched by McKay, who won an Oscar for best adapted screenplay for last year’s rollicking financial meltdown procedural The Big Short.</p> </blockquote> <p>Frankly, I think we all know how this movie will end.</p> <h2 id="the-people-vs-oj-simpson-vsstatistics">The People vs. OJ Simpson vs….Statistics</h2> <p>I’m in the middle of watching <a href="https://en.wikipedia.org/wiki/The_People_v._O._J._Simpson:_American_Crime_Story">The People vs. OJ Simpson</a> and so far it is fantastic—I highly recommend it. One thing that is not represented in the show is the important role that statistics played in the trial. The trial was just in the early days of using DNA as evidence in criminal trials and there were many questions about how likely it was to find DNA matches in blood.</p> <p>Terry Speed ended up testifying for the defense (Simpson) and in this <a href="http://www.statisticsviews.com/details/feature/4915471/To-some-statisticians-a-number-is-a-number-but-to-me-a-number-is-packed-with-his.html">nice interview</a>, he explains how that came to be:</p> <blockquote> <p>At the beginning of the Simpson trial, there was going to be a pre-trial hearing and experts from both sides would argue in front of the judge as to what approaches should be accepted. Other pre-trial activities dragged on, and the one on DNA forensics was eventually scrapped. The DNA experts, including me were then asked whether they wanted to give evidence for the prosecution or defence, or leave. I did not initially plan to join the defence team, but wished to express my point of view in what was more or less a scientific environment before the trial started, but when the pre-trial DNA hearing was scrapped, I decided that I had no choice but to express my views in court on behalf of the defence, which I did.</p> </blockquote> <p>The full interview is well worth the read.</p> <h2 id="ai-is-the-residual">AI is the residual</h2> <p>I just recently found out about the <a href="https://en.m.wikipedia.org/wiki/AI_effect">AI effect</a> which I thought was interesting. Basically, “AI” is whatever can’t be explained, or in other words, the residuals of machine learning.</p> A Year at Stack Overflow 2016-06-28T00:00:00+00:00 http://simplystats.github.io/2016/06/28/stack-overflow-drob <p>David Robinson (<a href="https://twitter.com/drob">@drob</a>) has a great post on his blog about his <a href="http://varianceexplained.org/r/year_data_scientist/">first year as a data scientist at Stack Overflow</a>. This section in particular stood out for me:</p> <blockquote> <p>I like using R to learn interesting things about our data, but my longer term goal is to make it easy for any of our engineers to do so….Towards this goal, I’ve been focusing on building reliable tools and frameworks that people can apply to a variety of problems, rather than “one-off” analysis scripts. (There’s an awesome post by Jeff Magnusson at StitchFix about some of these general challenges). My approach has been building internal R packages, similar to AirBnb’s strategy (though our data team is quite a bit younger and smaller than theirs). These internal packages can query databases and parsing our internal APIs, including making various security and infrastructure issues invisible to the user.</p> </blockquote> <p>The world needs an army of David Robinsons.</p> Ultimate AI battle - Apple vs. Google 2016-06-14T00:00:00+00:00 http://simplystats.github.io/2016/06/14/ultimate-ai-battle <p>Yesterday, Apple launched its Worldwide Developer’s Conference (WWDC) and had its public keynote address. While many new things were announced, the one thing that caught my eye was the <a href="http://go.theinformation.com/HnOAdA6DQ7g">dramatic expansion</a> of Apple’s use of artificial intelligence (AI) tools. I talked a bit about AI with Hilary Parker on the latest <a href="http://simplystatistics.org/2016/06/09/nssd-episode-17/"><em>Not So Standard Deviations</em></a>, particularly in the context of Amazon’s Echo/Alexa, and I think it’s definitely going to be an area of intense competition between the major tech companies.</p> <p>Pretty much every major tech player is involved in AI—Google, Facebook, Amazon, Apple, Microsoft—the list goes on. Recently, a <a href="https://marco.org/2016/05/21/avoiding-blackberrys-fate">some commentators</a> <a href="https://stratechery.com/2015/tim-cooks-unfair-and-unrealistic-privacy-speech-strategy-credits-the-privacy-priority-problem/">have suggested</a> that Apple in particular will never catch up with the likes of Google with respect to AI because of Apple’s strict stance on privacy and unwillingness to gather/aggregate data from all its users. However, yesterday at WWDC, Apple revealed a few clues about what it was up to in the AI world.</p> <p>First, Apple mentioned deep learning more than a few times, including specifically calling out its use of <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTM</a> in its Messages app and facial recognition in its Photos app. Previously, Apple had been rumored to be applying deep learning to its <a href="http://go.theinformation.com/4Z2WhEs9_Nc">Siri assistant and its fingerprint sensor</a>. At WWDC, Craig Federighi stressed Apple’s continued focus on privacy and how Apple does not need to develop “user profiles” server-side, but rather does most computation on-device (in this case, on the iPhone).</p> <p>However, it can’t be that Apple does all its deep learning computation on the iPhone. These models tend to be enormous and take advantage of reams of data that can only be reasonablly processed server-side. Unfortunately, because most companies (Apple in particular) release few details about what they do, we may never how this works. But we can definitely speculate!</p> <h2 id="apple-vs-google">Apple vs. Google</h2> <p>I think the Apple/Google dichotomy provides an interesting opportunity to talk about how models can be learned using data in different ways. There are two approaches being represented here by Apple and Google:</p> <ul> <li><strong>Google way</strong> - Collect lots of data from users and store them on a server in the Googleplex somewhere. Then use that data to fit an enormous model that can predict when you’ve taken a picture of a cat. As users generate more data, bring that data back to the Googleplex and update/refine the model.</li> <li><strong>Apple way</strong> - Build a “starter model” in the Apple <a href="http://9to5mac.com/2015/10/05/spaceship-campus-2-drone-video-october/">Mothership</a>. As users generate data on their phones, bring the model to the phone and update the model using just their data. Bring the updated model back to the Apple Mothership and leave the user’s data on the phone.</li> </ul> <p>Perhaps the easiest way to understand this difference is with the arithmetic mean, which is perhaps the simplest “model”. Suppose you have a bunch of users out there and you want to compute the average of some attribute that they have on their phones (or whatever device). The first approach would be to get all that data and compute the mean in the usual way.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/googleway.png" alt="Google way" /></p> <p>Once all the data is in the Googleplex, we can just use the formula</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Googlemean.png" alt="Google mean" /></p> <p>I’ll call this the “Google mean” because it requires that you get all the data X<sub>1</sub> through X<sub>n</sub>, then sum them up and divide by n. Here, each of the X<sub>i</sub>’s represents the ith user’s data. The general principle here is to gather all the data and then estimate the model parameters server-side.</p> <p>What if you didn’t want to gather everyone’s data centrally? Can you still compute the mean?</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/appleway.png" alt="Apple way" /></p> <p>Yes, because there’s a nice recurrence formula for the mean:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Applemean.png" alt="Apple mean" /></p> <p>We can call this the “Apple mean”. With this strategy, we can send our current estimate of the mean to each user, update our estimate by taking the weighted average of the old value and the new value, and then move on to the next user. Here, you send the model parameters out to the users, update those parameters and then bring the parameters back.</p> <p>Which method is better? Well, in this case, both give you the same answer. In general, for linear models (like the mean), you can usually rework the formulas to build out either “whole data” (Google) approaches or “streaming” (Apple) approaches and get pretty much the same answer. But for non-linear models, it’s not so simple and you usually cannot achieve this kind of equivalence.</p> <h2 id="clients-and-servers">Clients and Servers</h2> <p>Balancing how much work is done on a server and how much is done on the client is an age-old computing problem and, over time, the balance of work between client and server seems to shift back and forth like a pendulum. When I was in grad school, we had so-called “dumb terminals” that were basically a screen that you used to login to the server. Today, I use my laptop for computing/work and that’s it. But I use the cloud for many other tasks.</p> <p>The Apple approach definitely requires a “fatter” client because the work of integrating current model parameters with new user data has to happen on the phone. With the Google approach, all the phone has to do is be able to collect the data and send it over the network to Google.</p> <p>The Apple approach is also closely related to what my colleagues <a href="http://www.biostat.jhsph.edu/~mlindqui/">Martin Lindquist</a> and <a href="http://www.bcaffo.com">Brian Caffo</a> refer to as “fusion science”, whereby Big Data and “Small Data” can be fused together via models to improve inference, but without ever having to actually combine the data. In a Bayesian context, you might think of the Big Data as making up the prior distribution and the Small Data as the likelihood. The Small Data can be used to update the model parameters and produce the posterior distribution, after which the Small Data can be thrown out.</p> <h2 id="and-the-winner-is">And the Winner is…</h2> <p>It’s not clear to me which approach is better in terms of building a better model for prediction or inference. Sadly, we may never have enough details to find out, and will only be ablle to evaluate which approach is better by the performance of the systems in the marketplace. But perhaps that’s the way things should be evaluated in this case?</p> Good list of good books 2016-06-13T00:00:00+00:00 http://simplystats.github.io/2016/06/13/good-books <p>The MultiThreaded blog over at Stitch Fix (hat tip to Hilary Parker) has posted a <a href="http://multithreaded.stitchfix.com/blog/2016/06/09/ds-books/">really nice list of data science books</a> (disclosure: one of <a href="https://leanpub.com/artofdatascience/">my books</a> is on the list).</p> <blockquote> <p>We’ve queried our data science team for some of their favorite data science books. This list is by no means exhaustive, but should keep any data scientist/engineer new or old learning and entertained for many an evening.</p> </blockquote> <p>Enjoy!</p> Not So Standard Deviations Episode 17 - Diurnal High Variance 2016-06-09T00:00:00+00:00 http://simplystats.github.io/2016/06/09/nssd-episode-17 <p>Hilary and I talk about Amazon Echo and Alexa as AI as a service, the COMPAS algorithm, criminal justice forecasts, and whether algorithms can introduce or remove bias (or both).</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/">In Two Moves, AlphaGo and Lee Sedol Redefined the Future</a></p> </li> <li> <p><a href="http://qz.com/639952/googles-ai-won-the-game-go-by-defying-millennia-of-basic-human-instinct/">Google’s AI won the game Go by defying millennia of basic human instinct</a></p> </li> <li> <p><a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks</a></p> </li> <li> <p><a href="https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm">ProPublica analysis of COMPAS</a></p> </li> <li> <p><a href="http://www.amazon.com/Criminal-Justice-Forecasts-Risk-SpringerBriefs/dp/1461430844?ie=UTF8&amp;*Version*=1&amp;*entries*=0">Richard Berk’s <em>Criminal Justice Forecasts of Risk</em></a></p> </li> <li> <p><a href="http://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815">Cathy O’Neill’s <em>Weapons of Math Destruction</em></a></p> </li> <li> <p><a href="https://mathbabe.org/2016/04/07/ill-stop-calling-algorithms-racist-when-you-stop-anthropomorphizing-ai/">I’ll stop calling algorithms racist when you stop anthropomorphizing AI</a></p> </li> <li> <p><a href="https://cran.r-project.org/web/packages/rmsfact/index.html">RMS Fact package</a></p> </li> <li> <p><a href="http://user2016.org">Use R! 2016</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-17-diurnal-high-variance">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/268232081&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Defining success - Four secrets of a successful data science experiment 2016-06-03T00:00:00+00:00 http://simplystats.github.io/2016/06/03/defining-success <p><em>Editor’s note: This post is excerpted from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek. This particular section was written by Brian Caffo.</em></p> <p>Defining success is a crucial part of managing a data science experiment. Of course, success is often context specific. However, some aspects of success are general enough to merit discussion. A list of hallmarks of success includes:</p> <ol> <li>New knowledge is created.</li> <li>Decisions or policies are made based on the outcome of the experiment.</li> <li>A report, presentation, or app with impact is created.</li> <li>It is learned that the data can’t answer the question being asked of it.</li> </ol> <p>Some more negative outcomes include: Decisions being made that disregard clear evidence from the data, equivocal results that do not shed light in one direction or another, uncertainty which prevents new knowledge from being created.</p> <p>Let’s discuss some of the successful outcomes first.</p> <p>New knowledge seems ideal in many cases (especially since we are academics), but new knowledge doesn’t necessarily mean that it’s important. If this new knowledge produces actionable decisions or policies, that’s even better. The idea of having evidence-based policy, while perhaps newer than the analogous evidence-based medicine movement that has transformed medical practice, has the potential to similarly transform public policy. Finaly, that our data science products have great (positive) impact on an audience that is much wider than a group of data scientists, is of course ideal. Creating reusable code or apps is great way to increase the impact of a project and to disseminate its findings.</p> <p>The fourth point is perhaps the most controversial. I view it as a success if we can show that the data can’t answer the questions being asked. I am reminded of a friend who told a story of the company he worked at. They hired many expensive prediction consultants to help use their data to inform pricing. However, the prediction results weren’t helping. They were able to prove that the data couldn’t answer the hypothesis under study. There was too much noise and the measurements just weren’t accurately measuring what was needed. Sure, the result wasn’t optimal, as they still needed to know how to price things, but it did save money on consultants. I have since heard this story repeated nearly identically by friends in different industries.</p> Sometimes the biggest challenge is applying what we already know 2016-05-31T00:00:00+00:00 http://simplystats.github.io/2016/05/31/barrier-to-medication <p>There’s definitely a need to innovate and develop new treatments in the area of asthma, but it’s easy to underestimate the barriers to just doing what we already know, such as making sure that people are following existing, well-established guidelines on how to treat asthma. My colleague, Elizabeth Matsui, has <a href="http://skybrudeconsulting.com/blog/2016/05/31/barriers-medication.html">written about the challenges</a> in a <a href="https://clinicaltrials.gov/ct2/show/NCT02251379?term=ecatch&amp;rank=1">study</a> that we are collaborating on:</p> <blockquote> <p>Our group is currently conducting a study that includes implementation of national guidelines-based medical care for asthma, so that one process that we’ve had to get right is to <strong>prescribe an appropriate dose of medication and get it into the family’s hands</strong>. [emphasis added]</p> </blockquote> <p>Seems simple, right?</p> Sometimes there's friction for a reason 2016-05-24T00:00:00+00:00 http://simplystats.github.io/2016/05/24/somtimes-theres-friction-for-a-reason <p>Thinking about <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">my post on Theranos</a> yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of <a href="http://www.amazon.com/Wesson-Pure-100%25-Natural-Vegetable/dp/B007F1KVX8/ref=sr_1_2_a_it?ie=UTF8&amp;qid=1464092378&amp;sr=8-2&amp;keywords=vegetable+oil">vegetable oil</a> that I always wanted without having to leave the house, because Amazon.</p> <p>So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.</p> <p>But sometimes there’s friction for a reason. For example, <a href="https://stratechery.com/2013/friction/">Ben Thompson talks about</a> how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as <a href="http://davidsimon.com/we-are-shocked-shocked/">noted</a> by David Simon of…<a href="http://www.hbo.com/the-wire">The Wire</a>) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive <a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">data release from OkCupid</a> a few weeks ago, as I discussed on the latest <a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Not So Standard Deviations</a> podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.</p> <p>When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">commentary for JAMA</a>:</p> <blockquote> <p>Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.</p> </blockquote> <p>Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">precisely quantify</a> how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).</p> <p>I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.</p> Update On Theranos 2016-05-23T00:00:00+00:00 http://simplystats.github.io/2016/05/23/update-on-theranos <p>I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via <a href="http://www.theverge.com/2016/5/19/11711004/theranos-voids-edison-blood-test-results">The Verge</a>):</p> <blockquote> <p>Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.</p> </blockquote> <p>Furthermore, <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">this commentary in JAMA</a> from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.</p> <p>I <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">continue to maintain</a> that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:</p> <ul> <li><strong>More testing is better</strong>. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.</li> <li><strong>People do not get tested because they are afraid of needles</strong>. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are <a href="http://www.rwjf.org/en/library/research/2012/02/special-issue-of-health-services-research-links-health-care-rese/nonfinancial-barriers-and-access-to-care-for-us-adults.html">many barriers</a> to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.</li> </ul> Not So Standard Deviations Episode 16 - The Silicon Valley Episode 2016-05-23T00:00:00+00:00 http://simplystats.github.io/2016/05/23/nssd-episode-16 <p>Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company">BuzzFeed Article on Palantir</a></p> </li> <li> <p><a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">Roger’s Simply Statistics post on Palantir</a></p> </li> <li> <p><a href="https://looker.com">Looker</a></p> </li> <li> <p><a href="http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/">Data science done well looks easy</a></p> </li> <li> <p><a href="http://www.wsj.com/articles/theranos-voids-two-years-of-edison-blood-test-results-1463616976">Latest on Theranos</a></p> </li> <li> <p><a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">OkCupid Data Release</a></p> </li> <li> <p><a href="http://fr.slideshare.net/sblank/secret-history-why-stanford-and-not-berkeley">Secret history of Silicon Valley</a></p> </li> <li> <p><a href="https://blog.wealthfront.com">Wealthfront blog</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/265158223&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> What is software engineering for data science? 2016-05-18T00:00:00+00:00 http://simplystats.github.io/2016/05/18/software-engineering-data-science <p><em>Editor’s note: This post is a chapter from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek.</em></p> <p>Software is the generalization of a specific aspect of a data analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.</p> <p>Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis. Software will have an interface, or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the <em>interface</em> to that software is important to using it in any given situation.</p> <p>For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.</p> <p>There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.</p> <ol> <li>At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.</li> <li>The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.</li> <li>The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.</li> </ol> <p>One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.</p> <p>Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.</p> <p>A basic rule of thumb is</p> <ul> <li>If you’re going to do something <strong>once</strong> (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.</li> <li>If you’re going to do something <strong>twice</strong>, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.</li> <li>If you’re going to do something <strong>three</strong> times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.</li> </ul> Disseminating reproducible research is fundamentally a language and communication problem 2016-05-13T00:00:00+00:00 http://simplystats.github.io/2016/05/13/reproducible-research-language <p>Just about 10 years ago, I wrote my <a href="http://www.ncbi.nlm.nih.gov/pubmed/16510544">first</a> of many articles about the importance of reproducible research. Since that article, one of the points I’ve made is that the key issue to resolve was one of tools and infrastructure. At the time, many people were concerned that people would not want to share data and that we had to spend a lot of energy finding ways to either compel or incentivize them to do so. But the reality was that it was difficult to answer the question “What should I do if I desperately want to make my work reproducible?” Back then, even if you could convince a clinical researcher to use R and LaTeX to create a <a href="https://en.wikipedia.org/wiki/Sweave">Sweave</a> document (!), it was not immediately obvious where they should host the document, code, and data files.</p> <p>Much has happened since then. We now have knitr and Markdown for live documents (as well as iPython notebooks and the like). We also have git, GitHub, and friends, which provide free code sharing repositories in a distributed manner (unlike older systems like CVS and Subversion). With the recent announcement of the <a href="http://www.arfon.org/announcing-the-journal-of-open-source-software">Journal of Open Source Software</a>, posting code on GitHub can now be recognized within the current system of credits and incentives. Finally, the number of <a href="http://dataverse.org">data</a> <a href="https://osf.io">repositories</a> has grown, providing more places for researchers to deposit their data files.</p> <p>Is the tools and infrastructure problem solved? I’d say yes. One thing that has become clear is that disseminating reproducible research is <strong>no longer a software problem</strong>. At least in R land, building live documents that can be executed by others is straightforward and not too difficult to pick up (thank you <a href="https://daringfireball.net/projects/markdown/">John Gruber</a>!). For other languages there many equivalent (if not better) tools for writing documents that mix code and text. For this kind of thing, there’s just no excuse anymore. Could things be optimized a bit for some edge cases? Sure, but the tools are prefectly fine for the vast majority of use cases.</p> <p>But now there is a bigger problem that needs to be solved, which is that <strong>we do not have an effective way to communicate data analyses</strong>. One might think that publishing the full code and datasets is the perfect way to communicate a data analysis, but in a way, it is too perfect. That approach can provide too much information.</p> <p>I find the following analogy useful for discussing this problem. If you look at music, one way to communicate music is to provide an audio file, a standard WAV file or something similar. In a way, that is a near-perfect representation of the music—bit-for-bit—that was produced by the performer. If I want to experience a Beethoven symphony the way that it was meant to be experienced, I’ll listen to a <a href="https://itun.es/us/TudVe?i=79443286">recording of it</a>.</p> <p>But if I want to understand how Beethoven wrote the piece—the process and the details—the recording is not a useful tool. What I look at instead is <a href="http://www.amazon.com/dp/0486260348">the score</a>. The recording is a serialization of the music. The score provides an expanded representation of the music that shows all of the different pieces and how they fit together. A person with a good ear can often reconstruct the score, but this is a difficult and time-consuming task. Better to start with what the composer wrote originally.</p> <p>Over centuries, classical music composers developed a language and system for communicating their musical ideas so that</p> <ol> <li>there was enough detail that a 3rd party could interpret the music and perform it to a level of accuracy that satisfied the composer; but</li> <li>it was not so prescriptive or constraining so that different performers could not build on the work and incorporate their own ideas</li> </ol> <p>It would seem that traditional computer code satisfies those criteria, but I don’t think so. Traditional computer code (even R code) is designed to communicate programming concepts and constructs, not to communicate data analysis constructs. For example, a <code class="highlighter-rouge">for</code> loop is not a data analysis concept, even though we may use <code class="highlighter-rouge">for</code> loops all the time in data analysis.</p> <p>Because of this disconnect between computer code and data analysis, I often find it difficult to understand what a data analysis is doing, even if it is coded very well. I imagine this is what programmers felt like when programming in processor-specific <a href="https://en.wikipedia.org/wiki/Assembly_language">assembly language</a>. Before languages like C were developed, where high-level concepts could be expressed, you had to know the gory details of how each CPU operated.</p> <p>The closest thing that I can see to a solution emerging is the work that Hadley Wickham is doing with packages like <a href="https://github.com/hadley/dplyr">dplyr</a> and <a href="https://github.com/hadley/ggplot2">ggplot2</a>. The <code class="highlighter-rouge">dplyr</code> package’s verbs (<code class="highlighter-rouge">filter</code>, <code class="highlighter-rouge">arrange</code>, etc.) represent data manipulation concepts that are meaningful to analysts. But we still have a long way to go to cover all of data analysis in this way.</p> <p>Reproducible research is important because it is fundamentally about communicating what you have done in your work. Right now we have a sub-optimal way to communicate what was done in a data analysis, via traditional computer code. I think developing a new approach to communicating data analysis could have a few benefits:</p> <ol> <li>It would provide greater transparency</li> <li>It would allow others to more easily build on what was done in an analysis by extending or modifying specific elements</li> <li>It would make it easier to understand what common elements there were across many different data analyses</li> <li>It would make it easier to teach data analysis in a systematic and scalable way</li> </ol> <p>So, any takers?</p> The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles 2016-05-11T00:00:00+00:00 http://simplystats.github.io/2016/05/11/palantir-struggles <p>Buzzfeed recently published a <a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company?utm_term=.ko2PLKaMJ#.wiPxJERyA">long article</a> on the struggles of the secretive data science company, Palantir.</p> <blockquote> <p>Over the last 13 months, at least three top-tier corporate clients have walked away, including Coca-Cola, American Express, and Nasdaq, according to internal documents. Palantir mines data to help companies make more money, but clients have balked at its high prices that can exceed $1 million per month, expressed doubts that its software can produce valuable insights over time, and even experienced difficult working relationships with Palantir’s young engineers. Palantir insiders have bemoaned the “low-vision” clients who decide to take their business elsewhere.</p> </blockquote> <p>Palantir’s origins are with PayPal, and its founders are part of the <a href="https://en.wikipedia.org/wiki/PayPal_Mafia">PayPal Mafia</a>. As Peter Thiel describes it in his book <a href="https://en.wikipedia.org/wiki/Zero_to_One">Zero to One</a>, PayPal was having a lot of trouble with fraud and the FBI was getting on its case. So PayPal developed some software to monitor the millions of transacations going through its systems and to flag transactions that were suspicious. Eventually, they realized that this kind of software might be useful to government agencies in a variety of contexts and the idea for Palantir was born.</p> <p>Much of the press reaction to Buzzfeed’s article amounts to schadenfreude over the potential fall of <a href="http://simplystatistics.org/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing/">another</a> so-called Silicon Valley unicorn. Indeed, Palentir is valued at $20 billion, a valuation only exceeded in the private markets by Airbnb and Uber. But to me, nothing in the article indicates that Palantir is necessarily more poorly run than your average startup. It looks like they are going through pretty standard growing pains, trying to scale the business and diversify the customer base. It’s not surprising to me that employees would leave at this point—going from startup to “real company” is often not that fun and just a lot of work.</p> <p>However, a key question that arises is that if Palantir is having trouble trying to scale the business, why might that be? The Buzzfeed article doesn’t contain any answers but in this post I will attempt to speculate.</p> <p>The real message from the Buzzfeed article goes beyond just Palantir and is highly relevant to the data science world. It ultimately comes down to the question of <strong>what is the value of data analysis?</strong>, and secondarily, <strong>how do you communicate that value?</strong></p> <h2 id="the-data-science-spectrum">The Data Science Spectrum</h2> <p>Data science activities live on a spectrum with <strong>software</strong> on one end and <strong>highly customized consulting</strong> on another end (I lump a lot of things into consulting, including methods development, modeling, etc.).</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/DS_Spectrum2.png" alt="Data Science Spectrum" /></p> <p>The idea being that if someone comes to you with a data problem, there are two extremes that you might offer to them:</p> <ol> <li>Give them some software, some documentation, and maybe a brief tutorial on how to use the software, and then send them on their way. For example, if someone wants to see if two groups are different from each other, you could send them the <code class="highlighter-rouge">t.test()</code> function in R and explain how to use it. This could be done over email; you wouldn’t even have to talk to the person.</li> <li>Meet with the person, talk about their problem and the question they’re trying to answer, develop an analysis plan, and build a custom software solution that produces the exact output that they’re looking for.</li> </ol> <p>The first option is cheap, simple, and if you had a good enough web site, the person probably wouldn’t even have to talk with you at all! For example, I use <a href="http://hedwig.mgh.harvard.edu/sample_size/size.html">this web site</a> for sample size calculations and I’ve never spoken with the author of the web site. Much of the labor is up front, for the development of the software, and then is amortized over the life of the product. Ultimately, a software product has zero marginal cost and so it can be easily replicated and is <em>infinitely scalable</em>.</p> <p>The second option is labor intensive, time-consuming, ongoing in nature, and is only scalable to the extent that you are willing to forgo sleep and maybe bend the space-time continuum. By definition, a custom solution is unique and is difficult to replicate.</p> <h2 id="selling-data-science">Selling Data Science</h2> <p>An important question for Palantir and data scientists in general is “How do you communicate the value of data analysis?” Many people expect the result of a good data analysis to be something “surprising”, i.e. something that they didn’t already know. Because if they knew it already why bother looking at the data? Think Moneyball—if you can find that “diamond in the rough” it make spending the time to analyze the data worthwhile. But <strong>the success of a data analysis can’t depend on the results</strong>. What if you go through the data and find nothing? Is the data analysis a failure? We as data scientists can only show what the data show. Otherwise, it just becomes a recipe for telling people what they want to hear.</p> <p>It’s tempting for a client to say “well, the data didn’t show anything surprising so there’s no value there.” Also, a data analysis may reveal something that is perhaps interesting but doesn’t necessarily lead to any sort of decision. For example, there may be an aspect of a business process that is inefficient but is nevertheless unmodifiable. There may be little perceived value in discovering this with data.</p> <h3 id="what-is-useful">What is Useful?</h3> <p>Palantir apparently tried to develop a relationship with American Express, but ultimately failed.</p> <blockquote> <p>But some major firms have not found Palantir’s products and services that useful. In April 2015, employees were informed that American Express (codename: Charlie’s Angels) had dumped Palantir after 18 months of cybersecurity work, including a six-month pilot, an email shows. “We struggled from day 1 to make Palantir a sticky product for users and generate wins,” Sid Rajgarhia, a Palantir business development employee, said in the email.</p> </blockquote> <p>What does it mean for a data analysis product to be useful? It’s not necessarily clear to me in this case. Did Palantir not reveal new information? Did they not highlight something that could be modified?</p> <h3 id="lack-of-deep-expertise">Lack of Deep Expertise</h3> <p>A failed attempt attempt at working with Coke reveals some other challenges of the data science business model.</p> <blockquote> <p>The beverage giant also had other concerns [in addition to the price]. Coke “wanted deeper industry expertise in a partner,” Jonty Kelt, a Palantir executive, told colleagues in the email. He added that Coca-Cola’s “working relationship” with the youthful Palantir employees was “difficult.” The Coke executive acknowledged that the beverage giant “needs to get better at working with millennials,” according to Kelt. Coke spokesperson Scott Williamson declined to comment.</p> </blockquote> <p>Annoying millenials notwithstanding, it’s clear that Coke didn’t feel comfortable collaborating with Palantir’s personnel. Like any data science collaboration, it’s key that the data scientist have some familiarity with the domain. In many cases, having “deep expertise” in an area can give a collaborator confidence that you will focus on the things that matter to them. But developing that expertise costs money and time and it may prevent you from working with other types of clients where you will necessarily have less expertise. For example, Palantir’s long experience working with the US military and intelligence agencies gave them deep expertise in those areas, but how does that help them with a consumer products company?</p> <h3 id="harder-than-it-looks">Harder Than It Looks</h3> <p>The final example of a client that backed out is Kimberly-Clark:</p> <blockquote> <p>But Kimberly-Clark was getting cold feet by early 2016. In January, a year after the initial pilot, Kimberly-Clark executive Anthony J. Palmer said he still wasn’t ready to sign a binding contract, meeting notes show. Palmer also “confirmed our suspicion” that a primary reason Kimberly-Clark had not moved forward was that “<em>they wanted to see if they could do it cheaper themselves</em>,” Kelt told colleagues in January. [emphasis added]</p> </blockquote> <p>This is a common problem confronted by anyone in the data science business. A good analysis often looks easy in retrospect—all you did was run some functions and put the data through some models! In fact, running the models probably <em>is</em> the easy part, but getting to the point where you can actually fit models can be incredibly hard. Many clients, not seeing the long and winding process leading to a model, will be tempted think they can do it themselves.</p> <h2 id="palantirs-valuation">Palantir’s Valuation</h2> <p>Ultimately, what makes Palantir interesting is its astounding valuation. But what is the driver of this valuation? I think the key to answering this question lies in the description of the company itself:</p> <blockquote> <p>The company, based in Palo Alto, California, is essentially a hybrid software and consulting firm, placing what it calls “forward deployed engineers” on-site at client offices.</p> </blockquote> <p>What does it mean to be a “hybrid software and consulting firm”? And which one is the company more like? Consulting or software? Because ultimately, revealing which side of the spectrum Palantir is <em>really</em> on could have huge implications for its valuation and future prospects.</p> <p>Consulting companies can surely make a lot of money, but none to my knowledge have the kind of valuation that Palantir currently commands. If it turns out that every customer of Palantir’s requires a custom solution, then I think they’re likely overvalued, because that model scales poorly. On the other hand, if Palantir has genuinely figured out a way to “software-ize” data analysis and to turn it into a commodity, then they are very likely undervalued.</p> <p>Given the tremendous difficulty of turning data analysis into a software problem, my guess is that they are more akin to a consulting company and are overvalued. This is not to say that they won’t make money—they will likely make plenty—but that they won’t be the Silicon Valley darling that everyone wants them to be.</p> A means not an end - building a social media presence as a junior scientist 2016-05-10T00:00:00+00:00 http://simplystats.github.io/2016/05/10/social-media <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before. 50% of all royalties from the book go to support <a href="http://www.datacarpentry.org/">Data Carpentry</a> to promote data science education.</em></p> <h2 id="social-media---what-should-i-do-and-why">Social media - what should I do and why?</h2> <p>Social media can serve a variety of roles for modern scientists. Here I am going to focus on the role of social media for working scientists whose primary focus is not on scientific communication. Something that is often missed by people who are just getting started with social media is that there are two separate components to developing a successful social media presence.</p> <p>The first is to develop a following and connections to people in your community. This is achieved through being either a content curator, a content generator, or being funny/interesting in some other way. This often has nothing to do with your scientific output.</p> <p>The second component is using your social media presence to magnify the audience for your scientific work. You can only do this if you have successfully developed a network and community in the first step. Then, when you post about your own scientific papers they will be shared.</p> <p>To most effectively achieve all of these goals you need to identify relevant communities and develop a network of individuals who follow you and will help to share your ideas and work.</p> <p><strong>Set up social media accounts and follow relevant people/journals</strong></p> <p>One of the largest academic communities has developed around Twitter, but some scientists are also using Facebook for professional purposes. If you set up a Twitter account, you should then find as many colleagues in your area of expertise that you can find and also any journals that are in your area.</p> <p><strong>Use your social media account to promote the work of other people</strong></p> <p>If you just use your social media account to post links to any papers that you publish, it will be hard to develop much of a following. It is also hard to develop a following by constantly posting long form original content such as blog posts. Alternatively you can gain a large number of followers by being (a) funny, (b) interesting, or (c) being a content curator. This latter approach can be particularly useful for people new to social media. Just follow people and journals you find interesting and share anything that you think is important/creative/exciting.</p> <p><strong>Share any work that you develop</strong></p> <p>Any code, publications, data, or blog posts you create you can share from your social media account. This can help raise your profile as people notice your good work. But if you only post your own work it is rarely possible to develop a large following unless you are already famous for another reason.</p> <h2 id="social-media---what-tools-should-i-use">Social media - what tools should I use?</h2> <p>There are a large number of social media platforms that are available to scientists. Creatively using any new social media platform if it has a large number of users can be a way to quickly jump into the consciousness of more people. That being said the two largest communities of scientists have organized around two of the largest social media platforms.</p> <ul> <li><a href="https://twitter.com/">Twitter</a> - is a platform where you can post short (less than 140 character) messages. This is a great platform for both discovering science and engaging in conversations about topics at a superficial level. It is not particularly useful for in depth scientific discussions.</li> <li><a href="https://www.facebook.com/">Facebook</a> - some scientists post longer form scientific discussions on Facebook, but the community there is somewhat less organized and people tend to use it less for professional reasons. However, sharing content on Facebook, particularly when it is of interest to a general audience, can lead to a broader engagement in your work.</li> </ul> <p>There are also a large and growing number of academic-specific social networks. For the most part these social networks are not widely used by practicing scientists and so don’t represent the best use of your time.</p> <p>You might also consider short videos on <a href="https://vine.co/">Vine</a>, longer videos on <a href="https://www.youtube.com/">Youtube</a>, more image intensive social media on <a href="https://www.tumblr.com/">Tumblr</a> or <a href="https://www.instagram.com">Instagram</a> if you have content that regularly fits those outlets. But they tend to have smaller communities of scientists with less opportunity for back and forth.</p> <h2 id="social-media---further-tips-and-issues">Social media - further tips and issues</h2> <h3 id="you-do-not-need-to-develop-original-content">You do not need to develop original content</h3> <p>Social media can be a time suck, particularly if you are spending a lot of time engaging in conversations on your platform of choice. Generating long form content in particular can take up a lot of time. But you don’t need to do that to generate a decent following. Finding the right community and then sharing work within that community and adding brief commentary and ideas can often help you develop a large following which can then be useful for other reasons.</p> <h3 id="add-your-own-commentary">Add your own commentary</h3> <p>Once you are comfortable using the social media platform of your choice you can start to engage with other people in conversation or add comments when you share other people’s work. This will increase the interest in your social media account and help you develop followers. This can be as simple as one-liners copied straight from the text of papers or posts that you think are most important.</p> <h3 id="make-online-friends---then-meet-them-offline">Make online friends - then meet them offline</h3> <p>One of the biggest advantages of scientific social media is that it levels the playing ground. Don’t be afraid to engage with members of your scientific community at all levels, from members of the National Academy (if they are online!) all the way down to junior graduate students. Getting to know a diversity of people can really help you during scientific meetings and visits. If you spend time cultivating online friendships, you’ll often meet a “familiar handle” at any conference or meeting you go to.</p> <h3 id="include-images-when-you-can">Include images when you can</h3> <p>If you see a plot from a paper you think is particularly compelling, copy it and attach it when you post/tweet when you link to the paper. On social media, images are often better received than plain text.</p> <h3 id="be-careful-of-hot-button-issues-unless-you-really-care">Be careful of hot button issues (unless you really care)</h3> <p>One thing to keep in mind on social media is the amplification of opinions. There are a large number of issues that are of extreme interest and generate really strong opinions on multiple sides. Some of these issues are common societal issues (e.g., racism, feminism, economic inequality) and some are specific to science (e.g., open access publishing, open source development). If you are starting a social media account to engage in these topics then you should definitely do that. If you are using your account primarily for scientific purposes you should consider carefully the consequences of wading into these discussions. The debates run very hot on social media and you may post what you consider to be a relatively tangential or light message on one of these topics and find yourself the center of a lot of attention (positive and negative).</p> Time Series Analysis in Biomedical Science - What You Really Need to Know 2016-05-05T00:00:00+00:00 http://simplystats.github.io/2016/05/05/timeseries-biomedical <p>For a few years now I have given a guest lecture on time series analysis in our School’s <em>Environmental Epidemiology</em> course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.</p> <h2 id="prediction-vs-inference">Prediction vs. Inference</h2> <p>First, most of the literature on time series models tends to assume that you are interested in doing prediction—forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably <em>never include air pollution as a predictor</em>. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.</p> <p>What I <em>am</em> interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.</p> <p>The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome’s variation. If you’re trying to do inference and use a prediction modeling strategy, you may make at least two errors:</p> <ol> <li>You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn’t deem to include it</li> <li>You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.</li> </ol> <h2 id="random-vs-fixed">Random vs. Fixed</h2> <p>Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the <em>random</em> part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.</p> <p>For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_fixed.png" alt="Time series with seasonal pattern 1" /></p> <p>It’s possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it’s possible that the data are exhibiting a seasonal pattern. It’s impossible to tell from the data whether this is a random formulation of this pattern or whether it’s something you’d expect every time. The problem is that we usually onl have <em>one observation</em> from teh time series. That is, we observe the entire series only once.</p> <p>Now take a look at this time series. It exhibits some of the same properties as the first series: it’s low at the beginning and end and high in the middle.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_random.png" alt="Time series with seasonal pattern 2" /></p> <p>Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it’s air pollution data, we might do one thing, but if it’s stock market data, we might do a totally different thing.</p> <p>If one were to see replicates of the time series, we’d be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_both.png" alt="Fixed and random time series patterns" /></p> <p>It’s clear now that the time series on the top row has a fixed “seasonal” pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).</p> <p>The point here is that I think very often we know things about the time series that we’re modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we’re modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there’s relatively little of interest left in the residuals.</p> <h2 id="what-to-model">What to Model?</h2> <p>So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.</p> <p>Here’s a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987–2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:</p> <div class="highlighter-rouge"><pre class="highlight"><code>Call: lm(formula = death ~ pm10, data = ds) Residuals: Min 1Q Median 3Q Max -26.978 -5.559 -0.386 5.109 34.022 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 46.978826 0.112284 418.394 &lt;2e-16 pm10 0.004885 0.001936 2.523 0.0117 Residual standard error: 8.03 on 5112 degrees of freedom Multiple R-squared: 0.001244, Adjusted R-squared: 0.001049 F-statistic: 6.368 on 1 and 5112 DF, p-value: 0.01165 </code></pre> </div> <p>PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-3-1.png" alt="ACF1" /></p> <p>If we see a seasonal-like pattern in the auto-correlation function, then that means there’s a seasonal pattern in the residuals as well. Not good.</p> <p>But okay, we can just model the seasonal component with an indicator of the season.</p> <div class="highlighter-rouge"><pre class="highlight"><code>Call: lm(formula = death ~ season + pm10, data = ds) Residuals: Min 1Q Median 3Q Max -25.964 -5.087 -0.242 4.907 33.884 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 50.830458 0.215679 235.676 &lt; 2e-16 seasonQ2 -4.864167 0.304838 -15.957 &lt; 2e-16 seasonQ3 -6.764404 0.304346 -22.226 &lt; 2e-16 seasonQ4 -3.712292 0.302859 -12.258 &lt; 2e-16 pm10 0.009478 0.001860 5.097 0.000000358 Residual standard error: 7.649 on 5109 degrees of freedom Multiple R-squared: 0.09411, Adjusted R-squared: 0.09341 F-statistic: 132.7 on 4 and 5109 DF, p-value: &lt; 2.2e-16 </code></pre> </div> <p>Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.</p> <p>When we look at the residuals now, we see</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-5-1.png" alt="ACF2" /></p> <p>The seasonal pattern is gone, but we see that there’s positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there’s some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Call: lm(formula = death ~ season + date + pm10, data = ds) Residuals: Min 1Q Median 3Q Max -23.407 -5.073 -0.375 4.718 32.179 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 60.04317325 0.64858433 92.576 &lt; 2e-16 seasonQ2 -4.76600268 0.29841993 -15.971 &lt; 2e-16 seasonQ3 -6.56826695 0.29815323 -22.030 &lt; 2e-16 seasonQ4 -3.42007191 0.29704909 -11.513 &lt; 2e-16 date -0.00106785 0.00007108 -15.022 &lt; 2e-16 pm10 0.00933871 0.00182009 5.131 0.000000299 Residual standard error: 7.487 on 5108 degrees of freedom Multiple R-squared: 0.1324, Adjusted R-squared: 0.1316 F-statistic: 156 on 5 and 5108 DF, p-value: &lt; 2.2e-16 </code></pre> </div> <p>Now we can look at the autocorrelation function one last time.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-7-1.png" alt="ACF3" /></p> <p>The ACF trails to zero reasonably quickly now, but there’s still some autocorrelation at short lags up to about 15 days or so.</p> <p>Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order <em>p</em>. What should <em>p</em> be? We can check by looking at the partial autocorrelation function (PACF).</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-8-1.png" alt="PACF" /></p> <p>The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let’s use an AR(6) model and see how things look. We can use the <code class="highlighter-rouge">arima()</code> function for that.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Call: arima(x = y, order = c(6, 0, 0), xreg = m, include.mean = FALSE) Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 (Intercept) 0.0869 0.0933 0.0733 0.0454 0.0377 0.0489 59.8179 s.e. 0.0140 0.0140 0.0141 0.0141 0.0140 0.0140 1.0300 seasonQ2 seasonQ3 seasonQ4 date pm10 -4.4635 -6.2778 -3.2878 -0.0011 0.0096 s.e. 0.4569 0.4624 0.4546 0.0001 0.0018 sigma^2 estimated as 53.69: log likelihood = -17441.84, aic = 34909.69 </code></pre> </div> <p>Note that the coefficient for PM10 hasn’t changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here’s a comparison of the standard errors for each coefficient.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Naive AR model (Intercept) 0.648584 1.030007 seasonQ2 0.298420 0.456883 seasonQ3 0.298153 0.462371 seasonQ4 0.297049 0.454624 date 0.000071 0.000114 pm10 0.001820 0.001819 </code></pre> </div> <p>The standard errors for the <code class="highlighter-rouge">pm10</code> variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.</p> <h2 id="conclusion">Conclusion</h2> <p>Ultimately, I’ve found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:</p> <ol> <li> <p>Make sure you know if you’re doing <strong>prediction</strong> or <strong>inference</strong>. Most often you will be doing inference, in which case your modeling strategies will be quite different.</p> </li> <li> <p>Focus separately on the <strong>fixed</strong> and <strong>random</strong> parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.</p> </li> <li> <p>Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.</p> </li> </ol> Not So Standard Deviations Episode 15 - Spinning Up Logistics 2016-05-04T00:00:00+00:00 http://simplystats.github.io/2016/05/04/nssd-episode-15 <p>This is Hilary’s and my last New York-Baltimore episode! In future episodes, Hilary will be broadcasting from California. In this episode we discuss collaboration tools and workflow management for data science projects. To date, I have not found a project management tool that I can actually use (besides email), but am open to suggestions (from students).</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://twitter.com/hspter/status/725411087110299649">Hilary’s tweet on cats</a></p> </li> <li> <p><a href="http://www.etsy.com/listing/185113916/…mug-coffee-cup-tea">Awesome vs. cats mug</a></p> </li> <li> <p><a href="http://math.mit.edu/~urschel/">John Urschel’s web page</a></p> </li> <li> <p><a href="http://www.ams.org/publications/journa…1602/rnoti-p148.pdf">Profile of John Urschel by the AMS</a></p> </li> <li> <p><a href="http://en.wikipedia.org/wiki/Frank_Ryan_…merican_football">The other NFL player/mathematician</a>)</p> </li> <li> <p><a href="http://guides.github.com/introduction/flow/">GitHub flow</a></p> </li> <li> <p><a href="http://www.theinformation.com/articles/why-…a-product-fix">Problems with Slack</a></p> </li> <li> <p><a href="http://www.astronomy.ohio-state.edu/~pogge/Ast…5/gps.html">Relativity and GPS</a></p> </li> <li> <p><a href="http://www.theinformation.com/become-a-data…e-information">The Information is looking for a Data Storyteller</a></p> </li> <li> <p><a href="http://www.stitchfix.com/careers?gh_jid=1…46?gh_jid=169746">Stitch Fix is looking for Data Scientists</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/nssd-episode-15-spinning-up-logistics">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/261374684&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> High school student builds interactive R class for the intimidated with the JHU DSL 2016-04-27T00:00:00+00:00 http://simplystats.github.io/2016/04/27/r-intimidated <p>Annika Salzberg is currently a biology undergraduate at Haverford College majoring in biology. While in high-school here in Baltimore she developed and taught an R class to her classmates at the <a href="http://www.parkschool.net/">Park School</a>. Her interest in R grew out of a project where she and her fellow students and teachers went to the Canadian sub-Arctic to collect data on permafrost depth and polar bears. When analyzing the data she learned R (with the help of a teacher) to be able to do the analyses, some of which she did on her laptop while out in the field.</p> <p>Later she worked on developing a course that she felt was friendly and approachable enough for her fellow high-schoolers to benefit. With the help of Steven Salzberg and the folks here at the JHU DSL, she built a class she calls <a href="https://www.datacamp.com/courses/r-for-the-intimidated">R for the intimidated</a> which just launched on <a href="https://www.datacamp.com/courses/r-for-the-intimidated">DataCamp</a> and you can take for free!</p> <p>The class is a great introduction for people who are just getting started with R. It walks through R/Rstudio, package installation, data visualization, data manipulation, and a final project. We are super excited about the content that Annika created working here at Hopkins and think you should go check it out!</p> An update on Georgia Tech's MOOC-based CS degree 2016-04-27T00:00:00+00:00 http://simplystats.github.io/2016/04/27/georgia-tech-mooc-program <p><a href="https://www.insidehighered.com/news/2016/04/27/georgia-tech-plans-next-steps-online-masters-degree-computer-science?utm_source=Inside+Higher+Ed&amp;utm_campaign=d373e33023-DNU20160427&amp;utm_medium=email&amp;utm_term=0_1fcbc04421-d373e33023-197601005#.VyCmdfkGRPU.mailto">This article</a> in Inside Higher Ed discusses next steps for Georgia Tech’s ground-breaking low-cost CS degree based on MOOCs run by Udacity. With Sebastian Thrun <a href="http://blog.udacity.com/2016/04/udacity-has-a-new-___.html">stepping down</a> as CEO at Udacity, it seems both Georgia Tech and Udacity might be moving into a new phase.</p> <p>One fact that surprised me about the Georgia Tech program concerned the demographics:</p> <blockquote> <p>Once the first applications for the online program arrived, Georgia Tech was surprised by how the demographics differed from the applications to the face-to-face program. The institute’s face-to-face cohorts tend to have more men than women and international students than U.S. citizens or residents. Applications to the online program, however, came overwhelmingly from students based in the U.S. (80 percent). The gender gap was even larger, with nearly nine out of 10 applications coming from men.</p> </blockquote> Write papers like a modern scientist (use Overleaf or Google Docs + Paperpile) 2016-04-21T00:00:00+00:00 http://simplystats.github.io/2016/04/21/writing <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p> <h2 id="writing---what-should-i-do-and-why">Writing - what should I do and why?</h2> <p><strong>Write using collaborative software to avoid version control issues.</strong></p> <p>On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to create a single working document and pass it around. Unfortunately this system always results in a long collection of versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.</p> <p>An alternative approach is to use formal version control systems like <a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control">Git</a> and <a href="https://github.com/">Github</a>. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require all parties participating in the writing of the paper to be familiar with version control and the command line. Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved in using base version control systems.</p> <p><img src="https://imgs.xkcd.com/comics/documents.png" alt="The usual result of file naming by a group (image via https://xkcd.com/1459/)" /></p> <p><strong>Make figures the focus of your writing</strong></p> <p>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a> for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.</p> <p><strong>Write clearly and simply even though it may make your papers harder to publish</strong>.</p> <p>Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.</p> <p><strong>Include links to code, data, and software in your writing</strong></p> <p>Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.</p> <p><strong>Give credit to others</strong></p> <p>In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor <a href="http://www.genomine.org/">John Storey</a> taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.</p> <h2 id="writing---what-tools-should-i-use">Writing - what tools should I use?</h2> <h3 id="wysiwyg-software-google-docs-and-paperpile">WYSIWYG software: Google Docs and Paperpile.</h3> <p>This system uses <a href="https://www.google.com/docs/about/">Google Docs</a> for writing and <a href="https://paperpile.com/app">Paperpile</a> for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used <a href="http://endnote.com/">Endnote</a> and <a href="https://products.office.com/en-us/word">Microsoft Word</a>.</p> <p>This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.</p> <p><em>Getting started</em></p> <ol> <li>Set up accounts with <a href="https://accounts.google.com/SignUp">Google</a> and with <a href="https://paperpile.com/">Paperpile</a>. If you are an academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.</li> <li>Go to <a href="https://docs.google.com/document/u/0/">Google Docs</a> and create a new document.</li> <li>Set up the <a href="https://paperpile.com/blog/free-google-docs-add-on/">Paperpile add-on for Google Docs</a></li> <li>In the upper right hand corner of the document, click on the <em>Share</em> link and share the document with your collaborators</li> <li>Start editing</li> <li>When you want to include a reference, place the cursor where you want the reference to go, then using the <em>Paperpile</em> menu, choose insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.</li> <li>Once you have added some references use the <em>Citation style</em> option under <em>Paperpile</em> to pick the citation style for the journal you care about.</li> <li>Then use the <em>Format citations</em> option under <em>Paperpile</em> to create the bibliography at the end of the document</li> </ol> <p>The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.</p> <h3 id="typesetting-software-overleaf-or-sharelatex">Typesetting software: Overleaf or ShareLatex</h3> <p>An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now “Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are <a href="https://www.overleaf.com/">Overleaf</a> and <a href="https://www.sharelatex.com/">ShareLatex</a>.</p> <p>In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.</p> <p>I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have finished writing your paper you can directly submit it to some preprint servers like <a href="http://arxiv.org/">arXiv</a> or <a href="http://biorxiv.org/">biorXiv</a> and even some journals like <a href="https://peerj.com">Peerj</a> or <a href="http://f1000research.com/">f1000research</a>.</p> <p><em>Getting started</em></p> <ol> <li>Create an <a href="https://www.overleaf.com/signup">Overleaf account</a>. There is a free version of the software. Paying $8/month will give you easy saving to Dropbox.</li> <li>Click on <em>New Project</em> to create a new document and select from the available templates</li> <li>Open your document and start editing</li> <li>Share with colleagues by clicking on the <em>Share</em> button within the project. You can share either a read only version or a read and edit version.</li> </ol> <p>Once you have finished writing your document you can click on the <em>Publish</em> button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.</p> <h2 id="writing---further-tips-and-issues">Writing - further tips and issues</h2> <h3 id="when-to-write-your-first-paper">When to write your first paper</h3> <p>As soon as possible! The purpose of graduate school is (in some order):</p> <ul> <li>Freedom</li> <li>Time to discover new knowledge</li> <li>Time to dive deep</li> <li>Opportunity for leadership</li> <li>Opportunity to make a name for yourself <ul> <li>R packages</li> <li>Papers</li> <li>Blogs</li> </ul> </li> <li>Get a job</li> </ul> <p>The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.</p> <p>After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough <em>no one will care about your grades in graduate school but everyone will care what you produced</em>. See for example, Sherri’s excellent <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">guide on CV’s for academic positions</a>.</p> <p>I firmly believe that <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">R packages</a> and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).</p> <p>Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”</p> <h3 id="what-is-an-academic-paper">What is an academic paper?</h3> <p>A scientific paper can be distilled into four parts:</p> <ol> <li>A set of methodologies</li> <li>A description of data</li> <li>A set of results</li> <li>A set of claims</li> </ol> <p>When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.</p> <h3 id="how-do-you-start-a-paper">How do you start a paper?</h3> <p>The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:</p> <ol> <li>Concrete</li> <li>Solves a scientific problem</li> <li>Gives you an opportunity to learn something new</li> <li>Something you feel ownership of</li> <li>Something you want to work on</li> </ol> <p>Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is <em>your</em> problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you <em>just have to figure it out</em> and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.</p> <p>Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual <a href="https://leanpub.com/datastyle/">data analytic process</a>:</p> <ol> <li>Define the question</li> <li>Get/tidy the data</li> <li>Explore the data</li> <li>Build/borrow a model</li> <li>Perform the analysis</li> <li>Check/critique results</li> <li>Write things up</li> </ol> <p>The hardest part for the first paper is often knowing where to stop and start writing.</p> <h3 id="how-do-you-know-when-to-start-writing">How do you know when to start writing?</h3> <p>Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.</p> <p>If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.</p> <p>In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.</p> <p>One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.</p> <h3 id="a-note-on-negative-results">A note on “negative results”</h3> <p>If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. <strong>This is ok!!</strong> Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.</p> <h3 id="how-do-you-start-writing">How do you start writing?</h3> <ol> <li>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a>). Show these to someone you trust to make sure they “get” your story before proceeding.</li> <li>Start a project on <a href="https://www.overleaf.com/">Overleaf</a> or <a href="https://www.google.com/docs/about/">Google Docs</a>.</li> <li>Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.</li> <li>Go back and add references in only after you have finished the whole first draft.</li> <li>Add in additional technical detail in the supplementary material if you need it.</li> <li>Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.</li> </ol> <h3 id="what-are-the-sections-in-a-paper">What are the sections in a paper?</h3> <p>Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!</p> <p>The sections of a paper are always some variation on the following:</p> <ol> <li><strong>Title</strong>: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.</li> <li><strong>Abstract</strong>: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.</li> <li><strong>Introduction</strong>: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.</li> <li><strong>Methods</strong>: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.</li> <li><strong>Comparisons (for methods papers)</strong>: Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are <a href="http://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/">simulating both the easy case (where your method should be great) and harder cases where your method might be terrible</a>.</li> <li><strong>Your analysis</strong>: Explain what you did, what data you collected, how you processed it and how you analysed it.</li> <li><strong>Conclusions</strong>: Summarize what you did and explain why what you did is important one more time.</li> <li><strong>Supplementary Information</strong>: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.</li> </ol> <p>The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.</p> <p><em>Note</em>: Part of this chapter appeared in the <a href="https://github.com/jtleek/firstpaper">Leek group guide to writing your first paper</a></p> As a data analyst the best data repositories are the ones with the least features 2016-04-20T00:00:00+00:00 http://simplystats.github.io/2016/04/20/data-repositories <p>Lately, for a range of projects I have been working on I have needed to obtain data from previous publications. There is a growing list of data repositories where data is made available. General purpose data sharing sites include:</p> <ul> <li>The <a href="https://osf.io/">open science framework</a></li> <li>The <a href="https://dataverse.harvard.edu/">Harvard Dataverse</a></li> <li><a href="https://figshare.com/">Figshare</a></li> <li><a href="https://datadryad.org/">Datadryad</a></li> </ul> <p>There are also a host of field-specific data sharing sites.One thing that I find a little frustrating about these sites is that they add a lot of bells and whistles. For example I wanted to download a <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6FMTT3">p-value data set</a> from Dataverse (just to pick on one, but most repositories have similar issues). I go to the page and click <code class="highlighter-rouge">Download</code> on the data set I want.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse1.png" alt="Downloading a dataverse paper " /></p> <p>Then I have to accept terms:</p> <p>Then I have to <img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse2.png" alt="Downloading a dataverse paper part 2 " /></p> <p>Then the data set is downloaded. But it comes from a button that doesn’t allow me to get the direct link. There is an <a href="https://github.com/ropensci/dvn">R package</a> that you can use to download dataverse data, but again not with direct links to the data sets.</p> <p>This is a similar system to many data repositories where there is a multi-step process to downloading data rather than direct links.</p> <p>But as a data analyst I often find that I want:</p> <ul> <li>To be able to find a data set with some minimal search terms</li> <li>Find the data set in .csv or tab delimited format via a direct link</li> <li>Have the data set be available both as raw and processed versions</li> <li>The processed version will either be one or many <a href="https://www.jstatsoft.org/article/view/v059i10">tidy data sets</a>.</li> </ul> <p>As a data analyst I would rather have all of the data stored as direct links and ideally as csv files. Then you don’t need to figure out a specialized package, an API, or anything else. You just use <code class="highlighter-rouge">read.csv</code> directly using the URL in R and you are off to the races. It also makes it easier to point to which data set you are using. So I find the best data repositories are the ones with the least features.</p> Junior scientists - you don't have to publish in open access journals to be an open scientist. 2016-04-11T00:00:00+00:00 http://simplystats.github.io/2016/04/11/publishing <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p> <h2 id="publishing---what-should-i-do-and-why">Publishing - what should I do and why?</h2> <p>A modern scientific writing process goes as follows.</p> <ol> <li>You write a paper</li> <li>You post a preprint a. Everyone can read and comment</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> </ol> <p>You can take advantage of modern writing and publishing tools to handle several steps in the process.</p> <p><strong>Post preprints of your work</strong></p> <p>Once you have finished writing you paper, you want to share it with others. Historically, this involved submitting the paper to a journal, waiting for reviews, revising the paper, resubmitting, and eventually publishing it. There is now very little reason to wait that long for your paper to appear in print. Generally you can post a paper to a preprint server and have it appear in 1-2 days. This is a dramatic improvement on the weeks or months it takes for papers to appear in peer reviewed journals even under optimal conditions. There are several advantages to posting preprints.</p> <ul> <li>Preprints establish precedence for your work so it reduces your risk of being scooped.</li> <li>Preprints allow you to collect feedback on your work and improve it quickly.</li> <li>Preprints can help you to get your work published in formal academic journals.</li> <li>Preprints can get you attention and press for your work.</li> <li>Preprints give junior scientists and other researchers gratification that helps them handle the stress and pressure of their first publications.</li> </ul> <p>The last point is underappreciated and was first pointed out to me by <a href="http://giladlab.uchicago.edu/">Yoav Gilad</a> It takes a really long time to write a scientific paper. For a student publishing their first paper, the first feedback they get is often (a) delayed by several months and (b) negative and in the form of a referee report. This can have a major impact on the motivation of those students to keep working on projects. Preprints allow students to have an immediate product they can point to as an accomplishment, allow them to get positive feedback along with constructive or negative feedback, and can ease the pain of difficult referee reports or rejections.</p> <p><strong>Choose the journal that maximizes your visibility</strong></p> <p>You should try to publish your work in the best journals for your field. There are a couple of reasons for this. First, being a scientist is both a calling and a career. To advance your career, you need visibilty among your scientific peers and among the scientists who will be judging you for grants and promotions. The best place to do this is by publishing in the top journals in your field. The important thing is to do your best to do good work and submit it to these journals, even if the results aren’t the most “sexy”. Don’t adapt your workflow to the journal, but don’t ignore the career implications either. Do this even if the journals are closed source. There are ways to make your work accessible and you will both raise your profile and disseminate your results to the broadest audience.</p> <p><strong>Share your work on social media</strong></p> <p>Academic journals are good for disseminating your work to the appropriate scientific community. As a modern scientist you have other avenues and other communities - like the general public - that you would like to reach with your work. Once your paper has been published in a preprint or in a journal, be sure to share your work through appropriate social media channels. This will also help you develop facility in coming up with one line or one figure that best describes what you think you have published so you can share it on social media sites like Twitter.</p> <h3 id="preprints-and-criticism">Preprints and criticism</h3> <p>See the section on scientific blogging for how to respond to criticism of your preprints online.</p> <h2 id="publishing---what-tools-should-i-use">Publishing - what tools should I use?</h2> <h3 id="preprint-servers">Preprint servers</h3> <p>Here are a few preprint servers you can use.</p> <ol> <li><a href="http://arxiv.org/">arXiv</a> (free) - primarily takes math/physics/computer science papers. You can submit papers and they are reviewed and posted within a couple of days. It is important to note that once you submit a paper here, you can not take it down. But you can submit revisions to the paper which are tracked over time. This outlet is followed by a large number of journalists and scientists.</li> <li><a href="http://biorxiv.org/">biorXiv</a> (free) - primarily takes biology focused papers. They are pretty strict about which categories you can submit to. You can submit papers and they are reviewed and posted within a couple of days. biorxiv also allows different versions of manuscripts, but some folks have had trouble with their versioning system, which can be a bit tricky for keeping your paper coordinated with your publication. bioXiv is pretty carefully followed by the biological and computational biology communities.</li> <li><a href="https://peerj.com/preprints/">Peerj</a> (free) - takes a wide range of different types of papers. They will again review your preprint quickly and post it online. You can also post different versions of your manuscript with this system. This system is newer and so has fewer followers, you will need to do your own publicity if you publish your paper here.</li> </ol> <h3 id="journal-preprint-policies">Journal preprint policies</h3> <p>This <a href="https://en.wikipedia.org/wiki/List_of_academic_journals_by_preprint_policy">list</a> provides information on which journals accept papers that were first posted as preprints. However, you shouldn’t</p> <h2 id="publishing---further-tips-and-issues">Publishing - further tips and issues</h2> <h3 id="open-vs-closed-access">Open vs. closed access</h3> <p>Once your paper has been posted to a preprint server you need to submit it for publication. There are a number of considerations you should keep in mind when submitting papers. One of these considerations is closed versus open access. Closed access journals do not require you to pay to submit or publish your paper. But then people who want to read your paper either need to pay or have a subscription to the journal in question.</p> <p>There has been a strong push for open access journals over the last couple of decades. There are some very good reasons justifying this type of publishing including (a) moral arguments based on using public funding for research, (2) each of access to papers, and (3) benefits in terms of people being able to use your research. In general, most modern scientists want their work to be as widely accessible as possible. So modern scientists often opt for open access publishing.</p> <p>Open access publishing does have a couple of disadvantages. First it is often expensive, with fees for publication ranging between <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">$1,000 and $4,000</a> depending on the journal. Second, while science is often a calling, it is also a career. Sometimes the best journals in your field may be closed access. In general, one of the most important components of an academic career is being able to publish in journals that are read by a lot of people in your field so your work will be recognized and impactful.</p> <p>However, modern systems make both closed and open access journals reasonable outlets.</p> <h3 id="closed-access--preprints">Closed access + preprints</h3> <p>If the top journals in your field are closed access and you are a junior scientist then you should try to submit your papers there. But to make sure your papers are still widely accessible you can use preprints. First, you can submit a preprint before you submit the paper to the journal. Second, you can update the preprint to keep it current with the published version of your paper. This system allows you to make sure that your paper is read widely within your field, but also allows everyone to freely read the same paper. On your website, you can then link to both the published and preprint version of your paper.</p> <h3 id="open-access">Open access</h3> <p>If the top journal in your field is open access you can submit directly to that journal. Even if the journal is open access it makes sense to submit the paper as a preprint during the review process. You can then keep the preprint up-to-date, but this system has the advantage that the formally published version of your paper is also available for everyone to read without constraints.</p> <h3 id="responding-to-referee-comments">Responding to referee comments</h3> <p>After your paper has been reviewed at an academic journal you will receive referee reports. If the paper has not been outright rejected, it is important to respond to the referee reports in a timely and direct manner. Referee reports are often maddening. There is little incentive for people to do a good job refereeing and the most qualified reviewers will likely be those with a <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">conflict of interest</a>.</p> <p>The first thing to keep in mind is that the power in the refereeing process lies entirely with the editors and referees. The first thing to do when responding to referee reports is to eliminate the impulse to argue or respond with any kind of emotion. A step-by-step process for responding to referee reports is the following.</p> <ol> <li>Create a Google Doc. Put in all referee and editor comments in italics.</li> <li>Break the comments up into each discrete criticism or request.</li> <li>In bold respond to each comment. Begin each response with “On page xx we did yy to address this comment”</li> <li>Perform the analyses and experiments that you need to fill in the yy</li> <li>Edit the document to reflect all of the experiments that you have performed</li> </ol> <p>By actively responding to each comment you will ensure you are responsive to the referees and give your paper the best chance of success. If a comment is incorrect or non-sensical, think about how you can edit the paper to remove this confusion.</p> <h3 id="finishing">Finishing</h3> <p>While I have advocated here for using preprints to disseminate your work, it is important to follow the process all the way through to completion. Responding to referee reports is drudgery and no one likes to do it. But in terms of career advancement preprints are almost entirely valueless until they are formally accepted for publication. It is critical to see all papers all the way through to the end of the publication cycle.</p> <h3 id="you-arent-done">You aren’t done!</h3> <p>Publication of your paper is only the beginning of successfully disseminating your science. Once you have published the paper, it is important to use your social media, blog, and other resources to disseminate your results to the broadest audience possible. You will also give talks, discuss the paper with colleagues, and respond to requests for data and code. The most successful papers have a long half life and the responsibilities linger long after the paper is published. But the most successful scientists continue to stay on top of requests and respond to critiques long after their papers are published.</p> <p><em>Note:</em> Part of this chapter appeared in the Simply Statistics blog post: <a href="http://simplystatistics.org/2016/02/26/preprints-and-pppr/">“Preprints are great, but post publication peer review isn’t ready for prime time”</a></p> A Natural Curiosity of How Things Work, Even If You're Not Responsible For Them 2016-04-08T00:00:00+00:00 http://simplystats.github.io/2016/04/08/eecom <p>I just read Karl’s <a href="https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/">great post</a> on what it means to be a data scientist. I can’t really add much to it, but reading it got me thinking about the Apollo 12 mission, the second moon landing.</p> <p>This mission is actually famous because of its launch, where the Saturn V was struck by lightning and <a href="https://en.wikipedia.org/wiki/John_Aaron">John Aaron</a> (played wonderfully by Loren Dean in the movie <a href="http://www.imdb.com/title/tt0112384/">Apollo 13</a>), the flight controller in charge of environmental, electrical, and consumables (EECOM), had to make a decision about whether to abort the launch.</p> <p>In this great clip from the movie <em>Failure is Not An Option</em>, the real John Aaron describes what makes for a good EECOM flight controller. The bottom line is that</p> <blockquote> <p>A good EECOM has a natural curiosity for how things work, even if you…are not responsible for them</p> </blockquote> <p>I think a good data scientist or statistician also has that property. They key part of that line is the “<em>even if you are not responsible for them”</em> part. I’ve found that a lot of being a statistician involves nosing around in places where you’re not supposed to be, seeing how data are collected, handled, managed, analyzed, and reported. Focusing on the development and implementation of methods is not enough.</p> <p>Here’s the clip, which describes the famous “SCE to AUX” call from John Aaron:</p> <iframe width="640" height="480" src="https://www.youtube.com/embed/eWQIryll8y8" frameborder="0" allowfullscreen=""></iframe> Not So Standard Deviations Episode 13 - It's Good that Someone is Thinking About Us 2016-04-07T00:00:00+00:00 http://simplystats.github.io/2016/04/07/nssd-episode-13 <p>In this episode, Hilary and I talk about the difficulties of separating data analysis from its context, and Feather, a new file format for storing tabular data. Also, we respond to some listener questions and Hilary announces her new job.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.patreon.com/NSSDeviations">NSSD Patreon page</a></p> </li> <li> <p><a href="https://github.com/wesm/feather/">Feather git repository</a></p> </li> <li> <p><a href="https://arrow.apache.org">Apache Arrow</a></p> </li> <li> <p><a href="https://google.github.io/flatbuffers/">FlatBuffers</a></p> </li> <li> <p><a href="http://simplystatistics.org/2016/03/31/feather/">Roger’s blog post on feather</a></p> </li> <li> <p><a href="https://www.etsy.com/shop/NausicaaDistribution">NausicaaDistribution</a></p> </li> <li> <p><a href="http://www.rstats.nyc">New York R Conference</a></p> </li> <li> <p><a href="https://goo.gl/J2QAWK">Every Frame a Painting</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-13-its-good-that-someone-is-thinking-about-us">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/257851619&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Companies are Countries, Academia is Europe 2016-04-05T00:00:00+00:00 http://simplystats.github.io/2016/04/05/corporations-academia <p>I’ve been thinking a lot recently about the practice of data analysis in different settings and how the environment in which you work can affect the view you have on how things should be done. I’ve been working in academia for over 12 years now. I don’t have any industry data science experience, but long ago I worked as a software engineer at <a href="http://www.northropgrumman.com/Pages/default.aspx">two</a> <a href="http://kencast.com">companies</a>. Obviously, my experience is biased on the academic side.</p> <p>I’ve see an interesting divergence between what I see being written from data scientists in industry and my personal experience doing data science in academia. From the industry side, I see a lot of stuff about tooling/software and processes. This makes sense to me. Often, a company needs/wants to move quickly and doing so requires making decisions on a reasonable time scale. If decisions are made with data, then the process of collecting, organizing, analyzing, and communicating data needs to be well thought-out, systematized, rigorous, and streamlined. If everytime someone at the company had a question the data science team developed some novel custom coded-from-scratch solution, decisions would be made at a glacial pace, which is probably not good for business. In order to handle this type of situation you need solid tools and flexible workflows. You also need agreement within the company about how things are down and the processes that are followed.</p> <p>Now, I don’t mean to imply that life at a company is easy, that there isn’t politics or bureacracy to deal with. But I see companies as much like individual countries, with a clear (hierarchical) leadership structure and decision-making process (okay, maybe ideal companies). Much like in a country, it might take some time to come to a decision about a policy or problem (e.g. health insurance), with much negotiation and horse-trading, but once consensus is arrived at, often the policy can be implemented across the country at a reasonable timescale. In a company, if a certain workflow or data process can be shown to be beneficial and perhaps improve profitability down the road, then a decision could be made to implement it. Ultimately, everyone within a company is in the same boat and is interested in seeing the company succeed.</p> <p>When I worked at a company as a software developer, I’d sometimes run into a problem that was confusing or difficult to code. So I’d walk down to the systems engineer’s office (they guy who wrote the specification) and talk to him about it. We’d hash things out for a while and then figure out a way to go forward. Often the technical writers who wrote the documentation would come and ask me what exactly a certain module did and I’d explain it to them. Communication was usually quick and efficient because it usually occurred person-to-person and because we were all on the same team.</p> <p>Academia is more like Europe, a somewhat loose federation of states that only communicates with each other because they have to. Each principal investigator is a country and s/he has to engage in constant (sometimes contentious) negotiations with other investigators (“countries”). As a data scientist, this can be tricky because unless I collect/generate my own data (which sometimes, <a href="http://www.ncbi.nlm.nih.gov/pubmed/18477784">I do</a>), I have to negotiate with another investigator to obtain the data. Even if I were collaborating with that investigator from the very beginning of a study, I typically have very little direct control over the data collection process because those people don’t work for me. The result is often, the data come to me in some format over which I had little input, and I just have to deal with it. Sometimes this is a nice CSV file, but often it is not.</p> <p>In good situations, I can talk with the investigator collecting the data and we can hash out a plan to put the data into a <a href="https://www.jstatsoft.org/article/view/v059i10">certain format</a>. But even if we can agree on that, often the expertise will not be available on their end to get the data into that format, so I’ll end up having to do it myself anyway. In not-so-good situations, I can make all the arguments I want for an organized data collection and analysis workflow, but if the investigator doesn’t want to do it, can’t afford it, or doesn’t see any incentive, then it’s not going to happen. Ever.</p> <p>However, even in the good situations, every investigator works in their own personal way. I mean, that’s why people go into academia, because you can “be your own boss” and work on problems that interest you. Most people develop a process for running their group/lab that most suits their personality. If you’re a data scientist, you need to figure out a way to mesh with each and every investigator you collaborate with. In addition, you need to adapt yourself to whatever data process each investigator has developed for their group. So if you’re working with a genomics person, you might need to learn about BAM files. For a neuroimaging collaborator, you’ll need to know about SPM. If one person doesn’t like tidy data, then that’s too bad. You need to deal with it (or don’t work with them). As a result, it’s difficult to develop a useful “system” for data science because any system that works for one collaborator is unlikely to work for another collaborator. In effect, each collaboration often results in a custom coded-from-scratch solution.</p> <p>This contrast between companies and academia got me thinking about the <a href="https://en.wikipedia.org/wiki/Theory_of_the_firm">Theory of the Firm</a>. This is an economic theory that tries to explain why firms/companies develop at all, as opposed to individuals or small groups negotiating over an open market. My understanding is that it all comes down to how well you can write and enforce a contract between two parties. For example, if I need to manufacture iPhones, I can go to a contract manufacturer, given them the designs and the precise specifications/tolerances and they can just produce millions of them. However, if I need to <em>design</em> the iPhone, it’s a bit harder for me to go to another company and just say “Design an awesome new phone!” That kind of contract is difficult to write down, much less enforce. That other company will be operating off of different incentives from me and will likely not produce what I want. It’s probably better if I do the design work in-house. Ultimately, once the transaction costs of having two different companies work together become too high, it makes more sense for a company to do the work in-house.</p> <p>I think collaborating on data analysis is a high transaction cost activity. Companies have an advantage in this realm to the extent that they can hire lots of data scientists to work in-house. Academics that are well-funded and have large labs can often hire a data analyst to work for them. This is good because it makes a well-trained person available at low transaction cost, but this setup is the exception. PIs with smaller labs barely have enough funding to do their experiments and so either have to analyze the data themselves (for which they may not be appropriately trained) or collaborate with someone willing to do it. Large academic centers often have research cores that provide data analysis services, but this doesn’t change the fact that data analysis that occurs “outside the company” dramatically increases the transaction costs of doing the research. Because data analysis is a highly iterative process, each time you have to go back in forth with an outside entity, the costs go up.</p> <p>I think it’s possible to see a time when data analysis can effectively be made external. I mean, Apple used to manufacture all its products, but has shifted to contract manufacturing to great success. But I think we will have to develop a much better understanding of the data analysis process before we see the transaction costs start to go down.</p> New Feather Format for Data Frames 2016-03-31T00:00:00+00:00 http://simplystats.github.io/2016/03/31/feather <p>This past Tuesday, Hadley Wickham and Wes McKinney <a href="http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/">announced</a> a new binary file format specifically for storing data frames.</p> <blockquote> <p>One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.</p> </blockquote> <p>Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is <a href="https://github.com/wesm/feather/">here</a>.</p> <p>Initial thoughts:</p> <ul> <li> <p>The possibilities for passing data between languages is I think the main point here. The potential for passing data through a pipeline without worrying about the specifics of different languages could make for much more powerful analyses where different tools are used for whatever they tend to do best. Essentially, as long as data can be made tidy going in and coming out, there should not be a communication issue between languages.</p> </li> <li> <p>R users might be wondering what the big deal is–we already have a binary serialization format (XDR). But R’s serialization format is meant to cover all possible R objects. Feather’s focus on data frames allows for the removal of many of the annoying (but seldom used) complexities of R objects and optimizing a very commonly used data format.</p> </li> <li> <p>In my testing, there’s a noticeable speed difference between reading a feather file and reading an (uncompressed) R workspace file (feather seems about 2x faster). I didn’t time writing files, but the difference didn’t seem as noticeable there. That said, it’s not clear to me that performance on files is the main point here.</p> </li> <li> <p>Given the underlying framework and representation, there seem to be some interesting possibilities for low-memory environments.</p> </li> </ul> <p>I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.</p> How to create an AI startup - convince some humans to be your training set 2016-03-30T00:00:00+00:00 http://simplystats.github.io/2016/03/30/humans-as-training-set <p>The latest trend in data science is <a href="https://en.wikipedia.org/wiki/Artificial_intelligence">artificial intelligence</a>. It has been all over the news for tackling a bunch of interesting questions. For example:</p> <ul> <li><a href="https://deepmind.com/alpha-go.html">AlphaGo</a> <a href="http://www.techrepublic.com/article/how-googles-deepmind-beat-the-game-of-go-which-is-even-more-complex-than-chess/">beat</a> one of the top Go players in the world in what has been called a major advance for the field.</li> <li>Microsoft created a chatbot <a href="http://techcrunch.com/2016/03/23/microsofts-new-ai-powered-bot-tay-answers-your-tweets-and-chats-on-groupme-and-kik/">Tay</a> that ultimately <a href="http://www.bbc.com/news/technology-35902104">went very very wrong</a>.</li> <li>Google and a number of others are working on <a href="https://www.google.com/selfdrivingcar/">self driving cars</a>.</li> <li>Facebook is creating an artificial intellgence based <a href="http://www.engadget.com/2015/08/26/facebook-messenger-m-assistant/">virtual assistant called M</a></li> </ul> <p>Almost all of these applications are based (at some level) on using variations on <a href="http://neuralnetworksanddeeplearning.com/">neural networks and deep learning</a>. These models are used like any other statistical or machine learning model. They involve a prediction function that is based on a set of parameters. Using a training data set, you estimate the parameters. Then when you get a new set of data, you push it through the prediction function using those estimated parameters and make your predictions.</p> <p>So why does deep learning do so well on problems like voice recognition, image recognition, and other complicated tasks? The main reason is that these models involve hundreds of thousands or millions of parameters, that allow the model to capture even very subtle structure in large scale data sets. This type of model can be fit now because (a) we have huge training sets (think all the pictures on Facebook or all voice recordings of people using Siri) and (b) we have fast computers that allow us to estimate the parameters.</p> <p>Almost all of the high-profile examples of “artificial intelligence” we are hearing about involve this type of process. This means that the machine is “learning” from examples of how humans behave. The algorithm itself is a way to estimate subtle structure from collections of human behavior.</p> <p>The result is that the typical trajectory for an AI business is.</p> <ol> <li>Get a large collection of humans to perform some repetitive but possibly complicated behavior (play thousands of games of Go, or answer requests from people on Facebook messenger, or label pictures and videos, or drive cars.)</li> <li>Record all of the actions the humans perform to create a training set.</li> <li>Feed these data into a statistical model with a huge number of parameters - made possible by having a huge training set collected from the humans in steps 1 and 2.</li> <li>Apply the algorithm to perform the repetitive task and cut the humans out of the process.</li> </ol> <p>The question is how do you get the humans to perform the task for you? One option is to collect data from humans who are using your product (think Facebook image tagging). The other, more recent phenomenon, is to farm the task out to a large number of contractors (think <a href="http://www.theguardian.com/commentisfree/2015/jul/26/will-we-get-by-gig-economy">gig economy</a> jobs like driving for Uber, or responding to queries on Facebook).</p> <p>The interesting thing about the latter case is that in the short term it produces a market for gigs for humans. But in the long term, by performing those tasks, the humans are putting themselves out of a job. This played out in a relatively public way just recently with a service called <a href="http://www.fastcompany.com/3058060/this-is-what-it-feels-like-when-a-robot-takes-your-job">GoButler</a> that used its employees to train a model and then replaced them with that model.</p> <p>It will be interesting to see how many areas of employment this type of approach takes over. It is also interesting to think about how much value each task you perform for a company like that is worth to the training set. It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.</p> Not So Standard Deviations Episode 12 - The New Bayesian vs. Frequentist 2016-03-26T00:00:00+00:00 http://simplystats.github.io/2016/03/26/nssd-episode-12 <p>In this episode, Hilary and I discuss the new direction for the journal Biostatistics, the recent fracas over ggplot2 and base graphics in R, and whether collecting more data is always better than collecting less (fewer?) data. Also, Hilary and Roger respond to some listener questions and more free advertising.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="http://goo.gl/am6I3r">Jeff Leek on why he doesn’t use ggplot2</a></p> </li> <li> <p>David Robinson on <a href="http://varianceexplained.org/r/why-I-use-ggplot2/">why he uses ggplot2</a></p> </li> <li> <p><a href="http://goo.gl/6iEB2I">Nathan Yau’s post comparing ggplot2 and base graphics</a></p> </li> <li> <p><a href="https://goo.gl/YuhFgB">Biostatistics Medium post</a></p> </li> <li> <p><a href="http://goo.gl/tXNdCA">Photoviz</a></p> </li> <li> <p><a href="https://twitter.com/PigeonAir">PigeonAir</a></p> </li> <li> <p><a href="https://goo.gl/jqlg0G">I just want to plot()</a></p> </li> <li> <p><a href="https://goo.gl/vvCfkl">Hilary and Rush Limbaugh</a></p> </li> <li> <p><a href="http://imgur.com/a/K4RWn">Deep learning training set</a></p> </li> <li> <p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-12-the-new-bayesian-vs-frequentist">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/255099493&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> The future of biostatistics 2016-03-24T00:00:00+00:00 http://simplystats.github.io/2016/03/24/the-future-of-biostatistics <p>Starting in January my colleague <a href="https://twitter.com/drizopoulos">Dimitris Rizopoulos</a> and I took over as co-editors of the journal Biostatistics. We are pretty fired up to try some new things with the journal and to make sure that the most important advances in statistical methodology and application have a good home.</p> <p>We started a blog for the journal and our first post is here: <a href="https://medium.com/@biostatistics/the-future-of-biostatistics-5aa8246e14b4#.uk1gat5sr">The future of Biostatistics</a>. Thanks to <a href="https://twitter.com/kwbroman/status/695306823365169154">Karl Broman and his famiy</a> we also have the twitter handle <a href="https://twitter.com/biostatistics">@biostatistics</a>. Follow us there to hear about all the new stuff we are rolling out.</p> The Evolution of a Data Scientist 2016-03-21T00:00:00+00:00 http://simplystats.github.io/2016/03/21/dataScientistEvo-jaffe <p><em>Editor’s note: This post is a guest post by <a href="http://aejaffe.com">Andrew Jaffe</a></em></p> <p>“How do you get to Carnegie Hall? Practice, practice, practice.” (“The Wit Parade” by E.E. Kenyon on March 13, 1955)</p> <p>”..an extraordinarily consistent answer in an incredible number of fields … you need to have practiced, to have apprenticed, for 10,000 hours before you get good.” (Malcolm Gladwell, on Outliers)</p> <p>I have been a data scientist for the last seven or eight years, probably before “data science” existed as a field. I work almost exclusively in the R statistical environment which I first toyed with as a sophomore in college, which ramped up through graduate school. I write all of my code in Notepad++ and make all of my plots with base R graphics, over newer and probably easier approaches, like R Studio, ggplot2, and R Markdown. Every so often, someone will email asking for code used in papers for analysis or plots, and I dig through old folders to track it down. Every time this happens, I come to two realizations: 1) I used to write fairly inefficient and not-so-great code as an early PhD student, and 2) I write a lot of R code.</p> <p>I think there are some pretty good ways of measuring success and growth as a data scientist – you can count software packages and their user-bases, projects and papers, citations, grants, and promotions. But I wanted to calculate one more metric to add to the list – how much R code have I written in the last 8 years? I have been using the Joint High Performance Computing Exchange (JHPCE) at Johns Hopkins University since I started graduate school, so all of my R code was pretty much all in one place. I therefore decided to spend my Friday night drinking some Guinness and chronicling my journey using R and evolution as a data scientist.</p> <p>I found all of the .R files across my four main directories on the computing cluster (after copying over my local scripts), and then removed files that came with packages, that belonged to other users, and that resulted from poorly designed simulation and permutation analyses (perm1.R,…,perm100.R) before I learned how to use array jobs, and then extracted the creation date, last modified date, file size, and line count for each R script. Based on this analysis, I have written 3257 R scripts across 13.4 megabytes and 432,753 lines of code (including whitespace and comments) since February 22, 2009.</p> <p>I found that my R coding output has generally increased over time when tabulated by month (number of scripts: p=6.3e-7, size of files: p=3.5x10-9, and number of lines: p=5.0e-9). These metrics of coding – number, size, and lines - also suggest that, on average, I wrote the most code during my PhD (p-value range: 1.7e-4-1.8e-7). Interestingly, the changes in output over time surprisingly consistent across the three phases of my academic career: PhD, postdoc, and faculty (see Figure 1) – you can see the initial dropoff in production during the first one or two months as I transitioned to a postdoc at the Lieber Institute for Brain Development after finishing my PhD. My output rate has dropped slightly as a faculty member as I started working with doctoral students who took over the analyses of some projects (month-by-output interaction p-value: 5.3e-4, 0.002, and 0.03, respectively, for number, size, and lines). The mean coding output – on average, how much code it takes for a single analysis – were also increased over time and slightly decreased at LIBD, although to lesser extents (all p-values were between 0.01-0.05). I was actually surprised that coding output increased – rather than decreased – over time, as any gains in coding efficiency were probably canceled out my often times more modular analyses at LIBD.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsMonth_rCode.jpg" alt="Figure 1: Coding output over time. Vertical bars separate my PhD, postdoc, and faculty jobs" /></p> <p>I also looked at coding output by hour of the day to better characterize my working habits – the output per hour is shown stratified by the two eras each about ~3 years (Figure 2). As expected, I never really work much in the morning – very little work get done before 8AM – and little has changed since a second year PhD student. As a faculty member, I have the highest output between 9AM-3PM. The trough between 4PM and 7PM likely corresponds to walking the dog we got three years ago, working out, and cooking (and eating) dinner. The output then increases steadily from 8PM-12AM, where I can work largely uninterrupted from meetings and people dropping by my office, with occasional days (or nights) working until 1AM.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsHour_rCode.jpg" alt="Figure 2: Coding output by hour of day. X-axis starts at 5AM to divide the day into a more temporal order." /></p> <p>Lastly, I examined R coding output by day of the week. As expected, the lowest output occurred over the weekend, especially on Saturdays. Interestingly, I tended to increase output later in the work week as a faculty member, and also work a little more on Sundays and Mondays, compared to a PhD student.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsDay_rCode.jpg" alt="Figure 3: Coding output by day of week." /></p> <p>Looking at the code itself, of the 432,753 lines, 84,343 were newlines (19.5%), 66,900 were lines that were exclusively comments (15.5%), and an additional 6,994 lines contained comments following R code (1.6%). Some of my most used syntax and symbols, as line counts containing at least one symbol, were pretty much as expected (dropping commas and requiring whitespace between characters):</p> <table> <tbody> <tr> <td>Code</td> <td>Count</td> <td>Code</td> <td>Count</td> </tr> <tr> <td>=</td> <td>175604</td> <td>==</td> <td>5542</td> </tr> <tr> <td>#</td> <td>48763</td> <td>&lt;</td> <td>5039</td> </tr> <tr> <td>&lt;-</td> <td>16492</td> <td>for(i</td> <td>5012</td> </tr> <tr> <td>{</td> <td>11879</td> <td>&amp;</td> <td>4803</td> </tr> <tr> <td>}</td> <td>11612</td> <td>the</td> <td>4734</td> </tr> <tr> <td>in</td> <td>10587</td> <td>function(x)</td> <td>4591</td> </tr> <tr> <td>##</td> <td>8508</td> <td>###</td> <td>4105</td> </tr> <tr> <td>~</td> <td>6948</td> <td>-</td> <td>4034</td> </tr> <tr> <td>&gt;</td> <td>5621</td> <td>%in%</td> <td>3896</td> </tr> </tbody> </table> <p>My code is available on GitHub: https://github.com/andrewejaffe/how-many-lines (after removing file paths and names, as many of the projects are currently unpublished and many files are placed in folders named by collaborator), so feel free to give it a try and see how much R code you’ve written over your career. While there are probably a lot more things to play around with and explore, this was about all the time I could commit to this, given other responsibilities (I’m not on sabbatical like <a href="http://jtleek.com">Jeff Leek</a>…). All in all, this was a pretty fun experience and largely reflected, with data, how my R skills and experience have progressed over the years.</p> Not So Standard Deviations Episode 11 - Start and Stop 2016-03-14T00:00:00+00:00 http://simplystats.github.io/2016/03/14/nssd-episode-11 <p>We’ve started a Patreon page! Now you can support the podcast directly by going to <a href="http://patreon.com/NSSDeviations">our page</a> and making a pledge. This will help Hilary and me build the podcast, add new features, and get some better equipment.</p> <p>Episode 11 is an all craft episode of <em>Not So Standard Deviations</em>, where Hilary and Roger discuss starting and ending a data analysis. What do you do at the very beginning of an analysis? Hilary and Roger talk about some of the things that seem to come up all the time. Also up for discussion is the American Statistical Association’s statement on <em>p</em> values, famous statisticians on Twitter, and evil data scientists on TV. Plus two new things for free advertising.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p> </li> <li> <p><a href="https://twitter.com/deleeuw_jan">Jan de Leeuw</a></p> </li> <li> <p><a href="https://twitter.com/BatesDmbates">Douglas Bates</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Sports_Night">Sports Night</a></p> </li> <li> <p><a href="http://goo.gl/JFz7ic">ASA’s statement on p values</a></p> </li> <li> <p><a href="http://goo.gl/O8kL60">Basic and Applied Psychology Editorial banning p values</a></p> </li> <li> <p><a href="http://www.seriouseats.com/vegan-experience">J. Kenji Alt’s Vegan Experience</a></p> </li> <li> <p><a href="http://fieldworkfail.com/">fieldworkfail</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-11-start-and-stop">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/251825714&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Not So Standard Deviations Episode 10 - It's All Counterexamples 2016-03-02T00:00:00+00:00 http://simplystats.github.io/2016/03/02/nssd-episode-10 <p>In the latest episode of Not So Standard Deviations Hilary and I talk about the motivation behind the <a href="https://github.com/hilaryparker/explainr">explainr</a> package and the general usefulness of automated reporting and interpretation of statistical tests. Also, Roger struggles to come up with a quick and easy way to explain why statistics is useful when it sometimes doesn’t produce any different results.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p>The <a href="https://github.com/hilaryparker/explainr">explainr</a> package</p> </li> <li> <p><a href="https://google.github.io/CausalImpact/CausalImpact.html">Google’s CausalImpact package</a></p> </li> <li> <p><a href="http://www.wsj.com/articles/SB10001424053111903480904576512250915629460">Software is Eating the World</a></p> </li> <li> <p><a href="http://allendowney.blogspot.com/2015/12/many-rules-of-statistics-are-wrong.html">Many Rules of Statistics are Wrong</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-10-its-all-counterexamples">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/249517993&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Preprints are great, but post publication peer review isn't ready for prime time 2016-02-26T00:00:00+00:00 http://simplystats.github.io/2016/02/26/preprints-and-pppr <p>The current publication system works something like this:</p> <h3 id="coupled-review-and-publication">Coupled review and publication</h3> <ol> <li>You write a paper</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> <li>If published then people can read it</li> </ol> <p>This system has several major disadvantages that bother scientists. It means all research appears on a lag (whatever the time in peer review is). It can be a major lag time if the paper is sent to “top tier journals” and rejected then filters down to “lower tier” journals before ultimate publication. Another disadvantage is that there are two options for most people to publish their papers: (a) in closed access journals where it doesn’t cost anything to publish but then the articles are beyind paywalls and (b) in open access journals where anyone can read them but it costs money to publish. Especially for junior scientists or folks without resources, this creates a difficult choice because they <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">might not be able to afford open access fees</a>.</p> <p>For a number of years some fields like physics (with the <a href="http://arxiv.org/">arxiv</a>) and economics (with <a href="http://www.nber.org/papers.html">NBER</a>) have solved this problem by decoupling peer review and publication. In these fields the system works like this:</p> <h3 id="decoupled-review-and-publication">Decoupled review and publication</h3> <ol> <li>You write a paper</li> <li>You post a preprint a. Everyone can read and comment</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> </ol> <p>Lately there has been a growing interest in this same system in molecular and computational biology. I think this is a really good thing, because it makes it easier to publish papers more quickly and doesn’t cost researchers to publish. That is why the papers my group publishes all show up on <a href="http://biorxiv.org/search/author1%3AJeffrey%2BLeek%2B">biorxiv</a> or <a href="http://arxiv.org/find/stat/1/au:+Leek_J/0/1/0/all/0/1">arxiv</a> first.</p> <p>While I think this decoupling is great, there seems to be a push for this decoupling and at the same time a move to post publication peer review. I used to argue pretty strongly for <a href="http://simplystatistics.org/2012/10/04/should-we-stop-publishing-peer-reviewed-papers/">post-publication peer review</a> but Rafa <a href="http://simplystatistics.org/2012/10/08/why-we-should-continue-publishing-peer-reviewed-papers/">set me straight</a> and pointed out that at least with peer review every paper that gets submitted gets evaluated by <em>someone</em> even if the paper is ultimately rejected.</p> <p>One of the risks of post publication peer review is that there is no incentive to peer review in the current system. In a paper a few years ago I actually showed that under an economic model for closed peer review the Nash equilibrium is for <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026895">no one to peer review at all</a>. We showed in that same paper that under open peer review there is an increase in the amount of time spent reviewing, but the effect was relatively small. Moreover the dangers of open peer review are clear (junior people reviewing senior people and being punished for it) while the benefits (potentially being recognized for insightful reviews) are much hazier. Even the most vocal proponents of post publication peer review <a href="http://www.ncbi.nlm.nih.gov/myncbi/michael.eisen.1/comments/">don’t do it that often</a> when given the chance.</p> <p>The reason is that everyone in academics already have a lot of things they are asked to do. Many review papers either out of a sense of obligation or because they want to be in the good graces of a particular journal. Without this system in place there is a strong chance that peer review rates will drop and only a few papers will get reviewed. This will ultimately decrease the accuracy of science. In our (admittedly contrived/simplified) <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.002689">experiment</a> on peer review accuracy went from 39% to 78% after solutions were reviewed. You might argue that only “important” papers should be peer reviewed but then you are back in the camp of glamour. Say waht you want about glamour journals. They are for sure biased by the names of the people submitting the papers there. But it is <em>possible</em> for someone to get a paper in no matter who they are. If we go to a system where there is no curation through a journal-like mechanism then popularity/twitter followers/etc. will drive readers. I’m not sure that is better than where we are now.</p> <p>So while I think pre-prints are a great idea I’m still waiting to see a system that beats pre-publication review for maintaining scientific quality (even though it may just be an <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">impossible problem</a>)</p> Spreadsheets: The Original Analytics Dashboard 2016-02-23T08:42:30+00:00 http://simplystats.github.io/2016/02/23/spreadsheets-the-original-analytics-dashboard <p>Soon after my discussion with Hilary Parker and Jenny Bryan about spreadsheets on <em><a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a></em>, Brooke Anderson forwarded me <a href="https://backchannel.com/a-spreadsheet-way-of-knowledge-8de60af7146e#.gj4f2bod4">this article</a> written by Steven Levy about the original granddaddy of spreadsheets, <a href="https://en.wikipedia.org/wiki/VisiCalc">VisiCalc</a>. Actually, the real article was written back in 1984 as so-called microcomputers were just getting their start. VisiCalc was originally written for the Apple II computer and notable competitors at the time included <a href="https://en.wikipedia.org/wiki/Lotus_1-2-3">Lotus 1-2-3</a> and Microsoft <a href="https://en.wikipedia.org/wiki/Multiplan">Multiplan</a>, all since defunct.</p> <p>It’s interesting to see Levy’s perspective on spreadsheets back then and to compare it to the current thinking about data, data science, and reproducibility in science. The problem back then was “ledger sheets” (what we might now call a spreadsheet), which contained numbers and calculations related to businesses, were tedious to make and keep up to date.</p> <blockquote> <p>Making spreadsheets, however necessary, was a dull chore best left to accountants, junior analysts, or secretaries. As for sophisticated “modeling” tasks – which, among other things, enable executives to project costs for their companies – these tasks could be done only on big mainframe computers by the data-processing people who worked for the companies Harvard MBAs managed.</p> </blockquote> <p>You can see one issue here: Spreadsheets/Ledgers were a “dull chore”, and best left to junior people. However, the “real” computation was done by the people the “data processing” center on big mainframes. So what exactly does that leave for the business executive to do?</p> <p>Note that the way of doing things back then was effectively reproducible, because the presentation (ledger sheets printed on paper) and the computation (data processing on mainframes) was separated.</p> <p>The impact of the microcomputer-based spreadsheet program appears profound.</p> <blockquote> <p id="9424" class="graf--p graf-after--p"> Already, the spreadsheet has redefined the nature of some jobs; to be an accountant in the age of spreadsheet program is — well, almost sexy. And the spreadsheet has begun to be a forceful agent of decentralization, breaking down hierarchies in large companies and diminishing the power of data processing. </p> <p class="graf--p graf-after--p"> There has been much talk in recent years about an “entrepreneurial renaissance” and a new breed of risk-taker who creates businesses where none previously existed. Entrepreneurs and their venture-capitalist backers are emerging as new culture heroes, settlers of another American frontier. Less well known is that most of these new entrepreneurs depend on their economic spreadsheets as much as movie cowboys depend on their horses. </p> </blockquote> <p class="graf--p graf-after--p">  If you replace "accountant" with "statistician" and "spreadsheet" with "big data" and you are magically teleported into 2016. </p> <p class="graf--p graf-after--p"> The way I see it, in the early 80's, spreadsheets satisfied the never-ending desire that people have to interact with data. Now, with things like tablets and touch-screen phones, you can literally "touch" your data. But it took microcomputers to get to a certain point before interactive data analysis could really be done in a way that we recognize today. Spreadsheets tightened the loop between question and answer by cutting out the Data Processing department and replacing it with an Apple II (or an IBM PC, if you must) right on your desk. </p> <p class="graf--p graf-after--p"> Of course, the combining of presentation with computation comes at a cost of reproducibility and perhaps quality control. Seeing the description of how spreadsheets were originally used, it seems totally natural to me. It is not unlike today's analytic dashboards that give you a window into your business and allow you to "model" various scenarios by tweaking a few numbers of formulas. Over time, people took spreadsheets to all sorts of extremes, using them for purposes for which they were not originally designed, and problems naturally arose. </p> <p class="graf--p graf-after--p"> So now, we are trying to separate out the computation and presentation bits a little. Tools like knitr and R and shiny allow us to do this and to bring them together with a proper toolchain. The loss in interactivity is only slight because of the power of the toolchain and the speed of computers nowadays. Essentially, we've brought back the Data Processing department, but have staffed it with robots and high speed multi-core computers. </p> Non-tidy data 2016-02-17T15:47:23+00:00 http://simplystats.github.io/2016/02/17/non-tidy-data <p>During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of <a href="http://vita.had.co.nz/papers/tidy-data.pdf">“tidy data”</a> and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:</p> <ul> <li>One variable per column</li> <li>One observation per row</li> <li>Each type of observational unit forms a table</li> </ul> <p>I push this approach in my <a href="https://github.com/jtleek/datasharing">guide to data sharing</a> and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> <a href="https://twitter.com/hadleywickham">@hadleywickham</a> <a href="https://twitter.com/drob">@drob</a> <a href="https://twitter.com/mark_scheuerell">@mark_scheuerell</a> I'm saying that not all data are usefully tidy (and not just matrices) so I care more abt flexibility </p> <p> &mdash; Jeff Leek (@jtleek) <a href="https://twitter.com/jtleek/status/698247927706357760">February 12, 2016</a> </p> </blockquote> <p>This led to a very specific blog request:</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> <a href="https://twitter.com/jtleek">@jtleek</a> <a href="https://twitter.com/drob">@drob</a> I want a blog post on non-tidy data! </p> <p> &mdash; Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/698251883685646336">February 12, 2016</a> </p> </blockquote> <p>So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a <a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem first, not solution backward</a> approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal.   To illustrate these points I’ll use an example from my area.</p> <p><strong>Example data</strong></p> <p>Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in <a href="http://varianceexplained.org/r/tidy-genomics/">his post here</a>. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.</p> <p>He analyzes the data in that post by first tidying the data.</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">library(dplyr) library(tidyr) library(stringr) library(readr) library(broom) &nbsp; original_data % separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %&gt;% mutate_each(funs(trimws), name:systematic_name) %&gt;% select(-number, -GID, -YORF, -GWEIGHT) %&gt;% gather(sample, expression, G0.05:U0.3) %&gt;% separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)</pre> </td> </tr> </table> </div> <p>It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">rate_coeffs = cleaned_data %&gt;% group_by(name) %&gt;% do(fit = lm(expression ~ rate + nutrient, data = .)) %&gt;% tidy(fit) %&gt;% dplyr::filter(term=="rate")</pre> </td> </tr> </table> </div> <p>On my computer we get something like:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">system.time( cleaned_data %&gt;% group_by(name) %&gt;% + do(fit = lm(expression ~ rate + nutrient, data = .)) %&gt;% + tidy(fit) %&gt;% + dplyr::filter(term=="rate")) |==========================================================|100% ~0 s remaining user system elapsed 12.431 0.258 12.364</pre> </td> </tr> </table> </div> <p>Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">expr = original_data %&gt;% select(grep("[0:9]",names(original_data))) &nbsp; rownames(expr) = original_data %&gt;% separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %&gt;% select(systematic_name) %&gt;% mutate_each(funs(trimws),systematic_name) %&gt;% as.matrix() &nbsp; vals = data.frame(vals=names(expr)) pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE) &nbsp; expr = as.matrix(expr)</pre> </td> </tr> </table> </div> <p>If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">expr = as.matrix(expr) &nbsp; mod = model.matrix(~ rate + as.factor(nutrient),data=pdata) rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)</pre> </td> </tr> </table> </div> <p>This gives the same answer after re-ordering</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">all(abs(rate_betas[,2]- rate_coeffs$estimate[ind]) &lt; 1e-5,na.rm=T) [1] TRUE</pre> </td> </tr> </table> </div> <p>But this approach is much faster.</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;"> system.time(expr %*% mod %*% solve(t(mod) %*% mod)) user system elapsed 0.015 0.000 0.015</pre> </td> </tr> </table> </div> <p>This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: <strong>you might not use tidy data because some computations are more efficient if the data is in a different format. </strong></p> <p>Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.</p> <p>Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where <a href="https://bioconductor.org/packages/release/bioc/html/limma.html">you pool variances</a>.</p> <p>If the data are in this matrix format you can do it with R like so:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">library(limma) fit_limma = lmFit(expr,mod) ebayes_limma = eBayes(fit_limma) topTable(ebayes_limma)</pre> </td> </tr> </table> </div> <p>This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: <strong>**you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. </strong>**Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.</p> <p>I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.</p> <p>The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned <a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">GRanges list</a> which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could <strong>force</strong> these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.</p> <p>In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.</p> <ul> <li><a href="http://bioconductor.org/packages/3.3/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf">Expression sets</a> for expression data</li> <li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/SummarizedExperiment.html">Summarized experiments</a> for a variety of genomic experiments</li> <li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">Granges Lists</a> for genomic intervals</li> <li><a href="https://cran.r-project.org/web/packages/tm/tm.pdf">Corpus</a> objects for corpora of texts.</li> <li><a href="http://igraph.org/r/doc/">igraph objects</a> for graphs</li> </ul> <p>I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.</p> <p> </p> When it comes to science - its the economy stupid. 2016-02-16T14:57:14+00:00 http://simplystats.github.io/2016/02/16/when-it-comes-to-science-its-the-economy-stupid <p>I read a lot of articles about what is going wrong with science:</p> <ul> <li><a href="http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble">The reproducibility/replicability crisis</a></li> <li><a href="http://www.theatlantic.com/business/archive/2013/02/the-phd-bust-americas-awful-market-for-young-scientists-in-7-charts/273339/">Lack of jobs for PhDs</a></li> <li><a href="https://theresearchwhisperer.wordpress.com/2013/11/19/academic-scattering/">The pressure on the families (or potential families) of scientists</a></li> <li><a href="http://quillette.com/2016/02/15/the-unbearable-asymmetry-of-bullshit/?utm_content=buffer235f2&amp;utm_medium=social&amp;utm_source=twitter.com&amp;utm_campaign=buffer">Hype around specific papers and a more general abundance of BS</a></li> <li><a href="http://www.michaeleisen.org/blog/?p=1179">Consortia and their potential evils</a></li> <li><a href="http://www.vox.com/2015/12/7/9865086/peer-review-science-problems">Peer review not working well</a></li> <li><a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">Research parasites</a></li> <li><a href="http://gmwatch.org/news/latest-news/16691-public-science-is-broken-says-professor-who-helped-expose-water-pollution-crisis">Not enough room for applications/public good</a></li> <li><a href="http://www.statnews.com/2016/02/10/press-releases-stink/?s_campaign=stat:rss">Press releases that do evil</a></li> <li><a href="https://twitter.com/Richvn/status/697725899404349440">Scientists don’t release enough data</a></li> </ul> <p>These articles always point to the “incentives” in science and how they don’t align with how we’d like scientists to work. These discussions often frustrate me because they almost always boil down to asking scientists (especially and often junior scientists) to make some kind of change for public good without any guarantee that they are going to be ok. I’ve seen an acceleration/accumulation of people who are focusing on these issues, I think largely  because it is now possible to make a very nice career by pointing out how other people are doing science wrong.</p> <p>The issue I have is that the people who propose unilateral moves seem to care less that science is both (a) a calling and (b) a career for most people. I do science because I love it. I do science because I want to discover new things about the world. It is a direct extension of the wonder and excitement I had about the world when I was a little kid. But science is also a career for me. It matters if I get my next grant, if I get my next paper. Why? Because I want to be able to support myself and my family.</p> <p>The issue with incentives is that talking about them costs nothing, but actually changing them is expensive. Right now our system, broadly defined, rewards (a) productivity - lots of papers, (b) cleverness - coming up with an idea first, and (c) measures of prestige - journal titles, job titles, etc. This is because there are tons of people going for a relatively small amount of grant money. More importantly, that money is decided on by processes that are both peer reviewed and political.</p> <p>Suppose that you wanted to change those incentives to something else. Here is a small list of things I would like:</p> <ul> <li>People can have stable careers and live in a variety of places without massive two body problems</li> <li>Scientists shouldn’t have to move every couple of years 2-3 times right at the beginning of their career</li> <li>We should distribute our money among the <a href="http://simplystatistics.org/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants/">largest number of scientists possible </a></li> <li>Incentivizing long term thinking</li> <li>Incentivizing objective peer review</li> <li>Incentivizing openness and sharing</li> </ul> <div> The key problem isn't publishing, or code, or reproducibility, or even data analysis. </div> <div> </div> <div> <b>The key problem is that the fundamental model by which we fund science is completely broken. </b> </div> <div> </div> <div> The model now is that you have to come up with an <span class="lG">idea</span> every couple of years then "sell" it to funders, your peers, etc. This is the source of the following problems: </div> <div> </div> <ul> <li>An incentive to publish only positive results so your <span class="lG">ideas</span> look good</li> <li>An incentive to be closed so people don’t discover flaws in your analysis</li> <li> An incentive to publish in specific “<span class="lG">big</span> name” journals that skews the results (again mostly in the positive direction)</li> <li> Pressure to publish quickly which leads to cutting corners</li> <li>Pressure to stay in a single area and make incremental changes so you know things will work.</li> </ul> <div> If we really want to have any measurable impact on science we need to solve the funding model. The solution is actually pretty simple. We need to give out 20+ year grants to people who meet minimum qualifications. These grants would cover their own salary plus one or two people and the minimum necessary equipment. </div> <div> </div> <div> The criteria for getting or renewing these grants should not be things  like Nature papers or number of citations. It has to be designed to incentivize the things that we want to (mine are listed above). So if I was going to define the criteria for meeting the standards people would have to be: </div> <div> </div> <ul> <li>Working on a scientific problem and trained as a scientist</li> <li>Publishing all results immediately online as preprints/free code</li> <li>Responding to queries about their data/code</li> <li>Agreeing to peer review a number of papers per year</li> </ul> <p>More importantly these grants should be given out for a very long term (20+ years) and not be tied to a specific institution. This would allow people to have flexible careers and to target bigger picture problems. We saw the benefits of people working on problems they weren’t originally funded to work on with <a href="http://www.wired.com/2016/02/zika-research-utmb/">research on the Zika virus.</a></p> <p>These grants need to be awarded using a rigorous peer review system just like the NIH, HHMI, and other organizations use to ensure we are identifying scientists with potential early in their careers and letting them flourish. But they’d be given out in a different matter. I’m very confident in a peer review to detect the difference between psuedo-science and real science, or complete hype and realistic improvement. But I’m much less confident in the ability of peer review to accurately distinguish “important” from “not important” research. So I think we should <a href="http://www.wsj.com/articles/SB10001424052702303532704579477530153771424">consider seriously the lottery</a> for these grants.</p> <p>Each year all eligible scientists who meet some minimum entry requirements submit proposals for what they’d like to do scientifically. Each year those proposals are reviewed to make sure they meet the very minimum bar (are they scientific? do they have relevant training at all?). Among all the (very large) class of people who pass that bar we hold a lottery. We take the number of research dollars and divide it up to give the maximum number of these grants possible.  These grants might be pretty small - just enough to fund the person’s salary and maybe one or two students/postdocs. To make this works for labs that required equipment there would have to be cooperative arrangements between multiple independent indviduals to fund/sustain equipment they needed. Renewal of these grants would happen as long as you were posting your code/data online, you were meeting peer review requirements, and responding to inquires about your work.</p> <p>One thing we’d do to fund this model is eliminate/reduce large-scale projects and super well funded labs. Instead of having 30 postdocs in a well funded lab, you’d have some fraction of those people funded as independent investigators right from the get-go. If we wanted to run a massive large scale program that would be out of a very specific pot of money that would have to be saved up and spent, completely outside of the pot of money for investigator-initiated grants. That would reduce the hierarchy in the system, reduce pressure that leads to bad incentive, and give us the best chance to fund creative, long term thinking science.</p> <p>Regardless of whether you like my proposal or not, I hope that people will start focusing on how to change the incentives, even when that means doing something big or potentially costly.</p> <p> </p> <p> </p> Not So Standard Deviations Episode 9 - Spreadsheet Drama 2016-02-12T11:24:04+00:00 http://simplystats.github.io/2016/02/12/not-so-standard-deviations-episode-9-spreadsheet-drama <p>For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Show notes:</p> <ul> <li><a href="http://stat545-ubc.github.io/">Jenny’s Stat 545</a></li> <li><a href="http://goo.gl/VvFyXz">Coding is not the new literacy</a></li> <li><a href="https://goo.gl/mC0Qz9">Goldman Sachs spreadsheet error</a></li> <li><a href="https://goo.gl/hNloVr">Jingmai O’Connor episode</a></li> <li><a href="http://goo.gl/IYDwn1">De-weaponizing reproducibility</a></li> <li><a href="https://goo.gl/n02EGP">Vintage Space</a></li> <li><a href="https://goo.gl/H3YgV6">Tabby Cats</a></li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/246296744&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Why I don't use ggplot2 2016-02-11T13:25:38+00:00 http://simplystats.github.io/2016/02/11/why-i-dont-use-ggplot2 <p>Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p> <p>But I don’t use ggplot2 and I get nervous when other people do.</p> <p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p> <ol> <li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li> <li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li> <li>When grading student data analyses.</li> </ol> <p>Let’s consider each case.</p> <p><strong>Exploratory graphs</strong></p> <p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p> <p><strong>Expository graphs</strong></p> <p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p> <p><img class="aligncenter" src="" alt="" width="373" height="280" /></p> <p> </p> <p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p> <p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p> <p>in ggplot2 the code is:</p> <p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p> <p> </p> <p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p> <p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p> <p><strong>Grading student work</strong></p> <p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p> <p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p> <p>And get this out:</p> <p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p> <p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p> <p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p> <p><strong>Where ggplot2 is better for sure</strong></p> <p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create [Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p> <p>But I don’t use ggplot2 and I get nervous when other people do.</p> <p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p> <ol> <li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li> <li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li> <li>When grading student data analyses.</li> </ol> <p>Let’s consider each case.</p> <p><strong>Exploratory graphs</strong></p> <p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p> <p><strong>Expository graphs</strong></p> <p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p> <p><img class="aligncenter" src="" alt="" width="373" height="280" /></p> <p> </p> <p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p> <p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p> <p>in ggplot2 the code is:</p> <p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p> <p> </p> <p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p> <p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p> <p><strong>Grading student work</strong></p> <p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p> <p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p> <p>And get this out:</p> <p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p> <p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p> <p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p> <p><strong>Where ggplot2 is better for sure</strong></p> <p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create](https://ggplot2-exts.github.io/index.html) are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I’m psyched about the reach and power ggplot2 has had. Still, I probably won’t use it for my own work, even thought it disappoints my data scientist friends.</p> Data handcuffs 2016-02-10T15:38:37+00:00 http://simplystats.github.io/2016/02/10/data-handcuffs <p>A few years ago, if you asked me what the top skills I got asked about for students going into industry, I’d definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.</p> <p>In particular I see data scientists talking more about <a href="https://medium.com/@InVisionApp/a-b-and-see-a-beginner-s-guide-to-a-b-testing-a16406f1a239#.p7hoxirwo">A/B testing</a>, <a href="http://varianceexplained.org/r/bayesian-ab-testing/">sequential stopping rules</a>, <a href="https://twitter.com/hspter/status/696820603945414656">hazard regression</a> and other ideas  that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.</p> <p>I think it is great that companies are choosing to do experiments, as this <a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">still remains</a> the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be “data-driven”.  They make all decisions based on data they have collected or experiments they have performed.</p> <p>But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can’t predict conditions in the future or ideas that may work out or might not.</p> <p>In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don’t discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.</p> <p>Regardless, almost all of the most <a href="http://simplystatistics.org/2013/09/25/is-most-science-false-the-titans-weigh-in/">interesting projects</a> I’ve worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn’t in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of <a href="http://simplystatistics.org/2012/01/11/healthnewsrater/">my biggest flameouts</a>.</p> Leek group guide to reading scientific papers 2016-02-09T13:59:53+00:00 http://simplystats.github.io/2016/02/09/leek-group-guide-to-reading-scientific-papers <p>The other day on Twitter Amelia requested a guide for reading papers</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> I love <a href="https://twitter.com/jtleek">@jtleek</a>’s github guides to reviewing papers, writing R packages, giving talks, etc. Would love one on reading papers, for students. </p> <p> &mdash; Amelia McNamara (@AmeliaMN) <a href="https://twitter.com/AmeliaMN/status/695633602751635456">February 5, 2016</a> </p> </blockquote> <p> </p> <p>So I came up with a guide which you can find here: <a href="https://github.com/jtleek/readingpapers">Leek group guide to reading papers</a>. I actually found this to be one that I had the hardest time with. I described how I tend to read a paper but I’m not sure that is really the optimal (or even a very good) way. I’d really appreciate pull requests if you have ideas on how to improve the guide.</p> A menagerie of messed up data analyses and how to avoid them 2016-02-01T13:39:57+00:00 http://simplystats.github.io/2016/02/01/a-menagerie-of-messed-up-data-analyses-and-how-to-avoid-them <p><em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman &amp; Loken, 2014; John, Loewenstein, &amp; Prelec, 2012; Simmons, Nelson, &amp; Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p> <p> </p> <p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p> <p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p> <p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P &lt; 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P &lt; 0.05 just by chance.</p> <p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p> <p> </p> <p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p> <p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p> <p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid $$ to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman &amp; Loken, 2014; John, Loewenstein, &amp; Prelec, 2012; Simmons, Nelson, &amp; Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p> <p> </p> <p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p> <p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p> <p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P &lt; 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P &lt; 0.05 just by chance.</p> <p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p> <p> </p> <p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p> <p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p> <p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another.](http://www.ncbi.nlm.nih.gov/pubmed/17597765) a large fraction of these differences.</p> <p><em>What you can do</em>: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend <em>a lot</em> of time trying to figure out why it could be a mistake. If you don’t, others definitely will, and you might be embarrassed.</p> <p><span style="text-decoration: underline;"><strong><img class="alignleft wp-image-4632" src="http://simplystatistics.org/wp-content/uploads/2016/02/man298.png" alt="man298" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/man298-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/man298.png 256w" sizes="(max-width: 125px) 100vw, 125px" />Double complication</strong></span></p> <p><em>What it is</em>: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.</p> <p><em>An example:</em> There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people <a href="http://www.nature.com/nm/journal/v12/n11/full/nm1491.html">tried to use complicated prediction algorithms</a> to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.</p> <p><em>What you can do:</em> When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets <a href="http://arxiv.org/pdf/math/0606441.pdf">even simple things work</a>.</p> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Image credits:</strong></span></p> <ul> <li>Outcome switching. Icon made by <a href="http://hananonblog.wordpress.com" title="Hanan">Hanan</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Forking paths. Icon made by <a href="http://iconalone.com" title="Popcic">Popcic</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>P-hacking.Icon made by <a href="http://www.icomoon.io" title="Icomoon">Icomoon</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Uncorrected multiple testing.Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Big one here. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Double complication. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> </ul> Exactly how risky is breathing? 2016-01-26T09:58:23+00:00 http://simplystats.github.io/2016/01/26/exactly-how-risky-is-breathing <p>This <a href="http://nyti.ms/23nysp5">article by by George Johnson</a> in the NYT describes a study by Kamen P. Simonov​​ and Daniel S. Himmelstein​ that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.</p> <blockquote> <p>All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.</p> </blockquote> <p>I’m not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the <a href="https://peerj.com/articles/705/">paper</a> itself, <a href="http://simplystatistics.org/2015/01/15/how-to-find-the-science-paper-behind-a-headline-when-the-link-is-missing/">already a major victory</a>. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one–all the potential confounders, selection bias, negative controls, etc.</p> <p>I remember particulate matter air pollution epidemiology used to have this feel. You’d try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it….</p> On research parasites and internet mobs - let's try to solve the real problem. 2016-01-25T14:34:08+00:00 http://simplystats.github.io/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem <p>A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</p> <blockquote> <p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p> </blockquote> <p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p> <ol> <li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good [A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</li> </ol> <blockquote> <p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p> </blockquote> <p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p> <ol> <li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good](https://github.com/jtleek/datasharing) policies and respond to queries from people using their data promptly then this should not be a problem at all.</li> <li><strong>“… but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited.” </strong>The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that <em>is not papers</em>. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.</li> <li><strong>“</strong><strong>The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own.” </strong> The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn’t the only way science should work. Researchers shouldn’t be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn’t, but in neither case should we label the researchers “symbiotic” or “parasitic”, terms that have extreme connotations.</li> <li><strong>“How would data sharing work best? We think it should happen symbiotically, not parasitically.”</strong> I think that it should happen <em>automatically</em>. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should <em>get credit for generating the data set and the hypothesis that led to the data set</em>. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.</li> <li><strong>“Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested.”</strong> The trouble with this framework is that it preferentially accrues credit to data generators and doesn’t accurately describe the role of either party. To flip this argument around,  you could just as easily say that anyone who uses <a href="http://salzberg-lab.org/">Steven Salzberg</a>’s software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.</li> </ol> <p>After the piece was posted there was predictable internet rage from <a href="https://twitter.com/dataparasite">data parasites</a>, a <a href="https://twitter.com/hashtag/researchparasite?src=hash">dedicated hashtag</a>, and half a dozen angry blog posts written about the piece. These inspired a <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1601087">follow up piece</a> from Drazen. I recognize why these folks were upset - the “research parasites” thing was unnecessarily inflammatory. But <a href="http://simplystatistics.org/2014/03/05/plos-one-i-have-an-idea-for-what-to-do-with-all-your-profits-buy-hard-drives/">I also sympathize with data creators</a> who are also subject to a tough environment - particularly when they are junior scientists.</p> <p>I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I’d like to see something that acknowledges different contributions appropriately and doesn’t slow down science. Something like:</p> <ol> <li>We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.</li> <li>When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.</li> <li>We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.</li> <li>We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.</li> <li>We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.</li> <li>When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.</li> <li>When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.</li> </ol> <p> </p> Not So Standard Deviations Episode 8 - Snow Day 2016-01-24T21:41:44+00:00 http://simplystats.github.io/2016/01/24/not-so-standard-deviations-episode-8-snow-day <p>Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal’s view on data sharing, Google’s “Cohort Analysis”, and trying to predict a movie’s box office returns based on the movie’s script.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Follow <a href="https://twitter.com/nssdeviations">@NSSDeviations</a> on Twitter!</p> <p>Show notes:</p> <ul> <li><a href="http://goo.gl/eUU2AK">Remembrances of Peter Hall</a></li> <li><a href="http://goo.gl/HbMu87">Research Parasites</a> (NEJM editorial by Dan Longo and Jeffrey Drazen)</li> <li>Amazon <a href="http://goo.gl/83DvvO">review/data analysis</a> of Fifty Shades of Grey</li> <li><a href="https://youtu.be/55psWVYSbrI">Time-lapse cats</a></li> <li><a href="https://getpocket.com">Pocket</a></li> </ul> <p>Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!</p> <p><a href="https://api.soundcloud.com/tracks/243634673/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p> <p> </p> Parallel BLAS in R 2016-01-21T11:53:07+00:00 http://simplystats.github.io/2016/01/21/parallel-blas-in-r <p>I’m working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:</p> <blockquote class="twitter-tweet" lang="en"> <p dir="ltr" lang="en"> How fast is this <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> code? x &lt;- replicate(5e3, rnorm(5e3)) x %*% t(x) For me, w/Microsoft R Open, 2.5sec. Wow. <a href="https://t.co/0SbijNxxVa">https://t.co/0SbijNxxVa</a> </p> <p> — David Robinson (@drob) <a href="https://twitter.com/drob/status/689916280233562112">January 20, 2016</a> </p> </blockquote> <p>What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (`x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.</p> <p>Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what’s known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here’s what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:</p> <pre>system.time({ x &lt;- replicate(5e3, rnorm(5e3)); tcrossprod(x) }) user system elapsed 59.622 0.314 59.927 </pre> <p>Note that the “user” time and the “elapsed” time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.</p> <p>Now, when I run the same code on my built-from-source version of R (version 3.2.3), here’s what I get:</p> <pre>system.time({ x &lt;- replicate(5e3, rnorm(5e3)); tcrossprod(x) }) user system elapsed 14.378 0.276 3.344 </pre> <p>Overall, it’s faster when I don’t run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?</p> <p>The build-from-source version of R is linked to Apple’s Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was “charged” to the CPU(s).</p> <p>David’s tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft’s R Open is linked against <a href="https://software.intel.com/en-us/intel-mkl">Intel’s Math Kernel Library</a> (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don’t know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel’s MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.</p> <p>If you’re going to be doing any linear algebra in R (and you will), it’s important to link to an optimized BLAS. Otherwise, you’re just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD’s <a href="http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/">ACML</a> library for AMD chips and the <a href="http://math-atlas.sourceforge.net">ATLAS</a> library which is a general purpose tunable library. Also <a href="https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2">Goto’s BLAS</a> is optimized but is not under active development.</p> Profile of Hilary Parker 2016-01-14T21:15:46+00:00 http://simplystats.github.io/2016/01/14/profile-of-hilary-parker <p>If you’ve ever wanted to know more about my <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the <a href="http://thisisstatistics.org/hilary-parker-gets-crafty-with-statistics-in-her-not-so-standard-job/">great profile of her</a> on the American Statistical Association’s This Is Statistics web site.</p> <blockquote> <p><strong>What advice would you give to high school students thinking about majoring in statistics?</strong></p> <p>It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.</p> </blockquote> Not So Standard Deviations Episode 7 - Statistical Royalty 2016-01-12T08:45:24+00:00 http://simplystats.github.io/2016/01/12/not-so-standard-deviations-episode-7-statistical-royalty <p>The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.</p> <p>We also talk about Theranos and the pitfalls of diagnostic testing, Spotify’s Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some “free advertising” to something interesting that they think other people should know about.</p> <p>Show Notes:</p> <ul> <li><a href="http://goo.gl/JDk6ni">Gosset Icterometer</a></li> <li>The <a href="http://skybrudeconsulting.com/blog/2015/10/16/theranos-healthcare.html">dangers</a> of <a href="https://www.fredhutch.org/en/news/center-news/2013/11/scientists-urge-caution-personal-genetic-screenings.html">entertainment</a> <a href="http://mobihealthnews.com/35444/the-rise-of-the-seemingly-serious-but-just-for-entertainment-purposes-medical-app/">medicine</a></li> <li>Spotify’s Discover Weekly <a href="http://goo.gl/enzFeR">solves human curation</a>?</li> <li>David Robinson’s <a href="http://varianceexplained.org">Variance Explained</a></li> <li><a href="http://what3words.com">What3Words</a></li> </ul> <p><a href="https://api.soundcloud.com/tracks/241071463/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p> Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today 2016-01-11T09:29:28+00:00 http://simplystats.github.io/2016/01/11/jeff-roger-and-brian-caffo-are-doing-a-reddit-ama-at-3pm-est-today <p>Jeff Leek, Brian Caffo, and I are doing a <a href="https://www.reddit.com/r/IAmA">Reddit AMA</a> TODAY at 3pm EST. We’re happy to answer questions about…anything…including our roles as Co-Directors of the <a href="https://www.coursera.org/specializations/jhu-data-science">Johns Hopkins Data Science Specialization</a> as well as the <a href="https://www.coursera.org/specializations/executive-data-science">Executive Data Science Specialization</a>.</p> <p>This is one of the few pictures of the three of us together.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189.jpg"><img class="alignright size-large wp-image-4586" src="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg" alt="IMG_0189" width="990" height="743" srcset="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-260x195.jpg 260w" sizes="(max-width: 990px) 100vw, 990px" /></a></p> A non-comprehensive list of awesome things other people did in 2015 2015-12-21T11:22:07+00:00 http://simplystats.github.io/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015 <p><em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head.   I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p> <ol> <li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li> <li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li> <li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li> <li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li> <li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li> <li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li> <li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li> <li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li> <li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li> <li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li> <li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li> <li> <p>Amanda Cox an co. created this [<em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head.   I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p> </li> <li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li> <li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li> <li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li> <li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li> <li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li> <li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li> <li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li> <li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li> <li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li> <li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li> <li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li> <li>Amanda Cox an co. created this ](http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html) , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.</li> <li>As usual Philip Guo was producing gold over on his blog. I appreciate this piece on <a href="http://www.pgbovine.net/tips-for-data-driven-research.htm">twelve tips for data driven research</a>.</li> <li>I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. <a href="http://www.sciencemag.org/content/349/6248/636.abstract">This paper</a> from Cynthia Dwork and co was one of the initial salvos that came out this year.</li> <li>Datacamp <a href="https://www.datacamp.com/courses/intro-to-python-for-data-science?utm_source=growth&amp;utm_campaign=python&amp;utm_medium=button">incorporated Python</a> into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.</li> <li>I was really into the idea of <a href="http://projecteuclid.org/euclid.aoas/1430226098">Cross-Study validatio</a>n that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.</li> <li>The Open Science Foundation did this <a href="http://www.sciencemag.org/content/349/6251/aac4716">incredible replication of 100 different studies</a> in psychology with attention to detail and care that deserves a ton of attention.</li> <li>Florian’s piece “<a href="http://www.ncbi.nlm.nih.gov/pubmed/26402330">You are not working for me; I am working with you.</a>” should be required reading for all students/postdocs/mentors in academia. This is something I still hadn’t fully figured out until I read Florian’s piece.</li> <li>I think Karl Broman’s post on why <a href="https://kbroman.wordpress.com/2015/09/09/reproducibility-is-hard/">reproducibility is hard</a> is a great introduction to the real issues in making data analyses reproducible.</li> <li>This was the year of the f1000 post-publication review paper. I thought <a href="http://f1000research.com/articles/4-121/v1">this one</a> from Yoav and the ensuing fallout was fascinating.</li> <li>I love pretty much everything out of Di Cook/Heike Hoffman’s groups. This year I liked the paper on <a href="http://download.springer.com/static/pdf/611/art%253A10.1007%252Fs00180-014-0534-x.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1007%2Fs00180-014-0534-x&amp;token2=exp=1450714996~acl=%2Fstatic%2Fpdf%2F611%2Fart%25253A10.1007%25252Fs00180-014-0534-x.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs00180-014-0534-x*~hmac=3c5f5c7c1b2381685437659d8ffd64e1cb2c52d1dfd10506cad5d2af1925c0ac">visual statistical inference in high-dimensional low sample size settings</a>.</li> <li>This is pretty recent, but Nathan Yau’s <a href="https://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/">day in the life graphic is mesmerizing</a>.</li> </ol> <p>This was a year where open source data people <a href="http://treycausey.com/emotional_rollercoaster_public_work.html">described</a> their <a href="https://twitter.com/johnmyleswhite/status/666429299327569921">pain</a> from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.</p> <p> </p> Not So Standard Deviations: Episode 6 - Google is the New Fisher 2015-12-18T13:08:10+00:00 http://simplystats.github.io/2015/12/18/not-so-standard-deviations-episode-6-google-is-the-new-fisher <p>Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.</p> <p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>.</p> <p>This will be our last episode for 2015 so see you in 2016!</p> <p>Notes</p> <ul> <li><a href="https://goo.gl/X0TFt9">Roger’s books on Leanpub</a></li> <li><a href="https://goo.gl/VO0ckP">KPIs</a></li> <li><a href="http://replyall.soy">Reply All</a>, a great podcast</li> <li><a href="http://user2016.org">Use R! 2016 conference</a> where Don Knuth is an invited speaker!</li> <li><a href="http://goo.gl/wUcTBT">Liz Stuart’s directory of propensity score software</a></li> <li><a href="https://goo.gl/CibhJ0">A/B testing</a></li> <li><a href="https://goo.gl/qMyksb">iid</a></li> <li><a href="https://goo.gl/qHVzWQ">R 3.2.3 release notes</a></li> <li><a href="http://www.pqr-project.org/">pqR</a></li> <li><a href="https://goo.gl/pFOVkx">John Myles White’s tweet</a></li> </ul> <p><a href="https://api.soundcloud.com/tracks/237909534/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> Instead of research on reproducibility, just do reproducible research 2015-12-11T12:18:33+00:00 http://simplystats.github.io/2015/12/11/instead-of-research-on-reproducibility-just-do-reproducible-research <p>Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the “reproducibility crisis”.</p> <p>I <a href="http://www.slideshare.net/jtleek/evidence-based-data-analysis-45800617">gave a talk</a> a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.</p> <ol> <li>Use <a href="https://github.com/">Github </a>for version control.</li> <li>Use <a href="http://rmarkdown.rstudio.com/">rmarkdown</a> or <a href="http://ipython.org/notebook.html">iPython notebooks</a> for your analysis code</li> <li>When your paper is done post it to <a href="http://arxiv.org/">arxiv</a> or <a href="http://biorxiv.org/">biorxiv</a>.</li> <li>Post your data to an appropriate repository like <a href="http://www.ncbi.nlm.nih.gov/sra">SRA</a> or a general purpose site like <a href="https://figshare.com/">figshare.</a></li> <li>Send any software you develop to a controlled repository like <a href="https://cran.r-project.org/">CRAN</a> or <a href="http://bioconductor.org/">Bioconductor</a>.</li> <li>Participate in the <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">post publication discussion on Twitter and with a Blog</a></li> </ol> <p>This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the “reproducibility crisis”. There is a lot of attention and heat that has focused on the “crisis” or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.</p> <p>I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I’ll just point them here and tell them to just follow the lead of people doing it.</p> <ul> <li><strong>Karl Broman</strong> - posts all of his <a href="http://kbroman.org/pages/talks.html">talks online </a>, generates many widely used <a href="http://kbroman.org/pages/software.html">open source packages</a>, writes <a href="http://kbroman.org/pages/tutorials.html">free/open tutorials</a> on everything from knitr to making webpages, makes his <a href="http://www.ncbi.nlm.nih.gov/pubmed/26290572">papers</a> highly <a href="https://github.com/kbroman/Paper_SampleMixups">reproducible</a>.</li> <li><strong>Jessica Li</strong> - <a href="http://www.stat.ucla.edu/~jingyi.li/software-and-data.html">posts her data online and writes open source software for her analyses</a>.</li> <li><strong>Mark Robinson - </strong>posts many of his papers as <a href="http://biorxiv.org/search/author1%3Arobinson%252C%2Bmd%20numresults%3A10%20sort%3Arelevance-rank%20format_result%3Astandard">preprints on biorxiv</a>, makes his <a href="https://github.com/markrobinsonuzh/diff_splice_paper">analyses reproducible</a>, writes <a href="http://bioconductor.org/packages/release/bioc/html/Repitools.html">open source software </a></li> <li><strong>Florian Markowetz -<a href="http://www.markowetzlab.org/software/"> </a></strong><a href="http://www.markowetzlab.org/software/">writes open source software</a>, provides <a href="http://www.markowetzlab.org/data.php">Bioconductor data for major projects</a>, links <a href="http://www.markowetzlab.org/publications.php">his papers with his code</a> nicely on his publications page.</li> <li><strong>Raphael Gottardo</strong> - <a href="http://www.rglab.org/software.html">writes/maintains many open source software packages</a>, makes <a href="https://github.com/RGLab/BNCResponse">his analyses reproducible and available via Github</a>, posts <a href="http://biorxiv.org/content/early/2015/06/15/020842">preprints of his papers</a>.</li> <li><strong>Genevera Allen - </strong>writes](https://cran.r-project.org/web/packages/TCGA2STAT/index.html) to make data easier to access, posts <a href="http://biorxiv.org/content/early/2015/09/24/027516">preprints on biorxiv</a> and <a href="http://arxiv.org/pdf/1502.03853v1.pdf">on arxiv</a></li> <li><strong>Lorena Barba</strong> - <a href="http://openedx.seas.gwu.edu/courses/GW/MAE6286/2014_fall/about">teaches open source moocs</a>, with lessons as <a href="https://github.com/barbagroup/CFDPython">open source iPython modules</a>, and <a href="https://github.com/barbagroup/pygbe">reproducible code for her analyses</a>.</li> <li><strong>Alicia Oshlack  - </strong>writes papers with <a href="http://www.genomemedicine.com/content/7/1/43">completely reproducible analyses</a>, <a href="http://bioconductor.org/packages/release/bioc/html/missMethyl.html">publishes lots of open source software</a> and publishes <a href="http://biorxiv.org/content/early/2015/01/23/013698">preprints</a> for her papers.</li> <li><strong>Baggerly and Coombs</strong> - although they are famous for a <a href="https://projecteuclid.org/euclid.aoas/1267453942">highly public reproducible piece of research</a> they have also quietly implemented policies like <a href="http://magazine.amstat.org/blog/2011/01/01/scipolicyjan11/">making all  reports reproducible for their consulting center</a>.</li> </ul> <p>This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.</p> <p>Moreover there are people that have been doing parts of this for a long time (like the <a href="http://arxiv.org/">physics</a> or <a href="http://biostats.bepress.com/jhubiostat/">biostatistics</a> communities with preprints, or how people have used <a href="https://projecteuclid.org/euclid.aoas/1267453942">Sweave for a long time</a>) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their <a href="http://ivory.idyll.org/blog/grants-posted.html">grants</a> <a href="http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/">online</a>. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.</p> By opposing tracking well-meaning educators are hurting disadvantaged kids 2015-12-09T10:10:02+00:00 http://simplystats.github.io/2015/12/09/by-opposing-tracking-well-meaning-educators-are-hurting-disadvantaged-kids <div class="page" title="Page 2"> <div class="layoutArea"> <div class="column"> <p> An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track. </p> </div> </div> </div> <p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p> <p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p> <blockquote> <p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p> </blockquote> <p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p> <p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a [&lt;div class="page" title="Page 2"&gt;</p> <div class="layoutArea"> <div class="column"> <p> An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track. </p> </div> </div> <p>&lt;/div&gt;</p> <p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p> <p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p> <blockquote> <p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p> </blockquote> <p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p> <p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a](http://web.stanford.edu/~pdupas/Tracking_rev.pdf) (and the only one of which I am aware) finds that tracking helps all students:</p> <blockquote> <p>We find that tracking students by prior achievement raised scores for all students, even those assigned to lower achieving peers. On average, after 18 months, test scores were 0.14 standard deviations higher in tracking schools than in non-tracking schools (0.18 standard deviations higher after controlling for baseline scores and other control variables). After controlling for the baseline scores, students in the top half of the pre-assignment distribution gained 0.19 standard deviations, and those in the bottom half gained 0.16 standard deviations. <strong>Students in all quantiles benefited from tracking. </strong></p> </blockquote> <p>I believe that without tracking, the achievement gap between disadvantaged children and their affluent peers will continue to widen since involved parents will seek alternative educational opportunities, including private schools or subject specific extracurricular acceleration programs. With limited or no access to advanced classes in the public system, disadvantaged students will be less prepared to enter the very competitive STEM fields. Note that competition comes not only from within the US, but from other countries including many with educational systems that track.</p> <p>To illustrate the extreme gap, the following exercises are from a 7th grade public school math class (in a high performing school district):</p> <table style="width: 100%;"> <tr> <td> <a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png" alt="Screen Shot 2015-12-07 at 11.49.41 AM" width="275" /></a> </td> <td> <a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png" alt="Screen Shot 2015-12-09 at 9.00.57 AM" width="275" /></a> </td> </tr> </table> <p>(Click to enlarge). There is no tracking so all students must work on these problems. Meanwhile, in a 7th grade advanced, private math class, that same student can be working on problems like these:<a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png"><img class="alignnone size-full wp-image-4511" src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png" alt="Screen Shot 2015-12-07 at 11.47.45 AM" width="1165" height="341" srcset="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-300x88.png 300w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-1024x300.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-260x76.png 260w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png 1165w" sizes="(max-width: 1165px) 100vw, 1165px" /></a>Let me stress that there is nothing wrong with the first example if it is the appropriate level of the student.  However, a student who can work at the level of the second example, should be provided with the opportunity to do so notwithstanding their family’s ability to pay. Poorer kids in districts which do not offer advanced classes will not only be less equipped to compete with their richer peers, but many of the academically advanced ones may, I suspect,  dismiss academics due to lack of challenge and boredom.  Educators need to consider evidence when making decisions regarding policy. Tracking can be applied unfairly, but that aspect can be remedied. Eliminating tracking all together takes away a crucial tool for disadvantaged students to move into the STEM fields and, according to the empirical evidence, hurts all students.</p> Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It 2015-12-03T09:52:47+00:00 http://simplystats.github.io/2015/12/03/not-so-standard-deviations-episode-5-irl-roger-is-totally-with-it <p>I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.</p> <p>In this episode, Hilary and I open up the mailbag and go through some of the feedback we’ve gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.</p> <p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>. Or you can use the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">SoundCloud RSS feed</a> directly.</p> <p>Notes:</p> <ul> <li>Hilary’s <a href="https://youtu.be/7B3n-5atLxM">talk on reproducible analysis in production</a> at the New York R Conference</li> <li>Hilary’s <a href="https://youtu.be/zlSOckFpYqg">Ignite presentation</a> at Strata 2013</li> <li>Roger’s <a href="https://youtu.be/aH8dpcirW1U">talk on “Computational and Policy Tools for Reproducible Research”</a> at the Applied Mathematics Perspectives Workshop in Vancouver, 2011</li> <li>Duke Scandal <a href="http://goo.gl/rEO5QD">Starter Set</a></li> <li><a href="https://youtu.be/7gYIs7uYbMo">Keith Baggerly’s talk</a> on Duke Scandal</li> <li>The <a href="https://goo.gl/RtpBZa">Web of Trust</a></li> <li><a href="https://goo.gl/MlM0gu">testdat</a> R package</li> </ul> <p><a href="https://api.soundcloud.com/tracks/235689361/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> <p>Or you can listen right here:</p> Thinking like a statistician: the importance of investigator-initiated grants 2015-12-01T11:40:29+00:00 http://simplystats.github.io/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants <p>A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how much funding gets allocated to these types of proposals. At the NIH, the largest funding agency of these types of grants, the success rate recently <a href="https://nihdirectorsblog.files.wordpress.com/2013/09/sequestration-success-rates1.jpg">fell below 20% from a high above 35%</a>. Part of the reason these percentages have fallen is to make room for large collaborative projects. Large projects seem to be increasing, and not just at the NIH. In Europe, for example, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> has an estimated cost of over 1 billion US$ over 10 years. To put this in perspective, 1 billion dollars can fund over 500 <a href="http://grants.nih.gov/grants/funding/r01.htm">NIH R01s</a>. R01 is the NIH mechanism most appropriate for investigator initiated proposals.</p> <p>The merits of big science has been widely debated (for example <a href="http://www.michaeleisen.org/blog/?p=1179">here</a> and <a href="http://simplystatistics.org/2013/02/27/please-save-the-unsolicited-r01s/">here</a>). And most agree that some big projects have been successful. However, in this post I present a statistical argument highlighting the importance of investigator-initiated awards. The idea is summarized in the graph below.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png"><img class="alignnone size-full wp-image-4483" src="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png" alt="Rplot" width="1112" height="551" srcset="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-300x149.png 300w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-1024x507.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-260x129.png 260w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png 1112w" sizes="(max-width: 1112px) 100vw, 1112px" /></a></p> <p>The two panes above represent two different funding strategies: fund-many-R01s (left) or reduce R01s to fund several large projects (right). The grey crosses represent investigators and the gold dots represent potential paradigm-shifting geniuses. Location on the Cartesian plane represent research areas, with the blue circles denoting areas that are prime for an important scientific advance. The largest scientific contributions occur when a gold dot falls in a blue circle. Large contributions also result from the accumulation of incremental work produced by grey crosses in the blue circles.</p> <p>Although not perfect, the peer review approach implemented by most funding agencies appears to work quite well at weeding out unproductive researchers and unpromising ideas. They also seem to do well at spreading funds across general areas. For example NIH spreads funds across <a href="https://www.nih.gov/institutes-nih/list-nih-institutes-centers-offices">diseases and public health challenges</a> (for example cancer, mental health, heart, genomics, heart and lung disease.) as well as <a href="https://www.nigms.nih.gov/Pages/default.aspx">general medicine</a>, <a href="https://www.genome.gov/">genomics</a> and <a href="https://www.nlm.nih.gov/">information.</a> However, precisely predicting who will be a gold dot or what specific area will be a blue circle seems like an impossible endeavor. Increasing the number of tested ideas and researchers therefore increases our chance of success. When a funding agency decides to invest big in a specific area (green dollar signs) they are predicting the location of a blue circle. As funding flows into these areas, so do investigators (note the clusters). The total number of funded lead investigators also drops. The risk here is that if the dollar sign lands far from a blue dot, we pull researchers away from potentially fruitful areas. If after 10 years of funding, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> doesn’t <a href="https://www.humanbrainproject.eu/mission">“achieve a multi-level, integrated understanding of brain structure and function”</a> we will have missed out on trying out 500 ideas by hundreds of different investigators. With a sample size this large, we expect at least a  handful of these attempts to result in the type of impactful advance that justifies funding scientific research.</p> <p>The simulation presented (code below) here is clearly an over simplification, but it does depict the statistical reason why I favor investigator-initiated grants.  The simulation clearly depicts that the strategy of funding many investigator-initiated grants is key for the continued success of scientific research.</p> <p><tt><br /> set.seed(2)<br /> library(rafalib)<br /> thecol=”gold3”<br /> mypar(1,2,mar=c(0.5,0.5,2,0.5))<br /> ###<br /> ## Start with the many R01s model<br /> ###<br /> ##generate location of 2,000 investigators<br /> N = 2000<br /> x = runif(N)<br /> y = runif(N)<br /> ## 1% are geniuses<br /> Ng = N<em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> ## generate location of important areas of research<br /> M0 = 10<br /> x0 = runif(M0)<br /> y0 = runif(M0)<br /> r0 = rep(0.03,M0)<br /> ##Make the plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”Many R01s”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x,y,pch=g,col=ifelse(g==4,NA,thecol))<br /> ### Generate the location of 5 big projects<br /> M1 = 5<br /> x1 = runif(M1)<br /> y1 = runif(M1)<br /> ##make initial plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”A Few Big Projects”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> ### Generate location of investigators attracted<br /> ### to location of big projects. There are 1000 total<br /> ### investigators<br /> Sigma = diag(2)</em>0.005<br /> N1 = 200<br /> Ng1 = round(N1<em>0.01)<br /> g1 = rep(4,N);g1[1:Ng1]=16<br /> library(MASS)<br /> for(i in 1:M1){<br /> xy = mvrnorm(N1,c(x1[i],y1[i]),Sigma)<br /> points(xy[,1],xy[,2],pch=g1,col=ifelse(g1==4,”grey”,thecol))<br /> }<br /> ### generate location of investigators that ignore big projects<br /> ### note now 500 instead of 200. Note overall total<br /> ## is also less because large projects result in less<br /> ## lead investigators<br /> N = 500<br /> x = runif(N)<br /> y = runif(N)<br /> Ng = N</em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x1,y1,pch=”$”,col=”darkgreen”,cex=2,lwd=2)<br /> </tt></p> A thanksgiving dplyr Rubik's cube puzzle for you 2015-11-25T12:14:06+00:00 http://simplystats.github.io/2015/11/25/a-thanksgiving-dplyr-rubiks-cube-puzzle-for-you <p><a href="http://nickcarchedi.com/">Nick Carchedi</a> is back visiting from <a href="https://www.datacamp.com/">DataCamp</a> and for fun we came up with a <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">[Nick Carchedi](http://nickcarchedi.com/) is back visiting from [DataCamp](https://www.datacamp.com/) and for fun we came up with a</a> Rubik’s cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:</p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/4d4b63a035973231e6d4">Gist</a>. </noscript> </div> <p><span style="line-height: 1.5;">To solve the puzzle you need to pipe this data frame in </span></p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/aae1218a8f4d1220e07d">Gist</a>. </noscript> </div> <p>and pipe out the Thanksgiving data frame using only the dplyr commands <em>arrange</em>, <em>mutate</em>, <em>slice</em>, <em>filter</em> and <em>select</em>. For advanced users you can try our slightly more complicated puzzle:</p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/b82531d9dac78ba3c60a">Gist</a>. </noscript> </div> <p>See if you can do it <a href="http://www.theguardian.com/technology/video/2015/nov/24/boy-completes-rubiks-cube-in-49-seconds-word-recordvideo">this fast</a>. Post your solutions in the comments and Happy Thanksgiving!</p> 20 years of Data Science: from Music to Genomics 2015-11-24T10:00:56+00:00 http://simplystats.github.io/2015/11/24/20-years-of-data-science-and-data-driven-discovery-from-music-to-genomics <p>I finally got around to reading David Donoho’s <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 Years of Data Science</a> paper.  I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:</p> <div class="page" title="Page 5"> <div class="layoutArea"> <div class="column"> <blockquote> <p> The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers. </p> </blockquote> </div> </div> </div> <p>The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were “<a href="http://simplystatistics.org/2011/09/07/first-things-first/">fired up about the new era where data is abundant and statisticians are scientists</a>”. It was clear that many disciplines were becoming data-driven and  that interest in data analysis was growing rapidly. We were further motivated because, despite this <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">new found interest in our work</a>, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">leadership roles</a> in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of <a href="http://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/">our</a> <a href="http://simplystatistics.org/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt/">posts</a> exhorted academic departments to embrace larger numbers of applied statisticians:</p> <blockquote> <p>[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none.  By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.</p> </blockquote> <p>Donoho points out that John Tukey had a similar preoccupation 50 years ago:</p> <div class="page" title="Page 10"> <div class="layoutArea"> <div class="column"> <blockquote> <p> For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data </p> </blockquote> <p> Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">teach the gory details of what what they do</a>, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled<em> 20 years of Data Science: from Music to Genomics. </em>The goal was to explain why <em>applied statistician</em> is not considered synonymous with <em>data scientist </em>even when we focus on the same goal: <a href="https://en.wikipedia.org/wiki/Data_science">extract knowledge or insights from data.</a> </p> <p> The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. <a href="http://archive.cnmat.berkeley.edu/Research/1998/Rafael/tesis.pdf">My dissertation work</a> was about finding meaningful parametrization of musical sound signals that<img class="wp-image-4449 alignright" src="http://www.biostat.jhsph.edu/~ririzarr/Demo/img7.gif" alt="Spectrogram" width="380" height="178" /> my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them <a href="http://www.biostat.jhsph.edu/~ririzarr/Demo/">here</a>). None of these data science aspects were highlighted in the <a href="http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n42.pdf">papers</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/000313001300339969#.Vk4_ht-rQUE">I</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/016214501750332875#.Vk4_mN-rQUE">wrote </a><a href="http://www.tandfonline.com/doi/abs/10.1198/016214501753168082#.Vk4_qt-rQUE">about</a> my <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract?userIsAuthenticated=false&amp;deniedAccessCustomisedMessage=">thesis</a>. Here is a screen shot from <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract">this paper</a>: </p> </div> </div> </div> <p><a href="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png"><img class="wp-image-4449 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png" alt="Screen Shot 2015-04-15 at 12.24.40 PM" width="320" height="342" srcset="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-957x1024.png 957w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-187x200.png 187w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png 1204w" sizes="(max-width: 320px) 100vw, 320px" /></a></p> <p>I am actually glad I wrote out and published all the technical details of this work.  It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.</p> <p>The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a <a href="http://www.jhsph.edu/faculty/directory/profile/3859/scott-zeger">department chair</a> that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how <a href="http://bioinformatics.oxfordjournals.org/content/20/3/307.short">developing software packages,</a> <a href="http://www.nature.com/nmeth/journal/v2/n5/abs/nmeth756.html">planning</a> the <a href="http://www.nature.com/nmeth/journal/v4/n11/abs/nmeth1102.html">gathering of data</a> to <a href="http://www.ncbi.nlm.nih.gov/pubmed/?term=16108723">aid method development</a>, developing <a href="http://www.ncbi.nlm.nih.gov/pubmed/14960458">web tools</a> to assess data analysis techniques in the wild, and facilitating <a href="http://www.ncbi.nlm.nih.gov/pubmed/19151715">data-driven discovery</a> in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to <a href="https://youtu.be/vbb-AjiXyh0">President Obama</a>, and go do some great data science.</p> <p> </p> <p> </p> Some Links Related to Randomized Controlled Trials for Policymaking 2015-11-19T12:49:03+00:00 http://simplystats.github.io/2015/11/19/some-links-related-to-randomized-controlled-trials-for-policymaking <div> <p> In response to <a href="http://simplystatistics.org/2015/11/17/why-are-randomized-trials-not-used-by-policymakers/">my previous post</a>, <a href="https://gspp.berkeley.edu/directories/faculty/avi-feller">Avi Feller</a> sent me these links related to efforts promoting the use of RCTs  and evidence-based approaches for policymaking: </p> <ul> <li>  The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see <a href="http://www.appam.org/events/fall-research-conference/2015-fall-research-conference-information/" target="_blank">here</a> and <a href="http://www.appam.org/2015appam-student-summary-using-experiments-for-evidence-based-policy-lessons-from-the-private-sector/" target="_blank">here</a>). </li> </ul> <ul> <li> Jeff Liebman has written extensively about the use of randomized experiments in policy (see <a href="http://govinnovator.com/ten_year_challenge/" target="_blank">here</a> for a recent interview). </li> </ul> <ul> <li> The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report <a href="https://www.whitehouse.gov/sites/default/files/microsites/ostp/sbst_2015_annual_report_final_9_14_15.pdf" target="_blank">here</a>. </li> </ul> <ul> <li> JPAL North America just launched a major initiative to help state and local governments run randomized trials (see <a href="https://www.povertyactionlab.org/about-j-pal/news/j-pal-north-america-state-and-local-innovation-initiative-release" target="_blank">here</a>). </li> </ul> </div> Given the history of medicine, why are randomized trials not used for social policy? 2015-11-17T10:42:24+00:00 http://simplystats.github.io/2015/11/17/why-are-randomized-trials-not-used-by-policymakers <p>Policy changes can have substantial societal effects. For example, clean water and  hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, <a href="https://en.wikipedia.org/wiki/Prohibition_in_the_United_States">prohibition</a>, or the “noble experiment”, boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.</p> <p>The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered <a href="http://abcnews.go.com/Health/TenWays/story?id=3605442&amp;page=1">one of the greatest medical advances of the 20th century</a>.</p> <p>Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In <a href="http://www.badscience.net/2011/05/we-should-so-blatantly-do-more-randomised-trials-on-policy/">this post</a>, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In <a href="https://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en">this TED talk</a> Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs  are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least <a href="http://peabody.vanderbilt.edu/research/pri/VPKthrough3rd_final_withcover.pdf">two</a> <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">RCT</a>s finding that universal pre-K programs are not effective, polymakers in New York <a href="http://www.npr.org/sections/ed/2015/09/08/438584249/new-york-city-mayor-goes-all-in-on-free-preschool">are implementing a $400 million a year program</a>. Supporters of this noble endeavor defend their decision by pointing to observational studies and “expert” opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.</p> <p>Today, when we <a href="http://www.ncbi.nlm.nih.gov/pubmed/7058834">compare conclusions from non-RCT studies to RCTs</a>, we note the unintended strong effects that preconceived notions can have. The first chapter in <a href="http://www.amazon.com/Statistics-4th-Edition-David-Freedman/dp/0393929728">this book</a> provides a summary and some examples. One example comes from <a href="http://www.jameslindlibrary.org/grace-nd-muench-h-chalmers-tc-1966/">a study</a> of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:</p> <table> <tr> <td> Design </td> <td> Marked Improvement </td> <td> Moderate Improvement </td> <td> None </td> </tr> <tr> <td> No control </td> <td> 24 </td> <td> 7 </td> <td> 1 </td> </tr> <tr> <td> Controls; but no randomized </td> <td> 10 </td> <td> 3 </td> <td> 2 </td> </tr> <tr> <td> Randomized </td> <td> </td> <td> 1 </td> <td> 3 </td> </tr> </table> <p>Compare the first and last column to appreciate the importance of the randomized trials.</p> <p>A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has <a href="http://diethylstilbestrol.co.uk/des-side-effects/">terrible side effects</a>. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.</p> <p>Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. <span style="line-height: 1.5;">Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.</span></p> <p><strong>Update: </strong>A reader pointed me to <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">these</a> <a href="http://eml.berkeley.edu//~crwalters/papers/kline_walters.pdf">preprints</a> which point out that the control group in <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">one of the cited</a> early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">this preprint</a> they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the<span style="line-height: 1.5;"> RCT data facilitated the principal stratification framework analysis. I also want to restate what <a href="http://simplystatistics.org/2014/04/17/correlation-does-not-imply-causation-parental-involvement-edition/">I’ve posted before</a>, “I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.”</span></p> So you are getting crushed on the internet? The new normal for academics. 2015-11-16T09:49:04+00:00 http://simplystats.github.io/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics <p>Roger and I were just talking about all the discussion around the <a href="http://www.pnas.org/content/early/2015/10/29/1518393112.full.pdf">Case and Deaton paper</a> on death rates for middle class people. Andrew Gelman <a href="http://www.slate.com/articles/health_and_science/science/2015/11/death_rates_for_white_middle_aged_americans_are_not_increasing.html">discussed it</a> among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday <a href="http://noahpinionblog.blogspot.com/2015/11/gelman-vs-case-deaton-academics-vs.html">Noah Smith</a> wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn’t fully fleshed out.</p> <p>I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of <a href="http://f1000research.com/">f1000Research</a>. I think it is pretty scary for folks who aren’t used to “internet speed” to see this play out and I thought it would be helpful to make a few points.</p> <ol> <li><strong>Everyone is an internet scientist now.</strong> The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.</li> <li><strong>The internet loves a fight</strong>. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.</li> <li><strong>Takedowns are easier than new research and happen faster</strong>. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn’t mean the critique won’t be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.</li> </ol> <p>In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you’d expect (point 3). The first thing to keep in mind is that the internet wants you to “fight back” and wants to declare a “winner”. Reading about amicable disagreements doesn’t build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.</p> <p>I think these are the three ways to respond to rapid internet criticism of your work.</p> <ul> <li><strong>Option 1: Respond on internet time.</strong> This means if you publish a big paper that you think might be controversial  you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren’t. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.</li> <li><strong>Option2 : Respond in academic time.</strong> You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like “you didn’t have a good answer so you didn’t respond immediately”. The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.</li> <li><strong>Option 3: Do not respond.</strong> This is what a lot of people do and I’m unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.</li> </ul> <p>All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won’t have this issue and if you keep your claims small, this also probably won’t apply. But I thought it was useful to try to work out how to act under this “new normal”.</p> Prediction Markets for Science: What Problem Do They Solve? 2015-11-10T20:29:19+00:00 http://simplystats.github.io/2015/11/10/prediction-markets-for-science-what-problem-do-they-solve <p>I’ve recently seen a bunch of press on <a href="http://www.pnas.org/content/early/2015/11/04/1516179112.abstract">this paper</a>, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:</p> <blockquote> <p>Although <a href="http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/">replication is essential for verifying results</a>, the <a href="http://fivethirtyeight.com/features/science-isnt-broken/">current scientific culture does little to encourage it in most fields</a>. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, <a href="http://pss.sagepub.com/content/22/11/1359.short?rss=1&amp;ssource=mfr">could be common in the scientific literature</a>. Indeed, a 2005 study claimed that <a href="http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">most published research findings are false.</a></p> <p>[…]</p> <p>The researchers began by selecting some studies slated for replication in the <a href="https://osf.io/ezcuj/wiki/home/">Reproducibility Project: Psychology</a> — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in <a href="https://osf.io/yjmht/">two prediction markets</a>. These are the same types of markets that people use <a href="http://www.nytimes.com/2015/10/24/upshot/betting-markets-call-marco-rubio-front-runner-in-gop.html?_r=0">to bet on who’s going to be president</a>. In this case, though, researchers were betting on whether a study would replicate or not.</p> </blockquote> <p>There are all kinds of prediction markets these days–for politics, general ideas–so having one for scientific ideas is not too controversial. But I’m not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I’ll admit that’s an interesting result, but I’m not yet convinced.</p> <p>First off, it’s worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">have a</a> <a href="http://simplystatistics.org/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication/">fundamentally flawed definition of replication</a>. So I’m not sure I can really agree with the idea of basing a prediction market on such a definition, but I’ll let that go for now.</p> <p>The purpose of most markets is some general notion of “price discovery”. One popular market is the stock market and I think it’s instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it’s probably best to look at…what people are willing to pay. That’s exactly what the stock market gives you. You only run into trouble when there’s no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.</p> <p>Now, suppose you’re interested in finding out what the “true fundamental value” of Apple, Inc. Some people think the stock market gives you that at every instance, while <a href="http://www.econ.yale.edu/~shiller/">others</a> think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.</p> <p>What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they’d be willing to pay for different ideas. But that’s not what we’re talking about.</p> <p>Here, we’re trying to get at whether a given hypothesis is <em>true or not</em>. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.</p> <p>But if market participants aren’t contributing real information about an hypothesis, what are they contributing? Well, they’re contributing their <em>opinion</em> about an hypothesis. How is that related to science? I’m not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it’s consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it’s not clear how valuable people’s opinions are.</p> <p>In a way, this reminds me of a time a while back when the EPA was soliciting “expert opinion” about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money–just the price of a conference call.</p> <p>There’s a version of this playing out in the health tech market right now. Companies like <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">Theranos</a> and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.</p> <p>Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the $9 billion valuation of the private company is another indicator that people…highly value the company. But ultimately, <em>we still don’t know if their blood tests are accurate</em> because we don’t have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?</p> <p>I think there may be a role to play for prediction markets in science, but I’m not sure discovering the truth about nature is one of them.</p> Biostatistics: It's not what you think it is 2015-11-09T10:00:20+00:00 http://simplystats.github.io/2015/11/09/biostatistics-its-not-what-you-think-it-is <p><a href="http://www.hsph.harvard.edu/biostatistics">My department</a> recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of <em>Biostatistics</em> and most of the rest thought <em>Biostatistics</em> was the study of clinical trials. We <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">have</a> <a href="http://simplystatistics.org/2011/11/02/we-need-better-marketing/">posted</a> on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data science and are considering PhD programs.</p> <p>There are dozens of Biostatistics departments and most run PhD programs. As an undergraduate, you may have never heard of it because they are usually in schools that undergrads don’t regularly frequent: Public Health and Medicine.  However, they are very active in research and teaching graduate students. In fact, the 2014 US News &amp; World Report <a href="http://US News and R">ranking of Statistics Departments</a> includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.</p> <p><strong>How is it different from Statistics? </strong>In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.</p> <p><strong>What type of job can I get with a Phd In Biostatistics? </strong><a href="http://fortune.com/2015/04/27/best-worst-graduate-degrees-jobs/">A well paying one</a>. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the <strong>Bio </strong>in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.</p> <p><strong>How should I prepare if I want to apply to a PhD program?</strong> First you need to decide if you are going to like it. One way to do this is to participate in one of the <a href="http://www.nhlbi.nih.gov/research/training/summer-institute-biostatistics-t15">summer programs</a> where you get a glimpse of what we do. My department runs <a href="http://www.hsph.harvard.edu/biostatistics/diversity/summer-program/">one of these as well</a>.  However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it’s like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is difficult as an undergrad because you have to make sure to do well in your courses, prepare for the GRE, and get a solid mathematical and computing foundation in order to conduct research later. This is why these programs are usually in the summer. If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I  recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve you well if you plan to analyze data regardless of what academic route you follow. Python and a low-level language such as C++ are more powerful languages that many biostatisticians use these days.</p> <p>I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.</p> Not So Standard Deviations: Episode 4 - A Gajillion Time Series 2015-11-07T11:46:49+00:00 http://simplystats.github.io/2015/11/07/not-so-standard-deviations-episode-4-a-gajillion-time-series <p>Episode 4 of Not So Standard Deviations is hot off the audio editor. In this episode Hilary first explains to me what heck is DevOps and then we talk about the statistical challenges in detecting rare events in an enormous set of time series data. There’s also some discussion of Ben and Jerry’s and the t-test, so you’ll want to hang on for that.</p> <p>Notes:</p> <ul> <li><a href="https://goo.gl/259VKI">Nobody Loves Graphite Anymore</a></li> <li><a href="http://goo.gl/zB7wM9">A response</a></li> <li><a href="https://goo.gl/7PgLKY">Why Gosset is awesome</a></li> </ul> <p> </p> How I decide when to trust an R package 2015-11-06T13:41:02+00:00 http://simplystats.github.io/2015/11/06/how-i-decide-when-to-trust-an-r-package <p>One thing that I’ve given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from <a href="https://twitter.com/KasperDHansen/status/657589509975076864">trolling me</a> <a href="https://twitter.com/KasperDHansen/status/621315346633519104">on Twitter</a> to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor.  A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/michaelhoffman">@michaelhoffman</a> But it's not on Bioconductor or CRAN. This decreases trust substantially. </p> <p> &mdash; Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/659777449098637312">October 29, 2015</a> </p> </blockquote> <p>The primary reason is because Bioc/CRAN demonstrate something about the developer’s willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn’t just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn’t have a user guarantee system in place.</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/StrictlyStat">@StrictlyStat</a> CRAN is a repository, not a development platform. It is user oriented, not developer oriented. GH is the reverse. </p> <p> &mdash; Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/661746848437243904">November 4, 2015</a> </p> </blockquote> <p>This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:</p> <ol> <li>Quick and dirty analyses for myself</li> <li>Shareable data analyses that I hope are reproducible</li> <li>As dependencies of R packages I maintain</li> </ol> <p>As you move from 1-3 it is more and more of a pain if the package I’m depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.</p> <p>So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:</p> <ol> <li><strong>People prior </strong></li> <li><strong>Forced competence</strong></li> <li><strong>Indirect data</strong></li> </ol> <p>I’ll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I’m using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.</p> <p><strong>People prior</strong></p> <p>The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like <a href="https://en.wikipedia.org/wiki/Brian_D._Ripley">Brian</a>, or <a href="https://github.com/hadley">Hadley,</a> or <a href="https://github.com/jennybc">Jenny</a>, or <a href="http://rafalab.dfci.harvard.edu/index.php/software-and-data">Rafa</a>, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn’t I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.</p> <p><strong>Forced competence</strong></p> <p>For people who I don’t know about or whose software I’ve never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don’t know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don’t know goes something like:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png"><img class="aligncenter wp-image-4410 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png" alt="Screen Shot 2015-11-06 at 1.25.01 PM" width="843" height="197" srcset="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-300x70.png 300w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-260x61.png 260w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png 843w" sizes="(max-width: 843px) 100vw, 843px" /></a></p> <p>This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.</p> <p>This prior isn’t perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.</p> <p><strong>Indirect data</strong></p> <p>After I’ve applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I’m already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the <a href="http://cran-logs.rstudio.com/">download stats</a> provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/KasperDHansen">@KasperDHansen</a> Flipside: CRAN has no issue pages, stars/ratings, outdated limits on size, and limited development cycle/turnover. </p> <p> &mdash; John Muschelli (@StrictlyStat) <a href="https://twitter.com/StrictlyStat/status/661746348409114624">November 4, 2015</a> </p> </blockquote> <p>If I’m going to use a package that is on Github from a person who isn’t on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn’t actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven’t used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a <a href="https://travis-ci.org/">travis badge</a>). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I’d go ahead and trust the package.</p> <p><strong>Summary</strong></p> <p>In general one of the best things about the R ecosystem is being able to rely on other packages so that you don’t have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I’ve outlined to worry less about trustworthy dependencies.</p> The Statistics Identity Crisis: Am I a Data Scientist 2015-10-30T14:21:08+00:00 http://simplystats.github.io/2015/10/30/the-statistics-identity-crisis-am-i-a-data-scientist <p>The joint ASA/Simply Statistics webinar on the statistics identity crisis is now live!</p> Faculty/postdoc job opportunities in genomics across Johns Hopkins 2015-10-30T10:33:06+00:00 http://simplystats.github.io/2015/10/30/facultypostdoc-job-opportunities-in-genomics-across-johns-hopkins <p>It’s pretty exciting to be in genomics at Hopkins right now with three new Bloomberg professors in genomics areas, a ton of stellar junior faculty, and a really fun group of students/postdocs. If you want to get in on the action here is a non-comprehensive list of great opportunities.</p> <h2 id="faculty-jobs"><span style="text-decoration: underline;"><strong>Faculty Jobs</strong></span></h2> <p><strong>Job: </strong>Multiple tenure track faculty positions in all areas including in genomics</p> <p><strong>Department: </strong> Biostatistics</p> <p><strong>To apply</strong>: <a href="http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Tenure track position in data intensive biology</p> <p><strong>Department: </strong> Biology</p> <p><strong>To apply</strong>: <a href="http://apply.interfolio.com/31146">http://apply.interfolio.com/31146</a></p> <p><strong>Deadline: </strong>Nov 1st and ongoing</p> <p><strong>Job:</strong> Tenure track positions in bioinformatics, with focus on proteomics or sequencing data analysis</p> <p><strong>Department: </strong> Oncology Biostatistics</p> <p><strong>To apply</strong>: <a href="https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf">https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p> </p> <h2 id="postdoc-jobs"><span style="text-decoration: underline;"><strong>Postdoc Jobs</strong></span></h2> <p><strong>Job:</strong> Postdoc(s) in statistical methods/software development for RNA-seq</p> <p><strong>Employer: </strong> Jeff Leek</p> <p><strong>To apply</strong>: email Jeff (<a href="http://jtleek.com/jobs/">http://jtleek.com/jobs/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Data scientist for integrative genomics in the human brain (MS/PhD)</p> <p><strong>Employer: </strong> Andrew Jaffe</p> <p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+)</p> <p><strong>Employer: </strong> Andrew Jaffe</p> <p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> PhD developing scalable software and algorithms for analyzing sequencing data</p> <p><strong>Employer: </strong> Ben Langmead</p> <p><strong>To apply</strong>:  http://www.cs.jhu.edu/graduate-studies/phd-program/</p> <p><strong>Deadline:</strong> See site</p> <p><strong>Job:</strong> Postdoctoral researcher developing scalable software and algorithms for analyzing sequencing data</p> <p><strong>Employer: </strong> Ben Langmead</p> <p><strong>To apply</strong>:  email Ben (<a href="http://www.langmead-lab.org/open-positions/">http://www.langmead-lab.org/open-positions/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Postdoctoral researcher developing algorithms for challenging problems in large-scale genomics whole-genome assenbly, RNA-seq analysis, and microbiome analysis</p> <p><strong>Employer: </strong> Steven Salzberg</p> <p><strong>To apply</strong>:  email Steven (<a href="http://salzberg-lab.org/">http://salzberg-lab.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+) in cancer</p> <p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p> <p><strong>To apply</strong>:  email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral researcher developing algorithms for biomarkers development and precision medicine application in cancer</p> <p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p> <p><strong>To apply</strong>:  email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong>Postdoctoral researcher developing methods in machine learning, genomics, and regulatory variation</p> <p><strong>Employer: </strong> Alexis Battle</p> <p><strong>To apply</strong>:  email Alexis (<a href="http://battlelab.jhu.edu/join_us.html">http://battlelab.jhu.edu/join_us.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral fellow with interests in biomarker discovery for Alzheimer’s disease</p> <p><strong>Employer: </strong> Madhav Thambisetty / Ingo Ruczinski</p> <p><strong>To apply</strong>: <a href="http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers"> http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral positions for research in the interface of statistical genetics, precision medicine and big data</p> <p><strong>Employer: </strong> Nilanjan Chatterjee</p> <p><strong>To apply</strong>:  <a href="http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral research developing algorithms and software for time course pattern detection in genomics data</p> <p><strong>Employer: </strong> Elana Fertig</p> <p><strong>To apply</strong>:  email Elana (ejfertig@jhmi.edu)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral fellow to develop novel methods for large-scale DNA and RNA sequence analysis related to human and/or plant genetics, such as developing methods for discovering structural variations in cancer or for assembling and analyzing large complex plant genomes.</p> <p><strong>Employer: </strong> Mike Schatz</p> <p><strong>To apply</strong>:  email Mike (<a href="http://schatzlab.cshl.edu/apply/">http://schatzlab.cshl.edu/apply/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <h2 id="students"><span style="text-decoration: underline;"><strong>Students</strong></span></h2> <p>We are all always on the hunt for good Ph.D. students. At Hopkins students are admitted to specific departments. So if you find a faculty member you want to work with, you can apply to their department. Here are the application details for the various departments admitting students to work on genomics:<a href="https://ccb.jhu.edu/students.shtml"> https://ccb.jhu.edu/students.shtml</a></p> <p> </p> <p> </p> <p> </p> The statistics identity crisis: am I really a data scientist? 2015-10-29T13:32:13+00:00 http://simplystats.github.io/2015/10/29/the-statistics-identity-crisis-am-i-really-a-data-scientist <p> </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p> <p> </p> <p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p> <p> </p> <p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a> </p> <p> &mdash; Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a> </p> </blockquote> <p>but it turns out not everyone fit in the room:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a> </p> <p> &mdash; sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a> </p> </blockquote> <p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p> <table width="100%" cellspacing="0" cellpadding="4" bgcolor="white"> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&amp;T</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b> </td> </tr> </table> <p>You can watch it on Youtube or Google Plus. Here is the link:</p> <p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p> <p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag [ </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p> <p> </p> <p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p> <p> </p> <p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a> </p> <p> &mdash; Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a> </p> </blockquote> <p>but it turns out not everyone fit in the room:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a> </p> <p> &mdash; sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a> </p> </blockquote> <p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p> <table width="100%" cellspacing="0" cellpadding="4" bgcolor="white"> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&amp;T</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b> </td> </tr> </table> <p>You can watch it on Youtube or Google Plus. Here is the link:</p> <p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p> <p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag](https://twitter.com/search?q=%23jsm2015) or you can watch later as the video will remain on Youtube.</p> Discussion of the Theranos Controversy with Elizabeth Matsui 2015-10-28T14:54:50+00:00 http://simplystats.github.io/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui <p>Theranos is a Silicon Valley diagnostic testing company that has been in the news recently. The story of Theranos has fascinated me because I think it represents a perfect collision of the tech startup culture and the health care culture and how combining them together can generate unique problems.</p> <p>I talked with Elizabeth Matsui, a Professor of Pediatrics in the Division of Allergy and Immunology here at Johns Hopkins, to discuss Theranos, the realities of diagnostic testing, and the unique challenges that a health-tech startup faces with respect to doing good science and building products people want to buy.</p> <p>Notes:</p> <ul> <li>Original <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">Wall Street Journal story</a> on Theranos (paywalled)</li> <li>Related stories in <a href="http://www.wired.com/2015/10/theranos-scandal-exposes-the-problem-with-techs-hype-cycle/">Wired</a> and NYT’s <a href="http://www.nytimes.com/2015/10/28/business/dealbook/theranos-under-fire.html">Dealbook</a> (not paywalled)</li> <li>Theranos <a href="https://www.theranos.com/news/posts/custom/theranos-facts">response</a> to WSJ story</li> </ul> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/230510705%3Fsecret_token%3Ds-WbZX8&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Not So Standard Deviations: Episode 3 - Gilmore Girls 2015-10-24T23:17:18+00:00 http://simplystats.github.io/2015/10/24/not-so-standard-deviations-episode-3-gilmore-girls <p>I just uploaded Episode 3 of <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> so check your feeds. In this episode Hilary and I talk about our jobs and the life of the data scientist in both academia and the tech industry. It turns out that they’re not as different as I would have thought.</p> <p><a href="https://api.soundcloud.com/tracks/229957578/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> We need a statistically rigorous and scientifically meaningful definition of replication 2015-10-20T10:05:22+00:00 http://simplystats.github.io/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication <p>Replication and confirmation are indispensable concepts that help define scientific facts.  However, the way in which we reach scientific consensus on a given finding is rather complex. Although <a href="http://simplystatistics.org/2015/06/24/how-public-relations-and-the-media-are-distorting-science/">some press releases try to convince us otherwise</a>, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made.  These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question.  This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans.  In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.</p> <p>Regulatory agencies, such as the FDA, are an exception since they clearly spell out a <a href="http://www.fda.gov/downloads/Drugs/.../Guidances/ucm078749.pdf">definition</a> of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.</p> <p>In response to a growing concern over a <em><a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a></em>, projects such as the <a href="http://osc.centerforopenscience.org/">Open Science Collaboration</a> have commenced to systematically try to replicate published results. In a <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">recent post</a>, Jeff described one of their <a href="http://www.sciencemag.org/content/349/6251/aac4716">recent papers</a> on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value &lt; 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.</p> <p>To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values &lt;0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p&lt;0.05). So which drug do you take? Let’s give a bit more information to help you decide. Here are the p-values for both original and replication trials:</p> <table style="width: 100%;"> <tr> <td> Drug </td> <td> Original </td> <td> Replication </td> <td> Replicated </td> </tr> <tr> <td> A </td> <td> 0.0001 </td> <td> 0.001 </td> <td> Yes </td> </tr> <tr> <td> B </td> <td> &lt;0.000001 </td> <td> 0.03 </td> <td> Yes </td> </tr> <tr> <td> C </td> <td> 0.03 </td> <td> 0.06 </td> <td> No </td> </tr> <tr> <td> D </td> <td> &lt;0.000001 </td> <td> 0.10 </td> <td> No </td> <td> </td> </tr> </table> <p>Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png"><img class=" wp-image-4368 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png" alt="replication" width="359" height="338" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/replication-300x283.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/replication-212x200.png 212w, http://simplystatistics.org/wp-content/uploads/2015/10/replication.png 617w" sizes="(max-width: 359px) 100vw, 359px" /></a></p> <p>I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.</p> <p>It seems that before continuing the debate over replication, and certainly before declaring that we are in a <a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a>, we need a statistically rigorous and scientifically meaningful definition of replication.  This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.</p> <hr /> <p>Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.</p> Theranos runs head first into the realities of diagnostic testing 2015-10-16T08:42:11+00:00 http://simplystats.github.io/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing <p>The Wall Street Journal has published a <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">lengthy investigation</a> into the diagnostic testing company Theranos.</p> <blockquote> <p>The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than $400 million into Theranos, valuing it at $9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple<span class="company-name-type"> Inc.</span> cofounder Steve Jobs.</p> </blockquote> <p>If ever there were a warning sign, the comparison to Steve Jobs has got to be it.</p> <blockquote> <p>But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.</p> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal. </div> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> </div> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said. </div> </blockquote> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons. </div> Minimal R Package Check List 2015-10-14T08:21:48+00:00 http://simplystats.github.io/2015/10/14/minimal-r-package-check-list <p>A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don’t lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you’re in the air.</p> <p>That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio’s tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.</p> <p>The list is basically a “pre-flight” list-–the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don’t run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.</p> <p><strong>Pre-flight Procedures for R Packages</strong></p> <ol> <li>Install most recent version of R</li> <li>Install most recent version of RStudio</li> <li>Open RStudio</li> <li>Install <strong>devtools</strong> package</li> <li>Click on Project –&gt; New Project… –&gt; New Directory –&gt; R package</li> <li>Enter package name</li> <li>Delete boilerplate code and “hello.R” file</li> <li>Goto “man” directory an delete “hello.Rd” file</li> <li>In File browser, click on package name to go to the top level directory</li> <li>Click “Build” tab in environment browser</li> <li>Click “Configure Build Tools…”</li> <li>Check “Generate documentation with Roxygen”</li> <li>Check “Build &amp; Reload” when Roxygen Options window opens –&gt; Click OK</li> <li>Click OK in Project Options window</li> </ol> <p>At this point, you’re clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.</p> <p>If I’m missing a step or have too many steps, I’d like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.</p> <p>UPDATE: I’ve made some changes to the check list and will be posting future updates/modifications to my <a href="https://github.com/rdpeng/daprocedures/blob/master/lists/Rpackage_preflight.md">GitHub repository</a>.</p> Profile of Data Scientist Shannon Cebron 2015-10-03T09:32:20+00:00 http://simplystats.github.io/2015/10/03/profile-of-data-scientist-shannon-cebron <p>The “This is Statistics” campaign has a nice <a href="http://thisisstatistics.org/interview-with-shannon-cebron-from-pegged-software/">profile of Shannon Cebron</a>, a data scientist working at the Baltimore-based Pegged Software.</p> <blockquote> <p><strong>What advice would you give to someone thinking of a career in data science?</strong></p> <p>Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science.  I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.</p> </blockquote> Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes 2015-10-02T09:00:29+00:00 http://simplystats.github.io/2015/10/02/not-so-standard-deviations-episode-2-we-got-it-under-40-minutes <p>Episode 2 of my podcast with Hilary Parker, <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a>, is out! In this episode, we talk about user testing for statistical methods, navigating the Hadleyverse, the crucial significance of rename(), and the secret reason for creating the podcast (hint: it rhymes with “bee”). Also, I erroneously claim that <a href="http://www.stat.purdue.edu/~wsc/">Bill Cleveland</a> is <em>way</em> older than he actually is. Sorry Bill.</p> <p>In other news, <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">we are finally on iTunes</a> so you can subscribe from there directly if you want (just search for “Not So Standard Deviations” or paste the link directly into your podcatcher.</p> <p><a href="https://api.soundcloud.com/tracks/226538106/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> <p>Notes:</p> <ul> <li><a href="http://www.sciencemag.org/content/229/4716/828.short">Bill Cleveland’s paper in Science</a>, on graphical perception, <strong>published in 1985</strong></li> <li><a href="https://www.eventbrite.com/e/statistics-making-a-difference-a-conference-in-honor-of-tom-louis-tickets-16248614042">TomFest</a></li> </ul> A glass half full interpretation of the replicability of psychological science 2015-10-01T10:00:53+00:00 http://simplystats.github.io/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science <p style="line-height: 18.0pt;"> <em>tl;dr: 77% of replication effects from the psychology replication study were in (or above) the 95% prediction interval based on the original effect size. This isn't perfect and suggests (a) there is still room for improvement, (b) the scientists who did the replication study are pretty awesome at replicating, (c) we need a better definition of replication that respects uncertainty but (d) the scientific sky isn't falling. We wrote this up in a <a href="http://arxiv.org/abs/1509.08968">paper on arxiv</a>; <a href="https://github.com/jtleek/replication_paper">the code is here.</a> </em> </p> <p style="line-height: 18.0pt;"> <span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">A week or two ago a paper came out in Science on<span class="apple-converted-space"> </span><a href="http://www.sciencemag.org/content/349/6251/aac4716">Estimating the reproducibility of psychological science</a>. The basic behind the study was to take a sample of studies that appeared in a particular journal in 2008 and try to replicate each of these studies. Here I'm using the definition that reproducibility is the ability to recalculate all results given the raw data and code from a study and replicability is the ability to re-do the study and get a consistent result. </span> </p> <p style="line-height: 18.0pt;"> <span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">The paper is pretty incredible and the authors did an amazing job of going back to the original sources and trying to be faithful to the original study designs. I have to admit when I first heard about the study design I was incredibly pessimistic about the results (I suppose grouchy is a natural default state for many statisticians –especially those with sleep deprivation). I mean 2008 was well before the push toward reproducibility had really taken off (Biostatistics was one of the first journals to adopt a policy on reproducible research and that didn't happen <a href="http://biostatistics.oxfordjournals.org/content/10/3/405.full">until 2009</a>). More importantly, the student researchers from those studies had possibly moved on, study populations may change, there could be any number of minor variations in the study design and so forth. I thought the chances of getting any effects in the same range was probably pretty low. </span> </p> <p style="line-height: 18.0pt;"> So when the results were published I was pleasantly surprised. I wasn’t the only one: </p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Someone has to say it, but this plot shows that science is, in fact, working. <a href="http://t.co/JUy10xHfbH">http://t.co/JUy10xHfbH</a> <a href="http://t.co/lJSx6IxPw2">pic.twitter.com/lJSx6IxPw2</a> </p> <p> &mdash; Roger D. Peng (@rdpeng) <a href="https://twitter.com/rdpeng/status/637009904289452032">August 27, 2015</a> </p> </blockquote> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Looks like psychologists are in a not-too-bad spot on the ROC curves of science (<a href="http://t.co/fPsesCn2yK">http://t.co/fPsesCn2yK</a>) <a href="http://t.co/9rAOdZWvzv">http://t.co/9rAOdZWvzv</a> </p> <p> &mdash; Joe Pickrell (@joe_pickrell) <a href="https://twitter.com/joe_pickrell/status/637304244538896384">August 28, 2015</a> </p> </blockquote> <p>But that was definitely not the prevailing impression that the paper left on social and mass media. A lot of the discussion around the paper focused on the <a href="https://github.com/jtleek/replication_paper/blob/gh-pages/in_the_media.md">idea that only 36% of the studies</a> had a p-value less than 0.05 in both the original and replication study. But many of the sample sizes were small and the effects were modest. So the first question I asked myself was, “Well what would we expect to happen if we replicated these studies?” The original paper measured replicability in several ways and tried hard to calibrate expected coverage of confidence intervals for the measured effects.</p> <p>With <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a> and <a href="http://www.biostat.jhsph.edu/~prpatil/">Prasad</a> we tried a little different approach. We estimated the 95% prediction interval for the replication effect given the original effect size.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png"><img class="aligncenter wp-image-4337" src="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png" alt="pi_figure_nofilter" width="397" height="397" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png 1050w" sizes="(max-width: 397px) 100vw, 397px" /></a></p> <p> </p> <p>72% of the replication effects were within the 95% prediction interval and 2 were above the interval (showed a stronger signal in replication in than predicted from original study). This definitely shows that there is still room for improvement in replication of these studies - we would expect 95% of the effects to fall into the 95% prediction interval. But at least my opinion is that 72% (or 77% if you count the 2 above the P.I.) of studies falling in the prediction interval is (a) not bad and (b) a testament to the authors of the reproducibility paper and their efforts to get the studies right.</p> <p>An important point here is that replication and reproducibility aren’t the same thing. When reproducing a study we expect the numbers and figures to be <em>exactly the same. _But a replication involves recollection of data and is subject to variation and so _we don’t expect the answer to be exactly the same in the replication</em>. This is of course made more confusing by regression to the mean, publication bias, and <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">the garden of forking paths</a>.  Our use of a prediction interval measures both the variation expected in the original study and in the replication. One thing we noticed when re-analyzing the data is how many of the studies had very low sample sizes. <a href="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png"><img class="aligncenter wp-image-4339" src="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png" alt="samplesize_figure_nofilter" width="450" height="450" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png 1050w" sizes="(max-width: 450px) 100vw, 450px" /></a></p> <p> </p> <p>Sample sizes were generally bigger in the replication, but often very low regardless. This makes it more difficult to disentangle what didn’t replicate from what is just expected variation for a small sample size study.  The point remains whether those small studies should be trusted in general, but for the purposes of measuring replication it makes the problem more difficult.</p> <p>One thing I have been thinking about a lot and this study drove home is that if we are measuring replication we need a definition that incorporates uncertainty directly. Suppose that you collect a data set <strong>D0</strong> from an original study and  <strong>D1</strong> from a replication. Then replication means that the data from a study replicates if <strong>D0 ~ F </strong>and <strong>D1 ~ F. </strong>Informally, if the data are generated from the same distribution in both experiments then the study replicates. To get an estimate you apply a pipeline to the data set to get an estimate <strong>e0 = p(D0). </strong>If the study is also reproducible than <strong>p</strong><strong>()</strong> is the same for both studies and <strong>p</strong><strong>(D0) ~ G </strong>and <strong>p</strong><strong>(D1)</strong> <strong>~ G</strong>, subject to some conditions on <strong>p</strong><strong>(). </strong></p> <p>One interesting consequence of this definition is that each complete replication data set represents <em>only a single data point</em> for measuring replication. To measure replication with this definition you either need to make assumptions about the data generating distribution for <strong>D0</strong> and <strong>D1</strong> or you need to perform a complete replication of a study many times to determine if it replicates. However, it does mean that we can define replication even for studies with very small number of replicates as the data generating distribution may be arbitrarily variable in each case.</p> <p>Regardless of this definition I was excited that the <a href="https://osf.io/">OSF </a>folks did the study and pulled it off as well as they did and was a bit bummed about the most common  reaction. I think there is an easy narrative that “science is broken” which I think isn’t a positive thing for a number of reasons. I love the way that {reproducibility/replicability/open science/open publication} are becoming more and more common, but often think we fall into the same trap in wanting to report these results as clear cut as we do when reporting exaggerations or oversimplifications of scientific discoveries in headlines. I’m excited to see how these kinds of studies look in 10 years when Github/open science/pre-prints/etc. are all the standards.</p> Apple Music's Moment of Truth 2015-09-30T07:38:08+00:00 http://simplystats.github.io/2015/09/30/apple-musics-moment-of-truth <p>Today is the day when Apple, Inc. learns whether it’s brand new streaming music service, Apple Music, is going to be a major contributor to the bottom line or just another streaming service (JASS?). Apple Music launched 3 months ago and all new users are offered a 3-month free trial. Today, that free trial ends and the big question is how many people will start to <strong>pay</strong> for their subscription, as opposed to simply canceling it. My guess is that most people (&gt; 50%) will opt to pay, but that’s a complete guess. For what it’s worth, I’ll be paying for my subscription. After adding all this music to my library, I’d hate to see it all go away.</p> <p>Back on August 18, 2015, consumer market research firm MusicWatch <a href="http://www.businesswire.com/news/home/20150818005755/en#.VddbR7Scy6F">released a study</a> that claimed, among other things, that</p> <blockquote> <p>Among people who had tried Apple Music, 48 percent reported they are not currently using the service.</p> </blockquote> <p>This would suggest that almost half of people who had signed up for the free trial period of Apple Music were not interested in using it further and would likely not pay for it once the trial ended. If it were true, it would be a blow to the newly launched service.</p> <p>But how did MusicWatch arrive at its number? It claimed to have surveyed 5,000 people in its study. Shortly before the survey by MusicWatch was released, Apple claimed that about 11 million people had signed up for their new Apple Music service (because the service had just launched, everyone who had signed up was in the free trial period). Clearly, 5,000 people do not make up the entire population, so we have but a small sample of users.</p> <p>What is the target that MusicWatch was trying to answer? It seems that they wanted to know the percentage of <strong>all people who had signed up for Apple Music</strong> that were still using the service. Can they make inference about the entire population from the sample of 5,000?</p> <p>If the sample is representative and the individuals are independent, we could use the number 48% as an estimate of the percentage in the population who no longer use the service. The press release from MusicWatch did not indicate any measure of uncertainty, so we don’t know how reliable the number is.</p> <p>Interestingly, soon after the MusicWatch survey was released, Apple released a statement to the publication <em>The Verge</em>, stating that 79% of users who had signed up were still using the service (i.e. only 21% had stopped using it, as opposed to 48% reported by MusicWatch). In other words, Apple just came out and <em>gave us the truth</em>! This was unusual because Apple typically does not make public statements about newly launched products. I just found this amusing because I’ve never been in a situation where I was trying to estimate a parameter and then someone later just told me what its value was.</p> <p>If we believe that Apple and MusicWatch were measuring the same thing in their analyses (and it’s not clear that they were), then it would suggest that MusicWatch’s estimate of the population percentage (48%) was quite far off from the true value (21%). What would explain this large difference?</p> <ol> <li><strong>Random variation</strong>. It’s true that MusicWatch’s survey was a small sample relative to the full population, but the sample was still big with 5,000 people. Furthermore, the analysis was fairly simple (just taking the proportion of users still using the service), so the uncertainty associated with that estimate is unlikely to be that large.</li> <li><strong>Selection bias</strong>. Recall that it’s not clear how MusicWatch sampled its respondents, but it’s possible that the way that they did it led them to capture a set of respondents who were less inclined to use Apple Music. Beyond this, we can’t really say more without knowing the details of the survey process.</li> <li><strong>Respondents are not independent</strong>. It’s possible that the survey respondents are not independent of each other. This would primiarily affect the uncertainty about the estimate, making it larger than we might expect if the respondents were all independent. However, since we do not know what MusicWatch’s uncertainty about their estimate was in the first place, it’s difficult to tell if dependence between respondents could play a role. Apple’s number, of course, has no uncertainty.</li> <li><strong>Measurement differences</strong>. This is the big one, in my opinion. We don’t know is how either MusicWatch or Apple defined “still using the service”. You could imagine a variety of ways to determine whether a person was still using the service. You could ask “Have you used it in the last week?” or perhaps “Did you use it yesterday?” Responses to these questions would be quite different and would likely lead to different overall percentages of usage.</li> </ol> We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15 2015-09-29T09:53:31+00:00 http://simplystats.github.io/2015/09/29/we-used-data-to-improve-our-harvardx-courses-new-versions-start-oct-15 <p>You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p> <p>Last semester we successfully [You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p> <p>Last semester we successfully](http://simplystatistics.org/2014/11/25/harvardx-biomedical-data-science-open-online-training-curriculum-launches-on-january-19/) of my <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">Data Analysis course</a>. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:</p> <ul> <li>We no longer use R packages that Microsoft Windows users had trouble installing in the first course.</li> <li>All courses are now designed to be completed in 4 weeks.</li> <li>We added new assessment questions.</li> <li>We improved the assessment questions determined to be problematic.</li> <li>We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.</li> <li>We consolidated the case studies into one course.</li> <li>We combined the materials from the statistics courses into a <a href="http://simplystatistics.org/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown/">book</a>, which you can download <a href="https://leanpub.com/dataanalysisforthelifesciences">here</a>. The material in the book match the materials taught in class so you can use it to follow along.</li> </ul> <p>You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.</p> <ol> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x">Statistics and R for the Life Sciences</a> starts October 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x">Introduction to Linear Models and Matrix Algebra</a> starts November 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x">Statistical Inference and Modeling for High-throughput Experiments</a> starts December 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-4-harvardx-ph525-4x">High-Dimensional Data Analysis</a> starts January 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-5-harvardx-ph525-5x">Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays</a> starts February 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x">High-performance Computing for Reproducible Genomics</a> starts March 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-7-case-harvardx-ph525-7x">Case Studies in Functional Genomics</a> start April 15.</li> </ol> <p>The landing page for the series continues to be <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a>.</p> Data Analysis for the Life Sciences - a book completely written in R markdown 2015-09-23T09:37:27+00:00 http://simplystats.github.io/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown <p class="p1"> The book <em>Data Analysis for the Life Sciences</em> is now available on <a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>. </p> <p class="p1"> <span class="s1"><img class="wp-image-4313 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg" alt="title_page" width="222" height="287" srcset="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/09/title_page-791x1024.jpg 791w" sizes="(max-width: 222px) 100vw, 222px" />Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of <a href="https://www.stat.berkeley.edu/~statlabs/">Stat Labs</a>, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.<span class="Apple-converted-space">  We use simulations and data analysis examples to teach statistical concepts. </span></span><span class="s1">The book includes links to computer code that readers can use to program along as they read the book.</span> </p> <p class="p1"> It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects. </p> <p class="p1">  The text was completely written in R markdown and every section contains a link to the  document that was used to create that section. This means that you can use <a href="http://yihui.name/knitr/">knitr</a> to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from  <a href="https://github.com/genomicsclass/labs">GitHub</a>. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions.  But you can see the solutions if  you take our <a href="http://genomicsclass.github.io/book/pages/classes.html">online course</a> quizzes. If we find that most readers want access to the solutions, we will open them up as well. </p> <p class="p1"> The material is based on the online courses I have been teaching with <a href="http://mikelove.github.io/">Mike Love</a>. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used<a href="http://www.stephaniehicks.com/githubPages_tutorial/pages/githubpages-jekyll.html"> jekyll</a> to create a <a href="http://genomicsclass.github.io/book/">webpage</a> with html versions of the markdown documents. Jeff then convinced us to publish it on <del>Leanbup</del><a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it. </p> The Leek group guide to writing your first paper 2015-09-18T10:57:26+00:00 http://simplystats.github.io/2015/09/18/the-leek-group-guide-to-writing-your-first-paper <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> The <a href="https://twitter.com/jtleek">@jtleek</a> guide to writing your first academic paper <a href="https://t.co/APLrEXAS46">https://t.co/APLrEXAS46</a> </p> <p> &mdash; Stephen Turner (@genetics_blog) <a href="https://twitter.com/genetics_blog/status/644540432534368256">September 17, 2015</a> </p> </blockquote> <p>I have written guides on <a href="https://github.com/jtleek/reviews">reviewing papers</a>, <a href="https://github.com/jtleek/datasharing">sharing data</a>,  and <a href="https://github.com/jtleek/rpackages">writing R packages</a>. One thing I haven’t touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest transition in graduate school is between taking classes and doing research.</p> <p>There are several hard parts to this transition including trying to find a problem, trying to find an advisor, and having a ton of unstructured time. One of the hardest things I’ve found is knowing (a) when to start writing your first paper and (b) how to do it. So I wrote a guide for students in my group:</p> <p><a href="https://github.com/jtleek/firstpaper">https://github.com/jtleek/firstpaper</a></p> <p>On how to write your first paper. It might be useful for other folks as well so I put it up on Github. Just like with the other guides I’ve written this is a very opinionated (read: doesn’t apply to everyone) guide. I also would appreciate any feedback/pull requests people have.</p> Not So Standard Deviations: The Podcast 2015-09-17T10:57:45+00:00 http://simplystats.github.io/2015/09/17/not-so-standard-deviations-the-podcast <p>I’m happy to announce that I’ve started a brand new podcast called <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> with Hilary Parker at Etsy. Episode 1 “RCatLadies Origin Story” is available through SoundCloud. In this episode we talk about the origins of RCatLadies, evidence-based data analysis, my new book, and the Python vs. R debate.</p> <p>You can subscribe to the podcast using the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">RSS feed</a> from SoundCloud. We’ll be getting it up on iTunes hopefully very soon.</p> <p><a href="https://api.soundcloud.com/tracks/224180667/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file</a>.</p> <p>Show Notes:</p> <ul> <li><a href="https://twitter.com/rcatladies">RCatLadies Twitter account</a></li> <li>Hilary’s <a href="http://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/">analysis of the name Hilary</a></li> <li><a href="https://leanpub.com/artofdatascience">The Art of Data Science</a></li> <li>What is <a href="http://www.amstat.org/meetings/jsm.cfm">JSM</a>?</li> <li><a href="https://en.wikipedia.org/wiki/A_rising_tide_lifts_all_boats">A rising tide lifts all boats</a></li> </ul> Interview with COPSS award Winner John Storey 2015-08-25T09:25:28+00:00 http://simplystats.github.io/2015/08/25/interview-with-copss-award-winner-john-storey <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p> <p> </p> <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics.  This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p> <p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p> <p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p> <p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p> <p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p> <p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p> <p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p> <p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p> <p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis).  For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p> <p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p> <p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder [<a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p> <p> </p> <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics.  This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p> <p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p> <p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p> <p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p> <p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p> <p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p> <p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p> <p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p> <p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis).  For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p> <p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p> <p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder](http://jtleek.com/) got to <del>endure</del> enjoy for years during his PhD in my lab.</p> <p><strong>SimplyStats: You are the founding Director of the Center for Statistics and Machine Learning at Princeton. What parts of the new gig are you most excited about?</strong></p> <p>JS: Princeton closed its Department of Statistics in the early 1980s. Because of this, the style of statistician and machine learner we have here today is one who’s comfortable being appointed in a field outside of statistics or machine learning. Examples include myself in genomics, Kosuke Imai in political science, Jianqing Fan in finance and economics, and Barbara Engelhardt in computer science. Nevertheless, statistics and machine learning here is strong, albeit too small at the moment (which will be changing soon). This is an interesting place to start, very different from most universities.</p> <p>What I’m most excited about is that we get to answer the question: “What’s the best way to build a faculty, educate undergraduates, and create a PhD program starting now, focusing on the most important problems of today?”</p> <p>For those who are interested, we’ll be releasing a <a href="http://www.princeton.edu/strategicplan/taskforces/sml/">public version of our strategic plan</a> within about six months. We’re trying to do something unique and forward-thinking, which will hopefully make Princeton an influential member of the statistics, machine learning, and data science communities.</p> <p><strong>SimplyStats: You are organizing the Tukey conference at Princeton (to be held September 18, <a href="http://csml.princeton.edu/tukey">details here</a>).</strong> <strong>Do you think Tukey’s influence will affect your vision for re-building statistics at Princeton?</strong></p> <p>JS: Absolutely, Tukey has been and will be a major influence in how we re-build. He made so many important contributions, and his approach was extremely forward thinking and tied into real-world problems. I strongly encourage everyone to read Tukey’s 1962 paper titled <a href="https://projecteuclid.org/euclid.aoms/1177704711">The Future of Data Analysis</a>. Here he’s 50 years into the future, foreseeing the rise of data science. This paper has truly amazing insights, including:</p> <blockquote> <p>For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.</p> <p>All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.</p> <p>Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.</p> <p>By and large, the great innovations in statistics have not had correspondingly great effects upon data analysis. . . . Is it not time to seek out novelty in data analysis?</p> </blockquote> <p>In this regard, another paper that has been influential in how we are re-building is Leo Breiman’s titled <a href="http://projecteuclid.org/euclid.ss/1009213726">Statistical Modeling: The Two Cultures</a>. We’re building something at Princeton that includes both cultures and seamlessly blends them into a bigger picture community concerned with data-driven scientific discovery and technology development.</p> <p><strong>SimplyStats:</strong> <strong>What advice would you give young statisticians getting into the discipline now?</strong></p> <p>JS: My most general advice is don’t isolate yourself within statistics. Interact with and learn from other fields. Work on problems that are important to practitioners of science and technology development. I recommend that students should master both “traditional statistics” and at least one of the following: (1) computational and algorithmic approaches to data analysis, especially those more frequently studied in machine learning or data science; (2) a substantive scientific area where data-driven discovery is extremely important (e.g., social sciences, economics, environmental sciences, genomics, neuroscience, etc.). I also recommend that students should consider publishing in scientific journals or computer science conference proceedings, in addition to traditional statistics journals. I agree with a lot of the constructive advice and commentary given on the Simply Statistics blog, such as encouraging students to learn about reproducible research, problem-driven research, software development, improving data analyses in science, and outreach to non-statisticians. These things are very important for the future of statistics.</p> The Next National Library of Medicine Director Can Help Define the Future of Data Science 2015-08-24T10:00:26+00:00 http://simplystats.github.io/2015/08/24/the-next-national-library-of-medicine-director-can-help-define-the-future-of-data-science <p>The main motivation for starting this blog was to share our enthusiasm about the increased importance of data and data analysis in science, industry, and society in general. Based on recent initiatives, such as <a href="https://datascience.nih.gov/bd2k">BD2k</a>, it is clear that the NIH is also enthusiastic and very much interested in supporting data science. For those that don’t know, the National Institutes of Health (NIH) is the largest public funder of biomedical research in the world. This federal agency has an annual budget of about $30 billion.</p> <p>The NIH has <a href="http://www.nih.gov/icd/icdirectors.htm">several institutes</a>, each with its own budget and capability to guide funding decisions. Currently, the missions of most of these institutes relate to a specific disease or public health challenge.  Many of them fund research in statistics and computing because these topics are important components of achieving their specific mission. Currently, however, there is no institute directly tasked with supporting data science per se. This is about to change.</p> <p>The National Library of Medicine (NLM) is one of the few NIH institutes that is not focused on a particular disease or public health challenge. Apart from the important task of maintaining an actual library, it supports, among many other initiatives, indispensable databases such as PubMed, GeneBank and GEO. After over 30 years of successful service as NLM director, Dr. Donald Lindberg stepped down this year and, as is customary, an advisory board was formed to advice the NIH on what’s next for NLM. One of the main recommendations of <a href="http://acd.od.nih.gov/reports/Report-NLM-06112015-ACD.pdf">the report</a> is the following:</p> <blockquote> <p>NLM  should be the intellectual and programmatic epicenter for data science at NIH and stimulate its advancement throughout biomedical research and application.</p> </blockquote> <p>Data science features prominently throughout the report making it clear the NIH is very much interested in further supporting this field. The next director can therefore have an enormous influence in the futre of data science. So, if you love data, have administrative experience, and a vision about the future of data science as it relates to the medical and related sciences, consider this exciting opportunity.</p> <p>Here is the <a href="http://www.jobs.nih.gov/vacancies/executive/nlm_director.htm">ad</a>.</p> <p> </p> <p> </p> <p> </p> Interview with Sherri Rose and Laura Hatfield 2015-08-21T13:20:14+00:00 http://simplystats.github.io/2015/08/21/interview-with-sherri-rose-and-laura-hatfied <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png"><img class="aligncenter wp-image-4273 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png" alt="Sherri Rose and Laura Hatfield" width="300" height="200" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-260x173.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png 975w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p style="text-align: center;"> Rose/Hatfield © Savannah Bergquist </p> <p><em><a href="http://www.hcp.med.harvard.edu/faculty/core/laura-hatfield-phd">Laura Hatfield</a> and <a href="http://www.drsherrirose.com/">Sherri Rose</a> are Assistant Professors specializing in biostatistics at Harvard Medical School in the <a href="http://www.hcp.med.harvard.edu">Department of Health Care Policy</a>. Laura received her PhD in Biostatistics from the University of Minnesota and Sherri completed her PhD in Biostatistics at UC Berkeley. They are developing novel statistical methods for health policy problems.</em></p> <p><strong><em>**_SimplyStats</em></strong>: Do you consider yourselves statisticians, data scientists, machine learners, or something else?_**</p> <p><strong>Rose</strong>: I’d definitely say a statistician. Even when I’m working on things that fall into the categories of data science or machine learning, there’s underlying statistical theory guiding that process, be it for methods development or applications. Basically, there’s a statistical foundation to everything I do.</p> <p><strong>Hatfield</strong>: When people ask what I do, I start by saying that I do research in health policy. Then I say I’m a statistician by training and I work with economists and physicians. People have mistaken ideas about what a statistician or professor does, so describing my context and work seems more informative. If I’m at a party, I usually wrap it up in a bow as, “I crunch numbers to study how Obamacare is working.” [laughs]</p> <p> </p> <p><strong><em>SimplyStats: What is the</em></strong> <a href="http://www.healthpolicydatascience.org/"><strong><em>Health Policy Data Science Lab</em></strong></a><strong><em>? How did you decide to start that?</em></strong></p> <p><strong>Hatfield</strong>: We wanted to give our trainees a venue to promote their work and get feedback from their peers. And it helps me keep up on the cool projects Sherri and her students are working on.</p> <p><strong>Rose</strong>: This grew out of us starting to jointly mentor trainees. It’s been a great way for us to make intellectual contributions to each other’s work through Lab meetings. Laura and I approach statistics from <em>completely</em> different frameworks, but work on related applications, so that’s a unique structure for a lab.</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>What kinds of problems are your groups working on these days? Are they mostly focused on health policy?_**</p> <p><strong>Rose</strong>: One of the fun things about working in health policy is that it is quite expansive. Statisticians can have an even bigger impact on science and public health if we take that next step: thinking about the policy implications of our research. And then, who needs to see the work in order to influence relevant policies. A couple projects I’m working on that demonstrate this breadth include a machine learning framework for risk adjustment in insurance plan payment and a new estimator for causal effects in a complex epidemiologic study of chronic disease. The first might be considered more obviously health policy, but the second will have important policy implications as well.</p> <p><strong>Hatfield</strong>: When I start an applied collaboration, I’m also thinking, “Where is the methods paper?” Most of my projects use messy observational data, so there is almost always a methods paper. For example, many studies here need to find a control group from an administrative data source. I’ve been keeping track of challenges in this process. One of our Lab students is working with me on a pathological case of a seemingly benign control group selection method gone bad. I love the creativity required in this work; my first 10 analysis ideas may turn out to be infeasible given the data, but that’s what makes this fun!</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>What are some particular challenges of working with large health data?_**</p> <p><strong>Hatfield</strong>: When I first heard about the huge sample sizes, I was excited! Then I learned that data not collected for research purposes…</p> <p><strong>Rose</strong>: This was going to be my answer!</p> <p><strong>Hatfield</strong>: …are <em>very</em> hard to use for research! In a recent project, I’ve been studying how giving people a tool to look up prices for medical services changes their health care spending. But the data set we have leaves out [painful pause] a lot of variables we’d like to use for control group selection and… a lot of the prices. But as I said, these gaps in the data are begging to be filled by new methods.</p> <p><strong>Rose</strong>: I think the fact that we have similar answers is important. I’ve repeatedly seen “big data” not have a strong signal for the research question, since they weren’t collected for that purpose. It’s easy to get excited about thousands of covariates in an electronic health record, but so much of it is noise, and then you end up with an R<sup>2</sup> of 10%. It can be difficult enough to generate an effective prediction function, even with innovative tools, let alone try to address causal inference questions. It goes back to basics: what’s the research question and how can we translate that into a statistical problem we can answer given the limitations of the data.</p> <p><strong><em>**_SimplyStats: </em></strong>You both have very strong data science skills but are in academic positions. Do you have any advice for students considering the tradeoff between academia and industry?_**</p> <p><strong>Hatfield</strong>: I think there is more variance within academia and within industry than between the two.</p> <p><strong>Rose</strong>: Really? That’s surprising to me…</p> <p><strong>Hatfield</strong>: I had stereotypes about academic jobs, but my current job defies those.</p> <p><strong>Rose</strong>: What if a larger component of your research platform included programming tools and R packages? My immediate thought was about computing and its role in academia. Statisticians in genomics have navigated this better than some other areas. It can surely be done, but there are still challenges folding that into an academic career.</p> <p><strong>Hatfield</strong>: I think academia imposes few restrictions on what you can disseminate compared to industry, where there may be more privacy and intellectual property concerns. But I take your point that R packages do not impress most tenure and promotion committees.</p> <p><strong>Rose</strong>: You want to find a good match between how you like spending your time and what’s rewarded. Not all academic jobs are the same and not all industry jobs are alike either. I wrote a more detailed <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">guest post</a> on this topic for <em>Simply Statistics</em>.</p> <p><strong>Hatfield</strong>: I totally agree you should think about how you’d actually spend your time in any job you’re considering, rather than relying on broad ideas about industry versus academia. Do you love writing? Do you love coding? etc.</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>You are both adopters of social media as a mechanism of disseminating your work and interacting with the community. What do you think of social media as a scientific communication tool? Do you find it is enhancing your careers?_**</p> <p><strong>Hatfield</strong>: Sherri is my social media mentor!</p> <p><strong>Rose</strong>: I think social media can be a useful tool for networking, finding and sharing neat articles and news, and putting your research out there to a broader audience. I’ve definitely received speaking invitations and started collaborations because people initially “knew me from Twitter.” It’s become a way to recruit students as well. Prospective students are more likely to “know me” from a guest post or Twitter than traditional academic products, like journal articles.</p> <p><strong>Hatfield</strong>: I’m grateful for our <a href="https://twitter.com/HPDSLab">Lab’s new Twitter</a> because it’s a purely academic account. My personal account has been awkwardly transitioning to include professional content; I still tweet silly things there.</p> <p><strong>Rose</strong>: My timeline might have <a href="https://twitter.com/sherrirose/status/569613197600272386">a cat picture</a> or <a href="https://twitter.com/sherrirose/status/601822958491926529">two</a>.</p> <p><strong>Hatfield</strong>: My very favorite thing about academic Twitter is discovering things I wouldn’t have even known to search for, especially packages and tricks in R. For example, that’s how I got converted to tidy data and dplyr.</p> <p><strong>Rose</strong>: I agree. I think it’s a fantastic place to become exposed to work that’s incredibly related to your own but in another field, and you wouldn’t otherwise find it preparing a typical statistics literature review.</p> <p> </p> <p><strong><em>**</em></strong><em>SimplyStats: </em><strong><em>**What would you change in the statistics community?</em></strong></p> <p><strong>Rose</strong>: Mentoring. I was tremendously lucky to receive incredible mentoring as a graduate student and now as a new faculty member. Not everyone gets this, and trainees don’t know where to find guidance. I’ve actively reached out to trainees during conferences and university visits, erring on the side of offering too much unsolicited help, because I feel there’s a need for that. I also have a <a href="http://drsherrirose.com/resources">resources page</a> on my website that I continue to update. I wish I had a more global solution beyond encouraging statisticians to take an active role in mentoring not just your own trainees. We shouldn’t lose good people because they didn’t get the support they needed.</p> <p><strong>Hatfield</strong>: I think we could make conferences much better! Being in the same physical space at the same time is very precious. I would like to take better advantage of that at big meetings to do work that requires face time. Talks are not an example of this. Workshops and hackathons and panels and working groups – these all make better use of face-to-face time. And are a lot more fun!</p> <p> </p> If you ask different questions you get different answers - one more way science isn't broken it is just really hard 2015-08-20T14:52:34+00:00 http://simplystats.github.io/2015/08/20/if-you-ask-different-quetions-you-get-different-asnwers-one-more-way-science-isnt-broken-it-is-just-really-hard <p>If you haven’t already read the amazing piece by Christie Aschwanden on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science isn’t Broken</a> you should do so immediately. It does an amazing job of capturing the nuance of statistics as applied to real data sets and how that can be misconstrued as science being “broken” without falling for the easy “everything is wrong” meme.</p> <p>One thing that caught my eye was how the piece highlighted a crowd-sourced data analysis of soccer red cards. The key figure for that analysis is this one:</p> <p> </p> <p><a href="http://fivethirtyeight.com/features/science-isnt-broken/"><img class="aligncenter" src="https://espnfivethirtyeight.files.wordpress.com/2015/08/truth-vigilantes-soccer-calls2.png?w=1024&amp;h=597" alt="" width="1024" height="597" /></a></p> <p>I think the figure and <a href="https://osf.io/qix4g/">underlying data</a> for this figure are fascinating in that they really highlight the human behavioral variation in data analysis and you can even see some <a href="http://simplystatistics.org/2015/04/29/data-analysis-subcultures/">data analysis subcultures </a>emerging from the descriptions of how people did the analysis and justified or not the use of covariates.</p> <p>One subtlety of the figure that I missed on the original reading is that not all of the estimates being reported are measuring the same thing. For example, if some groups adjusted for the country of origin of the referees and some did not, then the estimates for those two groups are measuring different things (the association conditional on country of origin or not, respectively). In this case the estimates may be different, but entirely consistent with each other, since they are just measuring different things.</p> <p>If you ask two people to do the analysis and you only ask them the simple question: <em>Are referees more likely to give  red cards to dark skinned players?</em> then you may get a different answer based on those two estimates. But the reality is the answers the analysts are reporting are actually to the questions:</p> <ol> <li>Are referees more likely to give  red cards to dark skinned players holding country of origin fixed?</li> <li>Are referees more likely to give  red cards to dark skinned players averaging over country of origin (and everything else)?</li> </ol> <p>The subtlety lies in the fact that changes to covariates in the analysis are actually changing the hypothesis you are studying.</p> <p>So in fact the conclusions in that figure may all be entirely consistent after you condition on asking the same question. I’d be interested to see the same plot, but only for the groups that conditioned on the same set of covariates, for example. This is just one more reason that science is really hard and why I’m so impressed at how well the FiveThirtyEight piece captured this nuance.</p> <p> </p> <p> </p> P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures 2015-08-19T10:38:31+00:00 http://simplystats.github.io/2015/08/19/p-0-05-i-can-make-any-p-value-statistically-significant-with-adaptive-fdr-procedures <p>Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:</p> <div style="width: 550px" class="wp-caption aligncenter"> <a href="http://xkcd.com/882/"><img class="" src=" http://imgs.xkcd.com/comics/significant.png" alt="" width="540" height="1498" /></a> <p class="wp-caption-text"> http://xkcd.com/882/ </p> </div> <p> </p> <p>One of the most popular ways to correct for multiple testing is to estimate or control the <a href="https://en.wikipedia.org/wiki/False_discovery_rate">false discovery rate</a>. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold <em>t</em> significant, then borrowing notation from this <a href="http://www.ncbi.nlm.nih.gov/pubmed/12883005">great introduction to false discovery rates </a></p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif"><img class="aligncenter size-full wp-image-4246" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif" alt="fdr3" width="285" height="40" /></a></p> <p> </p> <p>So <em>F(t)</em> is the (unknown) total number of null hypotheses called significant and <em>S(t)</em> is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif"><img class="aligncenter size-full wp-image-4247" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif" alt="fdr4" width="246" height="44" /></a></p> <p> </p> <p>To get an estimate of the FDR we just need an estimate for  <em>E[_F(t)]</em> _ and <em>E[S(t)]. _The latter is pretty easy to estimate as just the total number of rejections (the number of _p &lt; t</em>). If you assume that the p-values follow the expected distribution then <em>E[_F(t)]</em>  <em>can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by _t</em> since the p-values are uniform. To do this, we need an estimate for <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_d4c98d75e25f5d28461f1da221eb7a95.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt="\pi_0" /></span>, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.</p> <p>Combining the above equation with our estimates for <em>E[_F(t)]</em> _ and _E[S(t)] _we get:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif"><img class="aligncenter size-full wp-image-4250" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif" alt="fdr5" width="238" height="42" /></a></p> <p> </p> <p>The q-value is a multiple testing analog of the p-value and is defined as:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif"><img class="aligncenter size-full wp-image-4258" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif" alt="fdr6" width="163" height="26" /></a></p> <p> </p> <p>This is of course a very loose version of this and you can get a more technical description <a href="http://www.genomine.org/papers/directfdr.pdf">here</a>. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its “testing partners”. Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png"><img class="aligncenter size-medium wp-image-4260" src="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png" alt="uniform-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png"><img class="aligncenter size-medium wp-image-4261" src="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png" alt="significant-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.</p> <p>A couple of things to note:</p> <ul> <li>Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn’t be done.</li> <li>For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.</li> <li>This is the reason that sometimes a “multiple testing adjusted” p-value (or q-value) is smaller than the p-value itself.</li> <li>This doesn’t affect non-adaptive FDR procedures - but those procedures still depend on the “testing partners” of any p-value through the total number of tests performed. This is why people talk about the so-called “multiple testing burden”. But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.</li> <li>I’ve appended the code to generate the histograms and calculate the q-values in this post in the following gist.</li> </ul> <p> </p> UCLA Statistics 2015 Commencement Address 2015-08-12T10:34:03+00:00 http://simplystats.github.io/2015/08/12/ucla-statistics-2015-commencement-address <p>I was asked to speak at the <a href="http://www.stat.ucla.edu">UCLA Department of Statistics</a> Commencement Ceremony this past June. As one of the first graduates of that department back in 2003, I was tremendously honored to be invited to speak to the graduates. When I arrived I was just shocked at how much the department had grown. When I graduated I think there were no more than 10 of us between the PhD and Master’s programs. Now they have ~90 graduates per year with undergrad, Master’s and PhD. It was just stunning.</p> <p>Here’s the text of what I said, which I think I mostly stuck to in the actual speech.</p> <p> </p> <p><strong>UCLA Statistics Graduation: Some thoughts on a career in statistics</strong></p> <p>When I asked Rick [Schoenberg] what I should talk about, he said to ‘talk for 95 minutes on asymptotic properties of maximum likelihood estimators under nonstandard conditions”. I thought this is a great opportunity! I busted out Tom Ferguson’s book and went through my old notes. Here we go. Let X be a complete normed vector space….</p> <p>I want to thank the department for inviting me here today. It’s always good to be back. I entered the UCLA stat department in 1999, only the second entering class, and graduated from UCLA Stat in 2003. Things were different then. Jan was the chair and there were not many classes so we could basically do whatever we wanted. Things are different now and that’s a good thing. Since 2003, I’ve been at the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, where I was first a postdoctoral fellow and then joined the faculty. It’s been a wonderful place for me to grow up and I’ve learned a lot there.</p> <p>It’s just an incredible time to be a statistician. You guys timed it just right. I’ve been lucky enough to witness two periods like this, the first time being when I graduated from college at the height of the dot come boom. Today, it’s not computer programming skills that the world needs, but rather it’s statistical skills. I wish I were in your shoes today, just getting ready to startup. But since I’m not, I figured the best thing I could do is share some of the things I’ve learned and talk about the role that these things have played in my own life.</p> <p>Know your edge: What’s the one thing that you know that no one else seems to know? You’re not a clone—you have original ideas and skills. You might think they’re not valuable but you’re wrong. Be proud of these ideas and use them to your advantage. As an example, I’ll give you my one thing. Right now, I believe the greatest challenge facing the field of statistics today is getting the entire world to know what we in this room already know. Data are everywhere today and the biggest barrier to progress is our collective inability to process and analyze those data to produce useful information. The need for the things that we know has absolutely exploded and we simply have not caught up. That’s why I created, along with Jeff Leek and Brian Caffo, the Johns Hopkins Data Science Specialization, which is currently the most successful massive open online course program ever. Our goal is to teach the entire world statistics, which we think is an essential skill. We’re not quite there yet, but—assuming you guys don’t steal my idea—I’m hopeful that we’ll get there sometime soon.</p> <p>At some point the edge you have will no longer work: That sounds like a bad thing, but it’s actually good. If what you’re doing really matters, then at some point everyone will be doing it. So you’ll need to find something else. I’ve been confronted with this problem at least 3 times in my life so far. Before college, I was pretty good at the violin, and it opened a lot of doors for me. It got me into Yale. But when I got to Yale, I quickly realized that there were a lot of really good violinists here. Suddenly, my talent didn’t have so much value. This was when I started to pick up computer programming and in 1998 I learned an obscure little language called R. When I got to UCLA I realized I was one of the only people who knew R. So I started a little brown bag lunch series where I’d talk about some feature of R to whomever would show up (which wasn’t many people usually). Picking up on R early on turned out to be really important because it was a small community back then and it was easy to have a big impact. Also, as more and more people wanted to learn R, they’d usually call on me. It’s always nice to feel needed. Over the years, the R community exploded and R’s popularity got to the point where it was being talked about in the New York Times. But now you see the problem. Saying that you know R doesn’t exactly distinguish you anymore, so it’s time to move on again. These days, I’m realizing that the one useful skill that I have is the ability to make movies. Also, my experience being a performer on the violin many years ago is coming in handy. My ability to quickly record and edit movies was one of the key factors that enabled me to create an entire online data science program in 2 months last year.</p> <p>Find the right people, and stick with them forever. Being a statistician means working with other people. Choose those people wisely and develop a strong relationship. It doesn’t matter how great the project is or how famous or interesting the other person is, if you can’t get along then bad things will happen. Statistics and data analysis is a highly verbal process that requires constant and very clear communication. If you’re uncomfortable with someone in any way, everything will suffer. Data analysis is unique in this way—our success depends critically on other people. I’ve only had a few collaborators in the past 12 years, but I love them like family. When I work with these people, I don’t necessarily know what will happen, but I know it will be good. In the end, I honestly don’t think I’ll remember the details of the work that I did, but I’ll remember the people I worked with and the relationships I built.</p> <p>So I hope you weren’t expecting a new asymptotic theorem today, because this is pretty much all I’ve got. As you all go on to the next phase of your life, just be confident in your own ideas, be prepared to change and learn new things, and find the right people to do them with. Thank you.</p> Correlation is not a measure of reproducibility 2015-08-12T10:33:25+00:00 http://simplystats.github.io/2015/08/12/correlation-is-not-a-measure-of-reproducibility <p>Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, <a href="https://genome.ucsc.edu/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf">the ENCODE data standards document</a> states</p> <blockquote> <p>A typical R<sup>2</sup> (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.</p> </blockquote> <p>However, for  reasons I will explain here, correlation is not necessarily informative with regards to reproducibility. The mathematical results described below are not inconsequential theoretical details, and understanding them will help you assess new technologies, experimental procedures and computation methods.</p> <p>Suppose you have collected data from an experiment</p> <p style="text-align: center;"> <em>x</em><sub>1</sub>, <em>x</em><sub>2</sub>,..., <em>x</em><sub>n</sub> </p> <p>and want to determine if  a second experiment replicates these findings. For simplicity, we represent data from the second experiment as adding unbiased (averages out to 0) and statistically independent measurement error <em>d</em> to the first:</p> <p style="text-align: center;"> <em>y</em><sub>1</sub>=<em>x</em><sub>1</sub>+<em>d</em><sub>1</sub>, <em>y</em><sub>2</sub>=<em>x</em><sub>2</sub>+<em>d</em><sub>2</sub>, ... <em>y</em><sub>n</sub>=<em>x</em><sub>n</sub>+<em>d</em><sub>n</sub>. </p> <p>For us to claim reproducibility we want the differences</p> <p style="text-align: center;"> <em>d</em><sub>1</sub>=<em>y</em><sub>1</sub>-<em>x</em><sub>1</sub>, <em>d</em><sub>2</sub>=<em>y</em><sub>2</sub>-<em>x</em><sub>2</sub>,<em>... </em>,<em>d</em><sub>n</sub>=<em>y</em><sub>n</sub>-<em>x</em><sub>n</sub> </p> <p>to be “small”. To give this some context, imagine the <em>x</em> and <em>y</em> are log scale (base 2) gene expression measurements which implies the <em>d</em> represent log fold changes. If these differences have a standard deviation of 1, it implies that fold changes of 2 are typical between replicates. If our replication experiment produces measurements that are typically twice as big or twice as small as the original, I am not going to claim the measurements are reproduced. However, as it turns out, such terrible reproducibility can still result in correlations higher than 0.92.</p> <p>To someone basing their definition of correlation on the current common language usage this may seem surprising, but to someone basing it on math, it is not. To see this, note that the mathematical definition of correlation tells us that because <em>d</em> and <em>x</em> are independent:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula.png"><img class=" aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula-300x55.png" alt="pearsonformula" width="300" height="55" /></a></p> <p>This tells us that correlation summarizes the variability of <em>d</em> relative to the variability of <em>x</em>. Because of the wide range of gene expression values we observe in practice, the standard deviation of <em>x</em> can easily be as large as 3 (variance is 9). This implies we expect to see correlations as high as 1/sqrt(1+1/9) = 0.95, despite the lack of reproducibility when comparing <em>x</em> to <em>y</em>.</p> <p>Note that using Spearman correlation does not fix this problem. A Spearman correlation of 1 tells us that the ranks of <em>x</em> and <em>y</em> are preserved, yet doest not summarize the actual differences. The problem comes down to the fact that we care about the variability of <em>d</em> and correlation, Pearson or Spearman, does not provide an optimal summary. While correlation relates to the preservation of ranks, a much more appropriate summary of reproducibly is the distance between <em>x</em> and <em>y</em> which is related to the standard deviation of the differences <em>d</em>. A very simple R command you can use to generate this summary statistic is:</p> <pre>sqrt(mean(d^2))</pre> <p>or the robust version:</p> <pre>median(abs(d)) ##multiply by 1.4826 for unbiased estimate of true sd </pre> <p>The equivalent suggestion for plots it to make an <a href="https://en.wikipedia.org/wiki/MA_plot">MA-plot</a> instead of a scatterplot.</p> <p>But aren’t correlations and distances directly related? Sort of, and this actually brings up another problem. If the <em>x</em> and <em>y</em> are standardized to have average 0 and standard deviation 1 then, yes, correlation and distance are directly related:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png"><img class=" size-medium wp-image-4202 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png" alt="distcorr" width="300" height="51" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-260x44.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png 878w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>However, if instead <em>x</em> and <em>y</em> have different average values, which would put into question reproducibility, then distance is sensitive to this problem while correlation is not. If the standard devtiation is 1, the formula is:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2.png"><img class=" size-medium wp-image-4204 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png" alt="distcor2" width="300" height="27" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-1024x94.png 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Once we consider units (standard deviations different from 1) then the relationship becomes even more complicated. Two advantages of distance you should be aware of are:</p> <ol> <li>it is in the same units as the data, while correlations have no units making it hard to interpret and select thresholds, and</li> <li>distance accounts for bias (differences in average), while correlation does not.</li> </ol> <p>A final important point relates to the use of correlation with data that is not approximately normal. The useful interpretation of correlation as a summary statistic stems from the bivariate normal approximation: for every standard unit increase in the first variable, the second variable increased <em>r</em> standard units, with <em>r</em> the correlation. A  summary of this is <a href="http://genomicsclass.github.io/book/pages/exploratory_data_analysis_2.html">here</a>. However, when data is not normal this interpretation no longer holds. Furthermore, heavy tail distributions, which are common in genomics, can lead to instability. Here is an example of uncorrelated data with a single pointed added that leads to correlations close to 1. This is quite common with RNAseq data.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2.png"><img class=" size-medium wp-image-4208 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png" alt="supp_figure_2" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-200x200.png 200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> rafalib package now on CRAN 2015-08-10T10:00:26+00:00 http://simplystats.github.io/2015/08/10/rafalib-package-now-on-cran <p>For the last several years I have been <a href="https://github.com/ririzarr/rafalib">collecting functions</a> I routinely use during exploratory data analysis in a private R package. <a href="http://mike-love.net/">Mike Love</a> and I used some of these in our HarvardX course and now, due to popular demand, I have created man pages and added the <a href="https://cran.r-project.org/web/packages/rafalib/">rafalib</a> package to CRAN. Mike has made several improvements and added some functions of his own. Here is quick descriptions of the rafalib functions I most use:</p> <p>mypar - Before making a plot in R I almost always type <tt>mypar()</tt>. This basically gets around the suboptimal defaults of <tt>par</tt>. For example, it makes the margins (<tt>mar</tt>, <tt>mpg</tt>) smaller and defines RColorBrewer colors as defaults.  It is optimized for the RStudio window. Another advantage is that you can type <tt>mypar(3,2)</tt> instead of <tt>par(mfrow=c(3,2))</tt>. <tt>bigpar()</tt> is optimized for R presentations or PowerPoint slides.</p> <p>as.fumeric - This function turns characters into factors and then into numerics. This is useful, for example, if you want to plot values <tt>x,y</tt> with colors defined by their corresponding categories saved in a character vector <tt>labs</tt><tt>plot(x,y,col=as.fumeric(labs))</tt>.</p> <p>shist (smooth histogram, pronounced <em>shitz</em>) - I wrote this function because I have a hard time interpreting the y-axis of <tt>density</tt>. The height of the curve drawn by <tt>shist</tt> can be interpreted as the height of a histogram if you used the units shown on the plot. Also, it automatically draws a smooth histogram for each entry in a matrix on the same plot.</p> <p>splot (subset plot) - The datasets I work with are typically large enough that</p> <p><tt>plot(x,y)</tt> involves millions of points, which is <a href="http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points">a problem</a>. Several solution are available to avoid over plotting, such as alpha-blending, hexbinning and 2d kernel smoothing. For reasons I won’t explain here, I generally prefer subsampling over these solutions. <tt>splot</tt> automatically subsamples. You can also specify an index that defines the subset.</p> <p>sboxplot (smart boxplot) - This function draws points, boxplots or outlier-less boxplots depending on sample size. Coming soon is the kaboxplot (Karl Broman box-plots) for when you have too many boxplots.</p> <p>install_bioc - For Bioconductor users, this function simply does the <tt>source(“http://www.bioconductor.org/biocLite.R”)</tt> for you and then uses <tt>BiocLite</tt> to install.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png"><img class="alignnone size-large wp-image-4190" src="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png" alt="unnamed" width="990" height="747" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-300x226.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-260x196.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png 1035w" sizes="(max-width: 990px) 100vw, 990px" /></a></p> Interested in analyzing images of brains? Get started with open access data. 2015-08-09T21:29:17+00:00 http://simplystats.github.io/2015/08/09/interested-in-analyzing-images-of-brains-get-started-with-open-access-data <div> <i>Editor's note: This is a guest post by <a href="http://www.anieloyan.com/" target="_blank"><span class="lG">Ani</span> Eloyan</a>. She is an Assistant Professor of Biostatistics at Brown University. Dr. Eloyan’s work focuses on</i> <i>semi-parametric likelihood based methods for matrix decompositions, statistical analyses of brain images, and the integration of various types of complex data structures for analyzing health care data</i><i>. She received her PhD in statistics from North Carolina State University and subsequently completed a postdoctoral fellowship in the <a href="http://www.biostat.jhsph.edu/">Department of Biostatistics at Johns Hopkins University</a>. Dr. Eloyan and her team won the <a>ADHD200 Competition</a></i> <i>discussed in <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00061/abstract" target="_blank">this</a> article. She tweets <a href="https://twitter.com/eloyan_ani">@eloyan_ani</a>.</i> </div> <div> <i> </i> </div> <div> <div> Neuroscience is one of the exciting new fields for biostatisticians interested in real world applications where they can contribute novel statistical approaches. Most research in brain imaging has historically included studies run for small numbers of patients. While justified by the costs of data collection, the claims based on analyzing data for such small numbers of subjects often do not hold for our populations of interest. As discussed in <a href="http://www.huffingtonpost.com/american-statistical-association/wanted-neuroquants_b_3749363.html" target="_blank">this</a> article, there is a huge demand for biostatisticians in the field of quantitative neuroscience; so called neuroquants or neurostatisticians. However, while more statisticians are interested in the field, we are far from competing with other substantive domains. For instance, a quick search of abstract keywords in the online program of the upcoming <a href="https://www.amstat.org/meetings/jsm/2015/" target="_blank">JSM2015</a> conference of “brain imaging” and “neuroscience” results in 15 records, while a search of the words “genomics” and “genetics” generates 76 <a>records</a>. </div> <div> </div> <div> Assuming you are trained in statistics and an aspiring neuroquant, how would you go about working with brain imaging data? As a graduate student in the <a href="http://www.stat.ncsu.edu/" target="_blank">Department of Statistics at NCSU</a> several years ago, I was very interested in working on statistical methods that would be directly applicable to solve problems in neuroscience. But I had this same question: “Where do I find the data?” I soon learned that to <i>really</i>approach substantial relevant problems I also needed to learn about the subject matter underlying these complex data structures. </div> <div> </div> <div> In recent years, several leading groups have uploaded their lab data with the common goal of fostering the collection of high dimensional brain imaging data to build powerful models that can give generalizable results. <a href="http://www.nitrc.org/" target="_blank">Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC)</a> founded in 2006 is a platform for public data sharing that facilitates streamlining data processing pipelines and compiling high dimensional imaging datasets for crowdsourcing the analyses. It includes data for people with neurological diseases and neurotypical children and adults. If you are interested in Alzheimer’s disease, you can check out <a href="http://adni.loni.usc.edu/" target="_blank">ADNI</a>. <a href="http://fcon_1000.projects.nitrc.org/indi/abide/" target="_blank">ABIDE</a> provides data for people with Autism Spectrum Disorder and neurotypical peers. <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/" target="_blank">ADHD200</a> was released in 2011 as a part of a competition to motivate building predictive methods for disease diagnoses using functional magnetic resonance imaging (MRI) in addition to demographic information to predict whether a child has attention deficit hyperactivity disorder (ADHD). While the competition ended in 2011, the dataset has been widely utilized afterwards in studies of ADHD.  According to Google Scholar, the <a href="http://www.nature.com/mp/journal/v19/n6/abs/mp201378a.html" target="_blank">paper</a> introducing the ABIDE set has been cited 129 times since 2013 while the <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00062/full" target="_blank">paper</a> discussing the ADHD200 has been cited 51 times since <span style="font-family: Arial;">2012. These are only a few examples from the list of open access datasets that could of utilized by statisticians. </span> </div> <div> </div> <div> Anyone can download these datasets (you may need to register and complete some paperwork in some cases), however, there are several data processing and cleaning steps to perform before the final statistical analyses. These preprocessing steps can be daunting for a statistician new to the field, especially as the tools used for preprocessing may not be available in R. <a href="https://hopstat.wordpress.com/2014/08/27/statisticians-in-neuroimaging-need-to-learn-preprocessing/" target="_blank">This</a> discussion makes the case as to why statisticians need to be involved in every step of preprocessing the data, while <u><a href="https://hopstat.wordpress.com/2014/06/17/fslr-an-r-package-interfacing-with-fsl-for-neuroimaging-analysis/" target="_blank">this R package</a></u> contains new tools linking R to a commonly used platform <a href="http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/" target="_blank">FSL</a>. However, as a newcomer, it can be easier to start with data that are already processed. <a href="http://projecteuclid.org/euclid.ss/1242049389" target="_blank">This</a> excellent overview by Dr. Martin Lindquist provides an introduction to the different types of analyses for brain imaging data from a statisticians point of view, while our<a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0089470" target="_blank">paper</a> provides tools in R and example datasets for implementing some of these methods. At least one course on Coursera can help you get started with <a href="https://www.coursera.org/course/fmri" target="_blank">functional MRI</a> data. Talking to and reading the papers of biostatisticians working in the field of quantitative neuroscience and scientists in the field of neuroscience is the key. </div> </div> Statistical Theory is our "Write Once, Run Anywhere" 2015-08-09T11:19:53+00:00 http://simplystats.github.io/2015/08/09/statistical-theory-is-our-write-once-run-anywhere <p>Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing “native apps”, software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase “<a href="https://en.wikipedia.org/wiki/Write_once,_run_anywhere">Write Once, Run Anywhere</a>” to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously <a href="https://www.apple.com/hotnews/thoughts-on-flash/">banned Flash</a> from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.</p> <p>What’s the problem with “write once, run anywhere”, or of cross-platform development more generally, assuming it’s possible? Well, there are a <a href="https://en.wikipedia.org/wiki/Cross-platform#Challenges_to_cross-platform_development">number of issues</a>: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the “lowest common denominator” of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.</p> <p>In statistics, theory and theorems are our version of “write once, run anywhere”. The basic idea is that theorems provide an abstract layer (a “virtual machine”) that allows us to reason across a large number of specific problems. Think of the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.</p> <p>But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we’ll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.</p> <p>Because the CLT doesn’t apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of “write once, run anywhere” is always tantalizing, but the reality never seems to meet that expectation.</p> <p>Ironically, if you look across history and all programming languages, probably the most “cross-platform” language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C’s success I think are that it’s a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.</p> <p>In a sense, we need “compilers” that can help us translate statistical theory for specific data analysis problems. In many cases, I’d imagine the compiler would “fail”, meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.</p> <p>More practically (perhaps), we could develop <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">data analysis pipelines</a> that could be applied to broad classes of data analysis problems. Then a “compiler” could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.</p> <p>The key point is to recognize that there is a “translation” process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit “compiler” for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.</p> Autonomous killing machines won't look like the Terminator...and that is why they are so scary 2015-07-30T11:09:22+00:00 http://simplystats.github.io/2015/07/30/autonomous-killing-machines-wont-look-like-the-terminator-and-that-is-why-they-are-so-scary <p>Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p> <p> </p> <p> </p> <p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p> <p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p> <ol> <li>A drone with the ability to fly on its own</li> <li>The ability to make decisions about what people to target</li> <li>The ability to find those people and attack them</li> </ol> <p> </p> <p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p> <p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p> <p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a [Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p> <p> </p> <p> </p> <p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p> <p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p> <ol> <li>A drone with the ability to fly on its own</li> <li>The ability to make decisions about what people to target</li> <li>The ability to find those people and attack them</li> </ol> <p> </p> <p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p> <p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p> <p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a](file:///Users/jtleek/Downloads/deepface.pdf) that demonstrates an algorithm that can identify people with near human level accuracy. This approach is based on something called deep neural nets, which sounds very intimidating, but is actually just a set of nested nonlinear <a href="https://en.wikipedia.org/wiki/Deep_learning">logistic regression models</a>. These models have gotten very good because (a) we are getting better at fitting them mathematically and computationally but mostly (b) we have much more data to train them with than we ever did before. The speed that this part of the process is developing is (I think) why there is so much recent concern about potentially negative applications like autonomous killing machines.</p> <p>The scary thing is that these technologies could be combined *right now* to create such a system that was not controlled directly by humans but made automated decisions and flew drones to carry out those decisions. The technology to shrink these type of deep neural net systems to identify people is so good it can even be made simple enough to <a href="http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html">run on a phone f</a>or things like language translation and could easily be embedded in a drone.</p> <p>So I am with Musk, Hawking, and others who would urge caution by governments in developing these systems. Just because we can make it doesn’t mean it will do what we want. Just look at how well Facebook/Amazon/Google make suggestions for “other things you might like” to get an idea about how potentially disastrous automated killing systems could be.</p> <p> </p> Announcing the JHU Data Science Hackathon 2015 2015-07-28T13:31:04+00:00 http://simplystats.github.io/2015/07/28/announcing-the-jhu-data-science-hackathon-2015 <p>We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever <a href="https://www.regonline.com/jhudash">JHU Data Science Hackathon</a> (DaSH) on <strong>September 21-23, 2015</strong> at the Baltimore Marriott Waterfront.</p> <p>This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.</p> <p>To get more details and to sign up for the hackathon, you can go to the <a href="https://www.regonline.com/jhudash">DaSH web site</a>. We will be posting more information as the event nears.</p> <p>Organizers:</p> <ul> <li>Jeff Leek</li> <li>Brian Caffo</li> <li>Roger Peng</li> <li>Leah Jager</li> </ul> <p>Funding:</p> <ul> <li>National Institutes of Health</li> <li>Johns Hopkins University</li> </ul> <p> </p> stringsAsFactors: An unauthorized biography 2015-07-24T11:04:20+00:00 http://simplystats.github.io/2015/07/24/stringsasfactors-an-unauthorized-biography <p>Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like ‘read.table()’ and ‘read.csv()’ in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of</p> <blockquote> <p>Why does stringsAsFactors not default to FALSE????</p> </blockquote> <p>The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in ‘read.table()’ and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, ‘stringsAsFactors’ is set to TRUE.</p> <p>This argument dates back to May 20, 2006 when it was originally introduced into R as the ‘charToFactor’ argument to ‘data.frame()’. Soon afterwards, on May 24, 2006, it was changed to ‘stringsAsFactors’ to be compatible with S-PLUS by request from Bill Dunlap.</p> <p>Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. First of all, it should be noted that before the ‘stringsAsFactors’ argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn’t want this behavior, you had to manually coerce each column to be character.</p> <p>So here’s the story:</p> <p>In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor.</p> <p>Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.</p> <p>There’s also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like “disease” and “non-disease” will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn’t have to repeat the strings “disease” and “non-disease” for as many observations that you had, which would have been wasteful.</p> <p>Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions.</p> <p>The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about ‘stringsAsFactors’ not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you’re upset about ‘stringsAsFactors = TRUE’, then it’s a pretty good indicator that you’re either not a statistician by training, or you’re doing non-traditional statistical things.</p> <p>For example, in genomics, you might have the names of the genes in one column of data. It really doesn’t make sense to encode these as factors because they won’t be used in any modeling function. They’re just labels, essentially. And because of CHARSXP hashing, you don’t gain anything from an efficiency standpoint by converting them to factors either.</p> <p>But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about ‘stringsAsFactors’.</p> <p>I fully expect that this blog post will now make all R users happy. If you think I’ve missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).</p> The statistics department Moneyball opportunity 2015-07-17T09:21:16+00:00 http://simplystats.github.io/2015/07/17/the-statistics-department-moneyball-opportunity <p><a href="https://en.wikipedia.org/wiki/Moneyball"></a> is a book and a movie about Billy Bean. It makes statisticians look awesome and I loved the movie. I loved it so much I’m putting the movie trailer right here:</p> <p>The basic idea behind Moneyball was that the Oakland Athletics were able to build a very successful baseball team on a tight budget by valuing skills that many other teams undervalued. In baseball those skills were things like on-base percentage and slugging percentage. By correctly valuing these skills and their impact on a teams winning percentage, the A’s were able to build one of the most successful regular season teams on a minimal budget. This graph shows what an outlier they were, from a nice <a href="http://fivethirtyeight.com/features/billion-dollar-billy-beane/">fivethirtyeight analysis</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png"><img class="aligncenter wp-image-4146" src="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png" alt="oakland" width="500" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland-250x200.png 250w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png 1150w" sizes="(max-width: 500px) 100vw, 500px" /></a></p> <p> </p> <p>I think that the data science/data analysis revolution that we have seen over the last decade has created a similar moneyball opportunity for statistics and biostatistics departments. Traditionally in these departments the highest value activities have been publishing a select number of important statistics journals (JASA, JRSS-B, Annals of Statistics, Biometrika, Biometrics and more recently journals like Biostatistics and Annals of Applied Statistics). But there are some hugely valuable ways to contribute to statistics/data science that don’t necessarily end with papers in those journals like:</p> <ol> <li>Creating good, well-documented, and widely used software</li> <li>Being primarily an excellent collaborator who brings in grant money and is a major contributor to science through statistics</li> <li>Publishing in top scientific journals rather than statistics journals</li> <li>Being a good scientific communicator who can attract talent</li> <li>Being a statistics educator who can build programs</li> </ol> <p>Another thing that is undervalued is not having a Ph.D. in statistics or biostatistics. The fact that these skills are undervalued right now means that up and coming departments could identify and recruit talented people that might be missed by other departments and have a huge impact on the world. One tricky thing is that the rankings of department are based on the votes of people from other departments who may or may not value these same skills. Another tricky thing is that many industry data science positions put incredibly high value on these skills and so you might end up competing with them for people - a competition that will definitely drive up the market value of these data scientist/statisticians. But for the folks that want to stay in academia, now is a prime opportunity.</p> The Mozilla Fellowship for Science 2015-07-10T11:10:26+00:00 http://simplystats.github.io/2015/07/10/the-mozilla-fellowship-for-science <p>This looks like an <a href="https://www.mozillascience.org/fellows">interesting opportunity</a> for grad students, postdocs, and early career researchers:</p> <blockquote> <p>We’re looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.</p> <p>Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.</p> </blockquote> <p>Here’s what you get:</p> <blockquote> <p>Fellows will receive:</p> <ul> <li>A stipend of $60,000 USD, paid in 10 monthly installments.</li> <li>One-time health insurance supplement for Fellows and their families, ranging from $3,500 for single Fellows to $7,000 for a couple with two or more children.</li> <li>One-time childcare allotment for families with children of up to $6,000.</li> <li>Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.</li> <li>All approved fellowship trips – domestic and international – are covered in full.</li> </ul> </blockquote> <p>Deadline is August 14.</p> JHU, UMD researchers are getting a really big Big Data center 2015-07-08T16:26:45+00:00 http://simplystats.github.io/2015/07/08/jhu-umd-researchers-are-getting-a-really-big-big-data-center <p>From <a href="http://technical.ly/baltimore/2015/07/07/jhu-umd-big-data-maryland-advanced-research-computing-center-marcc/">Technical.ly Baltimore</a>:</p> <blockquote> <p>A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The $30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.</p> </blockquote> <p>The web site has a pretty cool time-lapse video of the construction of the computing center. There’s also a bit more detail at the <a href="http://hub.jhu.edu/2015/07/06/computing-center-bayview">JHU Hub</a> site.</p> The Massive Future of Statistics Education 2015-07-03T10:17:24+00:00 http://simplystats.github.io/2015/07/03/the-massive-future-of-statistics-education <p><em>NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. </em></p> <p>Data are eating the world, but our collective ability to analyze data is going on a starvation diet.</p> <div id="content"> <p> Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions. </p> <p> So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we <em>are</em> interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone. </p> <p> In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college. </p> <p> Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight. </p> <p> <strong>Limited Capacity</strong> </p> <p> The McKinsey Global Institute, in a <a href="http://www.mckinsey.com/insights/americas/us_game_changers">highly cited report</a>, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality. </p> <p> Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of <a href="http://community.amstat.org/blogs/steve-pierson/2014/02/09/largest-graduate-programs-in-statistics">730 people</a>. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent. </p> <p> On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in <em>some</em> data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for <em>everyone</em> to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time. </p> <p> <strong>Enter the MOOCs</strong> </p> <p> In April of 2014, Jeff Leek, Brian Caffo, and I launched the <a href="https://www.coursera.org/specialization/jhudatascience/1">Johns Hopkins Data Science Specialization</a> on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is </p> <ol> <li> Formulating a question that can be answered with data </li> <li> Assembling, cleaning, tidying data relevant to a question </li> <li> Exploring data, checking, eliminating hypotheses </li> <li> Developing a statistical model </li> <li> Making statistical inference </li> <li> Communicating findings </li> <li> Making the work reproducible </li> </ol> <p> We took these basic steps and designed courses around each one of them. </p> <p> Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month. </p> <p> We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is <em>what we do every single day</em>. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work. </p> <p> The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist. </p> <p> For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with <a href="http://swiftkey.com/en/">SwiftKey</a>, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public. </p> <p> <strong>Degree Alternatives</strong> </p> <p> The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis. </p> <p> All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses. </p> <p> Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses. </p> <p> Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs. </p> <p> <strong>Early Numbers</strong> </p> <p> So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project. </p> <p> <strong>Some Early Lessons</strong> </p> <p> From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned. </p> <p> <strong>Data Science as Art and Science. </strong>Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower. </p> <p> When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach? </p> <p> Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his <a href="http://www.nap.edu/catalog/1910/the-future-of-statistical-software-proceedings-of-a-forum">1991 talk at the National Academies of Science</a>, we had a process that we regularly espoused but barely understood. </p> <p> <strong>Content vs. Curation</strong>.<strong> </strong>Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing? </p> <p> Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a <em>curation</em> of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated. </p> <p> <strong>Skill sets vs. Certification</strong>. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly. </p> <p> In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing. </p> <p> <strong>Conclusions</strong> </p> <p> As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work. </p> </div> Looks like this R thing might be for real 2015-07-02T10:01:45+00:00 http://simplystats.github.io/2015/07/02/looks-like-this-r-thing-might-be-for-real <p>Not sure how I missed this, but the Linux Foundation just announced the <a href="http://www.linuxfoundation.org/news-media/announcements/2015/06/linux-foundation-announces-r-consortium-support-millions-users">R Consortium</a> for supporting the “world’s most popular language for analytics and data science and support the rapid growth of the R user community”. From the Linux Foundation:</p> <blockquote> <p>The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.</p> <p>Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.</p> </blockquote> How Airbnb built a data science team 2015-07-01T08:39:29+00:00 http://simplystats.github.io/2015/07/01/how-airbnb-built-a-data-science-team <p>From <a href="http://venturebeat.com/2015/06/30/how-we-scaled-data-science-to-all-sides-of-airbnb-over-5-years-of-hypergrowth/">Venturebeat</a>:</p> <blockquote> <p>Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.</p> <p>But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.</p> </blockquote> How public relations and the media are distorting science 2015-06-24T10:07:45+00:00 http://simplystats.github.io/2015/06/24/how-public-relations-and-the-media-are-distorting-science <p>Throughout history, engineers, medical doctors and other applied scientists have helped convert  basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries. And even the exceptions stand on the shoulders of incremental contributions. The researchers that produce this knowledge go through a slow and painstaking process to reach these achievements.</p> <p>In contrast, most science related media reports that grab the public’s attention fall into three categories:</p> <ol> <li>The <em>exaggerated big discovery</em>: Recent examples include the discovery of <a href="http://www.cbsnews.com/news/dangerous-pathogens-and-mystery-microbes-ride-the-subway/">the bubonic plague in the NYC subway</a>, <a href="http://www.bbc.com/news/science-environment-32287609">liquid water in mars</a>, and <a href="http://www.nytimes.com/2015/05/24/opinion/sunday/infidelity-lurks-in-your-genes.html?ref=opinion&amp;_r=3">the infidelity gene</a>.</li> <li><em>Over-promising</em>:  These try to explain a complicated basic science finding and, in the case of biomedical research, then speculate without much explanation that the finding will ”lead to a deeper understanding of diseases and new ways to treat or cure them”.</li> <li><em>Science is broken</em>:  These tend to report an anecdote about an allegedly corrupt scientist, maybe cite the “Why Most Published Research Findings are False” paper, and then extrapolate speculatively.</li> </ol> <p>In my estimation, despite the attention grabbing headlines, the great majority of the subject matter included in these reports will not have an impact on our lives and will not even make it into scientific textbooks. So does science still have anything to offer? Reports of the third category have even scientists particularly worried. I, however, remain optimistic. First, I do not see any empirical evidence showing that the negative effects of the lack of reproducibility are worse now than 50 years ago. Furthermore, although not widely reported in the lay press, I continue to see bodies of work built by several scientists over several years or decades with much promise of leading to tangible improvements to our quality of life.  Recent advances that I am excited about include <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">insulators</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/24955707">PD-1 pathway inhibitors</a>, <a href="https://en.wikipedia.org/wiki/CRISPR">clustered regularly interspaced short palindromic repeats</a>, advances in solar energy technology, and prosthetic robotics.</p> <p>However, there is one general aspect of science that I do believe has become worse.  Specifically, it’s a shift in how much scientists jockey for media attention, even if it’s short-lived. Instead of striving for having a sustained impact on our field, which may take decades to achieve, an increasing number of scientists seem to be placing more value on appearing in the New York Times, giving a Ted Talk or having a blog or tweet go viral. As a consequence, too many of us end up working on superficial short term challenges that, with the help of a professionally crafted press release, may result in an attention grabbing media report. NB: I fully support science communication efforts, but not when the primary purpose is garnering attention, rather than educating.</p> <p>My concern spills over to funding agencies and philanthropic organizations as well. Consider the following two options. Option 1: be the funding agency representative tasked with organizing a big science project with a well-oiled PR machine. Option 2: be the funding agency representative in charge of several small projects, one of which may with low, but non-negligible, probability result in a Nobel Prize 30 years down the road. In the current environment, I see a preference for option 1.</p> <p>I am also concerned about how this atmosphere may negatively affect societal improvements within science. Publicly shaming transgressors on Twitter or expressing one’s outrage on a blog post can garner many social media clicks. However, these may have a smaller positive impact than mundane activities such as serving on a committee that, after several months of meetings, implements incremental, yet positive, changes. Time and energy spent on trying to increase internet clicks is time and energy we don’t spend on the tedious administrative activities that are needed to actually affect change.</p> <p>Because so many of the scientists that thrive in this atmosphere of short-lived media reports are disproportionately rewarded, I imagine investigators starting their careers feel some pressure to garner some media attention of their own. Furthermore, their view of how they are evaluated may be highly biased because evaluators that ignore media reports and focus more on the specifics of the scientific content, tend to be less visible. So if you want to spend your academic career slowly building a body of work with the hopes of being appreciated decades from now, you should not think that it is hopeless based on what is perhaps, a distorted view of how we are currently being evaluated.</p> <p>Update: changed topological insulators links to <a href="http://scienceblogs.com/principles/2010/07/20/whats-a-topological-insulator/">these</a> <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">two</a>. <a href="http://spectrum.ieee.org/semiconductors/materials/topological-insulators">Here</a> is one more. Via David S.</p> Interview at Leanpub 2015-06-16T21:49:33+00:00 http://simplystats.github.io/2015/06/16/interview-at-leanpub <p>A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book <em><a href="https://leanpub.com/rprogramming">R Programming for Data Science</a></em>. So far, I’ve only published one book through Leanpub but I’m a huge fan. They’ve developed a system that is, in my opinion, perfect for academic publishing. The book’s written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.</p> <p>The full interview transcript is over at the <a href="http://blog.leanpub.com/2015/06/roger-peng.html">Leanpub blog</a>. If you want to listen to the audio of the interview, you can subscribe to the Leanpub <a href="https://itunes.apple.com/ca/podcast/id517117137?mt=2">podcast on iTunes</a>.</p> <p><a href="https://leanpub.com/rprogramming"><em>R Programming for Data Science</em></a> is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who’ve already bought a copy!</p> Johns Hopkins Data Science Specialization Captsone 2 Top Performers 2015-06-10T14:33:09+00:00 http://simplystats.github.io/2015/06/10/johns-hopkins-data-science-specialization-captsone-2-top-performers <p><em>The second capstone session of the <a href="https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage">Johns Hopkins Data Science Specialization</a> concluded recently. This time, we had 1,040 learners sign up to participate in the session, which again featured a project developed in collaboration with the amazingly innovative folks at <a href="http://swiftkey.com/en/">SwiftKey</a>. </em></p> <p><em>We’ve identified the learners listed below as the top performers in this capstone session. This is an incredibly talented group of people who have worked very hard throughout the entire nine-course specialization.  Please take some time to read their stories and look at their work. </em></p> <h1 id="ben-apple">Ben Apple</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg"><img class="aligncenter size-medium wp-image-4091" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg" alt="Ben_Apple" width="300" height="285" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg 360w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Ben Apple is a Data Scientist and Enterprise Architect with the Department of Defense.  Mr. Apple holds a MS in Information Assurance and is a PhD candidate in Information Sciences.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>As a self trained data scientist I was looking for a program that would formalize my established skills while expanding my data science knowledge and tool box.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>The capstone project was the most demanding aspect of the program.  As such I most proud of the finale project.  The project stretched each of us beyond the standard course work of the program and was quite satisfying.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>To open doors so that I may further my research into the operational value of applying data science thought and practice to analytics of my domain.</p> <p><strong>Final Project: </strong><a href="https://bengapple.shinyapps.io/coursera_nlp_capstone">https://bengapple.shinyapps.io/coursera_nlp_capstone</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/bengapple/71376">http://rpubs.com/bengapple/71376</a></p> <p> </p> <h1 id="ivan-corneillet">Ivan Corneillet</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg"><img class="aligncenter size-medium wp-image-4092" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg" alt="Ivan.Corneillet" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-200x200.jpg 200w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg 400w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>A technologist, thinker, and tinkerer, Ivan facilitates the establishment of start-up companies by advising these companies about the hiring process, product development, and technology development, including big data, cloud computing, and cybersecurity. In his 17-year career, Ivan has held a wide range of engineering and management positions at various Silicon Valley companies. Ivan is a recent Wharton MBA graduate, and he previously earned his master’s degree in computer science from the Ensimag, and his master’s degree in electrical engineering from Université Joseph Fourier, both located in France.</p> <p><strong>**Why did you take the JHU Data Science Specialization?</strong>**</p> <p>There are three reasons why I decided to enroll in the JHU Data Science Specialization. First, fresh from college, my formal education was best suited for scaling up the Internet’s infrastructure. However, because every firm in every industry now creates products and services from analyses of data, I challenged myself to learn about Internet-scale datasets. Second, I am a big supporter of MOOCs. I do not believe that MOOCs should replace traditional education; however, I do believe that MOOCs and traditional education will eventually coexist in the same way that open-source and closed-source software does (read my blog post for more information on this topic: http://ivantur.es/16PHild). Third, the Johns Hopkins University brand certainly motivated me to choose their program. With a great name comes a great curriculum and fantastic professors, right?</p> <p>Once I had completed the program, I was not disappointed at all. I had read a blog post that explained that the JHU Data Science Specialization was only a start to learning about data science. I certainly agree, but I would add that this program is a great start, because the curriculum emphasizes information that is crucial, while providing additional resources to those who wish to deepen their understanding of data science. My thanks to Professors Caffo, Leek, and Peng; the TAs, and Coursera for building and delivering this track!</p> <p><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</p> <p>The capstone project made for a very rich and exhilarating learning experience, and was my favorite course in the specialization. Because I did not have prior knowledge in natural language processing (NLP), I had to conduct a fair amount of research. However, the program’s minimal-guidance approach mimicked a real-world environment, and gave me the opportunity to leverage my experience with developing code and designing products to get the most out of the skillset taught in the track. The result was that I created a data product that implemented state-of-the-art NLP algorithms using what I think are the best technologies (i.e., C++, JavaScript, R, Ruby, and SQL), given the choices that I had made. Bringing everything together is what made me the most proud. Additionally, my product capabilities are a far cry from IBM’s Watson, but while I am well versed in supercomputer hardware, this track helped me to gain a much deeper appreciation of Watson’s AI.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-1"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>Thanks to the broad skillset that the specialization covered, I feel confident wearing a data science hat. The concepts and tools covered in this program helped me to better understand the concerns that data scientists have and the challenges they face. From a business standpoint, I am also better equipped to identify the opportunities that lie ahead.</p> <p><strong>Final Project: </strong><a href="https://paspeur.shinyapps.io/wordmaster-io/">https://paspeur.shinyapps.io/wordmaster-io/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/paspeur/wordmaster-io">http://rpubs.com/paspeur/wordmaster-io</a></p> <p>#</p> <h1 id="oscar-de-león">Oscar de León</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon.jpg"><img class="aligncenter size-medium wp-image-4093" src="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg" alt="Oscar_De_Leon" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-260x195.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Oscar is an assistant researcher at a research institute in a developing country, graduated as a licentiate in biochemistry and microbiology in 2010 from the same university which hosts the institute. He has always loved technology, programming and statistics and has engaged in self learning of these subjects from an early age, finally using his abilities in the health-related research in which he has been involved since 2008. He is now working on the design, execution and analysis of various research projects, consulting for other researchers and students, and is looking forward to develop his academic career in biostatistics.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-1"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I wanted to integrate my R experience into a more comprehensive data analysis workflow, which is exactly what this specialization offers. This was in line with the objectives of my position at the research institute in which I work, so I presented a study plan to my supervisor and she approved it. I also wanted to engage in an activity which enabled me to document my abilities in a verifiable way, and a Coursera Specialization seemed like a good option.</p> <p>Additionally, I’ve followed the JHSPH group’s courses since the first offering of Mathematical Biostatistics Bootcamp in November 2012. They have proved the standards and quality of education at their institution, and it was not something to let go by.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-1"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I’m not one to usually interact with other students, and certainly didn’t do it during most of the specialization courses, but I decided to try out the fora on the Capstone project. It was wonderful; sharing ideas with, and receiving criticism form, my peers provided a very complete learning experience. After all, my contributions ended being appreciated by the community and a few posts stating it were very rewarding. This re-kindled my passion for teaching, and I’ll try to engage in it more from now on.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-2"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>First, I’ll file it with HR at my workplace, since our research projects payed for the specialization <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <p>I plan to use the certificate as a credential for data analysis with R when it is relevant. For example, I’ve been interested in offering an R workshop for life sciences students and researchers at my University, and this certificate (and the projects I prepared during the specialization) could help me show I have a working knowledge on the subject.</p> <p><strong>Final Project: </strong><a href="https://odeleon.shinyapps.io/ngram/">https://odeleon.shinyapps.io/ngram/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/chemman/n-gram">http://rpubs.com/chemman/n-gram</a></p> <p>#</p> <h1 id="jeff-hedberg">Jeff Hedberg</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg"><img class="aligncenter size-full wp-image-4094" src="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg" alt="Jeff_Hedberg" width="200" height="200" /></a></p> <p>I am passionate about turning raw data into actionable insights that solve relevant business problems. I also greatly enjoy leading large, multi-functional projects with impact in areas pertaining to machine and/or sensor data.  I have a Mechanical Engineering Degree and an MBA, in addition to a wide range of Data Science (IT/Coding) skills.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-2"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I was looking to gain additional exposure into Data Science as a current practitioner, and thought this would be a great program.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-2"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I am most proud of completing all courses with distinction (top of peers).  Also, I’m proud to have achieved full points on my Capstone project having no prior experience in Natural Language Processing.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-3"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I am going to add this to my Resume and LinkedIn Profile.  I will use it to solidify my credibility as a data science practitioner of value.</p> <p><strong>Final Project: </strong><a href="https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/">https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/jhedbergfd3s/74960">https://rpubs.com/jhedbergfd3s/74960</a></p> <p>#</p> <h1 id="hernán-martínez-foffani">Hernán Martínez-Foffani</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg"><img class="aligncenter size-medium wp-image-4095" src="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg" alt="Hernán_Martínez-Foffani" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-260x195.jpg 260w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg 1256w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>I was born in Argentina but now I’m settled in Spain. I’ve been working in computer technology since the eighties, in digital networks, programming, consulting, project management. Now, as CTO in a software company, I lead a small team of programmers developing a supply chain management app.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-3"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>In my opinion the curriculum is carefully designed with a nice balance between theory and practice. The JHU authoring and the teachers’ widely known prestige ensure the content quality. The ability to choose the learning pace, one per month in my case, fits everyone’s schedule.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-3"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>The capstone definitely. It resulted in a fresh and interesting challenge. I sweat a lot, learned much more and in the end had a lot of fun.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-4"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>While for the time being I don’t have any specific plan for the certificate, it’s a beautiful reward for the effort done.</p> <p><strong>Final Project: </strong><a href="https://herchu.shinyapps.io/shinytextpredict/">https://herchu.shinyapps.io/shinytextpredict/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/herchu1/shinytextprediction">http://rpubs.com/herchu1/shinytextprediction</a></p> <p>#</p> <h1 id="francois-schonken">Francois Schonken</h1> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1.jpg"><img class="aligncenter size-medium wp-image-4097" src="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1-197x300.jpg" alt="Francois Schonken" width="197" height="300" /></a></p> <p>I’m a 36 year old South African male born and raised. I recently (4 years now) immigrated to lovely Melbourne, Australia. I wrapped up a BSc (Hons) Computer Science with specialization in Computer Systems back in 2001. Next I co-found a small boutique Software Development house operating from South Africa. I wrapped my MBA, from Melbourne Business School, in 2013 and now I consult for my small boutique Software Development house and 2 (very) small internet start-ups.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-4"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>One of the core subjects in my MBA was Data Analysis, basically an MBA take on undergrad Statistics with focus on application over theory (not that there was any shortage of theory). Waiting in a lobby room some 6 months later I was paging through the financial section of business focused weekly. I came across an article explaining how a Melbourne local applied a language called R to solve a grammatically and statistically challenging issue. The rest, as they say, is history.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-4"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I’m quite proud of both my Developing Data Products and Capstone projects, but for me these tangible outputs merely served as a vehicle to better understand a different way of thinking about data. I’ve spend most of my Software Development life dealing with one form or the other form of RDBS (Relational Database Management System). This, in my experience, leads to a very set oriented way of thinking about data.</p> <p>I’m most proud of developing a new tool in my “Skills Toolbox” that I consider highly complementary to both my Software and Business outlook on projects.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-5"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>Honestly, I had not planned on using my Certificate in and of itself. The skills I’ve acquired has already helped shape my thinking in designing an in-house web based consulting collaboration platform.</p> <p>I do not foresee this being the last time I’ll be applying Data Science thinking moving forward on my journey.</p> <p><strong>Final Project: </strong><a href="https://schonken.shinyapps.io/WordPredictor">https://schonken.shinyapps.io/WordPredictor</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/schonken/sentence-builder">http://rpubs.com/schonken/sentence-builder</a></p> <p>#</p> <h1 id="david-j-tagler">David J. Tagler</h1> <p>David is passionate about solving the world’s most important and challenging problems. His expertise spans chemical/biomedical engineering, regenerative medicine, healthcare technology management, information technology/security, and data science/analysis. David earned his Ph.D. in Chemical Engineering from Northwestern University and B.S. in Chemical Engineering from the University of Notre Dame.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-5"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I enrolled in this specialization in order to advance my statistics, programming, and data analysis skills. I was interested in taking a series of courses that covered the entire data science pipeline. I believe that these skills will be critical for success in the future.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-5"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I am most proud of the R programming and modeling skills that I developed throughout this specialization. Previously, I had no experience with R. Now, I can effectively manage complex data sets, perform statistical analyses, build prediction models, create publication-quality figures, and deploy web applications.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-6"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I look forward to utilizing these skills in future research projects. Furthermore, I plan to take additional courses in data science, machine learning, and bioinformatics.</p> <p><strong>Final Project: </strong><a href="http://dt444.shinyapps.io/next-word-predict">http://dt444.shinyapps.io/next-word-predict</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/dt444/next-word-predict">http://rpubs.com/dt444/next-word-predict</a></p> <p>#</p> <h1 id="melissa-tan">Melissa Tan</h1> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan.png"><img class="aligncenter size-medium wp-image-4099" src="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png" alt="MelissaTan" width="300" height="198" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-260x172.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>I’m a financial journalist from Singapore. I did philosophy and computer science at the University of Chicago, and I’m keen on picking up more machine learning and data viz skills.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-6"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I wanted to keep up with coding, while learning new tools and techniques for wrangling and analyzing data that I could potentially apply to my job. Plus, it sounded fun. <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-6"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>Building a word prediction app pretty much from scratch (with a truckload of forum reading). The capstone project seemed insurmountable initially and ate up all my weekends, but getting the app to work passably was worth it.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-7"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>It’ll go on my CV, but I think it’s more important to be able to actually do useful things. I’m keeping an eye out for more practical opportunities to apply and sharpen what I’ve learnt.</p> <p><strong>Final Project: </strong><a href="https://melissatan.shinyapps.io/word_psychic/">https://melissatan.shinyapps.io/word_psychic/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/melissatan/capstone">https://rpubs.com/melissatan/capstone</a></p> <p>#</p> <h1 id="felicia-yii">Felicia Yii</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii.jpg"><img class="aligncenter size-medium wp-image-4100" src="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg" alt="FeliciaYii" width="232" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-793x1024.jpg 793w" sizes="(max-width: 232px) 100vw, 232px" /></a></p> <p>Felicia likes to dream, think and do. With over 20 years in the IT industry, her current fascination is at the intersection of people, information and decision-making.  Ever inquisitive, she has acquired an expertise in subjects as diverse as coding to cookery to costume making to cosmetics chemistry. It’s not apparent that there is anything she can’t learn to do, apart from housework.  Felicia lives in Wellington, New Zealand with her husband, two children and two cats.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-7"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>Well, I love learning and the JHU Data Science Specialization appealed to my thirst for a new challenge. I’m really interested in how we can use data to help people make better decisions.  There’s so much data out there these days that it is easy to be overwhelmed by it all. Data visualisation was at the heart of my motivation when starting out. As I got into the nitty gritty of the course, I really began to see the power of making data accessible and appealing to the data-agnostic world. There’s so much potential for data science thinking in my professional work.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-7"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>Getting through it for starters while also working and looking after two children. Seriously though, being able to say I know what ‘practical machine learning’ is all about.  Before I started the course, I had limited knowledge of statistics, let alone knowing how to apply them in a machine learning context.  I was thrilled to be able to use what I learned to test a cool game concept in my final project.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-8"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I want to use what I have learned in as many ways possible. Firstly, I see opportunities to apply my skills to my day-to-day work in information technology. Secondly, I would like to help organisations that don’t have the skills or expertise in-house to apply data science thinking to help their decision making and communication. Thirdly, it would be cool one day to have my own company consulting on data science. I’ve more work to do to get there though!</p> <p><strong>Final Project: </strong><a href="https://micasagroup.shinyapps.io/nwpgame/">https://micasagroup.shinyapps.io/nwpgame/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/MicasaGroup/74788">https://rpubs.com/MicasaGroup/74788</a></p> <p> </p> Batch effects are everywhere! Deflategate edition 2015-06-09T11:47:27+00:00 http://simplystats.github.io/2015/06/09/batch-effects-are-everywhere-deflategate-edition <p>In my opinion, batch effects are the biggest challenge faced by genomics research, especially in precision medicine. As we point out in <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">this review</a>, they are everywhere among high-throughput experiments. But batch effects are not specific to genomics technology. In fact, in <a href="http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1972.10488878">this 1972 paper</a> (paywalled), <a href="http://en.wikipedia.org/wiki/William_J._Youden">WJ Youden</a> describes batch effects in the context of measurements made by physicists. Check out this plot of <a href="https://en.wikipedia.org/wiki/Astronomical_unit">astronomical unit</a> <del>speed of light</del> estimates <strong>with an estimate of spread <del>confidence intervals</del></strong> (red and green are same lab).</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png"><img class=" wp-image-4295 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png" alt="Rplot" width="467" height="290" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot-300x186.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png 903w" sizes="(max-width: 467px) 100vw, 467px" /></a></p> <p style="text-align: center;"> <p> &nbsp; </p> <p> Sometimes you find batch effects where you least expect them. For example, in the <a href="http://en.wikipedia.org/wiki/Deflategate">deflategate</a> debate. Here is quote from the New England patriot's deflategate<a href="http://www.boston.com/sports/football/patriots/2015/05/14/key-takeaways-from-the-patriots-deflategate-report-rebuttal/hK0J0J9abNgtGyhTwlW53L/story.html"> rebuttal</a> (written with help from Nobel Prize winner <a href="http://en.wikipedia.org/wiki/Roderick_MacKinnon">Roderick MacKinnon</a>) </p> <blockquote> <p> in other words, the Colts balls were measured after the Patriots balls and had warmed up more. For the above reasons, the Wells Report conclusion that physical law cannot explain the pressures is incorrect. </p> </blockquote> <p style="text-align: left;"> Here is another one: </p> <blockquote> <p style="text-align: left;"> In the pressure measurements physical conditions were not very well-defined and major uncertainties, such as which gauge was used in pre-game measurements, affect conclusions. </p> </blockquote> <p style="text-align: left;"> So NFL, please read <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">our paper</a> before you accuse a player of cheating. </p> <p style="text-align: left;"> Disclaimer: I live in New England but I am <a href="http://www.urbandictionary.com/define.php?term=Ball+so+Hard+University">Ravens</a> fan. </p> </p> I'm a data scientist - mind if I do surgery on your heart? 2015-06-08T14:15:39+00:00 http://simplystats.github.io/2015/06/08/im-a-data-scientist-mind-if-i-do-surgery-on-your-heart <p>There has been a lot of recent interest from scientific journals and from other folks in creating checklists for data science and data analysis. The idea is that the checklist will help prevent results that won’t reproduce or replicate from the literature. One analogy that I’m frequently hearing is the analogy with checklists for surgeons that <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa0810119">can help reduce patient mortality</a>.</p> <p>The one major difference between checklists for surgeons and checklists I’m seeing for research purposes is the difference in credentialing between people allowed to perform surgery and people allowed to perform complex data analysis. You would never let me do surgery on you. I have no medical training at all. But I’m frequently asked to review papers that include complicated and technical data analyses, but have no trained data analysts or statisticians. The most common approach is that a postdoc or graduate student in the group is assigned to do the analysis, even if they don’t have much formal training. Whenever this happens red flags are up all over the place. Just like I wouldn’t trust someone without years of training and a medical license to do surgery on me, I wouldn’t let someone without years of training and credentials in data analysis make major conclusions from complex data analysis.</p> <p>You might argue that the consequences for surgery and for complex data analysis are on completely different scales. I’d agree with you, but not in the direction that you might think. I would argue that high pressure and complex data analysis can have much larger consequences than surgery. In surgery there is usually only one person that can be hurt. But if you do a bad data analysis, say claiming say that <a href="http://www.ncbi.nlm.nih.gov/pubmed/9500320">vaccines cause autism</a>, that can have massive consequences for hundreds or even thousands of people. So complex data analysis, especially for important results, should be treated with at least as much care as surgery.</p> <p>The reason why I don’t think checklists alone will solve the problem is that they are likely to be used by people without formal training. One obvious (and recent) example that I think makes this really clear is the <a href="https://developer.apple.com/healthkit/">HealthKit</a> data we are about to start seeing. A ton of people signed up for studies on their iPhones and it has been all over the news. The checklist will (almost certainly) say to have a big sample size. HealthKit studies will certainly pass the checklist, but they are going to get <a href="http://en.wikipedia.org/wiki/Dewey_Defeats_Truman">Truman/Deweyed</a> big time if they aren’t careful about biased sampling.</p> <div> If I walked into an operating room and said I'm going to start dabbling in surgery I would be immediately thrown out. But people do that with statistics and data analysis all the time. What they really need is to require careful training and expertise in data analysis on each paper that analyzes data. Until we treat it as a first class component of the scientific process we'll continue to see retractions, falsifications, and irreproducible results flourish. </div> Interview with Class Central 2015-06-04T09:27:20+00:00 http://simplystats.github.io/2015/06/04/4063 <p>Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.</p> <p>See the <a href="https://www.class-central.com/report/data-science-specialization/">full interview</a> at the Class Central site. Below is short excerpt.</p> Interview with Chris Wiggins, chief data scientist at the New York Times 2015-06-01T09:00:27+00:00 http://simplystats.github.io/2015/06/01/interview-with-chris-wiggins-chief-data-scientist-at-the-new-york-times <p><em>Editor’s note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you’d like to ask, you can tweet them @simplystats and I’ll see if I can work them in. After the livestream we’ll leave the video on Youtube so you can check out the interview if you can’t watch the live stream. I’m embedding the Youtube video here but if you can’t see the live stream when it is running go check out the event page: <a href="https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o">https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o</a>.</em></p> Science is a calling and a career, here is a career planning guide for students and postdocs 2015-05-28T10:16:47+00:00 http://simplystats.github.io/2015/05/28/science-is-a-calling-and-a-career-here-is-a-career-planning-guide-for-students-and-postdocs <p><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em> <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md"><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em></a> <em>which you should go check out right now. You can also find the slightly adapted</em> <a href="https://github.com/jtleek/careerplanning"><em>Leek group career planning guide</em></a> <em>here.</em></p> <p>The most common reason that people go into science is altruistic. They loved dinosaurs and spaceships when they were a kid and that never wore off. On some level this is one of the reasons I love this field so much, it is an area where if you can get past all the hard parts can really keep introducing wonder into what you work on every day.</p> <p>Sometimes I feel like this altruism has negative consequences. For example, I think that there is less emphasis on the career planning and development side in the academic community. I don’t think this is malicious, but I do think that sometimes people think of the career part of science as unseemly. But if you have any job that you want people to pay you to do, then there will be parts of that job that will be career oriented. So if you want to be a professional scientist, being brilliant and good at science is not enough. You also need to pay attention to and plan carefully your career trajectory.</p> <p>A colleague of mine, Ben Langmead, created a really nice guide for his postdocs to thinking about and planning the career side of a postdoc <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">which he has over on Github</a>. I thought it was such a good idea that I immediately modified it and asked all of my graduate students and postdocs to fill it out. It is kind of long so there was no penalty if they didn’t finish it, but I think it is an incredibly useful tool for thinking about how to strategize a career in the sciences. I think that the more we are concrete about the career side of graduate school and postdocs, including being honest about all the realistic options available, the better prepared our students will be to succeed on the market.</p> <p>You can find the <a href="https://github.com/jtleek/careerplanning">Leek Group Guide to Career Planning</a> here and make sure you also go <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">check out Ben’s</a> since it was his idea and his is great.</p> <p> </p> Is it species or is it batch? They are confounded, so we can't know 2015-05-20T11:11:18+00:00 http://simplystats.github.io/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know <p>In a 2005 OMICS <a href="http://online.liebertpub.com/doi/abs/10.1089/153623104773547462" target="_blank">paper</a>, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that “any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue”. Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc… Two follow-up papers (<a href="http://mbe.oxfordjournals.org/content/23/3/530.abstract?ijkey=2c3d98666afbc99949fdcf514f10e3fedadee259&amp;keytype2=tf_ipsecsha" target="_blank">here</a> and <a href="http://mbe.oxfordjournals.org/content/24/6/1283.abstract?ijkey=366fdf09da56a5dd0cfdc5f74082d9c098ae7801&amp;keytype2=tf_ipsecsha" target="_blank">here</a>) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely <strong>confounded</strong>. In a 2010 paper, we confirmed that once this technical variability  was accounted for, the number of genes expressed in common  between the same tissue across the two species was much higher than the those expressed in common  between two species across the different tissues (see Figure 2 <a href="http://nar.oxfordjournals.org/content/39/suppl_1/D1011.full" target="_blank">here</a>).</p> <p>So <a href="http://genomicsclass.github.io/book/pages/confounding.html">what is confounding</a> and <a href="http://www.nature.com/ng/journal/v39/n7/full/ng0707-807.html">why is it a problem</a>? This topic has been discussed broadly. We wrote a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">review</a> some time ago. But based on recent discussions I’ve participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use</p> <ul> <li><em>X<sub>i</sub></em> to represent the gene expression measurements for human tissue <em>i,</em></li> <li><em>a<sub>X</sub></em> to represent the level of expression that is specific to humans and</li> <li><em>b<sub>X</sub></em> to represent the batch effect introduced by the use of the human microarray platform.</li> <li>Therefore <em>X<sub>i</sub></em> =<em>a<sub>X </sub></em>+ <em>b<sub>X </sub></em>+ <em>e<sub>i</sub></em>, with <em>e<sub>i</sub></em> the tissue <em>i</em> effect and other uninteresting sources of variability.</li> </ul> <p>Similarly, we will use:</p> <ul> <li><em>Y<sub>i</sub></em> to represent the measurements for mouse tissue <em>i</em></li> <li><em>a<sub>Y</sub></em>  to represent the mouse specific level and</li> <li><em>b<sub>Y</sub></em> the batch effect introduced by the use of the mouse microarray platform.</li> <li>Therefore <em>Y</em><sub>i</sub> = <em>a<sub>Y</sub></em>+ <em>b<sub>Y</sub></em> + <em>f<sub>i</sub></em>, with <em>f<sub>i</sub></em> tissue <em>i</em> effect and other uninteresting sources of variability.</li> </ul> <p>If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:</p> <p style="text-align: center;">  <em>a<sub>Y</sub> - a<sub>X</sub></em> </p> <p>Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: <em>Y<sub>1 </sub></em> - <em>X<sub>1 </sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2 </sub></em>, etc… The problem is that <em>a<sub>X</sub></em> and <em>b<sub>X</sub></em> are always together as are <em>a<sub>Y</sub></em> and <em>b<sub>Y</sub></em>. We say that the batch effect <em>b<sub>X</sub></em> is <strong>confounded</strong> with the species effect <em>a<sub>X</sub></em>. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a model like this:</p> <p style="text-align: center;"> <em>Y<sub>i</sub></em> - <em>X<sub>i</sub></em> = (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>) + other sources of variability </p> <p style="text-align: left;"> and then estimate the unknown quantities of interest: (<em>a<sub>Y</sub> - a<sub>X</sub></em>) and (<em>b<sub>Y</sub> - b<sub>X</sub></em>) from the observed data <em>Y<sub>1</sub></em> - <em>X<sub>1</sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2</sub></em>, etc... The problem is that, we can estimate the aggregate effect (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>), but, mathematically, we can't tease apart the two differences.  To see this note that if we are using least squares, the estimates (<em>a<sub>Y</sub> - a<sub>X</sub></em>) = 7,  (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=3  will fit the data exactly as well as (<em>a<sub>Y</sub> - a<sub>X</sub></em>)=3,(<em>b<sub>Y</sub> - b<sub>X</sub></em>)=7 since </p> <p style="text-align: center;"> <em>{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.</em> </p> <p style="text-align: left;"> In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x">this linear models course</a>. </p> <p style="text-align: left;"> Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the <em>b<sub>Y</sub> - b<sub>X  </sub></em>differences for each gene are squared and added up. </p> <p style="text-align: left;"> In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">2010 Nature Genetics Review</a>  about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach. </p> <p style="text-align: left;"> Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">this PNAS paper</a>:  "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered<sup>*</sup>. Therefore, the authors implicitly assumed that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0. However, in a recent F1000 Research <a href="http://f1000research.com/articles/4-121/v1" target="_blank">publication</a> Gilad and Mizrahi-Man describe describe an exercise in <a href="http://projecteuclid.org/euclid.aoas/1267453942">forensic bioinformatics</a> that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure1.gif">Figure 1</a>). As pointed out by these authors, with this experimental design we can't  simply accept that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a <a href="http://biostatistics.oxfordjournals.org/content/8/1/118.abstract">linear model</a> (ComBat) to account for the batch/species effect and find that <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure3.gif">samples cluster almost perfectly by tissue</a>. However, Gilad and Mizrahi-Man correctly note that,  due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data  and more analyses are needed. </p> <p>Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.</p> <p style="text-align: left;">  * The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols. </p> <p style="text-align: left;"> <strong>Update: </strong>Shin Lin has repeated a small version of the experiment described in the <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">PNAS paper</a>. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments <a href="http://f1000research.com/articles/4-121/v1" target="_blank">here</a>. </p> Residual expertise - or why scientists are amateurs at most of science 2015-05-18T10:21:18+00:00 http://simplystats.github.io/2015/05/18/residual-expertise <p><em>Editor’s note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I’ve adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across. </em></p> <p>An episode of_ The Daily Show with Jon Stewart_ featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.</p> <div style="background-color: #000000; width: 520px;"> <div style="padding: 4px;"> &lt;/p&gt; <p style="text-align: left; background-color: #ffffff; padding: 4px; margin-top: 4px; margin-bottom: 0px; font-family: Arial, Helvetica, sans-serif; font-size: 12px;"> <b><a href="http://thedailyshow.cc.com/">The Daily Show</a></b><br /> Get More: <a href="http://thedailyshow.cc.com/full-episodes/">Daily Show Full Episodes</a>,<a href="http://www.facebook.com/thedailyshow">The Daily Show on Facebook</a>,<a href="http://thedailyshow.cc.com/videos">Daily Show Video Archive</a> </p> </div> </div> <p>Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:</p> <blockquote> <p><em>Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”</em></p> <p>**</p> </blockquote> <blockquote> <p>** <em>Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”</em></p> </blockquote> <p>Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.</p> <p>Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn’t already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.</p> <p>This is an example of “residual expertise” - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the <a href="http://en.wikipedia.org/wiki/Argument_from_authority">argument from authority</a>:</p> <blockquote> <p>A is an authority on a particular topic</p> <p>A says something about that topic</p> <p>A is probably correct</p> </blockquote> <p>the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from “that person is a doctor” to “that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication” is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.</p> <p>What does this have to do with you?</p> <p>Most people aren’t necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don’t fully understand - you can still learn a ton and have a pretty fun time doing it.</p> <p> </p> <p> </p> The tyranny of the idea in science 2015-05-08T11:58:51+00:00 http://simplystats.github.io/2015/05/08/the-tyranny-of-the-idea-in-science <p>There are a lot of analogies between <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">startups and academic science labs</a>. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">Paul Graham has said:</a></p> <blockquote> <p>Actually, startup ideas are not million dollar ideas, and here’s an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there’s no market for startup ideas suggests there’s no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.</p> </blockquote> <p>In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven’t gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:</p> <ol> <li><strong>Higgs Boson</strong> - Peter Higgs <a href="http://journals.aps.org/pr/abstract/10.1103/PhysRev.145.1156">postulated the Boson in 1964</a>, <a href="http://www.symmetrymagazine.org/article/october-2013/nobel-prize-in-physics-honors-prediction-of-higgs-boson">he won the Nobel Prize in 2013 for that prediction</a>, in between tons of people did follow on work, someone convinced Europe to build one of the <a href="http://en.wikipedia.org/wiki/Large_Hadron_Collider">most expensive pieces of scientific equipment ever built</a> and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.</li> <li><strong>Human genome</strong> - <a href="http://en.wikipedia.org/wiki/Molecular_Structure_of_Nucleic_Acids:_A_Structure_for_Deoxyribose_Nucleic_Acid">Watson and Crick postulated the structure of DNA</a> in 1953, <a href="http://www.nobelprize.org/nobel_prizes/medicine/laureates/1962/">they won the Nobel Prize in  medicine in 1962</a> for this work. But the real value of the human genome was realized when the <a href="http://en.wikipedia.org/wiki/Human_Genome_Project">largest biological collaboration in history sequenced the human genome</a>, along with all of the subsequent work in the genomics revolution.</li> </ol> <p>These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, “the tyranny of the idea.” I notice a similar issue on a much smaller scale, for example when people <a href="http://ivory.idyll.org/blog/2015-software-as-a-primary-product-of-science.html">don’t recognize software as a primary product of science</a>. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn’t clear that some ideas wouldn’t be discovered by someone else - but surely we aren’t going to build another large hadron collider. I’d like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.</p> Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously 2015-05-07T11:30:09+00:00 http://simplystats.github.io/2015/05/07/mendelian-randomization-inspires-a-randomized-trial-design-for-multiple-drugs-simultaneously <p>Joe Pickrell has an interesting new paper out about <a href="http://biorxiv.org/content/early/2015/04/16/018150.full-text.pdf+html">Mendelian randomization.</a> He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.</p> <p>The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak’s group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06.jpg"><img class="aligncenter wp-image-4009 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg" alt="Slide06" width="300" height="224" srcset="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-260x194.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>If you look at the allele inherited from the two parental strains (BY, RM)  at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring)  they do appear to be random and independent:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM.png"><img class="aligncenter wp-image-4010 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM-235x300.png" alt="Screen Shot 2015-05-07 at 11.20.46 AM" width="235" height="300" /></a></p> <p> </p> <p> </p> <p>So this is a randomized trial in yeast where the yeast were each randomized to many many genetic “treatments” simultaneously. Now this isn’t strictly true, since genes on the same chromosomes near each other aren’t exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic “treatments” to downstream things like gene expression ( <a href="http://genomebiology.com/2007/8/10/r219">and even do a reasonable job in the model organism case</a>).</p> <p>In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments.  Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I’m seeing lots of <a href="http://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411">interesting proposals</a> for new trial designs (<a href="http://notstatschat.tumblr.com/post/118102423391/precise-answers-but-not-necessarily-to-the-right">which may or may not work</a>), so I thought I’d contribute my own interesting idea.</p> Rafa's citations above replacement in statistics journals is crazy high. 2015-05-01T11:18:47+00:00 http://simplystats.github.io/2015/05/01/rafas-citations-above-replacement-in-statistics-journals-is-crazy-high <p><em>Editor’s note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. </em></p> <p>I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from <a href="webofscience.com/">Web of Science</a>. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/journals.png"><img class="aligncenter wp-image-4001" src="http://simplystatistics.org/wp-content/uploads/2015/05/journals-300x300.png" alt="journals" width="500" height="500" srcset="http://simplystati