Simply Statistics 2016-09-28T14:34:38+00:00 http://simplystats.github.io The Mystery of Palantir Continues 2016-09-28T00:00:00+00:00 http://simplystats.github.io/2016/09/28/mystery-palantir-continues <p>Palantir, the secretive data science/consulting/software company, continues to be a mystery to most people, but recent reports have not been great. <a href="http://www.nytimes.com/reuters/2016/09/26/business/26reuters-palantir-tech-discrimination-lawsuit.html?smprod=nytcore-iphone&amp;smid=nytcore-iphone-share&amp;_r=0">Reuters reports</a> that the U.S. Department of Labor is suing it for employment discrimination:</p> <blockquote> <p>The lawsuit alleges Palantir routinely eliminated Asian applicants in the resume screening and telephone interview phases, even when they were as qualified as white applicants.</p> </blockquote> <p>Interestingly, the report indicates a statistical argument:</p> <blockquote> <p>In one example cited by the Labor Department, Palantir reviewed a pool of more than 130 qualified applicants for the role of engineering intern. About 73 percent of applicants were Asian. The lawsuit, which covers Palantir’s conduct between January 2010 and the present, said the company hired 17 non-Asian applicants and four Asians. “The likelihood that this result occurred according to chance is approximately one in a billion,” said the lawsuit, which was filed with the department’s Office of Administrative Law Judges.</p> </blockquote> <p>Note the use of the phrase “qualified applicants” in reference to the 130. Presumably, there was a screening process that removed “unqualified applicants” and that led us to 130. Of the 130, 73 were Asian, or about 56%. Presumably, there was a follow up selection process (interview, exam) that led to 4 Asians being hired out of 21 (about 19%). Clearly there’s a difference between 19% and 56% but the reasons may not be nefarious. If you assume the number of Asians hired is proportional to the number in the qualified pool, then the p-value for the observed data is about 0.0006, which is not quite “1 in a billion” as the report claims. But my guess is the Labor Department has more than this test of binomial proportions in terms of evidence if they were to go through with a suit.</p> <p>Alfred Lee from <a href="http://go.theinformation.com/r958P12lLdw">The Information</a> reports that a mutual fund run by Valic sold their shares of Palantir for below the recent valuation:</p> <blockquote> <p>The Valic fund sold its stake at $4.50 per share, filings show, down from the$11.38 per share at which the company raised money in December. The value of the stake at the sale price was $621,000. Despite the price drop, Valic made money on the deal, as it had acquired stock in preferred fundraisings in 2012 and 2013 at between$3.06 and $3.51 per share.</p> </blockquote> <p>In my <a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">previous post on Palantir</a>, I noted that while other large-scale consulting companies certainly make a lot of money, none have the sky-high valuation that Palantir commands. However, a more “down-to-Earth” valuation of$8 billion might be more or less in line with these other companies. It may be bad news for Palantir, but should the company ever have an IPO, it would be good for the public for market participants to realize the intrinsic value of the company.</p> Thinking like a statistician: this is not the election for progressives to vote third party 2016-09-27T00:00:00+00:00 http://simplystats.github.io/elections/2016/09/27/thinking-like-statistician-election-2016 <p>Democratic elections permit us to vote for whomever we perceive has the highest expectation to do better with the issues we care about. Let’s simplify and assume we can quantify how satisfied we are with an elected official’s performance. Denote this quantity with <em>X</em>. Because when we cast our vote we still don’t know for sure how the candidate will perform, we base our decision on what we expect, denoted here with <em>E(X)</em>. Thus we try to maximize <em>E(X)</em>. However, both political theory and data tell us that in US presidential elections only two parties have a non-negligible probability of winning. This implies that <em>E(X)</em> is 0 for some candidates no matter how large <em>X</em> could potentially be. So what we are really doing is deciding if <em>E(X-Y)</em> is positive or negative with <em>X</em> representing one candidate and <em>Y</em> the other.</p> <p>In past elections some progressives have argued that the difference between candidates is negligible and have therefore supported the Green Party ticket. The 2000 election is a notable example. The <a href="https://en.wikipedia.org/wiki/United_States_presidential_election,_2000">2000 election</a> was won by George W. Bush by just five <a href="https://en.wikipedia.org/wiki/Electoral_College_(United_States)">electoral votes</a>. In Florida, which had 25 electoral votes, Bush beat Al Gore by just 537 votes. Green Party candidate Ralph Nader obtained 97,488 votes. Many progressive voters were OK with this outcome because they perceived <em>E(X-Y)</em> to be practically 0.</p> <p>In contrast, in 2016, I suspect few progressives think that <em>E(X-Y)</em> is anywhere near 0. In the figures below I attempt to quantify the progressive’s pre-election perception of consequences for the last five contests. The first figure shows <em>E(X)</em> and <em>E(Y)</em> and the second shows <em>E(X-Y)</em>. Note despite <em>E(X)</em> being the lowest in the last past five elections, <em>E(X-Y)</em> is by far the largest. So if these figures accurately depict your perception and you think like a statistician, it becomes clear that this is not the election to vote third party.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election.png" alt="election-2016" /></p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/election-diff.png" alt="election-diff-2016" /></p> Facebook and left censoring 2016-09-26T00:00:00+00:00 http://simplystats.github.io/2016/09/26/facebook-left-censoring <p>From the <a href="http://www.wsj.com/articles/facebook-overestimated-key-video-metric-for-two-years-1474586951">Wall Street Journal</a>:</p> <blockquote> <p>Several weeks ago, Facebook disclosed in a post on its “Advertiser Help Center” that its metric for the average time users spent watching videos was artificially inflated because it was only factoring in video views of more than three seconds. The company said it was introducing a new metric to fix the problem.</p> </blockquote> <p>A classic case of left censoring (in this case, by “accident”).</p> <p>Also this:</p> <blockquote> <p>Ad buying agency Publicis Media was told by Facebook that the earlier counting method likely overestimated average time spent watching videos by between 60% and 80%, according to a late August letter Publicis Media sent to clients that was reviewed by The Wall Street Journal.</p> </blockquote> <p>What does this information tell us about the actual time spent watching Facebook videos?</p> Not So Standard Deviations Episode 22 - Number 1 Side Project 2016-09-19T00:00:00+00:00 http://simplystats.github.io/2016/09/19/nssd-episode-22 <p>Hilary and I celebrate our one year anniversary doing the podcast together by discussing whether there are cities that are good for data scientists, reproducible research, and professionalizing data science.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="https://www.biostat.washington.edu/suminst/sisbid2016/modules/BD1603">Roger’s reproducible research workshop</a></p> </li> <li> <p><a href="http://radar.oreilly.com/2013/06/theres-more-than-one-kind-of-data-scientist.html">There’s More Than One Kind of Data Scientist by Harlan Harris</a></p> </li> <li> <p><a href="http://sf.curbed.com/maps/mapping-the-10-sf-homes-with-the-highest-property-taxes">Billionaire’s row in San Francisco</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Mindfulness-based_stress_reduction">Mindfulness-based stress reduction</a></p> </li> <li> <p><a href="http://www.asteroidmission.org/">OSIRIS-REx</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-22-1-side-project">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/282927998&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Mastering Software Development in R 2016-09-19T00:00:00+00:00 http://simplystats.github.io/2016/09/19/msdr-launch-announcement <p>Today I’m happy to announce that we’re launching a new specialization on Coursera titled <a href="https://www.coursera.org/specializations/r/"><strong>Mastering Software Development in R</strong></a>. This is a 5-course sequence developed with <a href="https://twitter.com/seankross">Sean Kross</a> and <a href="http://csu-cvmbs.colostate.edu/academics/erhs/Pages/brooke-anderson.aspx">Brooke Anderson</a>.</p> <p>This sequence differs from our previous Data Science Specialization because it focuses primarily on using R for developing <em>software</em>. We’ve found that as the field of data science evolves, it is becoming ever more clear that software development skills are essential for producing useful data science results and products. In addition, there is a tremendous need for tooling in the data science universe and we want to train people to build those tools.</p> <p>The first course, <a href="https://www.coursera.org/learn/r-programming-environment">The R Programming Environment</a>, launches today. In the following months, we will launch the remaining courses:</p> <ul> <li>Advanced R Programming</li> <li>Building R Packages</li> <li>Building Data Visualization Tools</li> </ul> <p>In addition to the course, we have a <a href="https://leanpub.com/msdr">companion textbook</a> that goes along with the sequence. The book is available from Leanpub and is currently in progress (if you get the book now, you will receive free updates as they are available). We will be releaseing new chapters of the book alongside the launches of the other courses in the sequence.</p> Interview With a Data Sucker 2016-09-07T00:00:00+00:00 http://simplystats.github.io/open%20science/2016/09/07/interview-with-a-data-sucker <p>A few months ago Jill Sederstrom from ASH Clinical News interviewed me for <a href="http://ashclinicalnews.org/attack-of-the-data-suckers/">this article</a> on the data sharing editorial published by the The New England Journal of Medicine (NEJM) and the debate it generated. The article presented a nice summary, but I thought the original comprehensive set of questions was very good too. So, with permission from ASH Clinical News, I am sharing them here along with my answers.</p> <p>Before I answer the questions below, I want to make an important remark. When writing these answers I am reflecting on data sharing in general. Nuances arise in different contexts that need to be discussed on an individual basis. For example, there are different considerations to keep in mind when sharing publicly funded data in genomics (my field) and sharing privately funded clinical trials data, just to name two examples.</p> <h3 id="in-your-opinion-what-do-you-see-as-the-biggest-pros-of-data-sharing">In your opinion, what do you see as the biggest pros of data sharing?</h3> <p>The biggest pro of data sharing is that it can accelerate and improve the scientific enterprise. This can happen in a variety of ways. For example, competing experts may apply an improved statistical analysis that finds a hidden discovery the original data generators missed. Furthermore, examination of data by many experts can help correct errors missed by the analyst of the original project. Finally, sharing data facilitates the merging of datasets from different sources that allow discoveries not possible with just one study.</p> <p>Note that data sharing is not a radical idea. For example, thanks to an organization called <a href="http://fged.org">The MGED Soceity</a>, most journals require all published microarray gene expression data to be public in one of two repositories: GEO or ArrayExpress. This has been an incredible success, leading to new discoveries, new databases that combine studies, and the development of widely used statistical methods and software built with these data as practice examples.</p> <h3 id="the-nejm-editorial-expressed-concern-that-a-new-generation-of-researchers-will-emerge-those-who-had-nothing-to-do-with-collecting-the-research-but-who-will-use-it-to-their-own-ends-it-referred-to-these-as-research-parasites-is-this-a-real-concern">The NEJM editorial expressed concern that a new generation of researchers will emerge, those who had nothing to do with collecting the research but who will use it to their own ends. It referred to these as “research parasites.” Is this a real concern?</h3> <p>Absolutely not. If our goal is to facilitate scientific discoveries that improve our quality of life, I would be much more concerned about “data hoarders” than “research parasites”. If an important nugget of knowledge is hidden in a dataset, don’t you want the best data analysts competing to find it? Restricting the researchers who can analyze the data to those directly involved with the generators cuts out the great majority of experts.</p> <p>To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates. Finding a cure is possible but only after analyzing a very very complex genomic assay. If some of the best data analysts in the world want to help, does it make any sense at all to restrict the pool of analysts to, say, a freshly minted masters level statistician working for the genomics core that generated the data? Furthermore, what would be the harm of having someone double check that analysis?</p> <h3 id="the-nejm-editorial-also-presented-several-other-concerns-it-had-with-data-sharing-including-whether-researchers-would-compare-data-across-clinical-trials-that-is-not-in-fact-comparable-and-a-failure-to-provide-correct-attribution-do-you-see-these-as-being-concerns-what-cons-do-you-believe-there-may-be-to-data-sharing">The NEJM editorial also presented several other concerns it had with data sharing including whether researchers would compare data across clinical trials that is not in fact comparable and a failure to provide correct attribution. Do you see these as being concerns? What cons do you believe there may be to data sharing?</h3> <p>If such mistakes are made, good peer reviewers will catch the error. If it escapes peer review, we point it out in post publication discussions. Science is constantly self correcting.</p> <p>Regarding attribution, this is a legitimate, but in my opinion, minor concern. Developers of open source statistical methods and software see our methods used without attribution quite often. We survive. But as I elaborate below, we can do things to alleviate this concern.</p> <h3 id="is-data-stealing-a-real-worry-have-you-ever-heard-of-it-happening-before">Is data stealing a real worry? Have you ever heard of it happening before?</h3> <p>I can’t say I can recall any case of data being stolen. But let’s remember that most published data is paid for by tax payers. They are the actual owners. So there is an argument to be made that the public’s data is being held hostage.</p> <h3 id="does-data-sharing-need-to-happen-symbiotically-as-the-editorial-suggests-why-or-why-not">Does data sharing need to happen symbiotically as the editorial suggests? Why or why not?</h3> <p>I think symbiotic sharing is the most effective approach to the repurposing of data. But no, I don’t think we need to force it to happen this way. Competition is one of the key ingredients of the scientific enterprise. Having many groups competing almost always beats out a small group of collaborators. And note that the data generators won’t necessarily have time to collaborate with all the groups interested in the data.</p> <h3 id="in-a-recent-blog-post-you-suggested-several-possible-data-sharing-guidelines-what-would-the-advantage-be-of-having-guidelines-in-place-in-help-guide-the-data-sharing-process">In a recent blog post, you suggested several possible data sharing guidelines. What would the advantage be of having guidelines in place in help guide the data sharing process?</h3> <p>I think you are referring to <a href="http://simplystatistics.org/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem/">a post by Jeff Leek</a> but I am happy to answer. For data to be generated, we need to incentivize the endeavor. Guidelines that assure patient privacy should of course be followed. Some other simple guidelines related to those mentioned by Jeff are:</p> <ol> <li>Reward data generators when their data is used by others.</li> <li>Penalize those that do not give proper attribution.</li> <li>Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis.</li> <li>Include data sharing ethics in scientific education</li> </ol> <h3 id="one-of-the-guidelines-suggested-a-new-designation-for-leaders-of-major-data-collection-or-software-generation-projects-why-do-you-think-this-is-important">One of the guidelines suggested a new designation for leaders of major data collection or software generation projects. Why do you think this is important?</h3> <p>Again, this was Jeff, but I agree. This is important because we need an incentive other than giving the generators exclusive rights to publications emanating from said data.</p> <h3 id="you-also-discussed-the-need-for-requiring-statisticalcomputational-co-authors-for-papers-written-by-experimentalists-with-no-statisticalcomputational-co-authors-and-vice-versa-what-role-do-you-see-the-referee-serving-why-is-this-needed">You also discussed the need for requiring statistical/computational co-authors for papers written by experimentalists with no statistical/computational co-authors and vice versa. What role do you see the referee serving? Why is this needed?</h3> <p>I think the same rule should apply to referees. Every paper based on the analysis of complex data needs to have a referee with statistical/computational expertise. I also think biomedical journals publishing data-driven research should start adding these experts to their editorial boards. I should mention that NEJM actually has had such experts on their editorial board for a while now.</p> <h3 id="are-there-certain-guidelines-would-feel-would-be-most-critical-to-include">Are there certain guidelines would feel would be most critical to include?</h3> <p>To me the most important ones are:</p> <ol> <li> <p>The funding agencies and the community should reward data generators when their data is used by others. Perhaps more than for the papers they produce with these data.</p> </li> <li> <p>Apply the same critical rigor on critiques of the original analysis as we apply to the original analysis. Bashing published results and talking about the “replication crisis” has become fashionable. Although in some cases it is very well merited (see Baggerly and Coombes <a href="http://projecteuclid.org/euclid.aoas/1267453942#info">work</a> for example) in some circumstances critiques are made without much care mainly for the attention. If we are not careful about keeping a good balance, we may end up paralyzing scientific progress.</p> </li> </ol> <h3 id="you-mentioned-that-you-think-symbiotic-data-sharing-would-be-the-most-effective-approach-what-are-some-ways-in-which-scientists-can-work-symbiotically">You mentioned that you think symbiotic data sharing would be the most effective approach. What are some ways in which scientists can work symbiotically?</h3> <p>I can describe my experience. I am trained as a statistician. I analyze data on a daily basis both as a collaborator and method developer. Experience has taught me that if one does not understand the scientific problem at hand, it is hard to make a meaningful contribution through data analysis or method development. Most successful applied statisticians will tell you the same thing.</p> <p>Most difficult scientific challenges have nuances that only the subject matter expert can effectively describe. Failing to understand these usually leads analysts to chase false leads, interpret results incorrectly or waste time solving a problem no one cares about. Successful collaboration usually involve a constant back and forth between the data analysts and the subject matter experts.</p> <p>However, in many circumstances the data generator is not necessarily the only one that can provide such guidance. Some data analysts actually become subject matter experts themselves, others download data and seek out other collaborators that also understand the details of the scientific challenge and data generation process.</p> A Short Guide for Students Interested in a Statistics PhD Program 2016-09-06T00:00:00+00:00 http://simplystats.github.io/advice/2016/09/06/a-short-guide-for-phd-applicants <p>This summer I had several conversations with undergraduate students seeking career advice. All were interested in data analysis and were considering graduate school. I also frequently receive requests for advice via email. We have posted on this topic before, for example <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a> and <a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">here</a>, but I thought it would be useful to share this short guide I put together based on my recent interactions.</p> <h2 id="its-ok-to-be-confused">It’s OK to be confused</h2> <p>When I was a college senior I didn’t really understand what Applied Statistics was nor did I understand what one does as a researcher in academia. Now I love being an academic doing research in applied statistics. But it is hard to understand what being a researcher is like until you do it for a while. Things become clearer as you gain more experience. One important piece of advice is to carefully consider advice from those with more experience than you. It might not make sense at first, but I can tell today that I knew much less than I thought I did when I was 22.</p> <h2 id="should-i-even-go-to-graduate-school">Should I even go to graduate school?</h2> <p>Yes. An undergraduate degree in mathematics, statistics, engineering, or computer science provides a great background, but some more training greatly increases your career options. You may be able to learn on the job, but note that a masters can be as short as a year.</p> <h2 id="a-masters-or-a-phd">A masters or a PhD?</h2> <p>If you want a career in academia or as a researcher in industry or government you need a PhD. In general, a PhD will give you more career options. If you want to become a data analyst or research assistant, a masters may be enough. A masters is also a good way to test out if this career is a good match for you. Many people do a masters before applying to PhD Programs. The rest of this guide focuses on those interested in a PhD.</p> <h2 id="what-discipline">What discipline?</h2> <p>There are many disciplines that can lead you to a career in data science: Statistics, Biostatistics, Astronomy, Economics, Machine Learning, Computational Biology, and Ecology are examples that come to mind. I did my PhD in Statistics and got a job in a Department of Biostatistics. So this guide focuses on Statistics/Biostatistics.</p> <p>Note that once you finish your PhD you have a chance to become a postdoctoral fellow and further focus your training. By then you will have a much better idea of what you want to do and will have the opportunity to chose a lab that closely matches your interests.</p> <h2 id="what-is-the-difference-between-statistics-and-biostatistics">What is the difference between Statistics and Biostatistics?</h2> <p>Short answer: very little. I treat them as the same in this guide. Long answer: read <a href="http://simplystatistics.org/2015/11/09/biostatistics-its-not-what-you-think-it-is/">this</a>.</p> <h2 id="how-should-i-prepare-during-my-senior-year">How should I prepare during my senior year?</h2> <h3 id="math">Math</h3> <p>Good grades in math and statistics classes are almost a requirement. Good GRE scores help and you need to get a near perfect score in the Quantitative Reasoning part of the GRE. Get yourself a practice book and start preparing. Note that to survive the first two years of a statistics PhD program you need to prove theorems and derive relatively complicated mathematical results. If you can’t easily handle the math part of the GRE, this will be quite challenging.</p> <p>When choosing classes note that the area of math most related to your stat PhD courses is Real Analysis. The area of math most used in applied work is Linear Algebra, specifically matrix theory including understanding eigenvalues and eigenvectors. You might not make the connection between what you learn in class and what you use in practice until much later. This is totally normal.</p> <p>If you don’t feel ready, consider doing a masters first. But also, get a second opinion. You might be being too hard on yourself.</p> <h3 id="programming">Programming</h3> <p>You will be using a computer to analyze data so knowing some programming is a must these days. At a minimum, take a basic programming class. Other computer science classes will help especially if you go into an area dealing with large datasets. In hindsight, I wish I had taken classes on optimization and algorithm design.</p> <p>Know that learning to program and learning a computer language are different things. You need to learn to program. The choice of language is up for debate. If you only learn one, learn R. If you learn three, learn R, Python and C++.</p> <p>Knowing Linux/Unix is an advantage. If you have a Mac try to use the terminal as much as possible. On Windows get an emulator.</p> <h3 id="writing-and-communicating">Writing and Communicating</h3> <p>My biggest educational regret is that, as a college student, I underestimated the importance of writing. To this day I am correcting that mistake.</p> <p>Your success as a researcher greatly depends on how well you write and communicate. Your thesis, papers, grant proposals and even emails have to be well written. So practice as much as possible. Take classes, read works by good writers, and <a href="http://bulletin.imstat.org/2011/09/terence%E2%80%99s-stuff-speaking-reading-writing/">practice</a>. Consider starting a blog even if you don’t make it public. Also note that in academia, job interviews will involve a 50 minute talk as well as several conversations about your work and future plans. So communication skills are also a big plus.</p> <h2 id="but-wait-why-so-much-math">But wait, why so much math?</h2> <p>The PhD curriculum is indeed math heavy. Faculty often debate the possibility of changing the curriculum. But regardless of differing opinions on what is the right amount, math is the foundation of our discipline. Although it is true that you will not directly use much of what you learn, I don’t regret learning so much abstract math because I believe it positively shaped the way I think and attack problems.</p> <p>Note that after the first two years you are pretty much done with courses and you start on your research. If you work with an applied statistician you will learn data analysis via the apprenticeship model. You will learn the most, by far, during this stage. So be patient. Watch <a href="https://www.youtube.com/watch?v=R37pbIySnjg">these</a> <a href="https://www.youtube.com/watch?v=Bg21M2zwG9Q">two</a> Karate Kid scenes for some inspiration.</p> <h2 id="what-department-should-i-apply-to">What department should I apply to?</h2> <p>The top 20-30 departments are practically interchangeable in my opinion. If you are interested in applied statistics make sure you pick a department with faculty doing applied research. Note that some professors focus their research on the mathematical aspects of statistics. By reading some of their recent papers you will be able to tell. An applied paper usually shows data (not simulated) and motivates a subject area challenge in the abstract or introduction. A theory paper shows no data at all or uses it only as an example.</p> <h2 id="can-i-take-a-year-off">Can I take a year off?</h2> <p>Absolutely. Especially if it’s to work in a data related job. In general, maturity and life experiences are an advantage in grad school.</p> <h2 id="what-should-i-expect-when-i-finish">What should I expect when I finish?</h2> <p>You will have many many options. The demand of your expertise is great and growing. As a result there are many high-paying options. If you want to become an academic I recommend doing a postdoc. <a href="http://simplystatistics.org/2011/12/28/grad-students-in-bio-statistics-do-a-postdoc/">Here</a> is why. But there are many other options as we describe <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">here</a> and <a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/">here</a>.</p> Not So Standard Deviations Episode 21 - This Might be the Future! 2016-08-26T00:00:00+00:00 http://simplystats.github.io/2016/08/26/nssd-episode-21 <p>Hilary and I are apart again and this time we’re talking about political polling. Also, they discuss Trump’s tweets, and the fact that Hilary owns a bowling ball.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The book is available from Leanpub and will be updated as we record more episodes. If you’re new to the podcast, this is a good way to do some catching up!</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p>Subscribe to the podcast on <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a> or <a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="http://projects.fivethirtyeight.com/2016-election-forecast/">FiveThirtyEight election dashboard</a></p> </li> <li> <p><a href="http://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html">The Upshot’s election dashboard</a></p> </li> <li> <p><a href="http://varianceexplained.org/r/trump-tweets/">David Robinson’s post on Trump’s tweets</a></p> </li> <li> <p><a href="https://twitter.com/juliasilge">Julia Silge’s Twitter account</a></p> </li> <li> <p><a href="http://thekateringshow.com">The Katering Show</a></p> </li> <li> <p><a href="https://www.beomni.com">Omni</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-21-this-might-be-the-future">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/279922412&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> How to create a free distributed data collection "app" with R and Google Sheets 2016-08-26T00:00:00+00:00 http://simplystats.github.io/2016/08/26/googlesheets <p><a href="http://www.stat.ubc.ca/~jenny/">Jenny Bryan</a>, developer of the <a href="https://github.com/jennybc/googlesheets">google sheets R package</a>, <a href="https://speakerdeck.com/jennybc/googlesheets-talk-at-user2015">gave a talk</a> at Use2015 about the package.</p> <p>One of the things that got me most excited about the package was an example she gave in her talk of using the Google Sheets package for data collection at ultimate frisbee tournaments. One reason is that I used to play a little ultimate <a href="http://www.pbase.com/jmlane/image/76852417">back in the day</a>.</p> <p>Another is that her idea is an amazing one for producing cool public health applications. One of the major issues with public health is being able to do distributed data collection cheaply, easily, and reproducibly. So I decided to write a little tutorial on how one could use <a href="https://www.google.com/sheets/about/">Google Sheets</a> and R to create a free distributed data collecton “app” for public health (or anything else really).</p> <h3 id="what-you-will-need">What you will need</h3> <ul> <li>A Google account and access to <a href="https://www.google.com/sheets/about/">Google Sheets</a></li> <li><a href="https://www.r-project.org/">R</a> and the <a href="https://github.com/jennybc/googlesheets">googlesheets</a> package.</li> </ul> <h3 id="the-app">The “app”</h3> <p>What we are going to do is collect data in a Google Sheet or sheets. This sheet can be edited by anyone with the link using their computer or a mobile phone. Then we will use the <code class="highlighter-rouge">googlesheets</code> package to pull the data into R and analyze it.</p> <h3 id="making-the-google-sheet-work-with-googlesheets">Making the Google Sheet work with googlesheets</h3> <p>After you have a first thing to do is to go to the Google Sheets I suggest bookmarking this page: https://docs.google.com/spreadsheets/u/0/ which skips the annoying splash screen.</p> <p>Create a blank sheet and give it an appropriate title for whatever data you will be collecting.</p> <p>Next, we need to make the sheet <em>public on the web</em> so that the <em>googlesheets</em> package can read it. This is different from the sharing settings you set with the big button on the right. To make the sheet public on the web, go to the “File” menu and select “Publish to the web…”. Like this:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publishweb.png" alt="" /></p> <p>then it will ask you if you want to publish the sheet, just hit publish</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_publish.png" alt="" /></p> <p>Copy the link it gives you and you can use it to read in the Google Sheet. If you want to see all the Google Sheets you can read in, you can load the package and use the <code class="highlighter-rouge">gs_ls</code> function.</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">library</span><span class="p">(</span><span class="n">googlesheets</span><span class="p">)</span><span class="w"> </span><span class="n">sheets</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_ls</span><span class="p">()</span><span class="w"> </span><span class="n">sheets</span><span class="p">[</span><span class="m">1</span><span class="p">,]</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 1 x 10 ## sheet_title author perm version updated ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;time&gt; ## 1 app_example jtleek rw new 2016-08-26 17:48:21 ## # ... with 5 more variables: sheet_key &lt;chr&gt;, ws_feed &lt;chr&gt;, ## # alternate &lt;chr&gt;, self &lt;chr&gt;, alt_key &lt;chr&gt; </code></pre> </div> <p>It will pop up a dialog asking for you to authorize the <code class="highlighter-rouge">googlesheets</code> package to read from your Google Sheets account. Then you should see a list of spreadsheets you have created.</p> <p>In my example I created a sheet called “app_example” so I can load the Google Sheet like this:</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet </span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_title</span><span class="p">(</span><span class="s2">"app_example"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet successfully identified: "app_example" </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Reads the data </span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8. </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 3 x 5 ## who_collected at_work person time date ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; ## 1 jeff no ingo 13:47 08/26/2016 ## 2 jeff yes roger 13:47 08/26/2016 ## 3 jeff yes brian 13:47 08/26/2016 </code></pre> </div> <p>In this case the data I’m collecting is about who is at work right now as I’m writing this post :). But you could collect whatever you want.</p> <h3 id="distributing-the-data-collection">Distributing the data collection</h3> <p>Now that you have the data published to the web, you can read it into Google Sheets. Also, anyone with the link will be able to view the Google Sheet. But if you don’t change the sharing settings, you are the only one who can edit the sheet.</p> <p>This is where you can make your data collection distributed if you want. If you go to the “Share” button, then click on advanced you will get a screen like this and have some options.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_share_advanced.png" alt="" /></p> <p><em>Private data collection</em></p> <p>In the example I’m using I haven’t changed the sharing settings, so while you can <em>see</em> the sheet, you can’t edit it. This is nice if you want to collect some data and allow other people to read it, but you don’t want them to edit it.</p> <p><em>Controlled distributed data collection</em></p> <p>If you just enter people’s emails then you can open the data collection to just those individuals you have shared the sheet with. Be careful though, if they don’t have Google email addresses, then they get a link which they could share with other people and this could lead to open data collection.</p> <p><em>Uncontrolled distributed data collection</em></p> <p>Another option is to click on “Change” next to “Private - Only you can access”. If you click on “On - Anyone with the link” and click on “Can View”.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/gs_can_view.png" alt="" /></p> <p>Then you can modify it to say “Can Edit” and hit “Save”. Now anyone who has the link can edit the Google Sheet. This means that you can’t control who will be editing it (careful!) but you can really widely distribute the data collection.</p> <h3 id="collecting-data">Collecting data</h3> <p>Once you have distributed the link either to your collaborators or more widely it is time to collect data. This is where I think that the “app” part of this is so cool. You can edit the Google Sheet from a Desktop computer, but if you have the (free!) Google Sheets app for your phone then you can also edit the data on the go. There is even an offline mode if the internet connection isn’t available where you are working (more on this below).</p> <h3 id="quality-control">Quality control</h3> <p>One of the major issues with distributed data collection is quality control. If possible you want people to input data using (a) a controlled vocubulary/system and (b) the same controlled vocabulary/system. My suggestion here depends on whether you are using a controlled distributed system or an uncontrolled distributed system.</p> <p>For the controlled distributed system you are specifically giving access to individual people - you can provide some training or a walk through before giving them access.</p> <p>For the uncontrolled distributed system you should create a <em>very</em> detailed set of instructions. For example, for my sheet I would create a set of instructions like:</p> <ol> <li>Every data point must have a label of who collected in in the <code class="highlighter-rouge">who_collected</code> column. You should pick a username that does not currently appear in the sheet and stick with it. Use all lower case for your username.</li> <li>You should either report “yes” or “no” in lowercase in the <code class="highlighter-rouge">at_work</code> column.</li> <li>You should report the name of the person in all lower case in the <code class="highlighter-rouge">person</code> column. You should search and make sure that the person you are reporting on doesn’t appear before introducing a new name. If the name already exists, use the name spelled exactly as it is in the sheet already.</li> <li>You should report the <code class="highlighter-rouge">time</code> in the format hh:mm on a 24 hour clock in the eastern time zone of the United States.</li> <li>You should report the <code class="highlighter-rouge">date</code> in the mm/dd/yyyy format.</li> </ol> <p>You could be much more detailed depending on the case.</p> <h3 id="offline-editing-and-conflicts">Offline editing and conflicts</h3> <p>One of the cool things about Google Sheets is that they can even be edited without an internet connection. This is particularly useful if you are collecting data in places where internet connections may be spotty. But that may generate conflicts if you use only one sheet.</p> <p>There may be different ways to handle this, but one I thought of is to just create one sheet for each person collecting data (if you are using controlled distributed data collection). Then each person only edits the data in their sheet, avoiding potential conflicts if multiple people are editing offline and non-synchronously.</p> <h3 id="reading-the-data">Reading the data</h3> <p>Anyone with the link can now read the most up-to-date data with the following simple code.</p> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Identifies the Google Sheet </span><span class="n">example_sheet</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_url</span><span class="p">(</span><span class="s2">"https://docs.google.com/spreadsheets/d/177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o/pubhtml"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet-identifying info appears to be a browser URL. ## googlesheets will attempt to extract sheet key from the URL. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Putative key: 177WyyzWOHGIQ9O5iUY9P9IVwGi7jL3f4XBY4d98CY_o </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Sheet successfully identified: "app_example" </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1">## Reads the data </span><span class="n">dat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">gs_read</span><span class="p">(</span><span class="n">example_sheet</span><span class="p">,</span><span class="w"> </span><span class="n">ws</span><span class="o">=</span><span class="s2">"Sheet1"</span><span class="p">)</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## Accessing worksheet titled 'Sheet1'. </code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## No encoding supplied: defaulting to UTF-8. </code></pre> </div> <div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="n">dat</span><span class="w"> </span></code></pre> </div> <div class="highlighter-rouge"><pre class="highlight"><code>## # A tibble: 3 x 5 ## who_collected at_work person time date ## &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; &lt;chr&gt; ## 1 jeff no ingo 13:47 08/26/2016 ## 2 jeff yes roger 13:47 08/26/2016 ## 3 jeff yes brian 13:47 08/26/2016 </code></pre> </div> <p>Here the url is the one I got when I went to the “File” menu and clicked on “Publish to the web…”. The argument <code class="highlighter-rouge">ws</code> in the <code class="highlighter-rouge">gs_read</code> command is the name of the worksheet. If you have multiple sheets assigned to different people, you can read them in one at a time and then merge them together.</p> <h3 id="conclusion">Conclusion</h3> <p>So that’s it, its pretty simple. But as I gear up to teach advanced data science here at Hopkins I’m thinking a lot about Sean Taylor’s awesome post <a href="http://seanjtaylor.com/post/41463778912/real-scientists-make-their-own-data">Real scientists make their own data</a></p> <p>I think this approach is a super cool/super lightweight system for collecting data either on your own or as a team. As I said I think it could be really useful in public health, but it could also be used for any data collection you want.</p> Interview with COPSS award winner Nicolai Meinshausen. 2016-08-24T00:00:00+00:00 http://simplystats.github.io/2016/08/24/meinshausen-copss <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The COPSS Award is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to Nicolai Meinshausen from ETH Zurich. He is known for his work in causality, high-dimensional statistics, and machine learning. Also see our past COPSS award interviews with <a href="http://simplystatistics.org/2015/08/25/interview-with-copss-award-winner-john-storey/">John Storey</a> and <a href="http://simplystatistics.org/2014/08/18/interview-with-copss-award-winner-martin-wainright/">Martin Wainwright</a>.</em></p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/meinshausen.png" alt="Nicolai Meinshausen" /></p> <h2 id="do-you-consider-yourself-to-be-a-statistician-data-scientist-machine-learner-or-something-else">Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</h2> <p>Perhaps all of the above. If you forced me to pick one, then statistician but I hope we will soon come to a point where these distinctions do not matter much any more.</p> <h2 id="how-did-you-find-out-you-had-won-the-copss-award">How did you find out you had won the COPSS award?</h2> <p>Jeremy Taylor called me. I know I am expected to say I did not expect it but that was indeed the case and it was a genuine surprise.</p> <h2 id="how-do-you-see-the-fields-of-causal-inference-and-high-dimensional-statistics-merging">How do you see the fields of causal inference and high-dimensional statistics merging?</h2> <p>Causal inference is already very challenging in the low-dimensional case - if understood as data for which the number of observations exceeds the number of variables. There are commonalities between high-dimensional statistics and the subfield of causal discovery, however, as we try to recover a sparse underlying structure from data in both cases (say when trying to reconstruct a gene network from observational and intervention data). The interpretations are just slightly different. A further difference is the implicit optimization. High-dimensional estimators can often be framed as convex optimization problems and the question is whether causal discovery can or should be pushed in this direction as well.</p> <h2 id="can-you-explain-a-little-about-how-you-can-infer-causal-effects-from-inhomogeneous-data">Can you explain a little about how you can infer causal effects from inhomogeneous data?</h2> <p>Why do we want a causal model in the first place? In most cases the benefit of a causal over a regression model is that the predictions of a causal model continue to be valid even if we intervene on the variables we use for prediction.</p> <p>The inference we proposed turns this around and is looking for all models that are invariant in the sense that they give the same prediction accuracy across a number of different settings or environments. If we just have observational data, then this invariance holds for all models but if the data are inhomogeneous, certain models can be discarded since they work better in one environment than in another and can thus not be causal. If all models that show invariance use a certain variable, we can claim that the variable in question has a causal effect (while controlling type I error rates) and construct confidence intervals for the strength of the effect.</p> <h2 id="you-have-worked-on-studying-the-effects-of-climate-change---do-you-think-statisticians-can-play-an-important-role-in-this-debate">You have worked on studying the effects of climate change - do you think statisticians can play an important role in this debate?</h2> <p>To a certain extent. I have worked on several projects with physicists and the general caveat is that physicists are in general quite advanced in their methodology already and have quite a good understanding of the relevant statistical concepts. Biology is thus maybe a field where even more external input is required. Then again, it saves one from having to calculate t-tests in collaborations with physicists and just the interestingand challenging problems are left.</p> <h2 id="what-advice-would-you-give-young-statisticians-getting-into-the-discipline-right-now">What advice would you give young statisticians getting into the discipline right now?</h2> <p>First I would say that they have made a good choice since these are interesting times for the field with many challenging and relevant problems still open and unsolved (but not completely out of reach either). I think its important to keep an open mind and read as much literature as possible from neighbouring fields. My personal experience has been that it is very beneficial to get involved in some scientific collaborations.</p> <h2 id="what-sorts-of-things-is-your-group-working-on-these-days">What sorts of things is your group working on these days?</h2> <p>Two general themes: the first is what people would call more classical machine learning. For example, how can we detect interactions in large-scale datasets in sub-quadratic time? Secondly, we are trying to extend the invariance approach to causal inference to more general settings, for example allowing for nonlinearities and hidden variables while at the same time improving the computational aspects.</p> A Simple Explanation for the Replication Crisis in Science 2016-08-24T00:00:00+00:00 http://simplystats.github.io/2016/08/24/replication-crisis <p>By now, you’ve probably heard of the <a href="https://en.wikipedia.org/wiki/Replication_crisis">replication crisis in science</a>. In summary, many conclusions from experiments done in a variety of fields have been found to not hold water when followed up in subsequent experiments. There are now any number of famous examples now, particularly from the fields of <a href="http://science.sciencemag.org/content/349/6251/aac4716">psychology</a> and <a href="http://biorxiv.org/content/early/2016/04/27/050575">clinical medicine</a> that show that the rate of replication of findings is less than the the expected rate.</p> <p>The reasons proposed for this crisis are wide ranging, but typical center on the preference for “novel” findings in science and the pressure on investigators (especially new ones) to “publish or perish”. My favorite reason places the blame for the entire crisis on <a href="http://www.nature.com/news/psychology-journal-bans-p-values-1.17001">p-values</a>.</p> <p>I think to develop a better understanding of why there is a “crisis”, we need to step back and look across differend fields of study. There is one key question you should be asking yourself: <em>Is the replication crisis evenly distributed across different scientific disciplines?</em> My reading of the literature would suggest “no”, but the reasons attributed to the replication crisis are common to all scientists in every field (i.e. novel findings, publishing, etc.). So why would there be any heterogeneity?</p> <h2 id="an-aside-on-replication-and-reproducibility">An Aside on Replication and Reproducibility</h2> <p>As Lorena Barba recently <a href="https://twitter.com/LorenaABarba/status/764836487212957696">pointed</a> <a href="https://github.com/ReScience/ReScience-article/issues/5#issuecomment-241232791">out</a>, there can be tremendous confusion over the use of the words <strong>replication</strong> and <strong>reproducibility</strong>, so it’s best that we sort that out now. Here’s how I use both words:</p> <ul> <li> <p><em>replication</em>: This is the act of repeating an entire study, independently of the original investigator without the use of original data (but generally using the same methods).</p> </li> <li> <p><em>reproducibility</em>: A study is reproducible if you can take the original data and the <em>computer code</em> used to analyze the data and reproduce all of the numerical findings from the study. This may initially sound like a trivial task but experience has shown that it’s not always easy to achieve this seemly minimal standard.</p> </li> </ul> <p>For more precise definitions of what I mean by these terms, you can take a look at <a href="http://biorxiv.org/content/early/2016/07/29/066803">my recent paper with Jeff Leek and Prasad Patil</a>.</p> <p>One key distinction between replication and reproducibility is that with replication, there is no need to trust any of the original findings. In fact, you may be attempting to replicate a study <em>because</em> you do not trust the findings of the original study. A recent example includes the creation of stem cells from ordinary cells, a finding that was so extraodinary that it led several laboratories to quickly try to replicate the study. Ultimately, <a href="http://www.nature.com/nature/journal/v525/n7570/full/nature15513.html">seven separate laboratories could not replicate the finding</a> and the original study was ultimately retracted.</p> <h2 id="astronomy-and-epidemiology">Astronomy and Epidemiology</h2> <p>What do the fields of astronomy and epidemiology have in common? You might think nothing. Those two departments are often not even on the same campus at most universities! However, they have at least one common element, which is that the things that they study are generally reluctant to be controlled by human beings. As a result, both astronomers and epidemiologist rely heavily on one tools: the <strong>observational study</strong>. Much has been written about observational studies of late, and I’ll spare you the literature search by saying that the bottom line is they can’t be trusted (particularly observational studies that have not been pre-registered!).</p> <p>But that’s fine—we have a method for dealing with things we don’t trust: It’s called replication. Epidemiologists actually codified their understanding of when they believe a causal claim (see <a href="https://en.wikipedia.org/wiki/Bradford_Hill_criteria">Hill’s Criteria</a>), which is typically only after a claim has been replicated in numerous studies in a variety of settings. My understanding is that astronomers have a similar mentality as well—no single study will result in anyone believe something new about the universe. Rather, findings need to be replicated using different approaches, instruments, etc.</p> <p>The key point here is that in both astronomy and epidemiology expectations are low with respect to individual studies. <strong>It’s difficult to have a replication crisis when nobody believes the findings in the first place</strong>. Investigators have a culture of distrusting individual one-off findings until they have been replicated numerous times. In my own area of research, the idea that ambient air pollution causes health problems was difficult to believe for decades, until we started seeing the same associations appear in numerous studies conducted all around the world. It’s hard to imagine any single study “proving” that connection, no matter how well it was conducted.</p> <h2 id="theory-and-experimentation-in-science">Theory and Experimentation in Science</h2> <p>I’ve already described the primary mode of investigation in astronomy and epidemiology, but there are of course other methods in other fields. One large category of methods includes the <strong>controlled experiment</strong>. Controlled experiments come in a variety of forms, whether they are laboratory experiments on cells or randomized clinical trials with humans, all of them involve intentional manipulation of some factor by the investigator in order to observe how such manipulation affects an outcome. In clinical medicine and the social sciences, controlled experiments are considered the “gold standard” of evidence. Meta-analyses and literature summaries generally weight publications with controlled experiments more highly than other approaches like observational studies.</p> <p>The other aspect I want to look at here is whether a field has a strong basic theoretical foundation. The idea here is that some fields, like say physics, have a strong set of basic theories whose predictions have been consistently validated over time. Other fields, like medicine, lack even the most rudimentary theories that can be used to make basic predictions. Granted, the distinction between fields with or without “basic theory” is a bit arbitrary on my part, but I think it’s fair to say that different fields of study fall on a spectrum in terms of how much basic theory they can rely on.</p> <p>Given the theoretical nature of different fields and the primary mode of investigation, we can develop the following crude 2x2 table, in which I’ve inserted some representative fields of study.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/replication_2x2.png" alt="Theory vs. Experimentation in Science" /></p> <p>My primary contention here is</p> <blockquote> <p>The replication crisis in science is concentrated in areas where (1) there is a tradition of controlled experimentation and (2) there is relatively little basic theory underpinning the field.</p> </blockquote> <p>Further, in general, I don’t believe that there’s anything wrong with the people tirelessly working in the upper right box. At least, I don’t think there’s anything <em>more</em> wrong with them compared to the good people working in the other three boxes.</p> <p>In case anyone is wondering where the state of clinical science is relative to, say, particle physics with respect to basic theory, I only point you to the web site for the <a href="https://nccih.nih.gov">National Center for Complementary and Integrative Health</a>. This is essentially a government agency with a budget of $124 million dedicated to <a href="http://www.forbes.com/sites/stevensalzberg/2011/08/29/nihs-350000-questionnaire/#1ee73d4d4fc6">advancing pseudoscience</a>. This is the state of “basic theory” in clinical medicine.</p> <h2 id="the-bottom-line">The Bottom Line</h2> <p>The people working in the upper right box have an uphill battle for at least two reasons</p> <ol> <li>The lack of strong basic theory makes it more difficult to guide investigation, leading to wider ranging efforts that may be less likely to replicate.</li> <li>The tradition of controlled experimentation places <em>high expectations</em> that research produced here is “valid”. I mean, hey, they’re using the gold standard of evidence, right?</li> </ol> <p>The confluence of these two factors leads to a much greater disappointment when findings from these fields do not replicate. This leads me to believe that <strong>the replication crisis in science is largely attributable to a mismatch in our expectations of how often findings should replicate and how difficult it is to actually discover true findings in certain fields</strong>. Further, the reliance of controlled experiements in certain fields has lulled us into believing that we can place tremendous weight on a small number of studies. Ultimately, when someone asks, “Why <em>haven’t</em> we cured cancer yet?” the answer is “Because it’s hard”.</p> <h2 id="the-silver-lining">The Silver Lining</h2> <p>It’s important to remember that, as my colleague Rafa Irizarry <a href="http://simplystatistics.org/2013/08/01/the-roc-curves-of-science/">pointed out</a>, findings from many of the fields in the upper right box, especially clinical medicine, can have tremendous positive impacts on our lives when they do work out. Rafa says</p> <blockquote> <p>…I argue that the rate of discoveries is higher in biomedical research than in physics. But, to achieve this higher true positive rate, biomedical research has to tolerate a higher false positive rate.</p> </blockquote> <p>It is certainly possible to reduce the rate of false positives—one way would be to do no experiments at all! But is that what we want? Would that most benefit us as a society?</p> <h2 id="the-takeaway">The Takeaway</h2> <p>What to do? Here are a few thoughts:</p> <ul> <li>We need to stop thinking that any single study is definitive or confirmatory, no matter if it was a controlled experiment or not. Science is always a cumulative business, and the value of a given study should be understood in the context of what came before it.</li> <li>We have to recognize that some areas are more difficult to study and are less mature than other areas because of the lack of basic theory to guide us.</li> <li>We need to think about what the tradeoffs are for discovering many things that may not pan out relative to discovering only a few things. What are we willing to accept in a given field? This is a discussion that I’ve not seen much of.</li> <li>As Rafa pointed out in his article, we can definitely focus on things that make science better for everyone (better methods, rigorous designs, etc.).</li> </ul> A meta list of what to do at JSM 2016 2016-07-30T00:00:00+00:00 http://simplystats.github.io/2016/07/30/jsm2016 <p>I’m going to be heading out tomorrow for JSM 2016. If you want to catch up I’ll be presenting in the 6-8PM poster session on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=213079">The Extraordinary Power of Data</a> on Sunday and on <a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/ActivityDetails.cfm?SessionID=212543">data visualization (and other things) in MOOCs</a> at 8:30am on Monday. Here is a little sneak preview, the first slide from my talk:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/firstslide.jpg" alt="Was too scared to use GIFs" /></p> <p>This year I am so excited that other people have done all the work of going through the program for me and picking out what talks to see. Here is a list of lists.</p> <ul> <li><a href="https://kbroman.wordpress.com/2016/07/27/my-jsm-2016-itinerary/">Karl Broman</a> - if you like open source software, data viz, and genomics.</li> <li><a href="https://blog.rstudio.org/2016/07/19/discover-r-and-rstudio-at-jsm-2016-chicago/">Rstudio</a> - if you like Rstudio</li> <li><a href="http://citizen-statistician.org/2016/07/29/my-jsm2016-itinerary/">Mine Cetinkaya Rundel</a> - if you like stat ed, data science, data viz, and data journalism.</li> <li><a href="https://twitter.com/DrJWolfson/status/758990552754827264">Julian Wolfson</a> - if you like missing sessions and guilt.</li> <li><a href="https://github.com/stephaniehicks/classroomNotes/blob/master/conferences/JSM2016.md">Stephanie Hicks</a> - if you like lots of sessions and can’t make up your mind (also stat genomics, open source software, stat computing, stats for social good…)</li> </ul> <p>If you know about more lists, please feel free to tweet at me or send pull requests.</p> <p>I also saw the materials for this <a href="https://github.com/simonmunzert/rscraping-jsm-2016">awesome tutorial on webscraping</a> that I’m sorry I’ll miss.</p> The relativity of raw data 2016-07-20T00:00:00+00:00 http://simplystats.github.io/2016/07/20/relativity-raw-data <p>“Raw data” is one of those terms that everyone in statistics and data science uses but no one defines. For example, we all agree that we should be able to recreate results in scientific papers from the raw data and the code for that paper.</p> <blockquote> <p>But what do we mean when we say raw data?</p> </blockquote> <p>When working with collaborators or students I often find myself saying - could you just give me the raw data so I can do the normalization or processing myself. To give a concrete example, I work in the analysis of data from <a href="http://www.nature.com/nbt/journal/v26/n10/full/nbt1486.html">high-throughput genomic sequencing experiments</a>.</p> <p>These experiments produce data by breaking up genomic molecules into short fragements of DNA - then reading off parts of those fragments to generate “reads” - usually 100 to 200 letters long per read. But the reads are just puzzle pieces that need to be fit back together and then quantified to produce measurements on DNA variation or gene expression abundances.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/sequencing.png" alt="High throughput sequencing" /></p> <p><a href="http://cbcb.umd.edu/~hcorrada/CFG/lectures/lect22_seqIntro/seqIntro.pdf">Image from Hector Corrata Bravo’s lecture notes</a></p> <p>When I say “raw data” when talking to a collaborator I mean the reads that are reported from the sequencing machine. To me that is the rawest form of the data I will look at. But to generate those reads the sequencing machine first (1) created a set of images for each letter in the sequence of reads, (2) measured the color at the spots on that image to get the quantitative measurement of which letter, and (3) calculated which letter was there with a confidence measure. The raw data I ask for only includes the confidence measure and the sequence of letters itself, but ignores the images and the colors extracted from them (steps 1 and 2).</p> <p>So to me the “raw data” is the files of reads. But to the people who produce the machine for sequencing the raw data may be the images or the color data. To my collaborator the raw data may be the quantitative measurements I calculate from the reads. When thinking about this I realized an important characteristics of raw data.</p> <blockquote> <p>Raw data is relative to your reference frame.</p> </blockquote> <p>In other words the raw data is raw to <em>you</em> if you have done no processing, manipulation, coding, or analysis of the data. In other words, the file you received from the person before you is untouched. But it may not be the <em>rawest</em> version of the data. The person who gave you the raw data may have done some computations. They have a different “raw data set”.</p> <p>The implication for reproducibility and replicability is that we need a “chain of custody” just like with evidence collected by the police. As long as each person keeps a copy and record of the “raw data” to them you can trace the provencance of the data back to the original source.</p> Not So Standard Deviations Episode 18 - Divide by n-1, or n-2, or Whatever 2016-07-18T00:00:00+00:00 http://simplystats.github.io/2016/07/18/nssd-episode-19 <p>Hilary and I talk about statistical software in fMRI analyses, the differences between software testing differences in proportions (a must listen!), and a preview of JSM 2016.</p> <p>Also, Hilary and I have just published a new book, <a href="https://leanpub.com/conversationsondatascience?utm_source=SimplyStats&amp;utm_campaign=NSSD&amp;utm_medium=BlogPost">Conversations on Data Science</a>, which collects some of our episodes in an easy-to-read format. The books is available from Leanpub and will be updated as we record more episodes.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show Notes:</p> <ul> <li> <p><a href="http://www.theregister.co.uk/2016/07/03/mri_software_bugs_could_upend_years_of_research/?mt=1467760452040">fMRI bugs could upend years of research</a></p> </li> <li> <p><a href="http://www.pnas.org/content/113/28/7900.full">Eklund et al. PNAS Paper</a></p> </li> <li> <p><a href="https://www.amstat.org/meetings/jsm/2016/onlineprogram/index.cfm">JSM 2016 Program</a></p> </li> <li> <p><a href="https://leanpub.com/conversationsondatascience">Conversations on Data Science</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-19-divide-by-n-1-or-n-2-or-whatever">Download the audio for this episode</a>.</p> <p>Listen here:</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/274214566&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Tuesday update 2016-07-11T00:00:00+00:00 http://simplystats.github.io/2016/07/11/tuesday-update <h2 id="it-might-all-be-wrong">It Might All Be Wrong</h2> <p>Tom Nichols and colleagues have published a paper on the software used to analyze fMRI data:</p> <blockquote> <p>Functional MRI (fMRI) is 25 years old, yet surprisingly its most common statistical methods have not been validated using real data. Here, we used resting-state fMRI data from 499 healthy controls to conduct 3 million task group analyses. Using this null data with different experimental designs, we estimate the incidence of significant results. In theory, we should find 5% false positives (for a significance threshold of 5%), but instead we found that the most common software packages for fMRI analysis (SPM, FSL, AFNI) can result in false-positive rates of up to 70%. These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.</p> </blockquote> <h2 id="criminal-justice-forecasts">Criminal Justice Forecasts</h2> <p>The <a href="http://www.theatlantic.com/technology/archive/2016/06/when-algorithms-take-the-stand/489566/">ongoing discussion</a> over the use of prediction algorithms in the criminal justice system reminds me a bit of the introduction of DNA evidence decades ago. Ultimately, there is a technology that few people truly understand and there are questions as to whether the information they provide is fair or accurate.</p> <h2 id="shameless-promotion">Shameless Promotion</h2> <p>I have a <a href="https://leanpub.com/conversationsondatascience">new book</a> coming out with Hilary Parker, based on our <em>Not So Standard Deviations</em> podcast. Sign up to be notified of its release (which should be Real Soon Now).</p> Not So Standard Deviations Episode 18 - Back on Planet Earth 2016-07-05T00:00:00+00:00 http://simplystats.github.io/2016/07/05/nssd-episode-18 <p>With Hilary fresh from Use R! 2016, Hilary and I discuss some of the highlights from the conference. Also, some followup about a previous Free Advertising and the NSSD drinking game.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Theranos movie with Jennifer Lawrence and Adam McKay</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Snowden_(film)">Snowden movie</a></p> </li> <li> <p><a href="http://www.npr.org/2016/06/19/482514949/welcome-to-mongolias-new-postal-system-an-atlas-of-random-words">What3Words being used in Mongolia</a></p> </li> <li> <p><a href="https://github.com/jimhester/lintr">lintr package</a></p> </li> <li> <p><a href="https://youtu.be/dhh8Ao4yweQ">“The Electronic Coach” with Don Knuth</a></p> </li> <li> <p><a href="http://alyssafrazee.com/gender-and-github-code.html">Exploring the data on gender and GitHub repo ownership</a></p> </li> <li> <p><a href="https://blog.codinghorror.com/falling-into-the-pit-of-success/">Jeff Atwood “Falling Into the Pit of Success”</a></p> </li> <li> <p><a href="https://research.googleblog.com/2014/08/doing-data-science-with-colaboratory.html">Google coLaboratory</a></p> </li> <li> <p><a href="https://www.stickermule.com/marketplace/12936-number-rcatladies">#rcatladies stickers</a></p> </li> <li> <p><a href="https://twitter.com/astrokatie/status/745529809669787649">Katie Mack time-lapse video</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-18-back-on-planet-earth">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/272064450&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Tuesday Update 2016-06-28T00:00:00+00:00 http://simplystats.github.io/2016/06/28/tuesday-update <h2 id="if-you-werent-sick-of-theranos-yet">If you weren’t sick of Theranos yet….</h2> <p>Looks like there will be a movie version of the <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">Theranos saga</a> which, as far as I can tell, isn’t over yet, but no matter. It will be done by Adam McKay, the writer-director of The Big Short (excellent film), and will star Jennifer Lawrence as Elizabeth Holmes. From <a href="http://www.vanityfair.com/hollywood/2016/06/jennifer-lawrence-theranos-elizabeth-holmes">Vanity Fair</a>:</p> <blockquote> <p>Legendary Pictures snapped up rights to the hot-button biopic for a reported$3 million Thursday evening, after outbidding and outlasting a swarm of competition from Warner Bros., Twentieth Century Fox, STX Entertainment, Regency Enterprises, Cross Creek, Amazon Studios, AG Capital, the Weinstein Company, and, in the penultimate stretch, Paramount, among other studio suitors.</p> </blockquote> <blockquote> <p>Based on a book proposal by two-time Pulitzer Prize-winning journalist John Carreyrou titled Bad Blood: Secrets and Lies in Silicon Valley, the project (reported to be in the $40 million to$50 million budget range) has made the rounds to almost every studio in town. It’s been personally pitched by McKay, who won an Oscar for best adapted screenplay for last year’s rollicking financial meltdown procedural The Big Short.</p> </blockquote> <p>Frankly, I think we all know how this movie will end.</p> <h2 id="the-people-vs-oj-simpson-vsstatistics">The People vs. OJ Simpson vs….Statistics</h2> <p>I’m in the middle of watching <a href="https://en.wikipedia.org/wiki/The_People_v._O._J._Simpson:_American_Crime_Story">The People vs. OJ Simpson</a> and so far it is fantastic—I highly recommend it. One thing that is not represented in the show is the important role that statistics played in the trial. The trial was just in the early days of using DNA as evidence in criminal trials and there were many questions about how likely it was to find DNA matches in blood.</p> <p>Terry Speed ended up testifying for the defense (Simpson) and in this <a href="http://www.statisticsviews.com/details/feature/4915471/To-some-statisticians-a-number-is-a-number-but-to-me-a-number-is-packed-with-his.html">nice interview</a>, he explains how that came to be:</p> <blockquote> <p>At the beginning of the Simpson trial, there was going to be a pre-trial hearing and experts from both sides would argue in front of the judge as to what approaches should be accepted. Other pre-trial activities dragged on, and the one on DNA forensics was eventually scrapped. The DNA experts, including me were then asked whether they wanted to give evidence for the prosecution or defence, or leave. I did not initially plan to join the defence team, but wished to express my point of view in what was more or less a scientific environment before the trial started, but when the pre-trial DNA hearing was scrapped, I decided that I had no choice but to express my views in court on behalf of the defence, which I did.</p> </blockquote> <p>The full interview is well worth the read.</p> <h2 id="ai-is-the-residual">AI is the residual</h2> <p>I just recently found out about the <a href="https://en.m.wikipedia.org/wiki/AI_effect">AI effect</a> which I thought was interesting. Basically, “AI” is whatever can’t be explained, or in other words, the residuals of machine learning.</p> A Year at Stack Overflow 2016-06-28T00:00:00+00:00 http://simplystats.github.io/2016/06/28/stack-overflow-drob <p>David Robinson (<a href="https://twitter.com/drob">@drob</a>) has a great post on his blog about his <a href="http://varianceexplained.org/r/year_data_scientist/">first year as a data scientist at Stack Overflow</a>. This section in particular stood out for me:</p> <blockquote> <p>I like using R to learn interesting things about our data, but my longer term goal is to make it easy for any of our engineers to do so….Towards this goal, I’ve been focusing on building reliable tools and frameworks that people can apply to a variety of problems, rather than “one-off” analysis scripts. (There’s an awesome post by Jeff Magnusson at StitchFix about some of these general challenges). My approach has been building internal R packages, similar to AirBnb’s strategy (though our data team is quite a bit younger and smaller than theirs). These internal packages can query databases and parsing our internal APIs, including making various security and infrastructure issues invisible to the user.</p> </blockquote> <p>The world needs an army of David Robinsons.</p> Ultimate AI battle - Apple vs. Google 2016-06-14T00:00:00+00:00 http://simplystats.github.io/2016/06/14/ultimate-ai-battle <p>Yesterday, Apple launched its Worldwide Developer’s Conference (WWDC) and had its public keynote address. While many new things were announced, the one thing that caught my eye was the <a href="http://go.theinformation.com/HnOAdA6DQ7g">dramatic expansion</a> of Apple’s use of artificial intelligence (AI) tools. I talked a bit about AI with Hilary Parker on the latest <a href="http://simplystatistics.org/2016/06/09/nssd-episode-17/"><em>Not So Standard Deviations</em></a>, particularly in the context of Amazon’s Echo/Alexa, and I think it’s definitely going to be an area of intense competition between the major tech companies.</p> <p>Pretty much every major tech player is involved in AI—Google, Facebook, Amazon, Apple, Microsoft—the list goes on. Recently, a <a href="https://marco.org/2016/05/21/avoiding-blackberrys-fate">some commentators</a> <a href="https://stratechery.com/2015/tim-cooks-unfair-and-unrealistic-privacy-speech-strategy-credits-the-privacy-priority-problem/">have suggested</a> that Apple in particular will never catch up with the likes of Google with respect to AI because of Apple’s strict stance on privacy and unwillingness to gather/aggregate data from all its users. However, yesterday at WWDC, Apple revealed a few clues about what it was up to in the AI world.</p> <p>First, Apple mentioned deep learning more than a few times, including specifically calling out its use of <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">LSTM</a> in its Messages app and facial recognition in its Photos app. Previously, Apple had been rumored to be applying deep learning to its <a href="http://go.theinformation.com/4Z2WhEs9_Nc">Siri assistant and its fingerprint sensor</a>. At WWDC, Craig Federighi stressed Apple’s continued focus on privacy and how Apple does not need to develop “user profiles” server-side, but rather does most computation on-device (in this case, on the iPhone).</p> <p>However, it can’t be that Apple does all its deep learning computation on the iPhone. These models tend to be enormous and take advantage of reams of data that can only be reasonablly processed server-side. Unfortunately, because most companies (Apple in particular) release few details about what they do, we may never how this works. But we can definitely speculate!</p> <h2 id="apple-vs-google">Apple vs. Google</h2> <p>I think the Apple/Google dichotomy provides an interesting opportunity to talk about how models can be learned using data in different ways. There are two approaches being represented here by Apple and Google:</p> <ul> <li><strong>Google way</strong> - Collect lots of data from users and store them on a server in the Googleplex somewhere. Then use that data to fit an enormous model that can predict when you’ve taken a picture of a cat. As users generate more data, bring that data back to the Googleplex and update/refine the model.</li> <li><strong>Apple way</strong> - Build a “starter model” in the Apple <a href="http://9to5mac.com/2015/10/05/spaceship-campus-2-drone-video-october/">Mothership</a>. As users generate data on their phones, bring the model to the phone and update the model using just their data. Bring the updated model back to the Apple Mothership and leave the user’s data on the phone.</li> </ul> <p>Perhaps the easiest way to understand this difference is with the arithmetic mean, which is perhaps the simplest “model”. Suppose you have a bunch of users out there and you want to compute the average of some attribute that they have on their phones (or whatever device). The first approach would be to get all that data and compute the mean in the usual way.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/googleway.png" alt="Google way" /></p> <p>Once all the data is in the Googleplex, we can just use the formula</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Googlemean.png" alt="Google mean" /></p> <p>I’ll call this the “Google mean” because it requires that you get all the data X<sub>1</sub> through X<sub>n</sub>, then sum them up and divide by n. Here, each of the X<sub>i</sub>’s represents the ith user’s data. The general principle here is to gather all the data and then estimate the model parameters server-side.</p> <p>What if you didn’t want to gather everyone’s data centrally? Can you still compute the mean?</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/appleway.png" alt="Apple way" /></p> <p>Yes, because there’s a nice recurrence formula for the mean:</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/Applemean.png" alt="Apple mean" /></p> <p>We can call this the “Apple mean”. With this strategy, we can send our current estimate of the mean to each user, update our estimate by taking the weighted average of the old value and the new value, and then move on to the next user. Here, you send the model parameters out to the users, update those parameters and then bring the parameters back.</p> <p>Which method is better? Well, in this case, both give you the same answer. In general, for linear models (like the mean), you can usually rework the formulas to build out either “whole data” (Google) approaches or “streaming” (Apple) approaches and get pretty much the same answer. But for non-linear models, it’s not so simple and you usually cannot achieve this kind of equivalence.</p> <h2 id="clients-and-servers">Clients and Servers</h2> <p>Balancing how much work is done on a server and how much is done on the client is an age-old computing problem and, over time, the balance of work between client and server seems to shift back and forth like a pendulum. When I was in grad school, we had so-called “dumb terminals” that were basically a screen that you used to login to the server. Today, I use my laptop for computing/work and that’s it. But I use the cloud for many other tasks.</p> <p>The Apple approach definitely requires a “fatter” client because the work of integrating current model parameters with new user data has to happen on the phone. With the Google approach, all the phone has to do is be able to collect the data and send it over the network to Google.</p> <p>The Apple approach is also closely related to what my colleagues <a href="http://www.biostat.jhsph.edu/~mlindqui/">Martin Lindquist</a> and <a href="http://www.bcaffo.com">Brian Caffo</a> refer to as “fusion science”, whereby Big Data and “Small Data” can be fused together via models to improve inference, but without ever having to actually combine the data. In a Bayesian context, you might think of the Big Data as making up the prior distribution and the Small Data as the likelihood. The Small Data can be used to update the model parameters and produce the posterior distribution, after which the Small Data can be thrown out.</p> <h2 id="and-the-winner-is">And the Winner is…</h2> <p>It’s not clear to me which approach is better in terms of building a better model for prediction or inference. Sadly, we may never have enough details to find out, and will only be ablle to evaluate which approach is better by the performance of the systems in the marketplace. But perhaps that’s the way things should be evaluated in this case?</p> Good list of good books 2016-06-13T00:00:00+00:00 http://simplystats.github.io/2016/06/13/good-books <p>The MultiThreaded blog over at Stitch Fix (hat tip to Hilary Parker) has posted a <a href="http://multithreaded.stitchfix.com/blog/2016/06/09/ds-books/">really nice list of data science books</a> (disclosure: one of <a href="https://leanpub.com/artofdatascience/">my books</a> is on the list).</p> <blockquote> <p>We’ve queried our data science team for some of their favorite data science books. This list is by no means exhaustive, but should keep any data scientist/engineer new or old learning and entertained for many an evening.</p> </blockquote> <p>Enjoy!</p> Not So Standard Deviations Episode 17 - Diurnal High Variance 2016-06-09T00:00:00+00:00 http://simplystats.github.io/2016/06/09/nssd-episode-17 <p>Hilary and I talk about Amazon Echo and Alexa as AI as a service, the COMPAS algorithm, criminal justice forecasts, and whether algorithms can introduce or remove bias (or both).</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/">In Two Moves, AlphaGo and Lee Sedol Redefined the Future</a></p> </li> <li> <p><a href="http://qz.com/639952/googles-ai-won-the-game-go-by-defying-millennia-of-basic-human-instinct/">Google’s AI won the game Go by defying millennia of basic human instinct</a></p> </li> <li> <p><a href="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing">Machine Bias: There’s Software Used Across the Country to Predict Future Criminals. And it’s Biased Against Blacks</a></p> </li> <li> <p><a href="https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm">ProPublica analysis of COMPAS</a></p> </li> <li> <p><a href="http://www.amazon.com/Criminal-Justice-Forecasts-Risk-SpringerBriefs/dp/1461430844?ie=UTF8&amp;*Version*=1&amp;*entries*=0">Richard Berk’s <em>Criminal Justice Forecasts of Risk</em></a></p> </li> <li> <p><a href="http://www.amazon.com/Weapons-Math-Destruction-Increases-Inequality/dp/0553418815">Cathy O’Neill’s <em>Weapons of Math Destruction</em></a></p> </li> <li> <p><a href="https://mathbabe.org/2016/04/07/ill-stop-calling-algorithms-racist-when-you-stop-anthropomorphizing-ai/">I’ll stop calling algorithms racist when you stop anthropomorphizing AI</a></p> </li> <li> <p><a href="https://cran.r-project.org/web/packages/rmsfact/index.html">RMS Fact package</a></p> </li> <li> <p><a href="http://user2016.org">Use R! 2016</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-17-diurnal-high-variance">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/268232081&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Defining success - Four secrets of a successful data science experiment 2016-06-03T00:00:00+00:00 http://simplystats.github.io/2016/06/03/defining-success <p><em>Editor’s note: This post is excerpted from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek. This particular section was written by Brian Caffo.</em></p> <p>Defining success is a crucial part of managing a data science experiment. Of course, success is often context specific. However, some aspects of success are general enough to merit discussion. A list of hallmarks of success includes:</p> <ol> <li>New knowledge is created.</li> <li>Decisions or policies are made based on the outcome of the experiment.</li> <li>A report, presentation, or app with impact is created.</li> <li>It is learned that the data can’t answer the question being asked of it.</li> </ol> <p>Some more negative outcomes include: Decisions being made that disregard clear evidence from the data, equivocal results that do not shed light in one direction or another, uncertainty which prevents new knowledge from being created.</p> <p>Let’s discuss some of the successful outcomes first.</p> <p>New knowledge seems ideal in many cases (especially since we are academics), but new knowledge doesn’t necessarily mean that it’s important. If this new knowledge produces actionable decisions or policies, that’s even better. The idea of having evidence-based policy, while perhaps newer than the analogous evidence-based medicine movement that has transformed medical practice, has the potential to similarly transform public policy. Finaly, that our data science products have great (positive) impact on an audience that is much wider than a group of data scientists, is of course ideal. Creating reusable code or apps is great way to increase the impact of a project and to disseminate its findings.</p> <p>The fourth point is perhaps the most controversial. I view it as a success if we can show that the data can’t answer the questions being asked. I am reminded of a friend who told a story of the company he worked at. They hired many expensive prediction consultants to help use their data to inform pricing. However, the prediction results weren’t helping. They were able to prove that the data couldn’t answer the hypothesis under study. There was too much noise and the measurements just weren’t accurately measuring what was needed. Sure, the result wasn’t optimal, as they still needed to know how to price things, but it did save money on consultants. I have since heard this story repeated nearly identically by friends in different industries.</p> Sometimes the biggest challenge is applying what we already know 2016-05-31T00:00:00+00:00 http://simplystats.github.io/2016/05/31/barrier-to-medication <p>There’s definitely a need to innovate and develop new treatments in the area of asthma, but it’s easy to underestimate the barriers to just doing what we already know, such as making sure that people are following existing, well-established guidelines on how to treat asthma. My colleague, Elizabeth Matsui, has <a href="http://skybrudeconsulting.com/blog/2016/05/31/barriers-medication.html">written about the challenges</a> in a <a href="https://clinicaltrials.gov/ct2/show/NCT02251379?term=ecatch&amp;rank=1">study</a> that we are collaborating on:</p> <blockquote> <p>Our group is currently conducting a study that includes implementation of national guidelines-based medical care for asthma, so that one process that we’ve had to get right is to <strong>prescribe an appropriate dose of medication and get it into the family’s hands</strong>. [emphasis added]</p> </blockquote> <p>Seems simple, right?</p> Sometimes there's friction for a reason 2016-05-24T00:00:00+00:00 http://simplystats.github.io/2016/05/24/somtimes-theres-friction-for-a-reason <p>Thinking about <a href="http://simplystatistics.org/2016/05/23/update-on-theranos/">my post on Theranos</a> yesterday it occurred to me that one thing that’s great about all of the innovation and technology coming out of places like Silicon Valley is the tremendous reduction of friction in our lives. With Uber it’s much easier to get a ride because of improvement in communication and an increase in the supply of cars. With Amazon, I can get that jug of <a href="http://www.amazon.com/Wesson-Pure-100%25-Natural-Vegetable/dp/B007F1KVX8/ref=sr_1_2_a_it?ie=UTF8&amp;qid=1464092378&amp;sr=8-2&amp;keywords=vegetable+oil">vegetable oil</a> that I always wanted without having to leave the house, because Amazon.</p> <p>So why is there all this friction? Sometimes it’s because of regulation, which may have made sense at an earlier time, but perhaps doesn’t make as much sense now. Sometimes, you need a company like Amazon to really master the logistics operation to be able to deliver anything anywhere. Otherwise, you’re just stuck driving to the grocery store to get that vegetable oil.</p> <p>But sometimes there’s friction for a reason. For example, <a href="https://stratechery.com/2013/friction/">Ben Thompson talks about</a> how previously there was quite a bit more friction involved before law enforcement could listen in on our communications. Although wiretapping had long been around (as <a href="http://davidsimon.com/we-are-shocked-shocked/">noted</a> by David Simon of…<a href="http://www.hbo.com/the-wire">The Wire</a>) the removal of all friction by the NSA made the situation quite different. Related to this idea is the massive <a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">data release from OkCupid</a> a few weeks ago, as I discussed on the latest <a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Not So Standard Deviations</a> podcast episode. Sure, your OkCupid profile is visible to everyone with an account, but having someone vacuum up 70,000 profiles and dumping them on the web for anyone to view is not what people signed up for—there is a qualitative difference there.</p> <p>When it comes to Theranos and diagnostic testing in general, there is similarly a need for some friction in order to protect public health. John Ioannides notes in his <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">commentary for JAMA</a>:</p> <blockquote> <p>Even if the tests were accurate, when they are performed in massive scale and multiple times, the possibility of causing substantial harm from widespread testing is very real, as errors accumulate with multiple testing. Repeated testing of an individual is potentially a dangerous self-harm practice, and these individuals are destined to have some incorrect laboratory results and eventually experience harm, such as, for example, the anxiety of being labeled with a serious condition or adverse effects from increased testing and procedures to evaluate false-positive test results. Moreover, if the diagnostic testing process becomes dissociated from physicians, self-testing and self-interpretation could cause even more problems than they aim to solve.</p> </blockquote> <p>Unlike with the NSA, where the differences in scale may be difficult to quantify because the exact extent of the program is unknown to most people, with diagnostic testing, we can <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">precisely quantify</a> how a diagnostic test’s characteristics will change if we apply it to 1,000 people vs. 1,000,000 people. This is why organizations like the US Preventative Services Task Force so carefully considers recommendations for testing or screening (and why they have a really tough job).</p> <p>I’ll admit that a lot of the friction in our daily lives is pointless and it would be great to reduce it if possible. But in many cases, it was us that put the friction there for a reason, and it’s sometimes good to think about why before we move to eliminate it.</p> Update On Theranos 2016-05-23T00:00:00+00:00 http://simplystats.github.io/2016/05/23/update-on-theranos <p>I think it’s fair to say that things for Theranos, the Silicon Valley blood testing company, are not looking up. From the Wall Street Journal (via <a href="http://www.theverge.com/2016/5/19/11711004/theranos-voids-edison-blood-test-results">The Verge</a>):</p> <blockquote> <p>Theranos has voided two years of results from its Edison blood-testing machines, issuing tens of thousands of corrected reports to patients and doctors and raising the possibility that many health care decisions may have been made based on inaccurate data. The Wall Street Journal first reported the news, saying that many of the corrected tests have been run using traditional machinery. One doctor told the Journal that she sent a patient to the emergency room after seeing abnormal results from a Theranos test; the corrected report returned normal readings.</p> </blockquote> <p>Furthermore, <a href="http://jama.jamanetwork.com/article.aspx?articleid=2524161#.Vz-lkeuAj9p.twitter">this commentary in JAMA</a> from John Ioannides emphasizes the need for caution when implementing testing on a massive scale. In particular, “The notion of patients and healthy people being repeatedly tested in supermarkets and pharmacies, or eventually in cafeterias or at home, sounds revolutionary, but little is known about the consequences” and the consequences really matter here. In addition, as the title of the commentary would indicate, research done in secret is not research at all. For there the be credibility for a company like this, data needs to be made public.</p> <p>I <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">continue to maintain</a> that the fundamental premise on which the company is built, as stated by its founder Elizabeth Holmes, is flawed. Two concepts are repeatedly made in the context of Theranos:</p> <ul> <li><strong>More testing is better</strong>. Anyone who stayed awake in their introduction to Bayesian statistics lecture knows this is very difficult to make true in all circumstances, no matter how accurate a test is. With rare diseases, the number of false positives is overwhelming and can have very real harmful effects on people. Combine testing on a massive scale, with repeated application over time, and you get a recipe for confusion.</li> <li><strong>People do not get tested because they are afraid of needles</strong>. Elizabeth Holmes makes a big deal about her personal fear of needles and it’s impact on her (not) getting blood tests done. I have no doubt that many people share this fear, but I have serious doubt that this is the reason people don’t get the medical testing done. There are <a href="http://www.rwjf.org/en/library/research/2012/02/special-issue-of-health-services-research-links-health-care-rese/nonfinancial-barriers-and-access-to-care-for-us-adults.html">many barriers</a> to people getting the medical care that they need, many that are non-financial in nature and do not include fear of needles. The problem of getting people the medical care that they need is one deserving of serious attention, but changing the manner in which blood is collected is not going to do it.</li> </ul> Not So Standard Deviations Episode 16 - The Silicon Valley Episode 2016-05-23T00:00:00+00:00 http://simplystats.github.io/2016/05/23/nssd-episode-16 <p>Roger and Hilary are back, with Hilary broadcasting from the west coast. Hilary and Roger discuss the possibility of scaling data analysis and how that may or may not work for companies like Palantir. Also, the latest on Theranos and the release of data from OkCupid.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p><a href="https://play.google.com/music/listen?u=0#/ps/Izfnbx6tlruojkfrvhjfdj3nmna">Subscribe to the podcast on Google Play</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company">BuzzFeed Article on Palantir</a></p> </li> <li> <p><a href="http://simplystatistics.org/2016/05/11/palantir-struggles/">Roger’s Simply Statistics post on Palantir</a></p> </li> <li> <p><a href="https://looker.com">Looker</a></p> </li> <li> <p><a href="http://simplystatistics.org/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists/">Data science done well looks easy</a></p> </li> <li> <p><a href="http://www.wsj.com/articles/theranos-voids-two-years-of-edison-blood-test-results-1463616976">Latest on Theranos</a></p> </li> <li> <p><a href="http://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release">OkCupid Data Release</a></p> </li> <li> <p><a href="http://fr.slideshare.net/sblank/secret-history-why-stanford-and-not-berkeley">Secret history of Silicon Valley</a></p> </li> <li> <p><a href="https://blog.wealthfront.com">Wealthfront blog</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-16-the-silicon-valley-episode">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/265158223&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> What is software engineering for data science? 2016-05-18T00:00:00+00:00 http://simplystats.github.io/2016/05/18/software-engineering-data-science <p><em>Editor’s note: This post is a chapter from the book <a href="https://leanpub.com/eds">Executive Data Science: A Guide to Training and Managing the Best Data Scientists</a>, written by myself, Brian Caffo, and Jeff Leek.</em></p> <p>Software is the generalization of a specific aspect of a data analysis. If specific parts of a data analysis require implementing or applying a number of procedures or tools together, software is the encompassing of all these tools into a specific module or procedure that can be repeatedly applied in a variety of settings. Software allows for the systematizing and the standardizing of a procedure, so that different people can use it and understand what it’s going to do at any given time.</p> <p>Software is useful because it formalizes and abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis. Software will have an interface, or a set of inputs and a set of outputs that are well understood. People can think about the inputs and the outputs without having to worry about the gory details of what’s going on underneath. Now, they may be interested in those details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the <em>interface</em> to that software is important to using it in any given situation.</p> <p>For example, most statistical packages will have a linear regression function which has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.</p> <p>There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.</p> <ol> <li>At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.</li> <li>The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.</li> <li>The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.</li> </ol> <p>One question that you’ll find yourself asking, is at what point do you need to systematize common tasks and procedures across projects versus recreating code or writing new code from scratch on every new project? It depends on a variety of factors and answering this question may require communication within your team, and with people outside of your team. You may need to develop an understanding of how often a given process is repeated, or how often a given type of data analysis is done, in order to weigh the costs and benefits of investing in developing a software package or something similar.</p> <p>Within your team, you may want to ask yourself, “Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?” In our experience, there are relatively few one shot deals out there. Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.</p> <p>A basic rule of thumb is</p> <ul> <li>If you’re going to do something <strong>once</strong> (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.</li> <li>If you’re going to do something <strong>twice</strong>, write a function. This allows you to abstract a small piece of code, and it forces you to define an interface, so you have well defined inputs and outputs.</li> <li>If you’re going to do something <strong>three</strong> times or more, you should think about writing a small package. It doesn’t have to be commercial level software, but a small package which encapsulates the set of operations that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.</li> </ul> Disseminating reproducible research is fundamentally a language and communication problem 2016-05-13T00:00:00+00:00 http://simplystats.github.io/2016/05/13/reproducible-research-language <p>Just about 10 years ago, I wrote my <a href="http://www.ncbi.nlm.nih.gov/pubmed/16510544">first</a> of many articles about the importance of reproducible research. Since that article, one of the points I’ve made is that the key issue to resolve was one of tools and infrastructure. At the time, many people were concerned that people would not want to share data and that we had to spend a lot of energy finding ways to either compel or incentivize them to do so. But the reality was that it was difficult to answer the question “What should I do if I desperately want to make my work reproducible?” Back then, even if you could convince a clinical researcher to use R and LaTeX to create a <a href="https://en.wikipedia.org/wiki/Sweave">Sweave</a> document (!), it was not immediately obvious where they should host the document, code, and data files.</p> <p>Much has happened since then. We now have knitr and Markdown for live documents (as well as iPython notebooks and the like). We also have git, GitHub, and friends, which provide free code sharing repositories in a distributed manner (unlike older systems like CVS and Subversion). With the recent announcement of the <a href="http://www.arfon.org/announcing-the-journal-of-open-source-software">Journal of Open Source Software</a>, posting code on GitHub can now be recognized within the current system of credits and incentives. Finally, the number of <a href="http://dataverse.org">data</a> <a href="https://osf.io">repositories</a> has grown, providing more places for researchers to deposit their data files.</p> <p>Is the tools and infrastructure problem solved? I’d say yes. One thing that has become clear is that disseminating reproducible research is <strong>no longer a software problem</strong>. At least in R land, building live documents that can be executed by others is straightforward and not too difficult to pick up (thank you <a href="https://daringfireball.net/projects/markdown/">John Gruber</a>!). For other languages there many equivalent (if not better) tools for writing documents that mix code and text. For this kind of thing, there’s just no excuse anymore. Could things be optimized a bit for some edge cases? Sure, but the tools are prefectly fine for the vast majority of use cases.</p> <p>But now there is a bigger problem that needs to be solved, which is that <strong>we do not have an effective way to communicate data analyses</strong>. One might think that publishing the full code and datasets is the perfect way to communicate a data analysis, but in a way, it is too perfect. That approach can provide too much information.</p> <p>I find the following analogy useful for discussing this problem. If you look at music, one way to communicate music is to provide an audio file, a standard WAV file or something similar. In a way, that is a near-perfect representation of the music—bit-for-bit—that was produced by the performer. If I want to experience a Beethoven symphony the way that it was meant to be experienced, I’ll listen to a <a href="https://itun.es/us/TudVe?i=79443286">recording of it</a>.</p> <p>But if I want to understand how Beethoven wrote the piece—the process and the details—the recording is not a useful tool. What I look at instead is <a href="http://www.amazon.com/dp/0486260348">the score</a>. The recording is a serialization of the music. The score provides an expanded representation of the music that shows all of the different pieces and how they fit together. A person with a good ear can often reconstruct the score, but this is a difficult and time-consuming task. Better to start with what the composer wrote originally.</p> <p>Over centuries, classical music composers developed a language and system for communicating their musical ideas so that</p> <ol> <li>there was enough detail that a 3rd party could interpret the music and perform it to a level of accuracy that satisfied the composer; but</li> <li>it was not so prescriptive or constraining so that different performers could not build on the work and incorporate their own ideas</li> </ol> <p>It would seem that traditional computer code satisfies those criteria, but I don’t think so. Traditional computer code (even R code) is designed to communicate programming concepts and constructs, not to communicate data analysis constructs. For example, a <code class="highlighter-rouge">for</code> loop is not a data analysis concept, even though we may use <code class="highlighter-rouge">for</code> loops all the time in data analysis.</p> <p>Because of this disconnect between computer code and data analysis, I often find it difficult to understand what a data analysis is doing, even if it is coded very well. I imagine this is what programmers felt like when programming in processor-specific <a href="https://en.wikipedia.org/wiki/Assembly_language">assembly language</a>. Before languages like C were developed, where high-level concepts could be expressed, you had to know the gory details of how each CPU operated.</p> <p>The closest thing that I can see to a solution emerging is the work that Hadley Wickham is doing with packages like <a href="https://github.com/hadley/dplyr">dplyr</a> and <a href="https://github.com/hadley/ggplot2">ggplot2</a>. The <code class="highlighter-rouge">dplyr</code> package’s verbs (<code class="highlighter-rouge">filter</code>, <code class="highlighter-rouge">arrange</code>, etc.) represent data manipulation concepts that are meaningful to analysts. But we still have a long way to go to cover all of data analysis in this way.</p> <p>Reproducible research is important because it is fundamentally about communicating what you have done in your work. Right now we have a sub-optimal way to communicate what was done in a data analysis, via traditional computer code. I think developing a new approach to communicating data analysis could have a few benefits:</p> <ol> <li>It would provide greater transparency</li> <li>It would allow others to more easily build on what was done in an analysis by extending or modifying specific elements</li> <li>It would make it easier to understand what common elements there were across many different data analyses</li> <li>It would make it easier to teach data analysis in a systematic and scalable way</li> </ol> <p>So, any takers?</p> The Real Lesson for Data Science That is Demonstrated by Palantir's Struggles 2016-05-11T00:00:00+00:00 http://simplystats.github.io/2016/05/11/palantir-struggles <p>Buzzfeed recently published a <a href="https://www.buzzfeed.com/williamalden/inside-palantir-silicon-valleys-most-secretive-company?utm_term=.ko2PLKaMJ#.wiPxJERyA">long article</a> on the struggles of the secretive data science company, Palantir.</p> <blockquote> <p>Over the last 13 months, at least three top-tier corporate clients have walked away, including Coca-Cola, American Express, and Nasdaq, according to internal documents. Palantir mines data to help companies make more money, but clients have balked at its high prices that can exceed $1 million per month, expressed doubts that its software can produce valuable insights over time, and even experienced difficult working relationships with Palantir’s young engineers. Palantir insiders have bemoaned the “low-vision” clients who decide to take their business elsewhere.</p> </blockquote> <p>Palantir’s origins are with PayPal, and its founders are part of the <a href="https://en.wikipedia.org/wiki/PayPal_Mafia">PayPal Mafia</a>. As Peter Thiel describes it in his book <a href="https://en.wikipedia.org/wiki/Zero_to_One">Zero to One</a>, PayPal was having a lot of trouble with fraud and the FBI was getting on its case. So PayPal developed some software to monitor the millions of transacations going through its systems and to flag transactions that were suspicious. Eventually, they realized that this kind of software might be useful to government agencies in a variety of contexts and the idea for Palantir was born.</p> <p>Much of the press reaction to Buzzfeed’s article amounts to schadenfreude over the potential fall of <a href="http://simplystatistics.org/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing/">another</a> so-called Silicon Valley unicorn. Indeed, Palentir is valued at$20 billion, a valuation only exceeded in the private markets by Airbnb and Uber. But to me, nothing in the article indicates that Palantir is necessarily more poorly run than your average startup. It looks like they are going through pretty standard growing pains, trying to scale the business and diversify the customer base. It’s not surprising to me that employees would leave at this point—going from startup to “real company” is often not that fun and just a lot of work.</p> <p>However, a key question that arises is that if Palantir is having trouble trying to scale the business, why might that be? The Buzzfeed article doesn’t contain any answers but in this post I will attempt to speculate.</p> <p>The real message from the Buzzfeed article goes beyond just Palantir and is highly relevant to the data science world. It ultimately comes down to the question of <strong>what is the value of data analysis?</strong>, and secondarily, <strong>how do you communicate that value?</strong></p> <h2 id="the-data-science-spectrum">The Data Science Spectrum</h2> <p>Data science activities live on a spectrum with <strong>software</strong> on one end and <strong>highly customized consulting</strong> on another end (I lump a lot of things into consulting, including methods development, modeling, etc.).</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/DS_Spectrum2.png" alt="Data Science Spectrum" /></p> <p>The idea being that if someone comes to you with a data problem, there are two extremes that you might offer to them:</p> <ol> <li>Give them some software, some documentation, and maybe a brief tutorial on how to use the software, and then send them on their way. For example, if someone wants to see if two groups are different from each other, you could send them the <code class="highlighter-rouge">t.test()</code> function in R and explain how to use it. This could be done over email; you wouldn’t even have to talk to the person.</li> <li>Meet with the person, talk about their problem and the question they’re trying to answer, develop an analysis plan, and build a custom software solution that produces the exact output that they’re looking for.</li> </ol> <p>The first option is cheap, simple, and if you had a good enough web site, the person probably wouldn’t even have to talk with you at all! For example, I use <a href="http://hedwig.mgh.harvard.edu/sample_size/size.html">this web site</a> for sample size calculations and I’ve never spoken with the author of the web site. Much of the labor is up front, for the development of the software, and then is amortized over the life of the product. Ultimately, a software product has zero marginal cost and so it can be easily replicated and is <em>infinitely scalable</em>.</p> <p>The second option is labor intensive, time-consuming, ongoing in nature, and is only scalable to the extent that you are willing to forgo sleep and maybe bend the space-time continuum. By definition, a custom solution is unique and is difficult to replicate.</p> <h2 id="selling-data-science">Selling Data Science</h2> <p>An important question for Palantir and data scientists in general is “How do you communicate the value of data analysis?” Many people expect the result of a good data analysis to be something “surprising”, i.e. something that they didn’t already know. Because if they knew it already why bother looking at the data? Think Moneyball—if you can find that “diamond in the rough” it make spending the time to analyze the data worthwhile. But <strong>the success of a data analysis can’t depend on the results</strong>. What if you go through the data and find nothing? Is the data analysis a failure? We as data scientists can only show what the data show. Otherwise, it just becomes a recipe for telling people what they want to hear.</p> <p>It’s tempting for a client to say “well, the data didn’t show anything surprising so there’s no value there.” Also, a data analysis may reveal something that is perhaps interesting but doesn’t necessarily lead to any sort of decision. For example, there may be an aspect of a business process that is inefficient but is nevertheless unmodifiable. There may be little perceived value in discovering this with data.</p> <h3 id="what-is-useful">What is Useful?</h3> <p>Palantir apparently tried to develop a relationship with American Express, but ultimately failed.</p> <blockquote> <p>But some major firms have not found Palantir’s products and services that useful. In April 2015, employees were informed that American Express (codename: Charlie’s Angels) had dumped Palantir after 18 months of cybersecurity work, including a six-month pilot, an email shows. “We struggled from day 1 to make Palantir a sticky product for users and generate wins,” Sid Rajgarhia, a Palantir business development employee, said in the email.</p> </blockquote> <p>What does it mean for a data analysis product to be useful? It’s not necessarily clear to me in this case. Did Palantir not reveal new information? Did they not highlight something that could be modified?</p> <h3 id="lack-of-deep-expertise">Lack of Deep Expertise</h3> <p>A failed attempt attempt at working with Coke reveals some other challenges of the data science business model.</p> <blockquote> <p>The beverage giant also had other concerns [in addition to the price]. Coke “wanted deeper industry expertise in a partner,” Jonty Kelt, a Palantir executive, told colleagues in the email. He added that Coca-Cola’s “working relationship” with the youthful Palantir employees was “difficult.” The Coke executive acknowledged that the beverage giant “needs to get better at working with millennials,” according to Kelt. Coke spokesperson Scott Williamson declined to comment.</p> </blockquote> <p>Annoying millenials notwithstanding, it’s clear that Coke didn’t feel comfortable collaborating with Palantir’s personnel. Like any data science collaboration, it’s key that the data scientist have some familiarity with the domain. In many cases, having “deep expertise” in an area can give a collaborator confidence that you will focus on the things that matter to them. But developing that expertise costs money and time and it may prevent you from working with other types of clients where you will necessarily have less expertise. For example, Palantir’s long experience working with the US military and intelligence agencies gave them deep expertise in those areas, but how does that help them with a consumer products company?</p> <h3 id="harder-than-it-looks">Harder Than It Looks</h3> <p>The final example of a client that backed out is Kimberly-Clark:</p> <blockquote> <p>But Kimberly-Clark was getting cold feet by early 2016. In January, a year after the initial pilot, Kimberly-Clark executive Anthony J. Palmer said he still wasn’t ready to sign a binding contract, meeting notes show. Palmer also “confirmed our suspicion” that a primary reason Kimberly-Clark had not moved forward was that “<em>they wanted to see if they could do it cheaper themselves</em>,” Kelt told colleagues in January. [emphasis added]</p> </blockquote> <p>This is a common problem confronted by anyone in the data science business. A good analysis often looks easy in retrospect—all you did was run some functions and put the data through some models! In fact, running the models probably <em>is</em> the easy part, but getting to the point where you can actually fit models can be incredibly hard. Many clients, not seeing the long and winding process leading to a model, will be tempted think they can do it themselves.</p> <h2 id="palantirs-valuation">Palantir’s Valuation</h2> <p>Ultimately, what makes Palantir interesting is its astounding valuation. But what is the driver of this valuation? I think the key to answering this question lies in the description of the company itself:</p> <blockquote> <p>The company, based in Palo Alto, California, is essentially a hybrid software and consulting firm, placing what it calls “forward deployed engineers” on-site at client offices.</p> </blockquote> <p>What does it mean to be a “hybrid software and consulting firm”? And which one is the company more like? Consulting or software? Because ultimately, revealing which side of the spectrum Palantir is <em>really</em> on could have huge implications for its valuation and future prospects.</p> <p>Consulting companies can surely make a lot of money, but none to my knowledge have the kind of valuation that Palantir currently commands. If it turns out that every customer of Palantir’s requires a custom solution, then I think they’re likely overvalued, because that model scales poorly. On the other hand, if Palantir has genuinely figured out a way to “software-ize” data analysis and to turn it into a commodity, then they are very likely undervalued.</p> <p>Given the tremendous difficulty of turning data analysis into a software problem, my guess is that they are more akin to a consulting company and are overvalued. This is not to say that they won’t make money—they will likely make plenty—but that they won’t be the Silicon Valley darling that everyone wants them to be.</p> A means not an end - building a social media presence as a junior scientist 2016-05-10T00:00:00+00:00 http://simplystats.github.io/2016/05/10/social-media <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before. 50% of all royalties from the book go to support <a href="http://www.datacarpentry.org/">Data Carpentry</a> to promote data science education.</em></p> <h2 id="social-media---what-should-i-do-and-why">Social media - what should I do and why?</h2> <p>Social media can serve a variety of roles for modern scientists. Here I am going to focus on the role of social media for working scientists whose primary focus is not on scientific communication. Something that is often missed by people who are just getting started with social media is that there are two separate components to developing a successful social media presence.</p> <p>The first is to develop a following and connections to people in your community. This is achieved through being either a content curator, a content generator, or being funny/interesting in some other way. This often has nothing to do with your scientific output.</p> <p>The second component is using your social media presence to magnify the audience for your scientific work. You can only do this if you have successfully developed a network and community in the first step. Then, when you post about your own scientific papers they will be shared.</p> <p>To most effectively achieve all of these goals you need to identify relevant communities and develop a network of individuals who follow you and will help to share your ideas and work.</p> <p><strong>Set up social media accounts and follow relevant people/journals</strong></p> <p>One of the largest academic communities has developed around Twitter, but some scientists are also using Facebook for professional purposes. If you set up a Twitter account, you should then find as many colleagues in your area of expertise that you can find and also any journals that are in your area.</p> <p><strong>Use your social media account to promote the work of other people</strong></p> <p>If you just use your social media account to post links to any papers that you publish, it will be hard to develop much of a following. It is also hard to develop a following by constantly posting long form original content such as blog posts. Alternatively you can gain a large number of followers by being (a) funny, (b) interesting, or (c) being a content curator. This latter approach can be particularly useful for people new to social media. Just follow people and journals you find interesting and share anything that you think is important/creative/exciting.</p> <p><strong>Share any work that you develop</strong></p> <p>Any code, publications, data, or blog posts you create you can share from your social media account. This can help raise your profile as people notice your good work. But if you only post your own work it is rarely possible to develop a large following unless you are already famous for another reason.</p> <h2 id="social-media---what-tools-should-i-use">Social media - what tools should I use?</h2> <p>There are a large number of social media platforms that are available to scientists. Creatively using any new social media platform if it has a large number of users can be a way to quickly jump into the consciousness of more people. That being said the two largest communities of scientists have organized around two of the largest social media platforms.</p> <ul> <li><a href="https://twitter.com/">Twitter</a> - is a platform where you can post short (less than 140 character) messages. This is a great platform for both discovering science and engaging in conversations about topics at a superficial level. It is not particularly useful for in depth scientific discussions.</li> <li><a href="https://www.facebook.com/">Facebook</a> - some scientists post longer form scientific discussions on Facebook, but the community there is somewhat less organized and people tend to use it less for professional reasons. However, sharing content on Facebook, particularly when it is of interest to a general audience, can lead to a broader engagement in your work.</li> </ul> <p>There are also a large and growing number of academic-specific social networks. For the most part these social networks are not widely used by practicing scientists and so don’t represent the best use of your time.</p> <p>You might also consider short videos on <a href="https://vine.co/">Vine</a>, longer videos on <a href="https://www.youtube.com/">Youtube</a>, more image intensive social media on <a href="https://www.tumblr.com/">Tumblr</a> or <a href="https://www.instagram.com">Instagram</a> if you have content that regularly fits those outlets. But they tend to have smaller communities of scientists with less opportunity for back and forth.</p> <h2 id="social-media---further-tips-and-issues">Social media - further tips and issues</h2> <h3 id="you-do-not-need-to-develop-original-content">You do not need to develop original content</h3> <p>Social media can be a time suck, particularly if you are spending a lot of time engaging in conversations on your platform of choice. Generating long form content in particular can take up a lot of time. But you don’t need to do that to generate a decent following. Finding the right community and then sharing work within that community and adding brief commentary and ideas can often help you develop a large following which can then be useful for other reasons.</p> <h3 id="add-your-own-commentary">Add your own commentary</h3> <p>Once you are comfortable using the social media platform of your choice you can start to engage with other people in conversation or add comments when you share other people’s work. This will increase the interest in your social media account and help you develop followers. This can be as simple as one-liners copied straight from the text of papers or posts that you think are most important.</p> <h3 id="make-online-friends---then-meet-them-offline">Make online friends - then meet them offline</h3> <p>One of the biggest advantages of scientific social media is that it levels the playing ground. Don’t be afraid to engage with members of your scientific community at all levels, from members of the National Academy (if they are online!) all the way down to junior graduate students. Getting to know a diversity of people can really help you during scientific meetings and visits. If you spend time cultivating online friendships, you’ll often meet a “familiar handle” at any conference or meeting you go to.</p> <h3 id="include-images-when-you-can">Include images when you can</h3> <p>If you see a plot from a paper you think is particularly compelling, copy it and attach it when you post/tweet when you link to the paper. On social media, images are often better received than plain text.</p> <h3 id="be-careful-of-hot-button-issues-unless-you-really-care">Be careful of hot button issues (unless you really care)</h3> <p>One thing to keep in mind on social media is the amplification of opinions. There are a large number of issues that are of extreme interest and generate really strong opinions on multiple sides. Some of these issues are common societal issues (e.g., racism, feminism, economic inequality) and some are specific to science (e.g., open access publishing, open source development). If you are starting a social media account to engage in these topics then you should definitely do that. If you are using your account primarily for scientific purposes you should consider carefully the consequences of wading into these discussions. The debates run very hot on social media and you may post what you consider to be a relatively tangential or light message on one of these topics and find yourself the center of a lot of attention (positive and negative).</p> Time Series Analysis in Biomedical Science - What You Really Need to Know 2016-05-05T00:00:00+00:00 http://simplystats.github.io/2016/05/05/timeseries-biomedical <p>For a few years now I have given a guest lecture on time series analysis in our School’s <em>Environmental Epidemiology</em> course. The basic thrust of this lecture is that you should generally ignore what you read about time series modeling, either in papers or in books. The reason is because I find much of the time series literature is not particularly helpful when doing analyses in a biomedical or population health context, which is what I do almost all the time.</p> <h2 id="prediction-vs-inference">Prediction vs. Inference</h2> <p>First, most of the literature on time series models tends to assume that you are interested in doing prediction—forecasting future values in a time series. I almost am never doing this. In my work looking at air pollution and mortality, the goal is never to find the best model that predicts mortality. In particular, if our goal were to predict mortality, we would probably <em>never include air pollution as a predictor</em>. This is because air pollution has an inherently weak association with mortality at the population, whereas things like temperature and other seasonal factors tend to have a much stronger association.</p> <p>What I <em>am</em> interested in doing is estimating an association between changes in air pollution levels and mortality and making some sort of inference about that association, either to a broader population or to other time periods. The challenges in these types of analyses include estimating weak associations in the presence of many stronger signals and appropriately adjusting for any potential confounding variables that similarly vary over time.</p> <p>The reason the distinction between prediction and inference is important is that focusing on one vs. the other can lead you to very different model building strategies. Prediction modeling strategies will always want you to include into the model factors that are strongly correlated with the outcome and explain a lot of the outcome’s variation. If you’re trying to do inference and use a prediction modeling strategy, you may make at least two errors:</p> <ol> <li>You may conclude that your key predictor of interest (e.g. air pollution) is not important because the modeling strategy didn’t deem to include it</li> <li>You may omit important potential confounders because they have a weak releationship with the outcome (but maybe have a strong relationship with your key predictor). For example, one class of potential confounders in air pollution studies is other pollutants, which tend to be weakly associated with mortality but may be strongly associated with your pollutant of interest.</li> </ol> <h2 id="random-vs-fixed">Random vs. Fixed</h2> <p>Another area where I feel much time series literature differs from my practice is on the whether to focus on fixed effects or random effects. Most of what you might think of when you think of time series models (i.e. AR models, MA models, GARCH, etc.) focuses on modeling the <em>random</em> part of the model. Often, you end up treating time series data as random because you simply do not have any other data. But the reality is that in many biomedical and public health applications, patterns in time series data can be explained by clearly understood fixed patterns.</p> <p>For example, take this time series here. It is lower at the beginning and at the end of the series, with higher level sin the middle of the period.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_fixed.png" alt="Time series with seasonal pattern 1" /></p> <p>It’s possible that this time series could be modeled with an auto-regressive (AR) model or maybe an auto-regressive moving average (ARMA) model. Or it’s possible that the data are exhibiting a seasonal pattern. It’s impossible to tell from the data whether this is a random formulation of this pattern or whether it’s something you’d expect every time. The problem is that we usually onl have <em>one observation</em> from teh time series. That is, we observe the entire series only once.</p> <p>Now take a look at this time series. It exhibits some of the same properties as the first series: it’s low at the beginning and end and high in the middle.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_random.png" alt="Time series with seasonal pattern 2" /></p> <p>Should we model this as a random process or as a process with a fixed pattern? That ultimately will depend on the what type of data this is and what we know about it. If it’s air pollution data, we might do one thing, but if it’s stock market data, we might do a totally different thing.</p> <p>If one were to see replicates of the time series, we’d be able to resolve the fixed vs. random question. For example, because I simulated the data above, I can simulate another replicate and see what happens. In the plot below I show two replications from each of the processes.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/ts_both.png" alt="Fixed and random time series patterns" /></p> <p>It’s clear now that the time series on the top row has a fixed “seasonal” pattern while the time series on the bottom row is random (in fact it is simulated from an AR(1) model).</p> <p>The point here is that I think very often we know things about the time series that we’re modeling that we know introduced fixed variation into the data: seasonal patterns, day-of-week effects, and long-term trends. Furthermore, there may be other time-varying covariates that can help predict whatever time series we’re modeling and can be put into the fixed part of the model (a.k.a regression modeling). Ultimately, when many of these fixed components are accounted for, there’s relatively little of interest left in the residuals.</p> <h2 id="what-to-model">What to Model?</h2> <p>So the question remains: What should I do? The short answer is to try to incorporate everything that you know about the data into the fixed/regression part of the model. Then take a look at the residuals and see if you still care.</p> <p>Here’s a quick example from my work in air pollution and mortality. The data are all-cause mortality and PM10 pollution from Detroit for the years 1987–2000. The question is whether daily mortaliy is associated with daily changes in ambient PM10 levels. We can try to answer that with a simple linear regression model:</p> <div class="highlighter-rouge"><pre class="highlight"><code>Call: lm(formula = death ~ pm10, data = ds) Residuals: Min 1Q Median 3Q Max -26.978 -5.559 -0.386 5.109 34.022 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 46.978826 0.112284 418.394 &lt;2e-16 pm10 0.004885 0.001936 2.523 0.0117 Residual standard error: 8.03 on 5112 degrees of freedom Multiple R-squared: 0.001244, Adjusted R-squared: 0.001049 F-statistic: 6.368 on 1 and 5112 DF, p-value: 0.01165 </code></pre> </div> <p>PM10 appears to be positively associated with mortality, but when we look at the autocorrelation function of the residuals, we see</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-3-1.png" alt="ACF1" /></p> <p>If we see a seasonal-like pattern in the auto-correlation function, then that means there’s a seasonal pattern in the residuals as well. Not good.</p> <p>But okay, we can just model the seasonal component with an indicator of the season.</p> <div class="highlighter-rouge"><pre class="highlight"><code>Call: lm(formula = death ~ season + pm10, data = ds) Residuals: Min 1Q Median 3Q Max -25.964 -5.087 -0.242 4.907 33.884 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 50.830458 0.215679 235.676 &lt; 2e-16 seasonQ2 -4.864167 0.304838 -15.957 &lt; 2e-16 seasonQ3 -6.764404 0.304346 -22.226 &lt; 2e-16 seasonQ4 -3.712292 0.302859 -12.258 &lt; 2e-16 pm10 0.009478 0.001860 5.097 0.000000358 Residual standard error: 7.649 on 5109 degrees of freedom Multiple R-squared: 0.09411, Adjusted R-squared: 0.09341 F-statistic: 132.7 on 4 and 5109 DF, p-value: &lt; 2.2e-16 </code></pre> </div> <p>Note that the coefficient for PM10, the coefficient of real interest, gets a little bigger when we add the seasonal component.</p> <p>When we look at the residuals now, we see</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-5-1.png" alt="ACF2" /></p> <p>The seasonal pattern is gone, but we see that there’s positive autocorrelation at seemingly long distances (~100s of days). This is usually an indicator that there’s some sort of long-term trend in the data. Since we only care about the day-to-day changes in PM10 and mortality, it would make sense to remove any such long-term trend. I can do that by just including the date as a linear predictor.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Call: lm(formula = death ~ season + date + pm10, data = ds) Residuals: Min 1Q Median 3Q Max -23.407 -5.073 -0.375 4.718 32.179 Coefficients: Estimate Std. Error t value Pr(&gt;|t|) (Intercept) 60.04317325 0.64858433 92.576 &lt; 2e-16 seasonQ2 -4.76600268 0.29841993 -15.971 &lt; 2e-16 seasonQ3 -6.56826695 0.29815323 -22.030 &lt; 2e-16 seasonQ4 -3.42007191 0.29704909 -11.513 &lt; 2e-16 date -0.00106785 0.00007108 -15.022 &lt; 2e-16 pm10 0.00933871 0.00182009 5.131 0.000000299 Residual standard error: 7.487 on 5108 degrees of freedom Multiple R-squared: 0.1324, Adjusted R-squared: 0.1316 F-statistic: 156 on 5 and 5108 DF, p-value: &lt; 2.2e-16 </code></pre> </div> <p>Now we can look at the autocorrelation function one last time.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-7-1.png" alt="ACF3" /></p> <p>The ACF trails to zero reasonably quickly now, but there’s still some autocorrelation at short lags up to about 15 days or so.</p> <p>Now we can engage in some traditional time series modeling. We might want to model the residuals with an auto-regressive model over order <em>p</em>. What should <em>p</em> be? We can check by looking at the partial autocorrelation function (PACF).</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-05-05-timeseries-biomedical_files/figure-html/unnamed-chunk-8-1.png" alt="PACF" /></p> <p>The PACF seems to suggest we should fit an AR(6) or AR(7) model. Let’s use an AR(6) model and see how things look. We can use the <code class="highlighter-rouge">arima()</code> function for that.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Call: arima(x = y, order = c(6, 0, 0), xreg = m, include.mean = FALSE) Coefficients: ar1 ar2 ar3 ar4 ar5 ar6 (Intercept) 0.0869 0.0933 0.0733 0.0454 0.0377 0.0489 59.8179 s.e. 0.0140 0.0140 0.0141 0.0141 0.0140 0.0140 1.0300 seasonQ2 seasonQ3 seasonQ4 date pm10 -4.4635 -6.2778 -3.2878 -0.0011 0.0096 s.e. 0.4569 0.4624 0.4546 0.0001 0.0018 sigma^2 estimated as 53.69: log likelihood = -17441.84, aic = 34909.69 </code></pre> </div> <p>Note that the coefficient for PM10 hasn’t changed much from our initial models. The usual concern with not accounting for residual autocorrelation is that the variance/standard error of the coefficient of interest will be affected. In this case, there does not appear to be much of a difference between using the AR(6) to account for the residual autocorrelation and ignoring it altogether. Here’s a comparison of the standard errors for each coefficient.</p> <div class="highlighter-rouge"><pre class="highlight"><code> Naive AR model (Intercept) 0.648584 1.030007 seasonQ2 0.298420 0.456883 seasonQ3 0.298153 0.462371 seasonQ4 0.297049 0.454624 date 0.000071 0.000114 pm10 0.001820 0.001819 </code></pre> </div> <p>The standard errors for the <code class="highlighter-rouge">pm10</code> variable are almost identical, while the standard errors for the other variables are somewhat bigger in the AR model.</p> <h2 id="conclusion">Conclusion</h2> <p>Ultimately, I’ve found that in many biomedical and public health applications, time series modeling is very different from what I read in the textbooks. The key takeaways are:</p> <ol> <li> <p>Make sure you know if you’re doing <strong>prediction</strong> or <strong>inference</strong>. Most often you will be doing inference, in which case your modeling strategies will be quite different.</p> </li> <li> <p>Focus separately on the <strong>fixed</strong> and <strong>random</strong> parts of the model. In particular, work with the fixed part of the model first, incorporating as much information as you can that will explain variability in your outcome.</p> </li> <li> <p>Model the random part appropriately, after incorporating as much as you can into the fixed part of the model. Classical time series models may be of use here, but also simple robust variance estimators may be sufficient.</p> </li> </ol> Not So Standard Deviations Episode 15 - Spinning Up Logistics 2016-05-04T00:00:00+00:00 http://simplystats.github.io/2016/05/04/nssd-episode-15 <p>This is Hilary’s and my last New York-Baltimore episode! In future episodes, Hilary will be broadcasting from California. In this episode we discuss collaboration tools and workflow management for data science projects. To date, I have not found a project management tool that I can actually use (besides email), but am open to suggestions (from students).</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at <a href="https://twitter.com/nssdeviations">@NSSDeviations</a>.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Support us through our <a href="https://www.patreon.com/NSSDeviations?ty=h">Patreon page</a>.</p> <p>Show notes:</p> <ul> <li> <p><a href="http://twitter.com/hspter/status/725411087110299649">Hilary’s tweet on cats</a></p> </li> <li> <p><a href="http://www.etsy.com/listing/185113916/…mug-coffee-cup-tea">Awesome vs. cats mug</a></p> </li> <li> <p><a href="http://math.mit.edu/~urschel/">John Urschel’s web page</a></p> </li> <li> <p><a href="http://www.ams.org/publications/journa…1602/rnoti-p148.pdf">Profile of John Urschel by the AMS</a></p> </li> <li> <p><a href="http://en.wikipedia.org/wiki/Frank_Ryan_…merican_football">The other NFL player/mathematician</a>)</p> </li> <li> <p><a href="http://guides.github.com/introduction/flow/">GitHub flow</a></p> </li> <li> <p><a href="http://www.theinformation.com/articles/why-…a-product-fix">Problems with Slack</a></p> </li> <li> <p><a href="http://www.astronomy.ohio-state.edu/~pogge/Ast…5/gps.html">Relativity and GPS</a></p> </li> <li> <p><a href="http://www.theinformation.com/become-a-data…e-information">The Information is looking for a Data Storyteller</a></p> </li> <li> <p><a href="http://www.stitchfix.com/careers?gh_jid=1…46?gh_jid=169746">Stitch Fix is looking for Data Scientists</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/nssd-episode-15-spinning-up-logistics">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/261374684&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> High school student builds interactive R class for the intimidated with the JHU DSL 2016-04-27T00:00:00+00:00 http://simplystats.github.io/2016/04/27/r-intimidated <p>Annika Salzberg is currently a biology undergraduate at Haverford College majoring in biology. While in high-school here in Baltimore she developed and taught an R class to her classmates at the <a href="http://www.parkschool.net/">Park School</a>. Her interest in R grew out of a project where she and her fellow students and teachers went to the Canadian sub-Arctic to collect data on permafrost depth and polar bears. When analyzing the data she learned R (with the help of a teacher) to be able to do the analyses, some of which she did on her laptop while out in the field.</p> <p>Later she worked on developing a course that she felt was friendly and approachable enough for her fellow high-schoolers to benefit. With the help of Steven Salzberg and the folks here at the JHU DSL, she built a class she calls <a href="https://www.datacamp.com/courses/r-for-the-intimidated">R for the intimidated</a> which just launched on <a href="https://www.datacamp.com/courses/r-for-the-intimidated">DataCamp</a> and you can take for free!</p> <p>The class is a great introduction for people who are just getting started with R. It walks through R/Rstudio, package installation, data visualization, data manipulation, and a final project. We are super excited about the content that Annika created working here at Hopkins and think you should go check it out!</p> An update on Georgia Tech's MOOC-based CS degree 2016-04-27T00:00:00+00:00 http://simplystats.github.io/2016/04/27/georgia-tech-mooc-program <p><a href="https://www.insidehighered.com/news/2016/04/27/georgia-tech-plans-next-steps-online-masters-degree-computer-science?utm_source=Inside+Higher+Ed&amp;utm_campaign=d373e33023-DNU20160427&amp;utm_medium=email&amp;utm_term=0_1fcbc04421-d373e33023-197601005#.VyCmdfkGRPU.mailto">This article</a> in Inside Higher Ed discusses next steps for Georgia Tech’s ground-breaking low-cost CS degree based on MOOCs run by Udacity. With Sebastian Thrun <a href="http://blog.udacity.com/2016/04/udacity-has-a-new-___.html">stepping down</a> as CEO at Udacity, it seems both Georgia Tech and Udacity might be moving into a new phase.</p> <p>One fact that surprised me about the Georgia Tech program concerned the demographics:</p> <blockquote> <p>Once the first applications for the online program arrived, Georgia Tech was surprised by how the demographics differed from the applications to the face-to-face program. The institute’s face-to-face cohorts tend to have more men than women and international students than U.S. citizens or residents. Applications to the online program, however, came overwhelmingly from students based in the U.S. (80 percent). The gender gap was even larger, with nearly nine out of 10 applications coming from men.</p> </blockquote> Write papers like a modern scientist (use Overleaf or Google Docs + Paperpile) 2016-04-21T00:00:00+00:00 http://simplystats.github.io/2016/04/21/writing <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p> <h2 id="writing---what-should-i-do-and-why">Writing - what should I do and why?</h2> <p><strong>Write using collaborative software to avoid version control issues.</strong></p> <p>On almost all modern scientific papers you will have co-authors. The traditional way of handling this was to create a single working document and pass it around. Unfortunately this system always results in a long collection of versions of a manuscript, which are often hard to distinguish and definitely hard to synthesize.</p> <p>An alternative approach is to use formal version control systems like <a href="https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control">Git</a> and <a href="https://github.com/">Github</a>. However, the overhead for using these systems can be pretty heavy for paper authoring. They also require all parties participating in the writing of the paper to be familiar with version control and the command line. Alternative paper authoring tools are now available that provide some of the advantages of version control without the major overhead involved in using base version control systems.</p> <p><img src="https://imgs.xkcd.com/comics/documents.png" alt="The usual result of file naming by a group (image via https://xkcd.com/1459/)" /></p> <p><strong>Make figures the focus of your writing</strong></p> <p>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-10 publication-quality plots with 3-4 as the central focus (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a> for more information on how to make plots). Show these to someone you trust to make sure they “get” your story before proceeding. Your writing should then be focused around explaining the story of those plots to your audience. Many people, when reading papers, read the title, the abstract, and then usually jump to the figures. If your figures tell the whole story you will dramatically increase your audience. It also helps you to clarify what you are writing about.</p> <p><strong>Write clearly and simply even though it may make your papers harder to publish</strong>.</p> <p>Learn how to write papers in a very clear and simple style. Whenever you can write in plain English and make the approach you are using understandable and clear. This can (sometimes) make it harder to get your papers into journals. Referees are trained to find things to criticize and by simplifying your argument they will assume that what you have done is easy or just like what has been done before. This can be extremely frustrating during the peer review process. But the peer review process isn’t the end goal of publishing! The point of publishing is to communicate your results to your community and beyond so they can use them. Simple, clear language leads to much higher use/reading/citation/impact of your work.</p> <p><strong>Include links to code, data, and software in your writing</strong></p> <p>Not everyone recognizes the value of re-analysis, scientific software, or data and code sharing. But it is the fundamental cornerstone of the modern scientific process to make all of your materials easily accessible, re-usable and checkable. Include links to data, code, and software prominently in your abstract, introduction and methods and you will dramatically increase the use and impact of your work.</p> <p><strong>Give credit to others</strong></p> <p>In academics the main currency we use is credit for publication. In general assigning authorship and getting credit can be a very tricky component of the publication process. It is almost always better to err on the side of offering credit. A very useful test that my advisor <a href="http://www.genomine.org/">John Storey</a> taught me is if you are embarrassed to explain the authorship credit to anyone who was on the paper or not on the paper, then you probably haven’t shared enough credit.</p> <h2 id="writing---what-tools-should-i-use">Writing - what tools should I use?</h2> <h3 id="wysiwyg-software-google-docs-and-paperpile">WYSIWYG software: Google Docs and Paperpile.</h3> <p>This system uses <a href="https://www.google.com/docs/about/">Google Docs</a> for writing and <a href="https://paperpile.com/app">Paperpile</a> for reference management. If you have a Google account you can easily create documents and share them with your collaborators either privately or publicly. Paperpile allows you to search for academic articles and insert references into the text using a system that will be familiar if you have previously used <a href="http://endnote.com/">Endnote</a> and <a href="https://products.office.com/en-us/word">Microsoft Word</a>.</p> <p>This system has the advantage of being a what you see is what you get system - anyone with basic text processing skills should be immediately able to contribute. Google Docs also automatically saves versions of your work so that you can flip back to older versions if someone makes a mistake. You can also easily see which part of the document was written by which person and add comments.</p> <p><em>Getting started</em></p> <ol> <li>Set up accounts with <a href="https://accounts.google.com/SignUp">Google</a> and with <a href="https://paperpile.com/">Paperpile</a>. If you are an academic the Paperpile account will cost $2.99 a month, but there is a 30 day free trial.</li> <li>Go to <a href="https://docs.google.com/document/u/0/">Google Docs</a> and create a new document.</li> <li>Set up the <a href="https://paperpile.com/blog/free-google-docs-add-on/">Paperpile add-on for Google Docs</a></li> <li>In the upper right hand corner of the document, click on the <em>Share</em> link and share the document with your collaborators</li> <li>Start editing</li> <li>When you want to include a reference, place the cursor where you want the reference to go, then using the <em>Paperpile</em> menu, choose insert citation. This should give you a search box where you can search by Pubmed ID or on the web for the reference you want.</li> <li>Once you have added some references use the <em>Citation style</em> option under <em>Paperpile</em> to pick the citation style for the journal you care about.</li> <li>Then use the <em>Format citations</em> option under <em>Paperpile</em> to create the bibliography at the end of the document</li> </ol> <p>The nice thing about using this system is that everyone can easily directly edit the document simultaneously - which reduces conflict and difficulty of use. A disadvantage is getting the formatting just right for most journals is nearly impossible, so you will be sending in a version of your paper that is somewhat generic. For most journals this isn’t a problem, but a few journals are sticklers about this.</p> <h3 id="typesetting-software-overleaf-or-sharelatex">Typesetting software: Overleaf or ShareLatex</h3> <p>An alternative approach is to use typesetting software like Latex. This requires a little bit more technical expertise since you need to understand the Latex typesetting language. But it allows for more precise control over what the document will look like. Using Latex on its own you will have many of the same issues with version control that you would have for a word document. Fortunately there are now “Google Docs like” solutions for editing latex code that are readily available. Two of the most popular are <a href="https://www.overleaf.com/">Overleaf</a> and <a href="https://www.sharelatex.com/">ShareLatex</a>.</p> <p>In either system you can create a document, share it with collaborators, and edit it in a similar manner to a Google Doc, with simultaneous editing. Under both systems you can save versions of your document easily as you move along so you can quickly return to older versions if mistakes are made.</p> <p>I have used both kinds of software, but now primarily use Overleaf because they have a killer feature. Once you have finished writing your paper you can directly submit it to some preprint servers like <a href="http://arxiv.org/">arXiv</a> or <a href="http://biorxiv.org/">biorXiv</a> and even some journals like <a href="https://peerj.com">Peerj</a> or <a href="http://f1000research.com/">f1000research</a>.</p> <p><em>Getting started</em></p> <ol> <li>Create an <a href="https://www.overleaf.com/signup">Overleaf account</a>. There is a free version of the software. Paying$8/month will give you easy saving to Dropbox.</li> <li>Click on <em>New Project</em> to create a new document and select from the available templates</li> <li>Open your document and start editing</li> <li>Share with colleagues by clicking on the <em>Share</em> button within the project. You can share either a read only version or a read and edit version.</li> </ol> <p>Once you have finished writing your document you can click on the <em>Publish</em> button to automatically submit your paper to the available preprint servers and journals. Or you can download a pdf version of your document and submit it to any other journal.</p> <h2 id="writing---further-tips-and-issues">Writing - further tips and issues</h2> <h3 id="when-to-write-your-first-paper">When to write your first paper</h3> <p>As soon as possible! The purpose of graduate school is (in some order):</p> <ul> <li>Freedom</li> <li>Time to discover new knowledge</li> <li>Time to dive deep</li> <li>Opportunity for leadership</li> <li>Opportunity to make a name for yourself <ul> <li>R packages</li> <li>Papers</li> <li>Blogs</li> </ul> </li> <li>Get a job</li> </ul> <p>The first couple of years of graduate school are typically focused on (1) teaching you all the technical skills you need and (2) data dumping as much hard-won practical experience from more experienced people into your head as fast as possible.</p> <p>After that one of your main focuses should be on establishing your own program of research and reputation. Especially for Ph.D. students it can not be emphasized enough <em>no one will care about your grades in graduate school but everyone will care what you produced</em>. See for example, Sherri’s excellent <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">guide on CV’s for academic positions</a>.</p> <p>I firmly believe that <a href="http://simplystatistics.org/2013/01/23/statisticians-and-computer-scientists-if-there-is-no-code-there-is-no-paper/">R packages</a> and blog posts can be just as important as papers, but the primary signal to most traditional academic communities still remains published peer-reviewed papers. So you should get started on writing them as soon as you can (definitely before you feel comfortable enough to try to write one).</p> <p>Even if you aren’t going to be in academics, papers are a great way to show off that you can (a) identify a useful project, (b) finish a project, and (c) write well. So the first thing you should be asking when you start a project is “what paper are we working on?”</p> <h3 id="what-is-an-academic-paper">What is an academic paper?</h3> <p>A scientific paper can be distilled into four parts:</p> <ol> <li>A set of methodologies</li> <li>A description of data</li> <li>A set of results</li> <li>A set of claims</li> </ol> <p>When you (or anyone else) writes a paper the goal is to communicate clearly items 1-3 so that they can justify the set of claims you are making. Before you can even write down 4 you have to do 1-3. So that is where you start when writing a paper.</p> <h3 id="how-do-you-start-a-paper">How do you start a paper?</h3> <p>The first thing you do is you decide on a problem to work on. This can be a problem that your advisor thought of or it can be a problem you are interested in, or a combination of both. Ideally your first project will have the following characteristics:</p> <ol> <li>Concrete</li> <li>Solves a scientific problem</li> <li>Gives you an opportunity to learn something new</li> <li>Something you feel ownership of</li> <li>Something you want to work on</li> </ol> <p>Points 4 and 5 can’t be emphasized enough. Others can try to help you come up with a problem, but if you don’t feel like it is <em>your</em> problem it will make writing the first paper a total slog. You want to find an option where you are just insanely curious to know the answer at the end, to the point where you <em>just have to figure it out</em> and kind of don’t care what the answer is. That doesn’t always happen, but that makes the grind of writing papers go down a lot easier.</p> <p>Once you have a problem the next step is to actually do the research. I’ll leave this for another guide, but the basic idea is that you want to follow the usual <a href="https://leanpub.com/datastyle/">data analytic process</a>:</p> <ol> <li>Define the question</li> <li>Get/tidy the data</li> <li>Explore the data</li> <li>Build/borrow a model</li> <li>Perform the analysis</li> <li>Check/critique results</li> <li>Write things up</li> </ol> <p>The hardest part for the first paper is often knowing where to stop and start writing.</p> <h3 id="how-do-you-know-when-to-start-writing">How do you know when to start writing?</h3> <p>Sometimes this is an easy question to answer. If you started with a very concrete question at the beginning then once you have done enough analysis to convince yourself that you have the answer to the question. If the answer to the question is interesting/surprising then it is time to stop and write.</p> <p>If you started with a question that wasn’t so concrete then it gets a little trickier. The basic idea here is that you have convinced yourself you have a result that is worth reporting. Usually this takes the form of between 1 and 5 figures that show a coherent story that you could explain to someone in your field.</p> <p>In general one thing you should be working on in graduate school is your own internal timer that tells you, “ok we have done enough, time to write this up”. I found this one of the hardest things to learn, but if you are going to stay in academics it is a critical skill. There are rarely deadlines for paper writing (unless you are submitting to CS conferences) so it will eventually be up to you when to start writing. If you don’t have a good clock, this can really slow down your ability to get things published and promoted in academics.</p> <p>One good principle to keep in mind is “the perfect is the enemy of the very good” Another one is that a published paper in a respectable journal beats a paper you just never submit because you want to get it into the “best” journal.</p> <h3 id="a-note-on-negative-results">A note on “negative results”</h3> <p>If the answer to your research problem isn’t interesting/surprising but you started with a concrete question it is also time to stop and write. But things often get more tricky with this type of paper as most journals when reviewing papers filter for “interest” so sometimes a paper without a really “big” result will be harder to publish. <strong>This is ok!!</strong> Even though it may take longer to publish the paper, it is important to publish even results that aren’t surprising/novel. I would much rather that you come to an answer you are comfortable with and we go through a little pain trying to get it published than you keep pushing until you get an “interesting” result, which may or may not be justifiable.</p> <h3 id="how-do-you-start-writing">How do you start writing?</h3> <ol> <li>Once you have a set of results and are ready to start writing up the paper the first thing is <em>not to write</em>. The first thing you should do is create a set of 1-4 publication-quality plots (see Chapter 10 <a href="http://leanpub.com/datastyle">here</a>). Show these to someone you trust to make sure they “get” your story before proceeding.</li> <li>Start a project on <a href="https://www.overleaf.com/">Overleaf</a> or <a href="https://www.google.com/docs/about/">Google Docs</a>.</li> <li>Write up a story around the four plots in the simplest language you feel you can get away with, while still reporting all of the technical details that you can.</li> <li>Go back and add references in only after you have finished the whole first draft.</li> <li>Add in additional technical detail in the supplementary material if you need it.</li> <li>Write up a reproducible version of your code that returns exactly the same numbers/figures in your paper with no input parameters needed.</li> </ol> <h3 id="what-are-the-sections-in-a-paper">What are the sections in a paper?</h3> <p>Keep in mind that most people will read the title of your paper only, a small fraction of those people will read the abstract, a small fraction of those people will read the introduction, and a small fraction of those people will read your whole paper. So make sure you get to the point quickly!</p> <p>The sections of a paper are always some variation on the following:</p> <ol> <li><strong>Title</strong>: Should be very short, no colons if possible, and state the main result. Example, “A new method for sequencing data that shows how to cure cancer”. Here you want to make sure people will read the paper without overselling your results - this is a delicate balance.</li> <li><strong>Abstract</strong>: In (ideally) 4-5 sentences explain (a) what problem you are solving, (b) why people should care, (c) how you solved the problem, (d) what are the results and (e) a link to any data/resources/software you generated.</li> <li><strong>Introduction</strong>: A more lengthy (1-3 pages) explanation of the problem you are solving, why people should care, and how you are solving it. Here you also review what other people have done in the area. The most critical thing is never underestimate how little people know or care about what you are working on. It is your job to explain to them why they should.</li> <li><strong>Methods</strong>: You should state and explain your experimental procedures, how you collected results, your statistical model, and any strengths or weaknesses of your proposed approach.</li> <li><strong>Comparisons (for methods papers)</strong>: Compare your proposed approach to the state of the art methods. Do this with simulations (where you know the right answer) and data you haven’t simulated (where you don’t know the right answer). If you can base your simulation on data, even better. Make sure you are <a href="http://simplystatistics.org/2013/03/06/the-importance-of-simulating-the-extremes/">simulating both the easy case (where your method should be great) and harder cases where your method might be terrible</a>.</li> <li><strong>Your analysis</strong>: Explain what you did, what data you collected, how you processed it and how you analysed it.</li> <li><strong>Conclusions</strong>: Summarize what you did and explain why what you did is important one more time.</li> <li><strong>Supplementary Information</strong>: If there are a lot of technical computational, experimental or statistical details, you can include a supplement that has all of the details so folks can follow along. As far as possible, try to include the detail in the main text but explained clearly.</li> </ol> <p>The length of the paper will depend a lot on which journal you are targeting. In general the shorter/more concise the better. But unless you are shooting for a really glossy journal you should try to include the details in the paper itself. This means most papers will be in the 4-15 page range, but with a huge variance.</p> <p><em>Note</em>: Part of this chapter appeared in the <a href="https://github.com/jtleek/firstpaper">Leek group guide to writing your first paper</a></p> As a data analyst the best data repositories are the ones with the least features 2016-04-20T00:00:00+00:00 http://simplystats.github.io/2016/04/20/data-repositories <p>Lately, for a range of projects I have been working on I have needed to obtain data from previous publications. There is a growing list of data repositories where data is made available. General purpose data sharing sites include:</p> <ul> <li>The <a href="https://osf.io/">open science framework</a></li> <li>The <a href="https://dataverse.harvard.edu/">Harvard Dataverse</a></li> <li><a href="https://figshare.com/">Figshare</a></li> <li><a href="https://datadryad.org/">Datadryad</a></li> </ul> <p>There are also a host of field-specific data sharing sites.One thing that I find a little frustrating about these sites is that they add a lot of bells and whistles. For example I wanted to download a <a href="https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6FMTT3">p-value data set</a> from Dataverse (just to pick on one, but most repositories have similar issues). I go to the page and click <code class="highlighter-rouge">Download</code> on the data set I want.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse1.png" alt="Downloading a dataverse paper " /></p> <p>Then I have to accept terms:</p> <p>Then I have to <img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-04-20/dataverse2.png" alt="Downloading a dataverse paper part 2 " /></p> <p>Then the data set is downloaded. But it comes from a button that doesn’t allow me to get the direct link. There is an <a href="https://github.com/ropensci/dvn">R package</a> that you can use to download dataverse data, but again not with direct links to the data sets.</p> <p>This is a similar system to many data repositories where there is a multi-step process to downloading data rather than direct links.</p> <p>But as a data analyst I often find that I want:</p> <ul> <li>To be able to find a data set with some minimal search terms</li> <li>Find the data set in .csv or tab delimited format via a direct link</li> <li>Have the data set be available both as raw and processed versions</li> <li>The processed version will either be one or many <a href="https://www.jstatsoft.org/article/view/v059i10">tidy data sets</a>.</li> </ul> <p>As a data analyst I would rather have all of the data stored as direct links and ideally as csv files. Then you don’t need to figure out a specialized package, an API, or anything else. You just use <code class="highlighter-rouge">read.csv</code> directly using the URL in R and you are off to the races. It also makes it easier to point to which data set you are using. So I find the best data repositories are the ones with the least features.</p> Junior scientists - you don't have to publish in open access journals to be an open scientist. 2016-04-11T00:00:00+00:00 http://simplystats.github.io/2016/04/11/publishing <p><em>Editor’s note - This is a chapter from my book <a href="https://leanpub.com/modernscientist">How to be a modern scientist</a> where I talk about some of the tools and techniques that scientists have available to them now that they didn’t before.</em></p> <h2 id="publishing---what-should-i-do-and-why">Publishing - what should I do and why?</h2> <p>A modern scientific writing process goes as follows.</p> <ol> <li>You write a paper</li> <li>You post a preprint a. Everyone can read and comment</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> </ol> <p>You can take advantage of modern writing and publishing tools to handle several steps in the process.</p> <p><strong>Post preprints of your work</strong></p> <p>Once you have finished writing you paper, you want to share it with others. Historically, this involved submitting the paper to a journal, waiting for reviews, revising the paper, resubmitting, and eventually publishing it. There is now very little reason to wait that long for your paper to appear in print. Generally you can post a paper to a preprint server and have it appear in 1-2 days. This is a dramatic improvement on the weeks or months it takes for papers to appear in peer reviewed journals even under optimal conditions. There are several advantages to posting preprints.</p> <ul> <li>Preprints establish precedence for your work so it reduces your risk of being scooped.</li> <li>Preprints allow you to collect feedback on your work and improve it quickly.</li> <li>Preprints can help you to get your work published in formal academic journals.</li> <li>Preprints can get you attention and press for your work.</li> <li>Preprints give junior scientists and other researchers gratification that helps them handle the stress and pressure of their first publications.</li> </ul> <p>The last point is underappreciated and was first pointed out to me by <a href="http://giladlab.uchicago.edu/">Yoav Gilad</a> It takes a really long time to write a scientific paper. For a student publishing their first paper, the first feedback they get is often (a) delayed by several months and (b) negative and in the form of a referee report. This can have a major impact on the motivation of those students to keep working on projects. Preprints allow students to have an immediate product they can point to as an accomplishment, allow them to get positive feedback along with constructive or negative feedback, and can ease the pain of difficult referee reports or rejections.</p> <p><strong>Choose the journal that maximizes your visibility</strong></p> <p>You should try to publish your work in the best journals for your field. There are a couple of reasons for this. First, being a scientist is both a calling and a career. To advance your career, you need visibilty among your scientific peers and among the scientists who will be judging you for grants and promotions. The best place to do this is by publishing in the top journals in your field. The important thing is to do your best to do good work and submit it to these journals, even if the results aren’t the most “sexy”. Don’t adapt your workflow to the journal, but don’t ignore the career implications either. Do this even if the journals are closed source. There are ways to make your work accessible and you will both raise your profile and disseminate your results to the broadest audience.</p> <p><strong>Share your work on social media</strong></p> <p>Academic journals are good for disseminating your work to the appropriate scientific community. As a modern scientist you have other avenues and other communities - like the general public - that you would like to reach with your work. Once your paper has been published in a preprint or in a journal, be sure to share your work through appropriate social media channels. This will also help you develop facility in coming up with one line or one figure that best describes what you think you have published so you can share it on social media sites like Twitter.</p> <h3 id="preprints-and-criticism">Preprints and criticism</h3> <p>See the section on scientific blogging for how to respond to criticism of your preprints online.</p> <h2 id="publishing---what-tools-should-i-use">Publishing - what tools should I use?</h2> <h3 id="preprint-servers">Preprint servers</h3> <p>Here are a few preprint servers you can use.</p> <ol> <li><a href="http://arxiv.org/">arXiv</a> (free) - primarily takes math/physics/computer science papers. You can submit papers and they are reviewed and posted within a couple of days. It is important to note that once you submit a paper here, you can not take it down. But you can submit revisions to the paper which are tracked over time. This outlet is followed by a large number of journalists and scientists.</li> <li><a href="http://biorxiv.org/">biorXiv</a> (free) - primarily takes biology focused papers. They are pretty strict about which categories you can submit to. You can submit papers and they are reviewed and posted within a couple of days. biorxiv also allows different versions of manuscripts, but some folks have had trouble with their versioning system, which can be a bit tricky for keeping your paper coordinated with your publication. bioXiv is pretty carefully followed by the biological and computational biology communities.</li> <li><a href="https://peerj.com/preprints/">Peerj</a> (free) - takes a wide range of different types of papers. They will again review your preprint quickly and post it online. You can also post different versions of your manuscript with this system. This system is newer and so has fewer followers, you will need to do your own publicity if you publish your paper here.</li> </ol> <h3 id="journal-preprint-policies">Journal preprint policies</h3> <p>This <a href="https://en.wikipedia.org/wiki/List_of_academic_journals_by_preprint_policy">list</a> provides information on which journals accept papers that were first posted as preprints. However, you shouldn’t</p> <h2 id="publishing---further-tips-and-issues">Publishing - further tips and issues</h2> <h3 id="open-vs-closed-access">Open vs. closed access</h3> <p>Once your paper has been posted to a preprint server you need to submit it for publication. There are a number of considerations you should keep in mind when submitting papers. One of these considerations is closed versus open access. Closed access journals do not require you to pay to submit or publish your paper. But then people who want to read your paper either need to pay or have a subscription to the journal in question.</p> <p>There has been a strong push for open access journals over the last couple of decades. There are some very good reasons justifying this type of publishing including (a) moral arguments based on using public funding for research, (2) each of access to papers, and (3) benefits in terms of people being able to use your research. In general, most modern scientists want their work to be as widely accessible as possible. So modern scientists often opt for open access publishing.</p> <p>Open access publishing does have a couple of disadvantages. First it is often expensive, with fees for publication ranging between <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">$1,000 and$4,000</a> depending on the journal. Second, while science is often a calling, it is also a career. Sometimes the best journals in your field may be closed access. In general, one of the most important components of an academic career is being able to publish in journals that are read by a lot of people in your field so your work will be recognized and impactful.</p> <p>However, modern systems make both closed and open access journals reasonable outlets.</p> <h3 id="closed-access--preprints">Closed access + preprints</h3> <p>If the top journals in your field are closed access and you are a junior scientist then you should try to submit your papers there. But to make sure your papers are still widely accessible you can use preprints. First, you can submit a preprint before you submit the paper to the journal. Second, you can update the preprint to keep it current with the published version of your paper. This system allows you to make sure that your paper is read widely within your field, but also allows everyone to freely read the same paper. On your website, you can then link to both the published and preprint version of your paper.</p> <h3 id="open-access">Open access</h3> <p>If the top journal in your field is open access you can submit directly to that journal. Even if the journal is open access it makes sense to submit the paper as a preprint during the review process. You can then keep the preprint up-to-date, but this system has the advantage that the formally published version of your paper is also available for everyone to read without constraints.</p> <h3 id="responding-to-referee-comments">Responding to referee comments</h3> <p>After your paper has been reviewed at an academic journal you will receive referee reports. If the paper has not been outright rejected, it is important to respond to the referee reports in a timely and direct manner. Referee reports are often maddening. There is little incentive for people to do a good job refereeing and the most qualified reviewers will likely be those with a <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">conflict of interest</a>.</p> <p>The first thing to keep in mind is that the power in the refereeing process lies entirely with the editors and referees. The first thing to do when responding to referee reports is to eliminate the impulse to argue or respond with any kind of emotion. A step-by-step process for responding to referee reports is the following.</p> <ol> <li>Create a Google Doc. Put in all referee and editor comments in italics.</li> <li>Break the comments up into each discrete criticism or request.</li> <li>In bold respond to each comment. Begin each response with “On page xx we did yy to address this comment”</li> <li>Perform the analyses and experiments that you need to fill in the yy</li> <li>Edit the document to reflect all of the experiments that you have performed</li> </ol> <p>By actively responding to each comment you will ensure you are responsive to the referees and give your paper the best chance of success. If a comment is incorrect or non-sensical, think about how you can edit the paper to remove this confusion.</p> <h3 id="finishing">Finishing</h3> <p>While I have advocated here for using preprints to disseminate your work, it is important to follow the process all the way through to completion. Responding to referee reports is drudgery and no one likes to do it. But in terms of career advancement preprints are almost entirely valueless until they are formally accepted for publication. It is critical to see all papers all the way through to the end of the publication cycle.</p> <h3 id="you-arent-done">You aren’t done!</h3> <p>Publication of your paper is only the beginning of successfully disseminating your science. Once you have published the paper, it is important to use your social media, blog, and other resources to disseminate your results to the broadest audience possible. You will also give talks, discuss the paper with colleagues, and respond to requests for data and code. The most successful papers have a long half life and the responsibilities linger long after the paper is published. But the most successful scientists continue to stay on top of requests and respond to critiques long after their papers are published.</p> <p><em>Note:</em> Part of this chapter appeared in the Simply Statistics blog post: <a href="http://simplystatistics.org/2016/02/26/preprints-and-pppr/">“Preprints are great, but post publication peer review isn’t ready for prime time”</a></p> A Natural Curiosity of How Things Work, Even If You're Not Responsible For Them 2016-04-08T00:00:00+00:00 http://simplystats.github.io/2016/04/08/eecom <p>I just read Karl’s <a href="https://kbroman.wordpress.com/2016/04/08/i-am-a-data-scientist/">great post</a> on what it means to be a data scientist. I can’t really add much to it, but reading it got me thinking about the Apollo 12 mission, the second moon landing.</p> <p>This mission is actually famous because of its launch, where the Saturn V was struck by lightning and <a href="https://en.wikipedia.org/wiki/John_Aaron">John Aaron</a> (played wonderfully by Loren Dean in the movie <a href="http://www.imdb.com/title/tt0112384/">Apollo 13</a>), the flight controller in charge of environmental, electrical, and consumables (EECOM), had to make a decision about whether to abort the launch.</p> <p>In this great clip from the movie <em>Failure is Not An Option</em>, the real John Aaron describes what makes for a good EECOM flight controller. The bottom line is that</p> <blockquote> <p>A good EECOM has a natural curiosity for how things work, even if you…are not responsible for them</p> </blockquote> <p>I think a good data scientist or statistician also has that property. They key part of that line is the “<em>even if you are not responsible for them”</em> part. I’ve found that a lot of being a statistician involves nosing around in places where you’re not supposed to be, seeing how data are collected, handled, managed, analyzed, and reported. Focusing on the development and implementation of methods is not enough.</p> <p>Here’s the clip, which describes the famous “SCE to AUX” call from John Aaron:</p> <iframe width="640" height="480" src="https://www.youtube.com/embed/eWQIryll8y8" frameborder="0" allowfullscreen=""></iframe> Not So Standard Deviations Episode 13 - It's Good that Someone is Thinking About Us 2016-04-07T00:00:00+00:00 http://simplystats.github.io/2016/04/07/nssd-episode-13 <p>In this episode, Hilary and I talk about the difficulties of separating data analysis from its context, and Feather, a new file format for storing tabular data. Also, we respond to some listener questions and Hilary announces her new job.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="https://www.patreon.com/NSSDeviations">NSSD Patreon page</a></p> </li> <li> <p><a href="https://github.com/wesm/feather/">Feather git repository</a></p> </li> <li> <p><a href="https://arrow.apache.org">Apache Arrow</a></p> </li> <li> <p><a href="https://google.github.io/flatbuffers/">FlatBuffers</a></p> </li> <li> <p><a href="http://simplystatistics.org/2016/03/31/feather/">Roger’s blog post on feather</a></p> </li> <li> <p><a href="https://www.etsy.com/shop/NausicaaDistribution">NausicaaDistribution</a></p> </li> <li> <p><a href="http://www.rstats.nyc">New York R Conference</a></p> </li> <li> <p><a href="https://goo.gl/J2QAWK">Every Frame a Painting</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-13-its-good-that-someone-is-thinking-about-us">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/257851619&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Companies are Countries, Academia is Europe 2016-04-05T00:00:00+00:00 http://simplystats.github.io/2016/04/05/corporations-academia <p>I’ve been thinking a lot recently about the practice of data analysis in different settings and how the environment in which you work can affect the view you have on how things should be done. I’ve been working in academia for over 12 years now. I don’t have any industry data science experience, but long ago I worked as a software engineer at <a href="http://www.northropgrumman.com/Pages/default.aspx">two</a> <a href="http://kencast.com">companies</a>. Obviously, my experience is biased on the academic side.</p> <p>I’ve see an interesting divergence between what I see being written from data scientists in industry and my personal experience doing data science in academia. From the industry side, I see a lot of stuff about tooling/software and processes. This makes sense to me. Often, a company needs/wants to move quickly and doing so requires making decisions on a reasonable time scale. If decisions are made with data, then the process of collecting, organizing, analyzing, and communicating data needs to be well thought-out, systematized, rigorous, and streamlined. If everytime someone at the company had a question the data science team developed some novel custom coded-from-scratch solution, decisions would be made at a glacial pace, which is probably not good for business. In order to handle this type of situation you need solid tools and flexible workflows. You also need agreement within the company about how things are down and the processes that are followed.</p> <p>Now, I don’t mean to imply that life at a company is easy, that there isn’t politics or bureacracy to deal with. But I see companies as much like individual countries, with a clear (hierarchical) leadership structure and decision-making process (okay, maybe ideal companies). Much like in a country, it might take some time to come to a decision about a policy or problem (e.g. health insurance), with much negotiation and horse-trading, but once consensus is arrived at, often the policy can be implemented across the country at a reasonable timescale. In a company, if a certain workflow or data process can be shown to be beneficial and perhaps improve profitability down the road, then a decision could be made to implement it. Ultimately, everyone within a company is in the same boat and is interested in seeing the company succeed.</p> <p>When I worked at a company as a software developer, I’d sometimes run into a problem that was confusing or difficult to code. So I’d walk down to the systems engineer’s office (they guy who wrote the specification) and talk to him about it. We’d hash things out for a while and then figure out a way to go forward. Often the technical writers who wrote the documentation would come and ask me what exactly a certain module did and I’d explain it to them. Communication was usually quick and efficient because it usually occurred person-to-person and because we were all on the same team.</p> <p>Academia is more like Europe, a somewhat loose federation of states that only communicates with each other because they have to. Each principal investigator is a country and s/he has to engage in constant (sometimes contentious) negotiations with other investigators (“countries”). As a data scientist, this can be tricky because unless I collect/generate my own data (which sometimes, <a href="http://www.ncbi.nlm.nih.gov/pubmed/18477784">I do</a>), I have to negotiate with another investigator to obtain the data. Even if I were collaborating with that investigator from the very beginning of a study, I typically have very little direct control over the data collection process because those people don’t work for me. The result is often, the data come to me in some format over which I had little input, and I just have to deal with it. Sometimes this is a nice CSV file, but often it is not.</p> <p>In good situations, I can talk with the investigator collecting the data and we can hash out a plan to put the data into a <a href="https://www.jstatsoft.org/article/view/v059i10">certain format</a>. But even if we can agree on that, often the expertise will not be available on their end to get the data into that format, so I’ll end up having to do it myself anyway. In not-so-good situations, I can make all the arguments I want for an organized data collection and analysis workflow, but if the investigator doesn’t want to do it, can’t afford it, or doesn’t see any incentive, then it’s not going to happen. Ever.</p> <p>However, even in the good situations, every investigator works in their own personal way. I mean, that’s why people go into academia, because you can “be your own boss” and work on problems that interest you. Most people develop a process for running their group/lab that most suits their personality. If you’re a data scientist, you need to figure out a way to mesh with each and every investigator you collaborate with. In addition, you need to adapt yourself to whatever data process each investigator has developed for their group. So if you’re working with a genomics person, you might need to learn about BAM files. For a neuroimaging collaborator, you’ll need to know about SPM. If one person doesn’t like tidy data, then that’s too bad. You need to deal with it (or don’t work with them). As a result, it’s difficult to develop a useful “system” for data science because any system that works for one collaborator is unlikely to work for another collaborator. In effect, each collaboration often results in a custom coded-from-scratch solution.</p> <p>This contrast between companies and academia got me thinking about the <a href="https://en.wikipedia.org/wiki/Theory_of_the_firm">Theory of the Firm</a>. This is an economic theory that tries to explain why firms/companies develop at all, as opposed to individuals or small groups negotiating over an open market. My understanding is that it all comes down to how well you can write and enforce a contract between two parties. For example, if I need to manufacture iPhones, I can go to a contract manufacturer, given them the designs and the precise specifications/tolerances and they can just produce millions of them. However, if I need to <em>design</em> the iPhone, it’s a bit harder for me to go to another company and just say “Design an awesome new phone!” That kind of contract is difficult to write down, much less enforce. That other company will be operating off of different incentives from me and will likely not produce what I want. It’s probably better if I do the design work in-house. Ultimately, once the transaction costs of having two different companies work together become too high, it makes more sense for a company to do the work in-house.</p> <p>I think collaborating on data analysis is a high transaction cost activity. Companies have an advantage in this realm to the extent that they can hire lots of data scientists to work in-house. Academics that are well-funded and have large labs can often hire a data analyst to work for them. This is good because it makes a well-trained person available at low transaction cost, but this setup is the exception. PIs with smaller labs barely have enough funding to do their experiments and so either have to analyze the data themselves (for which they may not be appropriately trained) or collaborate with someone willing to do it. Large academic centers often have research cores that provide data analysis services, but this doesn’t change the fact that data analysis that occurs “outside the company” dramatically increases the transaction costs of doing the research. Because data analysis is a highly iterative process, each time you have to go back in forth with an outside entity, the costs go up.</p> <p>I think it’s possible to see a time when data analysis can effectively be made external. I mean, Apple used to manufacture all its products, but has shifted to contract manufacturing to great success. But I think we will have to develop a much better understanding of the data analysis process before we see the transaction costs start to go down.</p> New Feather Format for Data Frames 2016-03-31T00:00:00+00:00 http://simplystats.github.io/2016/03/31/feather <p>This past Tuesday, Hadley Wickham and Wes McKinney <a href="http://blog.cloudera.com/blog/2016/03/feather-a-fast-on-disk-format-for-data-frames-for-r-and-python-powered-by-apache-arrow/">announced</a> a new binary file format specifically for storing data frames.</p> <blockquote> <p>One thing that struck us was that, while R’s data frames and Python’s pandas data frames utilize different internal memory representations, the semantics of their user data types are mostly the same. In both R and pandas, data frames contain lists of named, equal-length columns, which can be numeric, boolean, and date-and-time, categorical (factors), or string. Additionally, these columns must support missing (null) values.</p> </blockquote> <p>Their work builds on the Apache Arrow project, which specifies a format for tabular data. There is currently a Python and R implementation for reading/writing these files but other implementations could easily be built as the file format looks pretty straightforward. The git repository is <a href="https://github.com/wesm/feather/">here</a>.</p> <p>Initial thoughts:</p> <ul> <li> <p>The possibilities for passing data between languages is I think the main point here. The potential for passing data through a pipeline without worrying about the specifics of different languages could make for much more powerful analyses where different tools are used for whatever they tend to do best. Essentially, as long as data can be made tidy going in and coming out, there should not be a communication issue between languages.</p> </li> <li> <p>R users might be wondering what the big deal is–we already have a binary serialization format (XDR). But R’s serialization format is meant to cover all possible R objects. Feather’s focus on data frames allows for the removal of many of the annoying (but seldom used) complexities of R objects and optimizing a very commonly used data format.</p> </li> <li> <p>In my testing, there’s a noticeable speed difference between reading a feather file and reading an (uncompressed) R workspace file (feather seems about 2x faster). I didn’t time writing files, but the difference didn’t seem as noticeable there. That said, it’s not clear to me that performance on files is the main point here.</p> </li> <li> <p>Given the underlying framework and representation, there seem to be some interesting possibilities for low-memory environments.</p> </li> </ul> <p>I’ve only had a chance to quickly look at the code but I’m excited to see what comes next.</p> How to create an AI startup - convince some humans to be your training set 2016-03-30T00:00:00+00:00 http://simplystats.github.io/2016/03/30/humans-as-training-set <p>The latest trend in data science is <a href="https://en.wikipedia.org/wiki/Artificial_intelligence">artificial intelligence</a>. It has been all over the news for tackling a bunch of interesting questions. For example:</p> <ul> <li><a href="https://deepmind.com/alpha-go.html">AlphaGo</a> <a href="http://www.techrepublic.com/article/how-googles-deepmind-beat-the-game-of-go-which-is-even-more-complex-than-chess/">beat</a> one of the top Go players in the world in what has been called a major advance for the field.</li> <li>Microsoft created a chatbot <a href="http://techcrunch.com/2016/03/23/microsofts-new-ai-powered-bot-tay-answers-your-tweets-and-chats-on-groupme-and-kik/">Tay</a> that ultimately <a href="http://www.bbc.com/news/technology-35902104">went very very wrong</a>.</li> <li>Google and a number of others are working on <a href="https://www.google.com/selfdrivingcar/">self driving cars</a>.</li> <li>Facebook is creating an artificial intellgence based <a href="http://www.engadget.com/2015/08/26/facebook-messenger-m-assistant/">virtual assistant called M</a></li> </ul> <p>Almost all of these applications are based (at some level) on using variations on <a href="http://neuralnetworksanddeeplearning.com/">neural networks and deep learning</a>. These models are used like any other statistical or machine learning model. They involve a prediction function that is based on a set of parameters. Using a training data set, you estimate the parameters. Then when you get a new set of data, you push it through the prediction function using those estimated parameters and make your predictions.</p> <p>So why does deep learning do so well on problems like voice recognition, image recognition, and other complicated tasks? The main reason is that these models involve hundreds of thousands or millions of parameters, that allow the model to capture even very subtle structure in large scale data sets. This type of model can be fit now because (a) we have huge training sets (think all the pictures on Facebook or all voice recordings of people using Siri) and (b) we have fast computers that allow us to estimate the parameters.</p> <p>Almost all of the high-profile examples of “artificial intelligence” we are hearing about involve this type of process. This means that the machine is “learning” from examples of how humans behave. The algorithm itself is a way to estimate subtle structure from collections of human behavior.</p> <p>The result is that the typical trajectory for an AI business is.</p> <ol> <li>Get a large collection of humans to perform some repetitive but possibly complicated behavior (play thousands of games of Go, or answer requests from people on Facebook messenger, or label pictures and videos, or drive cars.)</li> <li>Record all of the actions the humans perform to create a training set.</li> <li>Feed these data into a statistical model with a huge number of parameters - made possible by having a huge training set collected from the humans in steps 1 and 2.</li> <li>Apply the algorithm to perform the repetitive task and cut the humans out of the process.</li> </ol> <p>The question is how do you get the humans to perform the task for you? One option is to collect data from humans who are using your product (think Facebook image tagging). The other, more recent phenomenon, is to farm the task out to a large number of contractors (think <a href="http://www.theguardian.com/commentisfree/2015/jul/26/will-we-get-by-gig-economy">gig economy</a> jobs like driving for Uber, or responding to queries on Facebook).</p> <p>The interesting thing about the latter case is that in the short term it produces a market for gigs for humans. But in the long term, by performing those tasks, the humans are putting themselves out of a job. This played out in a relatively public way just recently with a service called <a href="http://www.fastcompany.com/3058060/this-is-what-it-feels-like-when-a-robot-takes-your-job">GoButler</a> that used its employees to train a model and then replaced them with that model.</p> <p>It will be interesting to see how many areas of employment this type of approach takes over. It is also interesting to think about how much value each task you perform for a company like that is worth to the training set. It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.</p> Not So Standard Deviations Episode 12 - The New Bayesian vs. Frequentist 2016-03-26T00:00:00+00:00 http://simplystats.github.io/2016/03/26/nssd-episode-12 <p>In this episode, Hilary and I discuss the new direction for the journal Biostatistics, the recent fracas over ggplot2 and base graphics in R, and whether collecting more data is always better than collecting less (fewer?) data. Also, Hilary and Roger respond to some listener questions and more free advertising.</p> <p>If you have questions you’d like us to answer, you can send them to nssdeviations @ gmail.com or tweet us at @NSSDeviations.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="http://goo.gl/am6I3r">Jeff Leek on why he doesn’t use ggplot2</a></p> </li> <li> <p>David Robinson on <a href="http://varianceexplained.org/r/why-I-use-ggplot2/">why he uses ggplot2</a></p> </li> <li> <p><a href="http://goo.gl/6iEB2I">Nathan Yau’s post comparing ggplot2 and base graphics</a></p> </li> <li> <p><a href="https://goo.gl/YuhFgB">Biostatistics Medium post</a></p> </li> <li> <p><a href="http://goo.gl/tXNdCA">Photoviz</a></p> </li> <li> <p><a href="https://twitter.com/PigeonAir">PigeonAir</a></p> </li> <li> <p><a href="https://goo.gl/jqlg0G">I just want to plot()</a></p> </li> <li> <p><a href="https://goo.gl/vvCfkl">Hilary and Rush Limbaugh</a></p> </li> <li> <p><a href="http://imgur.com/a/K4RWn">Deep learning training set</a></p> </li> <li> <p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-12-the-new-bayesian-vs-frequentist">Download the audio for this episode.</a></p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/255099493&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> The future of biostatistics 2016-03-24T00:00:00+00:00 http://simplystats.github.io/2016/03/24/the-future-of-biostatistics <p>Starting in January my colleague <a href="https://twitter.com/drizopoulos">Dimitris Rizopoulos</a> and I took over as co-editors of the journal Biostatistics. We are pretty fired up to try some new things with the journal and to make sure that the most important advances in statistical methodology and application have a good home.</p> <p>We started a blog for the journal and our first post is here: <a href="https://medium.com/@biostatistics/the-future-of-biostatistics-5aa8246e14b4#.uk1gat5sr">The future of Biostatistics</a>. Thanks to <a href="https://twitter.com/kwbroman/status/695306823365169154">Karl Broman and his famiy</a> we also have the twitter handle <a href="https://twitter.com/biostatistics">@biostatistics</a>. Follow us there to hear about all the new stuff we are rolling out.</p> The Evolution of a Data Scientist 2016-03-21T00:00:00+00:00 http://simplystats.github.io/2016/03/21/dataScientistEvo-jaffe <p><em>Editor’s note: This post is a guest post by <a href="http://aejaffe.com">Andrew Jaffe</a></em></p> <p>“How do you get to Carnegie Hall? Practice, practice, practice.” (“The Wit Parade” by E.E. Kenyon on March 13, 1955)</p> <p>”..an extraordinarily consistent answer in an incredible number of fields … you need to have practiced, to have apprenticed, for 10,000 hours before you get good.” (Malcolm Gladwell, on Outliers)</p> <p>I have been a data scientist for the last seven or eight years, probably before “data science” existed as a field. I work almost exclusively in the R statistical environment which I first toyed with as a sophomore in college, which ramped up through graduate school. I write all of my code in Notepad++ and make all of my plots with base R graphics, over newer and probably easier approaches, like R Studio, ggplot2, and R Markdown. Every so often, someone will email asking for code used in papers for analysis or plots, and I dig through old folders to track it down. Every time this happens, I come to two realizations: 1) I used to write fairly inefficient and not-so-great code as an early PhD student, and 2) I write a lot of R code.</p> <p>I think there are some pretty good ways of measuring success and growth as a data scientist – you can count software packages and their user-bases, projects and papers, citations, grants, and promotions. But I wanted to calculate one more metric to add to the list – how much R code have I written in the last 8 years? I have been using the Joint High Performance Computing Exchange (JHPCE) at Johns Hopkins University since I started graduate school, so all of my R code was pretty much all in one place. I therefore decided to spend my Friday night drinking some Guinness and chronicling my journey using R and evolution as a data scientist.</p> <p>I found all of the .R files across my four main directories on the computing cluster (after copying over my local scripts), and then removed files that came with packages, that belonged to other users, and that resulted from poorly designed simulation and permutation analyses (perm1.R,…,perm100.R) before I learned how to use array jobs, and then extracted the creation date, last modified date, file size, and line count for each R script. Based on this analysis, I have written 3257 R scripts across 13.4 megabytes and 432,753 lines of code (including whitespace and comments) since February 22, 2009.</p> <p>I found that my R coding output has generally increased over time when tabulated by month (number of scripts: p=6.3e-7, size of files: p=3.5x10-9, and number of lines: p=5.0e-9). These metrics of coding – number, size, and lines - also suggest that, on average, I wrote the most code during my PhD (p-value range: 1.7e-4-1.8e-7). Interestingly, the changes in output over time surprisingly consistent across the three phases of my academic career: PhD, postdoc, and faculty (see Figure 1) – you can see the initial dropoff in production during the first one or two months as I transitioned to a postdoc at the Lieber Institute for Brain Development after finishing my PhD. My output rate has dropped slightly as a faculty member as I started working with doctoral students who took over the analyses of some projects (month-by-output interaction p-value: 5.3e-4, 0.002, and 0.03, respectively, for number, size, and lines). The mean coding output – on average, how much code it takes for a single analysis – were also increased over time and slightly decreased at LIBD, although to lesser extents (all p-values were between 0.01-0.05). I was actually surprised that coding output increased – rather than decreased – over time, as any gains in coding efficiency were probably canceled out my often times more modular analyses at LIBD.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsMonth_rCode.jpg" alt="Figure 1: Coding output over time. Vertical bars separate my PhD, postdoc, and faculty jobs" /></p> <p>I also looked at coding output by hour of the day to better characterize my working habits – the output per hour is shown stratified by the two eras each about ~3 years (Figure 2). As expected, I never really work much in the morning – very little work get done before 8AM – and little has changed since a second year PhD student. As a faculty member, I have the highest output between 9AM-3PM. The trough between 4PM and 7PM likely corresponds to walking the dog we got three years ago, working out, and cooking (and eating) dinner. The output then increases steadily from 8PM-12AM, where I can work largely uninterrupted from meetings and people dropping by my office, with occasional days (or nights) working until 1AM.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsHour_rCode.jpg" alt="Figure 2: Coding output by hour of day. X-axis starts at 5AM to divide the day into a more temporal order." /></p> <p>Lastly, I examined R coding output by day of the week. As expected, the lowest output occurred over the weekend, especially on Saturdays. Interestingly, I tended to increase output later in the work week as a faculty member, and also work a little more on Sundays and Mondays, compared to a PhD student.</p> <p><img src="https://raw.githubusercontent.com/simplystats/simplystats.github.io/master/_images/2016-03-21/sizeVsDay_rCode.jpg" alt="Figure 3: Coding output by day of week." /></p> <p>Looking at the code itself, of the 432,753 lines, 84,343 were newlines (19.5%), 66,900 were lines that were exclusively comments (15.5%), and an additional 6,994 lines contained comments following R code (1.6%). Some of my most used syntax and symbols, as line counts containing at least one symbol, were pretty much as expected (dropping commas and requiring whitespace between characters):</p> <table> <tbody> <tr> <td>Code</td> <td>Count</td> <td>Code</td> <td>Count</td> </tr> <tr> <td>=</td> <td>175604</td> <td>==</td> <td>5542</td> </tr> <tr> <td>#</td> <td>48763</td> <td>&lt;</td> <td>5039</td> </tr> <tr> <td>&lt;-</td> <td>16492</td> <td>for(i</td> <td>5012</td> </tr> <tr> <td>{</td> <td>11879</td> <td>&amp;</td> <td>4803</td> </tr> <tr> <td>}</td> <td>11612</td> <td>the</td> <td>4734</td> </tr> <tr> <td>in</td> <td>10587</td> <td>function(x)</td> <td>4591</td> </tr> <tr> <td>##</td> <td>8508</td> <td>###</td> <td>4105</td> </tr> <tr> <td>~</td> <td>6948</td> <td>-</td> <td>4034</td> </tr> <tr> <td>&gt;</td> <td>5621</td> <td>%in%</td> <td>3896</td> </tr> </tbody> </table> <p>My code is available on GitHub: https://github.com/andrewejaffe/how-many-lines (after removing file paths and names, as many of the projects are currently unpublished and many files are placed in folders named by collaborator), so feel free to give it a try and see how much R code you’ve written over your career. While there are probably a lot more things to play around with and explore, this was about all the time I could commit to this, given other responsibilities (I’m not on sabbatical like <a href="http://jtleek.com">Jeff Leek</a>…). All in all, this was a pretty fun experience and largely reflected, with data, how my R skills and experience have progressed over the years.</p> Not So Standard Deviations Episode 11 - Start and Stop 2016-03-14T00:00:00+00:00 http://simplystats.github.io/2016/03/14/nssd-episode-11 <p>We’ve started a Patreon page! Now you can support the podcast directly by going to <a href="http://patreon.com/NSSDeviations">our page</a> and making a pledge. This will help Hilary and me build the podcast, add new features, and get some better equipment.</p> <p>Episode 11 is an all craft episode of <em>Not So Standard Deviations</em>, where Hilary and Roger discuss starting and ending a data analysis. What do you do at the very beginning of an analysis? Hilary and Roger talk about some of the things that seem to come up all the time. Also up for discussion is the American Statistical Association’s statement on <em>p</em> values, famous statisticians on Twitter, and evil data scientists on TV. Plus two new things for free advertising.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p><a href="http://patreon.com/NSSDeviations">NSSD Patreon Page</a></p> </li> <li> <p><a href="https://twitter.com/deleeuw_jan">Jan de Leeuw</a></p> </li> <li> <p><a href="https://twitter.com/BatesDmbates">Douglas Bates</a></p> </li> <li> <p><a href="https://en.wikipedia.org/wiki/Sports_Night">Sports Night</a></p> </li> <li> <p><a href="http://goo.gl/JFz7ic">ASA’s statement on p values</a></p> </li> <li> <p><a href="http://goo.gl/O8kL60">Basic and Applied Psychology Editorial banning p values</a></p> </li> <li> <p><a href="http://www.seriouseats.com/vegan-experience">J. Kenji Alt’s Vegan Experience</a></p> </li> <li> <p><a href="http://fieldworkfail.com/">fieldworkfail</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-11-start-and-stop">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/251825714&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Not So Standard Deviations Episode 10 - It's All Counterexamples 2016-03-02T00:00:00+00:00 http://simplystats.github.io/2016/03/02/nssd-episode-10 <p>In the latest episode of Not So Standard Deviations Hilary and I talk about the motivation behind the <a href="https://github.com/hilaryparker/explainr">explainr</a> package and the general usefulness of automated reporting and interpretation of statistical tests. Also, Roger struggles to come up with a quick and easy way to explain why statistics is useful when it sometimes doesn’t produce any different results.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Please <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">leave us a review on iTunes</a>!</p> <p>Show notes:</p> <ul> <li> <p>The <a href="https://github.com/hilaryparker/explainr">explainr</a> package</p> </li> <li> <p><a href="https://google.github.io/CausalImpact/CausalImpact.html">Google’s CausalImpact package</a></p> </li> <li> <p><a href="http://www.wsj.com/articles/SB10001424053111903480904576512250915629460">Software is Eating the World</a></p> </li> <li> <p><a href="http://allendowney.blogspot.com/2015/12/many-rules-of-statistics-are-wrong.html">Many Rules of Statistics are Wrong</a></p> </li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-10-its-all-counterexamples">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/249517993&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Preprints are great, but post publication peer review isn't ready for prime time 2016-02-26T00:00:00+00:00 http://simplystats.github.io/2016/02/26/preprints-and-pppr <p>The current publication system works something like this:</p> <h3 id="coupled-review-and-publication">Coupled review and publication</h3> <ol> <li>You write a paper</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> <li>If published then people can read it</li> </ol> <p>This system has several major disadvantages that bother scientists. It means all research appears on a lag (whatever the time in peer review is). It can be a major lag time if the paper is sent to “top tier journals” and rejected then filters down to “lower tier” journals before ultimate publication. Another disadvantage is that there are two options for most people to publish their papers: (a) in closed access journals where it doesn’t cost anything to publish but then the articles are beyind paywalls and (b) in open access journals where anyone can read them but it costs money to publish. Especially for junior scientists or folks without resources, this creates a difficult choice because they <a href="http://simplystatistics.org/2011/11/03/free-access-publishing-is-awesome-but-expensive-how/">might not be able to afford open access fees</a>.</p> <p>For a number of years some fields like physics (with the <a href="http://arxiv.org/">arxiv</a>) and economics (with <a href="http://www.nber.org/papers.html">NBER</a>) have solved this problem by decoupling peer review and publication. In these fields the system works like this:</p> <h3 id="decoupled-review-and-publication">Decoupled review and publication</h3> <ol> <li>You write a paper</li> <li>You post a preprint a. Everyone can read and comment</li> <li>You submit it to a journal</li> <li>It is peer reviewed privately</li> <li>The paper is accepted or rejected a. If rejected go back to step 2 and start over b. If accepted it will be published</li> </ol> <p>Lately there has been a growing interest in this same system in molecular and computational biology. I think this is a really good thing, because it makes it easier to publish papers more quickly and doesn’t cost researchers to publish. That is why the papers my group publishes all show up on <a href="http://biorxiv.org/search/author1%3AJeffrey%2BLeek%2B">biorxiv</a> or <a href="http://arxiv.org/find/stat/1/au:+Leek_J/0/1/0/all/0/1">arxiv</a> first.</p> <p>While I think this decoupling is great, there seems to be a push for this decoupling and at the same time a move to post publication peer review. I used to argue pretty strongly for <a href="http://simplystatistics.org/2012/10/04/should-we-stop-publishing-peer-reviewed-papers/">post-publication peer review</a> but Rafa <a href="http://simplystatistics.org/2012/10/08/why-we-should-continue-publishing-peer-reviewed-papers/">set me straight</a> and pointed out that at least with peer review every paper that gets submitted gets evaluated by <em>someone</em> even if the paper is ultimately rejected.</p> <p>One of the risks of post publication peer review is that there is no incentive to peer review in the current system. In a paper a few years ago I actually showed that under an economic model for closed peer review the Nash equilibrium is for <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026895">no one to peer review at all</a>. We showed in that same paper that under open peer review there is an increase in the amount of time spent reviewing, but the effect was relatively small. Moreover the dangers of open peer review are clear (junior people reviewing senior people and being punished for it) while the benefits (potentially being recognized for insightful reviews) are much hazier. Even the most vocal proponents of post publication peer review <a href="http://www.ncbi.nlm.nih.gov/myncbi/michael.eisen.1/comments/">don’t do it that often</a> when given the chance.</p> <p>The reason is that everyone in academics already have a lot of things they are asked to do. Many review papers either out of a sense of obligation or because they want to be in the good graces of a particular journal. Without this system in place there is a strong chance that peer review rates will drop and only a few papers will get reviewed. This will ultimately decrease the accuracy of science. In our (admittedly contrived/simplified) <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.002689">experiment</a> on peer review accuracy went from 39% to 78% after solutions were reviewed. You might argue that only “important” papers should be peer reviewed but then you are back in the camp of glamour. Say waht you want about glamour journals. They are for sure biased by the names of the people submitting the papers there. But it is <em>possible</em> for someone to get a paper in no matter who they are. If we go to a system where there is no curation through a journal-like mechanism then popularity/twitter followers/etc. will drive readers. I’m not sure that is better than where we are now.</p> <p>So while I think pre-prints are a great idea I’m still waiting to see a system that beats pre-publication review for maintaining scientific quality (even though it may just be an <a href="http://simplystatistics.org/2015/02/09/the-trouble-with-evaluating-anything/">impossible problem</a>)</p> Spreadsheets: The Original Analytics Dashboard 2016-02-23T08:42:30+00:00 http://simplystats.github.io/2016/02/23/spreadsheets-the-original-analytics-dashboard <p>Soon after my discussion with Hilary Parker and Jenny Bryan about spreadsheets on <em><a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a></em>, Brooke Anderson forwarded me <a href="https://backchannel.com/a-spreadsheet-way-of-knowledge-8de60af7146e#.gj4f2bod4">this article</a> written by Steven Levy about the original granddaddy of spreadsheets, <a href="https://en.wikipedia.org/wiki/VisiCalc">VisiCalc</a>. Actually, the real article was written back in 1984 as so-called microcomputers were just getting their start. VisiCalc was originally written for the Apple II computer and notable competitors at the time included <a href="https://en.wikipedia.org/wiki/Lotus_1-2-3">Lotus 1-2-3</a> and Microsoft <a href="https://en.wikipedia.org/wiki/Multiplan">Multiplan</a>, all since defunct.</p> <p>It’s interesting to see Levy’s perspective on spreadsheets back then and to compare it to the current thinking about data, data science, and reproducibility in science. The problem back then was “ledger sheets” (what we might now call a spreadsheet), which contained numbers and calculations related to businesses, were tedious to make and keep up to date.</p> <blockquote> <p>Making spreadsheets, however necessary, was a dull chore best left to accountants, junior analysts, or secretaries. As for sophisticated “modeling” tasks – which, among other things, enable executives to project costs for their companies – these tasks could be done only on big mainframe computers by the data-processing people who worked for the companies Harvard MBAs managed.</p> </blockquote> <p>You can see one issue here: Spreadsheets/Ledgers were a “dull chore”, and best left to junior people. However, the “real” computation was done by the people the “data processing” center on big mainframes. So what exactly does that leave for the business executive to do?</p> <p>Note that the way of doing things back then was effectively reproducible, because the presentation (ledger sheets printed on paper) and the computation (data processing on mainframes) was separated.</p> <p>The impact of the microcomputer-based spreadsheet program appears profound.</p> <blockquote> <p id="9424" class="graf--p graf-after--p"> Already, the spreadsheet has redefined the nature of some jobs; to be an accountant in the age of spreadsheet program is — well, almost sexy. And the spreadsheet has begun to be a forceful agent of decentralization, breaking down hierarchies in large companies and diminishing the power of data processing. </p> <p class="graf--p graf-after--p"> There has been much talk in recent years about an “entrepreneurial renaissance” and a new breed of risk-taker who creates businesses where none previously existed. Entrepreneurs and their venture-capitalist backers are emerging as new culture heroes, settlers of another American frontier. Less well known is that most of these new entrepreneurs depend on their economic spreadsheets as much as movie cowboys depend on their horses. </p> </blockquote> <p class="graf--p graf-after--p">  If you replace "accountant" with "statistician" and "spreadsheet" with "big data" and you are magically teleported into 2016. </p> <p class="graf--p graf-after--p"> The way I see it, in the early 80's, spreadsheets satisfied the never-ending desire that people have to interact with data. Now, with things like tablets and touch-screen phones, you can literally "touch" your data. But it took microcomputers to get to a certain point before interactive data analysis could really be done in a way that we recognize today. Spreadsheets tightened the loop between question and answer by cutting out the Data Processing department and replacing it with an Apple II (or an IBM PC, if you must) right on your desk. </p> <p class="graf--p graf-after--p"> Of course, the combining of presentation with computation comes at a cost of reproducibility and perhaps quality control. Seeing the description of how spreadsheets were originally used, it seems totally natural to me. It is not unlike today's analytic dashboards that give you a window into your business and allow you to "model" various scenarios by tweaking a few numbers of formulas. Over time, people took spreadsheets to all sorts of extremes, using them for purposes for which they were not originally designed, and problems naturally arose. </p> <p class="graf--p graf-after--p"> So now, we are trying to separate out the computation and presentation bits a little. Tools like knitr and R and shiny allow us to do this and to bring them together with a proper toolchain. The loss in interactivity is only slight because of the power of the toolchain and the speed of computers nowadays. Essentially, we've brought back the Data Processing department, but have staffed it with robots and high speed multi-core computers. </p> Non-tidy data 2016-02-17T15:47:23+00:00 http://simplystats.github.io/2016/02/17/non-tidy-data <p>During the discussion that followed the ggplot2 posts from David and I last week we started talking about tidy data and the man himself noted that matrices are often useful instead of <a href="http://vita.had.co.nz/papers/tidy-data.pdf">“tidy data”</a> and I mentioned there might be other data that are usefully “non tidy”. Here I will be using tidy/non-tidy according to Hadley’s definition. So tidy data have:</p> <ul> <li>One variable per column</li> <li>One observation per row</li> <li>Each type of observational unit forms a table</li> </ul> <p>I push this approach in my <a href="https://github.com/jtleek/datasharing">guide to data sharing</a> and in a lot of my personal work. But note that non-tidy data can definitely be already processed, cleaned, organized and ready to use.</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> <a href="https://twitter.com/hadleywickham">@hadleywickham</a> <a href="https://twitter.com/drob">@drob</a> <a href="https://twitter.com/mark_scheuerell">@mark_scheuerell</a> I'm saying that not all data are usefully tidy (and not just matrices) so I care more abt flexibility </p> <p> &mdash; Jeff Leek (@jtleek) <a href="https://twitter.com/jtleek/status/698247927706357760">February 12, 2016</a> </p> </blockquote> <p>This led to a very specific blog request:</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> <a href="https://twitter.com/jtleek">@jtleek</a> <a href="https://twitter.com/drob">@drob</a> I want a blog post on non-tidy data! </p> <p> &mdash; Hadley Wickham (@hadleywickham) <a href="https://twitter.com/hadleywickham/status/698251883685646336">February 12, 2016</a> </p> </blockquote> <p>So I thought I’d talk about a couple of reasons why data are usefully non-tidy. The basic reason is that I usually take a <a href="http://simplystatistics.org/2013/05/29/what-statistics-should-do-about-big-data-problem-forward-not-solution-backward/">problem first, not solution backward</a> approach to my scientific research. In other words, the goal is to solve a particular problem and the format I chose is the one that makes it most direct/easy to solve that problem, rather than one that is theoretically optimal.   To illustrate these points I’ll use an example from my area.</p> <p><strong>Example data</strong></p> <p>Often you want data in a matrix format. One good example is gene expression data or data from another high-dimensional experiment. David talks about one such example in <a href="http://varianceexplained.org/r/tidy-genomics/">his post here</a>. He makes the (valid) point that for students who aren’t going to do genomics professionally, it may be more useful to learn an abstract tool such as tidy data/dplyr. But for those working in genomics, this can make you do unnecessary work in the name of theory/abstraction.</p> <p>He analyzes the data in that post by first tidying the data.</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">library(dplyr) library(tidyr) library(stringr) library(readr) library(broom) &nbsp; original_data % separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %&gt;% mutate_each(funs(trimws), name:systematic_name) %&gt;% select(-number, -GID, -YORF, -GWEIGHT) %&gt;% gather(sample, expression, G0.05:U0.3) %&gt;% separate(sample, c("nutrient", "rate"), sep = 1, convert = TRUE)</pre> </td> </tr> </table> </div> <p>It isn’t 100% tidy as data of different types are in the same data frame (gene expression and metadata/phenotype data belong in different tables). But its close enough for our purposes. Now suppose that you wanted to fit a model and test for association between the “rate” variable and gene expression for each gene. You can do this with David’s tidy data set, dplyr, and the broom package like so:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">rate_coeffs = cleaned_data %&gt;% group_by(name) %&gt;% do(fit = lm(expression ~ rate + nutrient, data = .)) %&gt;% tidy(fit) %&gt;% dplyr::filter(term=="rate")</pre> </td> </tr> </table> </div> <p>On my computer we get something like:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">system.time( cleaned_data %&gt;% group_by(name) %&gt;% + do(fit = lm(expression ~ rate + nutrient, data = .)) %&gt;% + tidy(fit) %&gt;% + dplyr::filter(term=="rate")) |==========================================================|100% ~0 s remaining user system elapsed 12.431 0.258 12.364</pre> </td> </tr> </table> </div> <p>Let’s now try that analysis a little bit differently. As a first step, lets store the data in two separate tables. A table of “phenotype information” and a matrix of “expression levels”. This is the more common format used for these type of data. Here is the code to do that:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">expr = original_data %&gt;% select(grep("[0:9]",names(original_data))) &nbsp; rownames(expr) = original_data %&gt;% separate(NAME, c("name", "BP", "MF", "systematic_name", "number"), sep = "\\|\\|") %&gt;% select(systematic_name) %&gt;% mutate_each(funs(trimws),systematic_name) %&gt;% as.matrix() &nbsp; vals = data.frame(vals=names(expr)) pdata = separate(vals,vals,c("nutrient", "rate"), sep = 1, convert = TRUE) &nbsp; expr = as.matrix(expr)</pre> </td> </tr> </table> </div> <p>If we leave the data in this format we can get the model fits and the p-values using some simple linear algebra</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">expr = as.matrix(expr) &nbsp; mod = model.matrix(~ rate + as.factor(nutrient),data=pdata) rate_betas = expr %*% mod %*% solve(t(mod) %*% mod)</pre> </td> </tr> </table> </div> <p>This gives the same answer after re-ordering</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">all(abs(rate_betas[,2]- rate_coeffsestimate[ind]) &lt; 1e-5,na.rm=T) [1] TRUE</pre> </td> </tr> </table> </div> <p>But this approach is much faster.</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;"> system.time(expr %*% mod %*% solve(t(mod) %*% mod)) user system elapsed 0.015 0.000 0.015</pre> </td> </tr> </table> </div> <p>This requires some knowledge of linear algebra and isn’t pretty. But it brings us to the first general point: <strong>you might not use tidy data because some computations are more efficient if the data is in a different format. </strong></p> <p>Many examples from graphical models, to genomics, to neuroimaging, to social sciences rely on some kind of linear algebra based computations (matrix multiplication, singular value decompositions, eigen decompositions, etc.) which are all optimized to work on matrices, not tidy data frames. There are ways to improve performance with tidy data for sure, but they would require an equal amount of custom code to take advantage of say C, or vectorization properties in R.</p> <p>Ok now the linear regressions here are all treated independently, but it is very well known that you get much better performance in terms of the false positive/true positive tradeoff if you use an empirical Bayes approach for this calculation where <a href="https://bioconductor.org/packages/release/bioc/html/limma.html">you pool variances</a>.</p> <p>If the data are in this matrix format you can do it with R like so:</p> <div class="wp_syntax"> <table> <tr> <td class="code"> <pre class="r" style="font-family:monospace;">library(limma) fit_limma = lmFit(expr,mod) ebayes_limma = eBayes(fit_limma) topTable(ebayes_limma)</pre> </td> </tr> </table> </div> <p>This approach is again very fast, optimized for the calculations being performed and performs much better than the one-by-one regression approach. But it requires the data in matrix or expression set format. Which brings us to the second general point: <strong>**you might not use tidy data because many functions require a different, also very clean and useful data format, and you don’t want to have to constantly be switching back and forth. </strong>**Again, this requires you to be more specific to your application, but the potential payoffs can be really big as in the case of limma.</p> <p>I’m showing an example here with expression sets and matrices, but in NLP the data are often input in the form of lists, in graphical analyses as matrices, in genomic analyses as GRanges lists, etc. etc. etc. One option would be to rewrite all infrastructure in your area of interest to accept tidy data formats but that would be going against conventions of a community and would ultimately cost you a lot of work when most of that work has already been done for you.</p> <p>The final point, which I won’t discuss here is that data are often usefully represented in a non-tidy way. Examples include the aforementioned <a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">GRanges list</a> which consists of (potentially) ragged arrays of intervals and quantitative measurements about them. You could <strong>force</strong> these data to be tidy by the definition above, but again most of the infrastructure is built around a different format that is much more intuitive for that type of data. Similarly data from other applications may be more suited to application specific formats.</p> <p>In summary, tidy data is a useful conceptual idea and is often the right way to go for general, small data sets, but may not be appropriate for all problems. Here are some examples of data formats (biased toward my area, but there are others) that have been widely adopted, have a ton of useful software, but don’t meet the tidy data definition above. I will define these as “processed data” as opposed to “tidy data”.</p> <ul> <li><a href="http://bioconductor.org/packages/3.3/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf">Expression sets</a> for expression data</li> <li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/SummarizedExperiment.html">Summarized experiments</a> for a variety of genomic experiments</li> <li><a href="http://kasperdanielhansen.github.io/genbioconductor/html/GenomicRanges_GRanges.html">Granges Lists</a> for genomic intervals</li> <li><a href="https://cran.r-project.org/web/packages/tm/tm.pdf">Corpus</a> objects for corpora of texts.</li> <li><a href="http://igraph.org/r/doc/">igraph objects</a> for graphs</li> </ul> <p>I’m sure there are a ton more I’m missing and would be happy to get some suggestions on Twitter too.</p> <p> </p> When it comes to science - its the economy stupid. 2016-02-16T14:57:14+00:00 http://simplystats.github.io/2016/02/16/when-it-comes-to-science-its-the-economy-stupid <p>I read a lot of articles about what is going wrong with science:</p> <ul> <li><a href="http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble">The reproducibility/replicability crisis</a></li> <li><a href="http://www.theatlantic.com/business/archive/2013/02/the-phd-bust-americas-awful-market-for-young-scientists-in-7-charts/273339/">Lack of jobs for PhDs</a></li> <li><a href="https://theresearchwhisperer.wordpress.com/2013/11/19/academic-scattering/">The pressure on the families (or potential families) of scientists</a></li> <li><a href="http://quillette.com/2016/02/15/the-unbearable-asymmetry-of-bullshit/?utm_content=buffer235f2&amp;utm_medium=social&amp;utm_source=twitter.com&amp;utm_campaign=buffer">Hype around specific papers and a more general abundance of BS</a></li> <li><a href="http://www.michaeleisen.org/blog/?p=1179">Consortia and their potential evils</a></li> <li><a href="http://www.vox.com/2015/12/7/9865086/peer-review-science-problems">Peer review not working well</a></li> <li><a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">Research parasites</a></li> <li><a href="http://gmwatch.org/news/latest-news/16691-public-science-is-broken-says-professor-who-helped-expose-water-pollution-crisis">Not enough room for applications/public good</a></li> <li><a href="http://www.statnews.com/2016/02/10/press-releases-stink/?s_campaign=stat:rss">Press releases that do evil</a></li> <li><a href="https://twitter.com/Richvn/status/697725899404349440">Scientists don’t release enough data</a></li> </ul> <p>These articles always point to the “incentives” in science and how they don’t align with how we’d like scientists to work. These discussions often frustrate me because they almost always boil down to asking scientists (especially and often junior scientists) to make some kind of change for public good without any guarantee that they are going to be ok. I’ve seen an acceleration/accumulation of people who are focusing on these issues, I think largely because it is now possible to make a very nice career by pointing out how other people are doing science wrong.</p> <p>The issue I have is that the people who propose unilateral moves seem to care less that science is both (a) a calling and (b) a career for most people. I do science because I love it. I do science because I want to discover new things about the world. It is a direct extension of the wonder and excitement I had about the world when I was a little kid. But science is also a career for me. It matters if I get my next grant, if I get my next paper. Why? Because I want to be able to support myself and my family.</p> <p>The issue with incentives is that talking about them costs nothing, but actually changing them is expensive. Right now our system, broadly defined, rewards (a) productivity - lots of papers, (b) cleverness - coming up with an idea first, and (c) measures of prestige - journal titles, job titles, etc. This is because there are tons of people going for a relatively small amount of grant money. More importantly, that money is decided on by processes that are both peer reviewed and political.</p> <p>Suppose that you wanted to change those incentives to something else. Here is a small list of things I would like:</p> <ul> <li>People can have stable careers and live in a variety of places without massive two body problems</li> <li>Scientists shouldn’t have to move every couple of years 2-3 times right at the beginning of their career</li> <li>We should distribute our money among the <a href="http://simplystatistics.org/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants/">largest number of scientists possible </a></li> <li>Incentivizing long term thinking</li> <li>Incentivizing objective peer review</li> <li>Incentivizing openness and sharing</li> </ul> <div> The key problem isn't publishing, or code, or reproducibility, or even data analysis. </div> <div> </div> <div> <b>The key problem is that the fundamental model by which we fund science is completely broken. </b> </div> <div> </div> <div> The model now is that you have to come up with an <span class="lG">idea</span> every couple of years then "sell" it to funders, your peers, etc. This is the source of the following problems: </div> <div> </div> <ul> <li>An incentive to publish only positive results so your <span class="lG">ideas</span> look good</li> <li>An incentive to be closed so people don’t discover flaws in your analysis</li> <li> An incentive to publish in specific “<span class="lG">big</span> name” journals that skews the results (again mostly in the positive direction)</li> <li> Pressure to publish quickly which leads to cutting corners</li> <li>Pressure to stay in a single area and make incremental changes so you know things will work.</li> </ul> <div> If we really want to have any measurable impact on science we need to solve the funding model. The solution is actually pretty simple. We need to give out 20+ year grants to people who meet minimum qualifications. These grants would cover their own salary plus one or two people and the minimum necessary equipment. </div> <div> </div> <div> The criteria for getting or renewing these grants should not be things like Nature papers or number of citations. It has to be designed to incentivize the things that we want to (mine are listed above). So if I was going to define the criteria for meeting the standards people would have to be: </div> <div> </div> <ul> <li>Working on a scientific problem and trained as a scientist</li> <li>Publishing all results immediately online as preprints/free code</li> <li>Responding to queries about their data/code</li> <li>Agreeing to peer review a number of papers per year</li> </ul> <p>More importantly these grants should be given out for a very long term (20+ years) and not be tied to a specific institution. This would allow people to have flexible careers and to target bigger picture problems. We saw the benefits of people working on problems they weren’t originally funded to work on with <a href="http://www.wired.com/2016/02/zika-research-utmb/">research on the Zika virus.</a></p> <p>These grants need to be awarded using a rigorous peer review system just like the NIH, HHMI, and other organizations use to ensure we are identifying scientists with potential early in their careers and letting them flourish. But they’d be given out in a different matter. I’m very confident in a peer review to detect the difference between psuedo-science and real science, or complete hype and realistic improvement. But I’m much less confident in the ability of peer review to accurately distinguish “important” from “not important” research. So I think we should <a href="http://www.wsj.com/articles/SB10001424052702303532704579477530153771424">consider seriously the lottery</a> for these grants.</p> <p>Each year all eligible scientists who meet some minimum entry requirements submit proposals for what they’d like to do scientifically. Each year those proposals are reviewed to make sure they meet the very minimum bar (are they scientific? do they have relevant training at all?). Among all the (very large) class of people who pass that bar we hold a lottery. We take the number of research dollars and divide it up to give the maximum number of these grants possible. These grants might be pretty small - just enough to fund the person’s salary and maybe one or two students/postdocs. To make this works for labs that required equipment there would have to be cooperative arrangements between multiple independent indviduals to fund/sustain equipment they needed. Renewal of these grants would happen as long as you were posting your code/data online, you were meeting peer review requirements, and responding to inquires about your work.</p> <p>One thing we’d do to fund this model is eliminate/reduce large-scale projects and super well funded labs. Instead of having 30 postdocs in a well funded lab, you’d have some fraction of those people funded as independent investigators right from the get-go. If we wanted to run a massive large scale program that would be out of a very specific pot of money that would have to be saved up and spent, completely outside of the pot of money for investigator-initiated grants. That would reduce the hierarchy in the system, reduce pressure that leads to bad incentive, and give us the best chance to fund creative, long term thinking science.</p> <p>Regardless of whether you like my proposal or not, I hope that people will start focusing on how to change the incentives, even when that means doing something big or potentially costly.</p> <p> </p> <p> </p> Not So Standard Deviations Episode 9 - Spreadsheet Drama 2016-02-12T11:24:04+00:00 http://simplystats.github.io/2016/02/12/not-so-standard-deviations-episode-9-spreadsheet-drama <p>For this episode, special guest Jenny Bryan (@jennybryan) joins us from the University of British Columbia! Jenny, Hilary, and I talk about spreadsheets and why some people love them and some people despise them. We also discuss blogging as part of scientific discourse.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Show notes:</p> <ul> <li><a href="http://stat545-ubc.github.io/">Jenny’s Stat 545</a></li> <li><a href="http://goo.gl/VvFyXz">Coding is not the new literacy</a></li> <li><a href="https://goo.gl/mC0Qz9">Goldman Sachs spreadsheet error</a></li> <li><a href="https://goo.gl/hNloVr">Jingmai O’Connor episode</a></li> <li><a href="http://goo.gl/IYDwn1">De-weaponizing reproducibility</a></li> <li><a href="https://goo.gl/n02EGP">Vintage Space</a></li> <li><a href="https://goo.gl/H3YgV6">Tabby Cats</a></li> </ul> <p><a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Download the audio for this episode</a>.</p> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/246296744&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Why I don't use ggplot2 2016-02-11T13:25:38+00:00 http://simplystats.github.io/2016/02/11/why-i-dont-use-ggplot2 <p>Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p> <p>But I don’t use ggplot2 and I get nervous when other people do.</p> <p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p> <ol> <li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li> <li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li> <li>When grading student data analyses.</li> </ol> <p>Let’s consider each case.</p> <p><strong>Exploratory graphs</strong></p> <p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p> <p><strong>Expository graphs</strong></p> <p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p> <p><img class="aligncenter" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAkAAAAGwCAYAAABB4NqyAAAD8GlDQ1BJQ0MgUHJvZmlsZQAAOI2NVd1v21QUP4lvXKQWP6Cxjg4Vi69VU1u5GxqtxgZJk6XpQhq5zdgqpMl1bhpT1za2021Vn/YCbwz4A4CyBx6QeEIaDMT2su0BtElTQRXVJKQ9dNpAaJP2gqpwrq9Tu13GuJGvfznndz7v0TVAx1ea45hJGWDe8l01n5GPn5iWO1YhCc9BJ/RAp6Z7TrpcLgIuxoVH1sNfIcHeNwfa6/9zdVappwMknkJsVz19HvFpgJSpO64PIN5G+fAp30Hc8TziHS4miFhheJbjLMMzHB8POFPqKGKWi6TXtSriJcT9MzH5bAzzHIK1I08t6hq6zHpRdu2aYdJYuk9Q/881bzZa8Xrx6fLmJo/iu4/VXnfH1BB/rmu5ScQvI77m+BkmfxXxvcZcJY14L0DymZp7pML5yTcW61PvIN6JuGr4halQvmjNlCa4bXJ5zj6qhpxrujeKPYMXEd+q00KR5yNAlWZzrF+Ie+uNsdC/MO4tTOZafhbroyXuR3Df08bLiHsQf+ja6gTPWVimZl7l/oUrjl8OcxDWLbNU5D6JRL2gxkDu16fGuC054OMhclsyXTOOFEL+kmMGs4i5kfNuQ62EnBuam8tzP+Q+tSqhz9SuqpZlvR1EfBiOJTSgYMMM7jpYsAEyqJCHDL4dcFFTAwNMlFDUUpQYiadhDmXteeWAw3HEmA2s15k1RmnP4RHuhBybdBOF7MfnICmSQ2SYjIBM3iRvkcMki9IRcnDTthyLz2Ld2fTzPjTQK+Mdg8y5nkZfFO+se9LQr3/09xZr+5GcaSufeAfAww60mAPx+q8u/bAr8rFCLrx7s+vqEkw8qb+p26n11Aruq6m1iJH6PbWGv1VIY25mkNE8PkaQhxfLIF7DZXx80HD/A3l2jLclYs061xNpWCfoB6WHJTjbH0mV35Q/lRXlC+W8cndbl9t2SfhU+Fb4UfhO+F74GWThknBZ+Em4InwjXIyd1ePnY/Psg3pb1TJNu15TMKWMtFt6ScpKL0ivSMXIn9QtDUlj0h7U7N48t3i8eC0GnMC91dX2sTivgloDTgUVeEGHLTizbf5Da9JLhkhh29QOs1luMcScmBXTIIt7xRFxSBxnuJWfuAd1I7jntkyd/pgKaIwVr3MgmDo2q8x6IdB5QH162mcX7ajtnHGN2bov71OU1+U0fqqoXLD0wX5ZM005UHmySz3qLtDqILDvIL+iH6jB9y2x83ok898GOPQX3lk3Itl0A+BrD6D7tUjWh3fis58BXDigN9yF8M5PJH4B8Gr79/F/XRm8m241mw/wvur4BGDj42bzn+Vmc+NL9L8GcMn8F1kAcXgSteGGAABAAElEQVR4Ae3dBZgcRd6A8eLC4RDcLbg7h7sGC+5uwfXQwzncXQ734MH9cHcPENyPAIEgH/rlrbtaOsPs7GZ3trd75q3n2Z2ZnpaqX/VM/6equnuE34emYFJAAQUUUEABBZpI4C9NVFaLqoACCiiggAIKRAEDIHcEBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEDICarsotsAIKKKCAAgoYALkPKKCAAgoooEDTCRgANV2VW2AFFFBAAQUUMAByH1BAAQUUUECBphMwAGq6KrfACiiggAIKKGAA5D6ggAIKKKCAAk0nYADUdFVugRVQQAEFFFDAAMh9QAEFFFBAAQWaTsAAqOmq3AIroIACCiiggAGQ+4ACCiiggAIKNJ2AAVDTVbkFVkABBRRQQAEDIPcBBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEDICarsotsAIKKKCAAgoYALkPKKCAAgoooEDTCRgANV2VW2AFFFBAAQUUMAByH1BAAQUUUECBphMwAGq6KrfACiiggAIKKGAA5D6ggAIKKKCAAk0nYADUdFVugRVQQAEFFFDAAMh9QAEFFFBAAQWaTsAAqOmq3AIroIACCiigwIgSKKCAAgp0nUD//v07vPI+ffp0eFkXVECB2gK2ANX28V0FFFBAAQUUaEABA6AGrFSLpIACCiiggAK1BQyAavv4rgIKKKCAAgo0oIABUANWqkVSQAEFFFBAgdoCBkC1fXxXAQUUUEABBRpQwACoASvVIimggAIKKKBAbQEDoNo+vquAAgoooIACDShgANSAlWqRFFBAAQUUUKC2gAFQbR/fVUABBRRQQIEGFDAAasBKtUgKKKCAAgooUFvAAKi2j+8qoIACCiigQAMKGAA1YKVaJAUUUEABBRSoLWAAVNvHdxVQQAEFFFCgAQUMgBqwUi2SAgoooIACCtQWMACq7eO7CiiggAIKKNCAAgZADVipFkkBBRRQQAEFagsYANX28V0FFFBAAQUUaEABA6AGrFSLpIACCiiggAK1BQyAavv4rgIKKKCAAgo0oIABUANWaq0iffTRR+Hll1+uNYvvKaCAAgoo0PACBkANX8XDFvDMM88MRxxxxLATfaWAAgoooECTCRgANVGF/9///V+46667wrXXXhs+/PDDJiq5RVVAAQUUUGBYgR6HDE3DTvJVowpcfvnlYa211goPPPBA+O6778Kyyy7bUtQPPvggXHrppeGZZ54JU0wxRRhjjDHCLbfcEuedY445Qo8ePcJjjz0WbrrppjD++OOHccYZJy77/fffhxtvvDFMPvnk4cILL4zLjjzyyOHxxx8Pd9xxR/jkk0/CDDPM0LKd3377Ldx3333htttuC7/88kvczogjjhjXT1B21VVXhcGDB4epp546jDDCCC3LZZ88+uij4c477wxvvvlmmG666QLLk8jvPffcE3766acw5ZRTtizyyiuvhOuvvz68+uqrMX+jjjpqfK9a3il3tXKyQGvbbdmQTxSoIjBgwIAqU9s3aaaZZmrfjM6lgALDLWAL0HCTlXcBAo+ll1467LDDDuHcc88NBAApEfS888474eyzzw4TTzxxnEwAMsooo4S//vWv4aSTTgovvvhiWGCBBcJqq60Wg51vvvkmbLPNNmGrrbYKZ511Vujfv38cX7TTTjuFN954I2y55Zbh+OOPj4EQK/z999/DUkstFSabbLKw6KKLxr999tknDBw4MDz77LPhmGOOCUsssUS45JJLwqqrrpqyNszjrbfeGghoNtlkk3D//feHr776Kr6/+eabh7/85S+hb9++Ydttt435443zzz8/5m2zzTaLgducc84Zl28t79XKyXpa2y7vmRRQQAEFyicwwtCD0u/ly7Y5Hl4BWmSefvrpQHDy5ZdfxhabE044IWy//fYtq/r8889jC8lTTz0VaPU54IADwsEHHxw+++yzsNJKK4X9998/znvNNdeE999/PzAfQcgKK6wQBg0aFFtzmOHkk08Offr0Cb169QobbbRRbAFiPbQ8EagQaJFmm222sOeee4YtttgiLLnkkmG99daLLUu0RqXAiHVk02GHHRaef/75cNlllwWCGFpzaA264oorYksU89JSQ/DGr2cCO1p+UovQyiuvHFueWKYy77RAtVbOatvt2bNnNms+H04BAuaOJvavsqRmKWdZ6sN8KpAE/tt3kF752LACDH6mhWTnnXeOZZxooonCKaecErbbbruWrqYJJ5wwrLPOOrF16Oijj47BBa0/Tz75ZGwVIkghpUee00LEH11HKbGNK6+8MgYYdGcNGTIkvjXppJOGd999N/7RxTX66KOHZZZZJvz666+xy4wutNQ9RQvP2GOPnVbZ8rj11luH3r17x6CKVidaimjZomUnpYUXXjg+TV19E0wwQXorLLbYYrGliQmVea9VzmrbbVmpTxRQQAEFSidgF1jpqmz4M8w4HMbsXHTRReG0006Lf4wHYmzC7bffPswKd9xxx9i6QjCy/vrrx/doRXn99dcDgQTdY/yNN954cazOMAv/78WGG24Yu7to2SHoSWn66aePwdXGG28c7r777sDwM1pmGF80ySSTxPFCaf08fvvtt2nRlkdaXZ544onYarTmmmvG/NPaw+DubCLoSgEULV8pkZ80PU1Lj7XKWW27aTkfFVBAAQXKJ2AAVL46G+4cn3rqqYGgJJtoJaGb67jjjstODgsttFCYZpppYoAy7bTTxvfmm2++wODlPfbYI3afMYD6jDPOCKONNloMdGjBSenrr78OV199dex6Ytrbb78dW3h4znJcg+jhhx8Oyy23XGzJYTpp9dVXD7vttlt44YUX4joZtPzpp5/+983M/9NPPz223Bx++OFhr732iuOSVlxxxRgU0TVHjy4BEgOwZ5111vh37733tqyBVp4U2DFvNu+1ylltuy0r9YkCCiigQOkEDIBKV2XDl2HGyhAAMS4mO9yLQIOWF8bBHHroocOslFagTTfdtGUaLSyM6znvvPPiuB4GFC+//PKxhYZWJQIbBg//+OOPsXWFYIaAi3E8dJf169cvBlQENAxwHnfcccNUU00V5plnnnDsscfG7ey+++5xPXPNNVdYfPHF4wDt2WefvSUP6Qnjkej2YlDyzz//HAdgMx7kH//4RxxvNNZYY4WLL744rLHGGvHsMIIi5k2tX1wK4MADD4yBXGXeWysn3YDVtpvy5KMCCiigQPkEHARdvjrr8hzTKkIwwJihbCLgYEwPp8G3lQg0OB2exOnunKr+3HPPxQHJiyyySAxAWBfBChdm5MwwEkES3V+tJVqi+Mt2caV5CcDSqfVpGo8EfrREjTnmmIFxTm2lauWstd221uf71QWaZXBws5Szei07VYHiCgx7hCtuPs1ZjgK0DFUGP2yelpD2BD/Mm4Ifnqfr9DCwmbFDUw8dAE3rDy09tARlA57sc5atTOSL9VUbx1M5GDstSzBHd157gh+WqVbOWttN2/FRAQUUUKA8Ap4FVp66Kn1OGfRM9xN/888/f+wK45pEBFwmBRRQQAEF8hQwAMpTu8m3tfbaawf+st1jTU5i8RVQQAEFuknALrBugm/mzWa7x5rZwbIroIACCnSfgAFQ99m7ZQUUUEABBRToJgEDoG6Cd7MKKKCAAgoo0H0CBkDdZ++WFVBAAQUUUKCbBAyAugnezSqggAIKKKBA9wkYAHWfvVtWQAEFFFBAgW4SMADqJvjObPbcc88N3Om8zKloZfjpp5/CnnvuGb766qsys5p3BRRQQIF2ChgAtROqu2fjNg8pffzxx2HQoEHpZWke8yjDDz/80CEPbn/xzjvvxPuZpRV0dF1peR8VUEABBYorYABU3LppydmJJ54YXnrppZbXXFF5zTXXbHldhid5lIH7gPXt27dDHKOPPnrgDvSTTDJJXL4z6+pQBlxIAQUUUCBXgR5DD6aH5LrFAmzslVdeiQe7V199NUwxxRRh1FFHjbmiVeWGG24IM844Y7j22mvDU089Feacc854Y9BstmktuPfee+OkAQMGhOuuuy7e04qbbab02GOPhZtuuineO2ucccZJk+ONPO+7775w2223tdy4k3tbcTuI1157LXZt8f5ss80W70nF3dx32mmneB+rkUYaKUw55ZRh4MCB4eWXX443EOXu6g899FC84efkk08eyM+VV14Z52e7dO1wYH/66afDdNNNF1hHtfThhx+Gq666Kt7slHt1cf8s0vPPPx/v+k6rEzd15KalBAuU7/bbb4/54TWJm44y7Y477ghffPFFmH766eN6apWBe4HVWrY91izPfNxr7M477wwTTDBBmGiiiaLpjTfeGHC58MILY11zEcbHH3885vGTTz4JM8wwQ9w+Nzu9//774zKjjTban9ZF3T766KNx/W+++Wa0TPc4iyvw33ALsK92NM0000wdXTT35ZqlnLnDukEFOinQdC1A559/fjjrrLPCZpttFoMTAhwCIgKFQw89NGy55ZbhhBNOCHR/HHbYYeGMM874EzEH2T59+oR//vOfcdl///vfYb311muZ76STTgovvvhiWGCBBcJqq60WD768SYCw1FJLxSBi0UUXDfzts88+MaC555574hgUbhXB3dh33nnnuL5lllkmBkoLLbRQmHnmmePd1Nk2wQoHYIK1XXfdNd5XiwUIOghOevXqFcu0/fbbx+1xM0+CPe62XpmeffbZcMwxx4QlllgiEFCtuuqqcRbKxTRi5GeeeSbwmpuYHnvsseGtt94KV199ddh0003jvAQQq6++esz7NttsEwOMFVZYIb6uVQYWrrVsW9Zx40P/EbBxc9UxxhgjsD1syMdWW20V65vgjaCRYPKNN96I9Xz88cfHfLIOAt5ll1021kXlusYdd9xw6623xromyCJQcqxQkvdRAQUUKKdAUwVAHLQIFvbee+/Y6rPSSiuFOeaYI+yxxx6xZWSLLbaIQQODYXlOMEIwUZlWWWWVGFQsvfTSYeuttw4HHHBAeOKJJ2KAQ0sKrQ09e/aMQQIByplnnhlX8eCDD4b3338/8Ot1rrnmCrPMMksgsOE1rVBsk0SrCK0UpNQlwyMHYpZZZJFF4nv8Y3kO3JdeemmcRiBFEEf617/+FQMQtskdzmnZSPPFGf73j/Kz3ueeey7MOuussXWK8TAEawRdBEGbb755LBetJiuuuGIMfAiECJ5ItDp99tlnMXjiruxHHnlkoCXroosuarMMtZatZf2/7McHAh7uVJ9aybjzOwEQ9x0j6KFVCifKs+SSS8YAidYs6o207rrrxjrjeeW6KA8BIK1bBKcHH3xwtGRekwIKKKBAOQWaKgCiVea7776LXSSpuhZbbLHw5JNPxpep2yc90u2RHbibluGRedJ8tDrQgkRLBuuaeOKJ40GWAy0tSDfffHNcdNJJJw3vvvtu/GMCXUe0VpAIagh8Tj/99Nh6RD6zKW2LadnnvN5xxx3DBRdcEFuKaKVJ63z44YfDvPPO25IXDuIEBdnEAZ1gi6CG/NLCQXdXCrxoOeKPRCBAntP2KXcajE33EAFIStjRupZsmZ6Wq3ze1rIsl5bNWqdttfZIfvljmZRoWXvkkUdiMDd48OAwZMiQ9FbLNlomZJ4Q6NL1SJcZXaMEuCYFFFBAgfIKNFUANPbYY8eaYjxMSgQlaXqa1plHuplef/31GGQRCPE33njjxbE1dE9x+vfGG28c7r777ti1xJgeEt1JRx11VAxmaNWpTCkAqJzOa1pJaLUgCKLlI81LXjhop3zwyFiZbGLsEcEOLTvZ+b799tvsbC3PUzDUMuF/TxhvRICVTdhmA4WUr+w8PG/PspXLtPa6tW2k+TfccMPYUkdrG/mrlbLrohy0FrEcA9BpDTIpoIACCpRXoKkCILp3+EsDmKk2WijWX3/9WIO0hmRTZbCQfY/WnpQ4Y4jEGJ/55psvtgTRrfTll1/GFidageh+olWHcSi0zCy33HKhd+/eaRWxu4oDMgfdt99+O7bmpDdpxSAgSfkjX9m8EcRst912Ya+99gobbbRRWiyOyaEbjOCK/LLeagduxu7stttu4YUXXohlYNB0GitEmfgjsQ66lFI+0nTeW2uttcJ//vOfOEaJ1yQCTbqWSLXK0NayrVnHFWf+0Y2YWnTII/lLeWW2r7/+OloQGJLwyL6PaarLynXRMkcZDj/88OhMa6JJAQUUUKC8Ak0VANFKcs0118QBrYxNOe200+IB/cADD4wHviuuuCLWJGctMZaHAbi0atCik01chPC9996LY2U424l1kS6//PIYwJx88snhvPPOiwORGWy9/PLLxzE4BBUMMmYsD91daUAxy26wwQbxPQIYWhsIeA466CDeiq0ODDZm4DPdL3ThEEQRsKREFw2Do2kBSmnBBReMwR0BHl03p5xySkuwl+bhcffdd4/BGeOSGEj8/fffh9lnnz22eHCgv+uuu8Lnn38eW5jo6mMgOV1flIUAgjFPc889dywzg64Zb0NAxqByAkISLSetlaHWsrWs44oz/+aff/5oyyBufKgXgk4GpdOVSUsfwR6tQAw+p8uvX79+sZ4pA+aMRyJwza6LSxCk8U0MhiZQYnC1SQEFFFCgvAIjDP2V/N+f9+Utw3DnnCLz659xKtmAYbhXVGMBDpKMMcmOi2GQMafeM96HgyzvX3zxxeGII46Ig6ppXeE0bRItEQRsKXEgT6ebp2mVj2yTwc6ViZYPWi/4q5UI0OgG60yiDHS79Rp6Flq6vEBaX1tlqLVsWkdbjxhQv62d7s/ytZyz68+ui1Yo/mhhqmeXaXZ7zfacM/M6mgj2y5KapZxlqQ/zqUAS+OMIm6YM5yMtKrSCcMBmbEv2mjfDuarcZqebadppp+3S7RGIZIMfNsYAY069nnrodXb4o/WEU6pT0JGCH+bNBj+8biv4YZ5qwQ/T23vATvlgmY4mysAZZdVSW2WotWy19VWb1ppBdt5aztn5sutKg8Hba5ldj88VUEABBYon0KkAiOup0EXBheYIfMoQ/HRnFRwy9Ho6dLfxl7pYdthhh3jxve7Ml9tWQAEFFFCg2QQ6FQCdeuqpYZpppolnxdCiYaotwHWF+Mt2wdRewncVUEABBRRQoCsEOjUGiLObOKOKwcJnn3127OLpiky6TgUUaDwBxqZ1NJWpK7JZytnRunQ5BbpLoFMBUMo03WCcXcNp0JVjV9I8PiqggAIKKKCAAkUR6FQXWCoEt0rg+jrcaoIbUZoUUECBtgSyVwlva97K9//2t79VTirs62YpZ2ErwIwp0IpAXQIgTg8ea6yxDH5aQXayAgr8WYCrjzdDapZyNkNdWsbGEuhwAMTNMun24gJ3nM7NRfZMCiiggAIKKKBAGQQ6HABxoTuuSMxF57igoEkBBRRQQAEFFCiLQIcDIAqYvct2WQpsPhVQQAEFFFBAgaa6F1hXVDe3t+CeYvvtt1/LjTi7YjuV62TcFZcg4H5b2bvbp/m4hxk3Qn3sscdabmaa3qPljjxz77IffvghTR7mkXkuvfTSkPdNP88999x4cc1hMlPlxfPPPx+vql3lLScpoIACCijQpoABUJtErc/ADTKPPvrosPPOOwduEppnsMB2X3nllXj7i8UWWyzcc889LRnt27dv+Oabb+INOwmACM5SOvPMM+PNS7fddtswySSTxBu1ZoMgysTNUzlzhVubzDHHHGnRLnvkRqUpffzxx/FGq+l1a49cW+X9999v7e26TM+61GWFrkQBBRRQoDACdbkOUGFKk3NGuDs7dwenpSTPRJBACxC3ICFtvvnm8fk///nPeCkC7jTPPHRR8rjooovGm79yl3fu93XzzTcHLl1AmmmmmQK349hll13i/dyYzt3rDz744Ph+V/878cQTAwEctwYpUuLedltuuWW8432R8tVIeWmWm4Q2Szkbad+0LM0h0GPo/akOaY6i1reUd999dwx8aIXgRpkMBOdu6zfeeGMMRi688MIwxRRTxCCElprrr78+3gmeaeku6Z9//nnsvmJZginuPj7ZZJPFlo1+/frFDE866aR/yjjzc9mBlE4//fTYysPNV1k3XV/cbX7ZZZcNl112WZh++unDkksuGbfP7Uuo8nQl3SeeeCI8++yzYdNNN433KLvvvvvCDTfcEIOhHj16pE0M80hwQPm5Weijjz4arrvuujDRRBOF8cYbL843aNCg8PDDDwe++Lk/XLop7HvvvRfouuKu8Lz32muvhZ122ilMOOGE8e7tU045ZbyT/MsvvxwI4lK65ZZbYgsXrWzMQ6KFi0COG6+2lR9cH3/88XDHHXcETkmeYYYZ4jqoF1rtuHP8JZdcEtdD/TA/N67lCudc14qyYU5Zmfbmm2+G6aabzot+RsWO/xswYECHFyZwL0tqlnKWpT7MpwJJwC6wJDGcj1yIjRYYupFWWWWV0LNnz7DNNtvEbqezzjorHuA5kJ9//vmB15tttlkMBOacc87YdfXhhx/G+4IRABCkELAQpNBdRXDw7rvvxlYagqTW0gcffBAvQ0DgRRCREtsj0FlqqaVidxI3XyVx3zYCJMYtpUTAxRW8SVdeeWUMrMjTMsssE8vy6aefpllbHgnmVlxxxbDvvvsGghqCgvnmmy8MHjw4BjfzzDNPWHDBBWNX2iKLLBIDC4IN8kPLEtu56KKLYssPwctCCy0UZp555hig9enTJwaDaWO0bhFg0q1Htx2BJVZse/vtt4+z1coPM1AebtxLi87xxx8fAyEu40CdsJ6rr7465nvppZeO5RlhhBHC4osvHoNXHMYdd9zY0kcZCIzuv//+2NKW8uijAgoooED5BAyAOlhnBDyjjTZaGHnkkWMrAS0FBEDc6JQDLq0N8847b9h1113D3nvvHQOPlVZaKY6p2WOPPWLwxHWU6JJinr///e8x6OBgz5iiI488MrY80ErSWqL1iTE6tOKwjZToxlpjjTViSwUtUl988UV8i1aM1VZbLQZkjKF566234sGc4IMgiKDgqKOOivd14yD/6quvhnPOOSettuVx3XXXjQEfAQRBCOOPuAXKbbfdFq8JxXRaqGgRotWG4IMrhffu3Tu2CNFVxyDr1MpDEEmQQWsOAVNKBCbkEzdao2jZ4pc/86611lpptlArP8xE+QguySMBH15cxmG55ZYLtLBRH/vss0+sC4JD5qPVipYhWpwIMBlUfvvtt8fyEcRR9yYFFFBAgfIKGADVse44UPKXLg9A9wrdPdnbgzDeJV0an5YG/lLioEsAlBIBC60drSWChd133z12WdGqwrZI66yzThzTQ3cT61h55ZVbVkGLFHngoM/ZYwQYjBGi24eUuocIONZcc82QuuJaVvC/J+SbLjAS8xJMEDQR+BDA0S1HSxbzDRkyJM6HTeoOixP+9y9rkH1OdxwtZiktvPDCsbWI19n50utq+eE98kPAResRrVQpP6wjux7qLTsgm2VT2nrrrWP3HD5PPfVUDADTez4qoIACCpRP4I+jbfnyXvgcp3E22dPUaXFI0ysLkA1+Kt+r9ZruJoIQWqPoWqO1gi46gg3GFtGqQWsPafTRRw8HHHBAuOCCC2KgxlW8ObinfL3wwgstm6Jlhi6q9iQCKVpXPvroo9jVRRBGy0x7ypQNQrLbYvpdd92VndQSvAwzscqLlB/e2nDDDeOlALhqebUxVVUWj5Oy+aLFj5Yj1kFgSGuQSQEFFFCgvAIGQJ2oO7q76OJJiatiE1CkRLcPf1yvJyVafzjNnMT82UTrQ3Z53q+ch/nZZrZliAHHnMlF1w0Hbbqz0v2H6Iai1YLBvdlE9xndPgRCBE4kXtOKkxKDo1Ne07TsI+UnkR9aRej64qw4bopLNxXjhziVPJWp0odlaRX69ttvW+ahJSq1RjHOiKDjmmuuiQ48p2uRlJ0vThj6r1p+CIToSkvlf/vtt1u2xZl02USwl7wZK5Vaisg/LVrk9fDDDw977bVXrpc8yObR5woooIAC9RHwLLAOOhLIcAo3ZwRxlhXdXCeccEI824jWgrnnnjuOIWHg77HHHhufP/TQQ/G0dMbZEKQcc8wx8UwoBt9y/R2WpwWH8SoMoOZihQQ0vE5njpFdWpQYOExLDy07HKgZQ0RrC11QjFuh64rAgm4kxv2k6/kQONEVRPfUxRdfHMfUJALWSR7pOmM5BmCTVwKrykTeOa2e0+wJnBizxPIEXKyfsUd0SREAsS5akxhPxLgiurUYg0NieQYm40cAwz3lGNxNNx1jmQhK6K477rjjYnDCc/J18sknx+CIs7EoW2v5IWihVYtB4Sw344wzxgHY5Iduw4EDB8bT/skX28ZwhRVWiPm59tprY0sPQSxngBEEUbcEUXSrOQ6ocq8YvtfNcnZUs5Rz+GrfuRXofgGvA5RDHdCqwEGT8TjZs7U6s2laJQgu0nijauuiJYbT0LOJ4IkWoexp9Nn3eU5gQwsLB/vWEuslQJh99tlj0EMXXErkjTITOPHIX62uMMYu0TXXWqJljECoVllr5Yf1Elylli7WVS2oq9w+BuSdwdC0FvFHsNlaF2bl8r6uLUAQ3tHE2YJlSc1SzrLUh/lUIAn8+ad9esfHugnQijPttNPWbX2siICjVkDAPJXBD9M4Xb2t1N6WDQKCagFdNhii7PzVSrWCH5ajFac9qbX8sGwKfnjenuCH+dKgap4TwPFn8IOGSQEFFCi/gGOAyl+HuZeACxAytoYWILqwujsVLT/d7eH2FVBAAQXaFrAFqG0j56gQYKwPY5ZIbbVCVSzaJS+Llp8uKaQrVUABBRSoq4ABUF05m2Nl1a7l050lL1p+utPCbSuggAIKtE/ALrD2OTmXAgoooIACCjSQgAFQA1WmRVFAAQUUUECB9gkYALXPybkUUEABBRRQoIEEDIAaqDItigIKKKCAAgq0T8AAqH1OzqWAAgoooIACDSTgWWA5Vib31uLqwq+99lrc6sQTTxwWWGCBcNttt8XpXGiPm5jONNNMYdCgQfE2DFy1eJ555mm5lUWO2e3yTXEF6DvvvDM89thj8VYXHd0gtwSZaKKJwuSTT96yCty4ESzXK5ptttnCkksu2fIeT1iG+uA9biabvVgj91HjnmhcMHHllVeOt7zg1iVcIXuqqaYaZj2+UEABBRQop4AtQDnVGzcd5aah3DrihhtuCFtttVVYZJFF4tWauZdV37594/2mCH5I3FOL2zdwIE738aqVVW4X0d2JW3MMT+LKzW+88Ua49NJLh2exlnm5FtFJJ50Ur27NzV1T4vYV3EiVW4/suOOO8YKN3NMsJay/+eabWAcEX/vtt196K9x9993h6KOPjneyJzjiGkPc2oN7lRE0vfjiiy3z+kQBBRRQoLwCBkA51B3BD607tDRwm4nDDjss3lfqwQcfjFsnKOLeRtyINJvee++9sP/++2cnVX3OTVlfeumlqu/lNZEWFwKL4UkEKLRudTTR6rPbbrvF24Jk10GAiQemtOxsu+224eCDDw4ESdwfjZugzj///LFOuNs9d4tPiYCKeuJ2HjPPPHO80CM3hyWtueaa8W7w3OXepIACCihQbgEDoC6uP7p5uJM5B9qUuBs69+QiMEqJgy53JH/88cfTpHineW5cmhLdYnfddVe8C/3rr78eJ1922WXxbuzcDuKRRx6J0wicOMhzV3buVp8Sd4K//PLLwxlnnDFMwMR6mU6LE91GLEfrTGuJVhPWwQ1eSXTrbbzxxrErj2Wr3R6jWt5bW//wTK92rzG6trJdVdzNnft60d3G/dG4cOIRRxwRN8ONKjfbbLOWTRKMnnDCCTFQoiuMVq1skLb55pvHALZlAZ8ooIACCpRSwACoi6vtuuuuC1NPPfWfbujJgZSghbElJO7Sznznn39+fP3www/H7pf4Yug/AikOxARKBFN0n9Hqsswyy8RHumposeCAve+++4YNNtggjou544474ioIhNZaa62wxBJLxJaMDTfcMJx++umxW+7QQw8NW265ZTzwszwtVAQ41RItJHQDMXZptdVWi8ESQcjiiy8eW0vIz7jjjjvMoq3lfZiZ6viCgOfll1+OLqyW1hzGWyXrs846K5x66qlhqaWWimOtDjzwwJat77nnnjGgo3z77LNPwC87Pgjjf/3rX/Gu8C0L+UQBBRRQoHQCBkBdXGW06nDwrUypRYjxLwyKnn766cMWW2wR+vXrF4OdK664Iqy33notizEOhZaKscYaK44PYjwR42cmmWSSOA+PBB7vvvtuoGuNda600kphueWWi+/vtddeYZVVVokDhZn373//e/xjkDDbZX0c/Hm+9tprx4HJLRv/35MPP/wwBjw9e/YMb731VphxxhnDmWeeGQcL06oy0kgjhSmnnPJPwV5rea9cf71eE5jR4kOAx9goghhacwheSASBa6yxRmwdu/HGG8MXX3zRsmnubs/4LLrn6Ep74YUXWt7jyRRTTBFbx1IL3DBv+kIBBRRQoDQCBkBdXFUfffRRGH300f+0FQY5c4YR3WAXXXRR7EIi+KC15JJLLoldMIxxSYnAZ+edd44HdbptaJUYMmRIerullYKDPGNf5pprrtgSNN1008V5Hn300TiQNy2w6KKLxi4vWnNSC0d65OBfbVD1k08+GYM5zqjij1YiWrHaSm3lva3lh/d9vCnvO++8E/bee+8Y4NBaxhl2pHXWWSfssssugbE9lJV6SAn7Dz74ILbIpaCRQDMlAisCWgJNkwIKKKBAeQUMgLq47qaddtqqY2LY7CabbBJeeeWVeHYRLUC0Liy//PLxoL3qqqsOkzMCKbpsOHivu+66cQBvdoYUvBBA0SpzzTXXBMYHbbfddnG2scceOx7U0zK0ArEM09ubyB8tH5wRRRDAH4Hc4MGD4ypSHirX11beK+evx2u6wU455ZTY1cWZXQQ8k002WaAVi7O5CIZotWLM03PPPRdbtNjuOeecE7v2KMtBBx0UuxrpxswmAs/U8pad7nMFFFBAgfIIGAB1cV1xCjsBQLVEFxUtEJxdlBJjcWitqAyAbr311tgqRDcXZyExVoeuJdIoo4wSvv322/iaAcoc4OkGuuWWW2IrB/PQnXbfffcFThEnMeaILiwGZKf1xDeG/mNQc7XEwG0GR++xxx6BAdUEW7QCcWbbqKOO2tIiVbm+WnlnfSlPaZu33357y7ronrr++utbxvOkedIjy1Yun97j8eKLL46DtTkLjERgw1ggusRIBHAMNCe4I3G2Xjrri9fTTDNNmHfeeXkaE2WmdSx1p6XpPiqggAIKlEugxyFDU7myXK7c0gJ03nnnxUHJI4888jCZ50J7AwcOjKdyE0SQmJ+zq9IYobQAB2rOsGLMCt0wBEAENMsuu2wMfo4//vjYMkNAxbVv2BbrXn311eMBfuGFF45Bzz333BMDHlo6CF5ozWFgM4FTr1694usjjzwyjo/p3bt3bCVJeSB4YIwPLSOnnXZaHAy99dZbx5YgxgVde+218eKNtL5kxz21lneCLwYkc+bbLLPMEliO1hW67wi2uCYSZ7NxhhnX5OH9bCI4orWLM+MoL/OnAdi0VFGOMcYYI5aTIJFEdxxlYKwVQSOGBIvpWku0DDH25/PPP4+tcwSDlDElxhNRbwwyN3VOYMCAAR1eQbpeVodXkOOCzVLOHEndlAJ1ERhh6K/n/zYJ1GV1rqSaAFd6ZhwJ16ypTLQoVI4RqjaN5WhZobo4APPIHy0WpLQMLSpMIzigi6cycTo6rTeMDWqty6pymcrXBAV0e1Wun+nkicHQlalW3ivnZWB2tmuOIIUWJs50a0/iIocEkZzSzhlgrSWuCcRp8dUS66BeKpdnnNYhQ38zZE+zr7a809oWYCxbRxPj3MqSmqWcZakP86lAErALLEl04SNdXZNOOukwY3DS5iqDH6ZXm8Z0DsYEPySClxT88Dotk6ZVBifMQ6I1hvFGHQ1+WActUNXWz/RqwQ/L1Mo772dTNvghWGNQcnuDH9ZDKw+tSJXBS3YbPG8t+OE91lG5PNcOYiyRwQ9CJgUUUKDcAgZAOdUfA5dT8JLTJhtiM3RpZS9U2F2FokuMrri55567u7LgdhVQQAEF6ijw3+aEOq7QVbUuQKuEqZwCXB/IpIACCijQOAK2ADVOXVoSBRRQQAEFFGingAFQO6GcTQEFFFBAAQUaR8AAqHHq0pIooIACCiigQDsFDIDaCeVsCiiggAIKKNA4AgZAjVOXlkQBBRRQQAEF2ilgANROKGdTQAEFFFBAgcYRMABqnLq0JAoooIACCijQTgEDoHZCOZsCCiiggAIKNI6AAVDj1KUlUUABBRRQQIF2ChgAtRPK2RRQQAEFFFCgcQQMgBqnLi2JAgoooIACCrRTwAConVDOpoACCiiggAKNI2AA1Dh1aUkUUEABBRRQoJ0CBkDthHI2BRRQQAEFFGgcAQOgxqlLS6KAAgoooIAC7RQwAGonlLMpoIACCiigQOMI1CUA+uCDD8IDDzzQOCqWRAEFFFBAAQUaWqDTAdBvv/0Wdthhh/DII480NJSFU0ABBRRQQIHGERixs0W57LLLwsILLxx+//33zq7K5RVQoIkExhxzzKYobbOUsykq00I2lECnAqBnn302TDXVVOHHH38MX3zxRUPBWBgFFOhagaWXXrprN1CQtTdLOQvCbTYUaLdAhwOg7777Ljz55JNhu+22CwMGDGj3Bp1RAQUUQOCXX37pMMSII3b4q6vD2+zogs1Szo76uJwC3SXQ4W+RU089NbzxxhvhhRdeCK+++mpsBZp88snDpptu2l1lcbsKKFAigVtvvbXDue3Tp0+Hl817wWYpZ96ubk+Bzgp0OADq27dv+Prrr+P2+/XrFwYNGhR69+7d2fy4vAIKKKCAAgoo0OUCHQ6Axh133MAfaYIJJoiDoHk0KaCAAgoooIACRRfocACULdjWW2+dfelzBRRQQAEFFFCg0AKdvg5QoUtn5hRQQAEFFFBAgSoCBkBVUJykgAIKKKCAAo0tYADU2PVr6RRQQAEFFFCgioABUBUUJymggAIKKKBAYwsYADV2/Vo6BRRQQAEFFKgiYABUBcVJCiiggAIKKNDYAgZAjV2/lk4BBRRQQAEFqggYAFVBcZICCiiggAIKNLaAAVBj16+lU0ABBRRQQIEqAgZAVVCcpIACCiiggAKNLWAA1Nj1a+kUUEABBRRQoIqAAVAVFCcpoIACCiigQGMLGAA1dv1aOgUUUEABBRSoImAAVAXFSQoooIACCijQ2AIGQI1dv5ZOAQUUUEABBaoIGABVQXGSAgoooIACCjS2gAFQY9evpVNAAQUUUECBKgIGQFVQnKSAAgoooIACjS1gANTY9WvpFFBAAQUUUKCKgAFQFRQnKaCAAgoooEBjCxgANXb9WjoFFFBAAQUUqCJgAFQFxUkKKKCAAgoo0NgCBkCNXb+WTgEFFFBAAQWqCIxYZZqTFFBAAQUUGC6B/v37D9f82Zn79OmTfelzBXIRsAUoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEhixSJkxLwoooIACCijQ/QL9+/fvcCb69OnT4WXzXNAWoDy13ZYCCiiggAIKFEKgUy1A3377bbj66qvDTz/9FDbccMPQs2fPQhTKTCiggAIKKKCAArUEOtUCdN1114W55porvPrqq2G//fartR3fU0ABBRRQQAEFCiPQ4RYgWn3WX3/9MMooo4Sff/459OvXrzCFMiMKKFB8gVlmmaX4maxDDi1nHRBdRe4CzbDfjvD70NQZ2YEDB4Zdd901nHzyyWG66abrzKpcVgEFFFBAAQUUyEWgwy1AKXfPPfdcGDx4cNhoo43CE088kSb7qIACCtQUeOedd2q+X+vNXr161Xq7UO9Zzraro0z12XZpGmOOZthvOx0Arb322oG/eeaZJ3z88cdh0kknbYzatxQKKNClAi+++GKH11+mA6blbLuay1SfbZemMeZohv22U4Ogs9XMDjzxxBNnJ/lcAQUUUEABBRQopECHA6Cvv/46LLroouHiiy8OTz75ZDjyyCPDX/7S4dUVEsdMKaCAAgoooEBjCnS4C2zssccODzzwQFTp0aNHY+pYKgW6QaAZrsDaDaxuUgEFFBhGoMMBEGsx8BnG0hcKKKCAAgooUBIB+6xKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJzBiZ1Y1ZMiQcP3114cRRxwxrLnmmmGUUUbpzOpcVgEFFFBAAQUUyEWgUy1AJ510Uvjll1/C1VdfHdZff/1cMuxGFFBAAQUUUECBzgp0uAXok08+CX379g0TTjhhWH311UOvXr3Cr7/+Gnr06NHZPLm8AgoooIACCijQpQIj/D40dXYL7733Xth5553DTTfd1NlVubwCTS/w448/dtigTN3QlrPtarY+2zZyjq4RaIbPZ10CoGOOOSYsueSSYYEFFuiamnCtCiiggAIKKKBAHQU63AWW8vDMM8+EWWed1eAngfioQCcFHnzwwQ6vYfHFF+/wsnkvaDnbFrc+2zZyjq4RaIbPZ6cCoIEDB4ZPP/00rLLKKrEGXnvttTDzzDN3TW24VgWaROCrr75qipJazsaq5mapz8aqtdZL0wz12eGzwBgE3bt377D77ruHGWaYIUw22WRh0KBBrWv6jgIKKKCAAgooUBCBDrcATTLJJOGNN94oSDHMhgIKKKCAAgoo0H6BDrcAtX8TzqmAAgoooIACChRLwACoWPVhbhRQQAEFFFAgBwEDoByQ3YQCCiiggAIKFEvAAKhY9WFuFFBAAQUUUCAHAQOgHJDdhAIKKKCAAgoUS8AAqFj1YW4UUEABBRRQIAeBDp8Gn0Pe3IQCwwj0799/mNftfdGnT5/2zup8CiiggAJNImALUJNUtMVUQAEFFFBAgT8EDID+sPCZAgoooIACCjSJgAFQk1S0xVRAAQUUUECBPwQMgP6w8JkCCiiggAIKNImAAVCTVLTFVEABBRRQQIE/BAyA/rDwmQIKKKCAAgo0iYABUJNUtMVUQAEFFFBAgT8EDID+sPCZAgoooIACCjSJgBdCbJKKtpgKKKCAAp0X6OgFWdmyF2XtvH8912ALUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCoERS5FLM1lToH///jXfb+3NPn36tPaW0xVQQAEFFGhoAVuAGrp6LZwCCiiggAIKVBNo6BYgW0aqVbnTFFBAAQUUUMAWIPcBBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEOh0Afffdd+G+++5rOjgLrIACCiiggALlFejUafCDBg0KRx55ZHjvvffC0ksvXV4Fc66AAgoooIACTSXQqQBovPHGC8stt1w477zzCok2/vjjFzJf9c6U5ay3aPeuz/rsXv96b936rLdo967P+uxe/3pufYTfh6bOrPCOO+6IAdC1117bmdW4rAIKKKCAAgookJtAp1qAcstlBzfE+KSOpNFHH70ji3XbMpazNn2z1CcKZSprR/dby1l7f++ud63PtuX9fLZtlOccDR0A3XPPPR2yLNtNQi1n7WpulvpEoUxl7eh+azlr7+/d9a712ba8n8+2jfKco9NngeWZWbelgAIKKKCAAgrUQ6BTAdBXX30VGAM0YMCA8Prrr9cjP65DAQUUUEABBRTocoFOdYGNM8444eSTT+7yTLoBBRRQQAEFFFCgngKdagGqZ0ZclwIKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKIzDC70NTYXJjRhRQQAEFFFBAgRwEbAHKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQaOgA6NVXXw0XXHBBuPjii9uk/PXXX8NBBx0Ufv/992Hmfeutt8Kxxx4bvvvuu2Gmd+eL77//Plx22WXh0UcfzSUb9957bzj//PNz2dbwbiRbb59++mk4+eSTwyOPPDK8q8l1/gceeCDul/XY6CuvvBIuv/zyuKqnnnoq7LPPPuHnn3+ux6o7vI5LLrkkvP76620un627tmZ+8803wzHHHBN++umnllkp57XXXhvOOuuslmnd/YTvG/I6PGXr7jy7/e4TKOLxpfs08t9yQwdARxxx3+7eSQAADlFJREFURNhiiy3CCCOM0KbsX/7ylzD11FP/aV6mnXrqqeG3335rcx15zTDqqKOGJ598Mrz88su5bPKvf/1ruOaaa3LZ1vBuJFtvE088cXj++efDwIED42o++eST8MMPPwzvKrt8fvbHK6+8slPbeeedd+LyY4wxRphoooni8znmmCOce+658eDbqZV3cmHyQ77aStm6Y96333671UUmn3zy+EPkl19+aZmH/ZLPwm233dYyrbufsA9S9lplK+p+2d12zbj9Ih5f6lkPRd/XGyYA+vHHH8PgwYNj3dGKwwHigw8+iAfATTbZJP4q5sszzcOMn3/+eUtdc1DaeOONh2kB+vrrrwOtLaOMMkr8QmuZOYcn//nPf/4UdH300UexHOR1nHHGibngF3H2VzETK5fFg7ITDKRADq/KRFkr08cffxwPMnyh552++uqrltYMflFTH5T1//7v/+JzWgAq661nz54xm5Rl3XXXDV988UVcB8ullhEeswfSvMs17rjjtmxyyJAhLc/Tk2r1kOqeeY4++uhw0003xfJMNdVUYdFFF42LjjzyyLGu0np4rLau7Ptd8XyFFVYI448/fsuq2e8oJ/tesudzmK27F154Ieyyyy7DBKzZ/ZhAZ7TRRovr/Pbbb1vWPfbYY7c8T0+6o8xp2yuttFIYb7zxWi1bdr9Mn0X25/Sc9WSNeM6+y/7P55jn2XnTdrv6kTrkM/PNN98Ms6laeefzyueP7xqW5Tn5TylbT6yXfYQypu/dNN+XX34ZW+C7o9zZzx35IQ+fffZZylp8zB57mJCtvzRjpRPTUzm74/iS8pUeW/t+5f20/1EP2cR0Wt2z9ZItZ7V9Pbt8EZ7nf1TrglLT5N6vX7/Qv3//cN5558UDJM/5wN1xxx3h2WefDdNNN104/vjjwyKLLBJbCHbYYYfw/vvvh759+8Yc8QXMr8z0oTzttNMC62U9BAF5pq222ioQOe+7777htddei5s+5JBDYkC3/fbbt2TloYceCocffnjggJNaPSqX/fDDD8M888wTfz3vvffeYc455wyXXnppOOCAA8LKK68cv5z4IF5xxRXhoosuCqusskr0Y0febrvtwnPPPRe7azhY5ZnojqSraL755guUmYB2yy23jHmkO3K11VaLdVNZbymP1C3dMHfeeWc0nGKKKeL8vH/ccccF7Loz8cVx5JFHhvXWWy92Z5IXWq/OOeec8PDDD8dgILWIZOue/ZP3X3rppUD3F11+BHqVqVqdVs7TFa/5kqRurrrqqkCgQvn4jK2zzjrhzDPPHOZzSJdd+szRnfvee+8FultJlftxyivd0csvv3wg0KgM4rurzClvHBiXXXbZuM9l98ts2bL7JQf966+/Ptx6661hjTXWiF3adNtnv6v4DsKIHwMccHDJ/ohL2+6qR74HNtxww7D55psHWtTnn3/+WK9sr6280xXdq1ev+H3CQXKPPfaIdVZtP2ddGDDf6aefHrfHNvg+f+ONN8KBBx7Yrm5VlqlXyn7uWCffhXzfcjxZeumlY1BXeeyprD/8Kp1YV3ceX9h+ZSJArfb9OmDAgFiHhx56aDweLLDAAnFffPfdd+PnmTo+5ZRT4uoqy1m5r1duswivGyIAGmussULv3r1jRTE2hoh6ueWWi79C11xzzTDvvPMGfnUTKKRuo7XWWitMP/308SDLL00Cg5FGGinWCV+sHDj5RUrrEa0teQUA/AJacMEFw2yzzRa/8P7973/HR7o2yN/BBx/cst9QLj6QfClx4Ki2LF+ek002WVh88cXjh47uiWmnnTYGg6yIPmiCH4JFyslBi3XRrUAeCJIwyDMRzN1///1h9dVXD1tvvXWgW4GDwgwzzBCzQV1Sd6RsvcUJ//s300wzxRaDtddeO9A1tOeee4bHH388vovjUkstlZ099+djjjlm2H///QNfsgSeJL5kqEsO8NjzJclBL1v3tILMPPPMgS+iueaaKz5Wa82qVqdxI138j7ph/yJRRuqHffD2228PO+200zCfQ8qaPnPsy5NMMkkMwKvtxynbu+66awwUOFDyGc2m7ipzygOfLQJtUna/zJYtu1/yPXPDDTfEAIGuEMYqzjLLLMMYEfDQysdngnpeeOGFW1p/03a78pFWRfY18sd3T/rhQQDfVt757uB7h+56WpD4Tua7utp+Pvvss8di9OjRI34GUpn4/uO7gB8LE0wwQZrc5Y+Vnzs2yIGeH6Uca9gPSZXHnsr6I3CtdOrO40vMdJV/1HO179cZZ5wx7m+bbrppuPrqq8M000wTbr755kCAT7k4pvIdW21/yO7r1Vpqq2Qj90kNEQDxRZoOFpW/CpMoX7SpEviCpgWASJ4vaT6cpBTk3HPPPS3zMp3m97wSeZh00knD2WefHejOIW/knYMgXygXXnhhS1b4siCNPvrocb5qy/I+3VdpXgzSc8pFwEMrGV+sG2ywQQwI+XV94403tnyZM1+yYX1dnThgDho0KH7IaBYniK2VauUtvUcrBK2EdIvSRdHdacQRR4xZoO7YZznoE3im4IH6uOuuu1qt+1Su1spRrU5bm7fe07N544s1dUuyneznkNfZedNzHis/A8xL4vPK+/wCp2Ugm7qzzCkf2a7iVB7eyz5Pr/meoZx87ji40vpHqjSi5faEE04It9xyS2wliTPl+I+8p/zzXUALVHvzvuOOO8Zy0ZLOD5rW9vPWinPUUUfFctMynbVtbf56Ta/8ziX4fPDBB2PdsI0+ffoEPsPVjj3Z+qvm1J3Hl474UPeMtyMttNBCsaWWY9Hf/va3QID07tDWoGrlTNtK+056XaTHhgiAiMb5cPHriw9YStnnaRqPRLIcdGgFINJPifn5m3XWWePOTl8uKfW/p/m68pEDP7+Q6PZhpyM/NO3TAkKXB1E3O1y2bOl5tWUr85rtr03v8SGm5YzEB51mTQzYqUmUv9py8c0u+MdBk9YRuhRohUu/DvlioUmZRD5Tnih/Mshmh0AvzcNBmAMNf+wr3Zmy+U3P+ZIg8KS7hERX0oorrli17jkQUCe1UrU6rTV/vd+rVh/VtpHKny1Te/Zjmtcru/66u8yV5atWNuZJ+yX55fOczjDlAFst8QOAX9y0XtMaWoTU3ryzD5Nvxlum4LXafp79bPO9m/Yfhh/wY5VufM6wyytVfucylIDjC2cdkjjTj89oa8eelM9qTt15fEn5qvaYrYPs92t2XvZDvj+ffvrpOJSAk2MI0KuVk+XSvp5dR5GeN0QARDMkB0zGRjB25r777otfLPxCpKIYR8P4AnZeDhwcYHnO2V00W9O6wnwccPkVSb81/aF0RXHqLV/OTM8j8auC5kR2Kg72NDfy4SMQoA+aJmgGmNKsTKBC9x3POXDypVG57Isvvhj7znmf9fBlRFn4YqF/ly9dzpSjK4zxNvvtt1+Ye+65w2abbRbN6JfnjCVs0piUrnbgi+Uf//hHOPHEE+NYAMaOkPjgMeaF8QC0jDGWgHEkqd4o+zPPPBNdeJ/xGHR9cbAk7bbbbvFLLA0gjxO74R/N+nRpEMjS6oMrdcM+zHt84dNaxa9nAvRs3dMdsthii8XLErAs3UCMdWJdTzzxROzKvPvuu6vWaR5FpQ74LLFf8Zz9jrFcPK/8HGY/c7R8UY+05JKYP/sZ4D1+sPCZZR/gBwFdu+zLfM75fFfbj/Moc7VttFY2Dixpv6Q1he4DDoiMSeR1pRHr5iDCvkCXZ96JYISuY743qAOe85mjvtqTd747t9lmm2G6nKvt53T58pmlDpMdQROXOGDMIvs9gVNeqdrnjh+mdAPS+sH3DN29lccexgRljzW0kFQ6defxpZZfte/XFIhSLr6P6UHgBynHFS65wf7MsaJaOdlW2tfTd3Ct7XfHeyMMLeAfTSbdkYM6bZOKIHjgFz8furZSmp/it9ZERwBCsJTmbWud9XqfII0ykK9UHvLJL8W2Ti+utmx788UA23SmTVqGljJaovgSzivxBUhgR386ze10x3GAIPBLdZJc2spTmp/5+OKmPAQQRU4c/CeccMKW/bha3WfLVass1eq01vzd+R4HHfY19vvW9mMOksyTuhCr5bc7y8wPp2233Tb+eMrmLVs2pmfrrz35JTDkpATKXqTUnrxTl9W+P6rt5wRc1G36Dq+27+dV/mrbZhrlye5/6fhQ6zupmlPaB9LyeZWr1nZSnrJlIeChC5PvpHT8wYF8Mx/HyJRqlTPNU6TH/w5EKFKOOpiXtEOmD05bq0nztxb8sHyq2DRvW+us1/vZL4tUHvKZdr5a26m2bK35s+9VBj+8x4DyvBNnEzD2h19X6UsynVad6iS5tJU35qeFhF8v2BTponmt5b2ym6Na3SeH1taRpler0/Re0R5pgk+ptf2YVpK2UneUmSCdX/YcQGjNqEzZsvFetv5q5feioQPkadmji6FowQ/lqJV33idl6/K/U/77v9p+XulUbd/PrqMrn1fbNtMqjwfpda3vpGpOaR9Iy3dlWdq77pSnbFn40cFf9viDQ7X9sVY525uHPOdrmBagPNHcVtcK8MuC7j26hZZccsnY1dGZLRJMcd0cuvPacwDtzLZctjkF6Makq49uuqmHnjFVr8S4P7r16UowKZC3AF2fdLXT3UwXWZGCtXpYGADVQ9F1KKCAAgoooECpBNoeLFOq4phZBRRQQAEFFFCgbQEDoLaNnEMBBRRQQAEFGkzAAKjBKtTiKKCAAgoooEDbAgZAbRs5hwIKKKCAAgo0mIABUINVqMVRQAEFFFBAgbYFDIDaNnIOBRRQQAEFFGgwAQOgBqtQi6OAAgoooIACbQsYALVt5BwKKKCAAgoo0GACBkANVqEWRwEFFFBAAQXaFjAAatvIORRQQAEFFFCgwQT+Hxr6T8CDMgAiAAAAAElFTkSuQmCC" alt="" width="373" height="280" /></p> <p> </p> <p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p> <p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p> <p>in ggplot2 the code is:</p> <p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p> <p> </p> <p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p> <p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p> <p><strong>Grading student work</strong></p> <p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p> <p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p> <p>And get this out:</p> <p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p> <p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p> <p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p> <p><strong>Where ggplot2 is better for sure</strong></p> <p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create [Some of my colleagues think of me as super data-sciencey compared to other academic statisticians. But one place I lose tons of street cred in the data science community is when I talk about ggplot2. For the 3 data type people on the planet who still don’t know what that is, <a href="https://cran.r-project.org/web/packages/ggplot2/index.html">ggplot2</a> is an R package/phenomenon for data visualization. It was created by Hadley Wickham, who is (in my opinion) perhaps the most important statistician/data scientist on the planet. It is one of the best maintained, most important, and really well done R packages. Hadley also supports R software like few other people on the planet.</p> <p>But I don’t use ggplot2 and I get nervous when other people do.</p> <p>I get no end of grief for this from <a href="https://soundcloud.com/nssd-podcast/episode-9-spreadsheet-drama">Hilary and Roger</a> and especially from <a href="https://twitter.com/drob/status/625682366913228800">drob</a>, among many others. So I thought I would explain why and defend myself from the internet hordes. To understand why I don’t use it, you have to understand the three cases where I use data visualization.</p> <ol> <li>When creating exploratory graphics - graphs that are fast, not to be shown to anyone else and help me to explore a data set</li> <li>When creating expository graphs - graphs that i want to put into a publication that have to be very carefully made.</li> <li>When grading student data analyses.</li> </ol> <p>Let’s consider each case.</p> <p><strong>Exploratory graphs</strong></p> <p>Exploratory graphs don’t have to be pretty. I’m going to be the only one who looks at 99% of them. But I have to be able to make them <em>quickly</em> and I have to be able to make a <em>broad range of plots</em> <em>with minimal code</em>. There are a large number of types of graphs, including things like heatmaps, that don’t neatly fit into ggplot2 code and therefore make it challenging to make those graphs. The flexibility of base R comes at a price, but it means you can make all sorts of things you need to without struggling against the system. Which is a huge advantage for data analysts. There are some graphs (<a href="http://rafalab.dfci.harvard.edu/images/frontb300.png">like this one</a>) that are pretty straightforward in base, but require quite a bit of work in ggplot2. In many cases qplot can be used sort of interchangably with plot, but then you really don’t get any of the advantages of the ggplot2 framework.</p> <p><strong>Expository graphs</strong></p> <p>When making graphs that are production ready or fit for publication, you can do this with any system. You can do it with ggplot2, with lattice, with base R graphics. But regardless of which system you use it will require about an equal amount of code to make a graph ready for publication. One perfect example of this is the <a href="http://motioninsocial.com/tufte/">comparison of different plotting systems</a> for creating Tufte-like graphs. To create this minimal barchart:</p> <p><img class="aligncenter" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAkAAAAGwCAYAAABB4NqyAAAD8GlDQ1BJQ0MgUHJvZmlsZQAAOI2NVd1v21QUP4lvXKQWP6Cxjg4Vi69VU1u5GxqtxgZJk6XpQhq5zdgqpMl1bhpT1za2021Vn/YCbwz4A4CyBx6QeEIaDMT2su0BtElTQRXVJKQ9dNpAaJP2gqpwrq9Tu13GuJGvfznndz7v0TVAx1ea45hJGWDe8l01n5GPn5iWO1YhCc9BJ/RAp6Z7TrpcLgIuxoVH1sNfIcHeNwfa6/9zdVappwMknkJsVz19HvFpgJSpO64PIN5G+fAp30Hc8TziHS4miFhheJbjLMMzHB8POFPqKGKWi6TXtSriJcT9MzH5bAzzHIK1I08t6hq6zHpRdu2aYdJYuk9Q/881bzZa8Xrx6fLmJo/iu4/VXnfH1BB/rmu5ScQvI77m+BkmfxXxvcZcJY14L0DymZp7pML5yTcW61PvIN6JuGr4halQvmjNlCa4bXJ5zj6qhpxrujeKPYMXEd+q00KR5yNAlWZzrF+Ie+uNsdC/MO4tTOZafhbroyXuR3Df08bLiHsQf+ja6gTPWVimZl7l/oUrjl8OcxDWLbNU5D6JRL2gxkDu16fGuC054OMhclsyXTOOFEL+kmMGs4i5kfNuQ62EnBuam8tzP+Q+tSqhz9SuqpZlvR1EfBiOJTSgYMMM7jpYsAEyqJCHDL4dcFFTAwNMlFDUUpQYiadhDmXteeWAw3HEmA2s15k1RmnP4RHuhBybdBOF7MfnICmSQ2SYjIBM3iRvkcMki9IRcnDTthyLz2Ld2fTzPjTQK+Mdg8y5nkZfFO+se9LQr3/09xZr+5GcaSufeAfAww60mAPx+q8u/bAr8rFCLrx7s+vqEkw8qb+p26n11Aruq6m1iJH6PbWGv1VIY25mkNE8PkaQhxfLIF7DZXx80HD/A3l2jLclYs061xNpWCfoB6WHJTjbH0mV35Q/lRXlC+W8cndbl9t2SfhU+Fb4UfhO+F74GWThknBZ+Em4InwjXIyd1ePnY/Psg3pb1TJNu15TMKWMtFt6ScpKL0ivSMXIn9QtDUlj0h7U7N48t3i8eC0GnMC91dX2sTivgloDTgUVeEGHLTizbf5Da9JLhkhh29QOs1luMcScmBXTIIt7xRFxSBxnuJWfuAd1I7jntkyd/pgKaIwVr3MgmDo2q8x6IdB5QH162mcX7ajtnHGN2bov71OU1+U0fqqoXLD0wX5ZM005UHmySz3qLtDqILDvIL+iH6jB9y2x83ok898GOPQX3lk3Itl0A+BrD6D7tUjWh3fis58BXDigN9yF8M5PJH4B8Gr79/F/XRm8m241mw/wvur4BGDj42bzn+Vmc+NL9L8GcMn8F1kAcXgSteGGAABAAElEQVR4Ae3dBZgcRd6A8eLC4RDcLbg7h7sGC+5uwfXQwzncXQ734MH9cHcPENyPAIEgH/rlrbtaOsPs7GZ3trd75q3n2Z2ZnpaqX/VM/6equnuE34emYFJAAQUUUEABBZpI4C9NVFaLqoACCiiggAIKRAEDIHcEBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEDICarsotsAIKKKCAAgoYALkPKKCAAgoooEDTCRgANV2VW2AFFFBAAQUUMAByH1BAAQUUUECBphMwAGq6KrfACiiggAIKKGAA5D6ggAIKKKCAAk0nYADUdFVugRVQQAEFFFDAAMh9QAEFFFBAAQWaTsAAqOmq3AIroIACCiiggAGQ+4ACCiiggAIKNJ2AAVDTVbkFVkABBRRQQAEDIPcBBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEDICarsotsAIKKKCAAgoYALkPKKCAAgoooEDTCRgANV2VW2AFFFBAAQUUMAByH1BAAQUUUECBphMwAGq6KrfACiiggAIKKGAA5D6ggAIKKKCAAk0nYADUdFVugRVQQAEFFFDAAMh9QAEFFFBAAQWaTsAAqOmq3AIroIACCiigwIgSKKCAAgp0nUD//v07vPI+ffp0eFkXVECB2gK2ANX28V0FFFBAAQUUaEABA6AGrFSLpIACCiiggAK1BQyAavv4rgIKKKCAAgo0oIABUANWqkVSQAEFFFBAgdoCBkC1fXxXAQUUUEABBRpQwACoASvVIimggAIKKKBAbQEDoNo+vquAAgoooIACDShgANSAlWqRFFBAAQUUUKC2gAFQbR/fVUABBRRQQIEGFDAAasBKtUgKKKCAAgooUFvAAKi2j+8qoIACCiigQAMKGAA1YKVaJAUUUEABBRSoLWAAVNvHdxVQQAEFFFCgAQUMgBqwUi2SAgoooIACCtQWMACq7eO7CiiggAIKKNCAAgZADVipFkkBBRRQQAEFagsYANX28V0FFFBAAQUUaEABA6AGrFSLpIACCiiggAK1BQyAavv4rgIKKKCAAgo0oIABUANWaq0iffTRR+Hll1+uNYvvKaCAAgoo0PACBkANX8XDFvDMM88MRxxxxLATfaWAAgoooECTCRgANVGF/9///V+46667wrXXXhs+/PDDJiq5RVVAAQUUUGBYgR6HDE3DTvJVowpcfvnlYa211goPPPBA+O6778Kyyy7bUtQPPvggXHrppeGZZ54JU0wxRRhjjDHCLbfcEuedY445Qo8ePcJjjz0WbrrppjD++OOHccYZJy77/fffhxtvvDFMPvnk4cILL4zLjjzyyOHxxx8Pd9xxR/jkk0/CDDPM0LKd3377Ldx3333htttuC7/88kvczogjjhjXT1B21VVXhcGDB4epp546jDDCCC3LZZ88+uij4c477wxvvvlmmG666QLLk8jvPffcE3766acw5ZRTtizyyiuvhOuvvz68+uqrMX+jjjpqfK9a3il3tXKyQGvbbdmQTxSoIjBgwIAqU9s3aaaZZmrfjM6lgALDLWAL0HCTlXcBAo+ll1467LDDDuHcc88NBAApEfS888474eyzzw4TTzxxnEwAMsooo4S//vWv4aSTTgovvvhiWGCBBcJqq60Wg51vvvkmbLPNNmGrrbYKZ511Vujfv38cX7TTTjuFN954I2y55Zbh+OOPj4EQK/z999/DUkstFSabbLKw6KKLxr999tknDBw4MDz77LPhmGOOCUsssUS45JJLwqqrrpqyNszjrbfeGghoNtlkk3D//feHr776Kr6/+eabh7/85S+hb9++Ydttt435443zzz8/5m2zzTaLgducc84Zl28t79XKyXpa2y7vmRRQQAEFyicwwtCD0u/ly7Y5Hl4BWmSefvrpQHDy5ZdfxhabE044IWy//fYtq/r8889jC8lTTz0VaPU54IADwsEHHxw+++yzsNJKK4X9998/znvNNdeE999/PzAfQcgKK6wQBg0aFFtzmOHkk08Offr0Cb169QobbbRRbAFiPbQ8EagQaJFmm222sOeee4YtttgiLLnkkmG99daLLUu0RqXAiHVk02GHHRaef/75cNlllwWCGFpzaA264oorYksU89JSQ/DGr2cCO1p+UovQyiuvHFueWKYy77RAtVbOatvt2bNnNms+H04BAuaOJvavsqRmKWdZ6sN8KpAE/tt3kF752LACDH6mhWTnnXeOZZxooonCKaecErbbbruWrqYJJ5wwrLPOOrF16Oijj47BBa0/Tz75ZGwVIkghpUee00LEH11HKbGNK6+8MgYYdGcNGTIkvjXppJOGd999N/7RxTX66KOHZZZZJvz666+xy4wutNQ9RQvP2GOPnVbZ8rj11luH3r17x6CKVidaimjZomUnpYUXXjg+TV19E0wwQXorLLbYYrGliQmVea9VzmrbbVmpTxRQQAEFSidgF1jpqmz4M8w4HMbsXHTRReG0006Lf4wHYmzC7bffPswKd9xxx9i6QjCy/vrrx/doRXn99dcDgQTdY/yNN954cazOMAv/78WGG24Yu7to2SHoSWn66aePwdXGG28c7r777sDwM1pmGF80ySSTxPFCaf08fvvtt2nRlkdaXZ544onYarTmmmvG/NPaw+DubCLoSgEULV8pkZ80PU1Lj7XKWW27aTkfFVBAAQXKJ2AAVL46G+4cn3rqqYGgJJtoJaGb67jjjstODgsttFCYZpppYoAy7bTTxvfmm2++wODlPfbYI3afMYD6jDPOCKONNloMdGjBSenrr78OV199dex6Ytrbb78dW3h4znJcg+jhhx8Oyy23XGzJYTpp9dVXD7vttlt44YUX4joZtPzpp5/+983M/9NPPz223Bx++OFhr732iuOSVlxxxRgU0TVHjy4BEgOwZ5111vh37733tqyBVp4U2DFvNu+1ylltuy0r9YkCCiigQOkEDIBKV2XDl2HGyhAAMS4mO9yLQIOWF8bBHHroocOslFagTTfdtGUaLSyM6znvvPPiuB4GFC+//PKxhYZWJQIbBg//+OOPsXWFYIaAi3E8dJf169cvBlQENAxwHnfcccNUU00V5plnnnDsscfG7ey+++5xPXPNNVdYfPHF4wDt2WefvSUP6Qnjkej2YlDyzz//HAdgMx7kH//4RxxvNNZYY4WLL744rLHGGvHsMIIi5k2tX1wK4MADD4yBXGXeWysn3YDVtpvy5KMCCiigQPkEHARdvjrr8hzTKkIwwJihbCLgYEwPp8G3lQg0OB2exOnunKr+3HPPxQHJiyyySAxAWBfBChdm5MwwEkES3V+tJVqi+Mt2caV5CcDSqfVpGo8EfrREjTnmmIFxTm2lauWstd221uf71QWaZXBws5Szei07VYHiCgx7hCtuPs1ZjgK0DFUGP2yelpD2BD/Mm4Ifnqfr9DCwmbFDUw8dAE3rDy09tARlA57sc5atTOSL9VUbx1M5GDstSzBHd157gh+WqVbOWttN2/FRAQUUUKA8Ap4FVp66Kn1OGfRM9xN/888/f+wK45pEBFwmBRRQQAEF8hQwAMpTu8m3tfbaawf+st1jTU5i8RVQQAEFuknALrBugm/mzWa7x5rZwbIroIACCnSfgAFQ99m7ZQUUUEABBRToJgEDoG6Cd7MKKKCAAgoo0H0CBkDdZ++WFVBAAQUUUKCbBAyAugnezSqggAIKKKBA9wkYAHWfvVtWQAEFFFBAgW4SMADqJvjObPbcc88N3Om8zKloZfjpp5/CnnvuGb766qsys5p3BRRQQIF2ChgAtROqu2fjNg8pffzxx2HQoEHpZWke8yjDDz/80CEPbn/xzjvvxPuZpRV0dF1peR8VUEABBYorYABU3LppydmJJ54YXnrppZbXXFF5zTXXbHldhid5lIH7gPXt27dDHKOPPnrgDvSTTDJJXL4z6+pQBlxIAQUUUCBXgR5DD6aH5LrFAmzslVdeiQe7V199NUwxxRRh1FFHjbmiVeWGG24IM844Y7j22mvDU089Feacc854Y9BstmktuPfee+OkAQMGhOuuuy7e04qbbab02GOPhZtuuineO2ucccZJk+ONPO+7775w2223tdy4k3tbcTuI1157LXZt8f5ss80W70nF3dx32mmneB+rkUYaKUw55ZRh4MCB4eWXX443EOXu6g899FC84efkk08eyM+VV14Z52e7dO1wYH/66afDdNNNF1hHtfThhx+Gq666Kt7slHt1cf8s0vPPPx/v+k6rEzd15KalBAuU7/bbb4/54TWJm44y7Y477ghffPFFmH766eN6apWBe4HVWrY91izPfNxr7M477wwTTDBBmGiiiaLpjTfeGHC58MILY11zEcbHH3885vGTTz4JM8wwQ9w+Nzu9//774zKjjTban9ZF3T766KNx/W+++Wa0TPc4iyvw33ALsK92NM0000wdXTT35ZqlnLnDukEFOinQdC1A559/fjjrrLPCZpttFoMTAhwCIgKFQw89NGy55ZbhhBNOCHR/HHbYYeGMM874EzEH2T59+oR//vOfcdl///vfYb311muZ76STTgovvvhiWGCBBcJqq60WD768SYCw1FJLxSBi0UUXDfzts88+MaC555574hgUbhXB3dh33nnnuL5lllkmBkoLLbRQmHnmmePd1Nk2wQoHYIK1XXfdNd5XiwUIOghOevXqFcu0/fbbx+1xM0+CPe62XpmeffbZcMwxx4QlllgiEFCtuuqqcRbKxTRi5GeeeSbwmpuYHnvsseGtt94KV199ddh0003jvAQQq6++esz7NttsEwOMFVZYIb6uVQYWrrVsW9Zx40P/EbBxc9UxxhgjsD1syMdWW20V65vgjaCRYPKNN96I9Xz88cfHfLIOAt5ll1021kXlusYdd9xw6623xromyCJQcqxQkvdRAQUUKKdAUwVAHLQIFvbee+/Y6rPSSiuFOeaYI+yxxx6xZWSLLbaIQQODYXlOMEIwUZlWWWWVGFQsvfTSYeuttw4HHHBAeOKJJ2KAQ0sKrQ09e/aMQQIByplnnhlX8eCDD4b3338/8Ot1rrnmCrPMMksgsOE1rVBsk0SrCK0UpNQlwyMHYpZZZJFF4nv8Y3kO3JdeemmcRiBFEEf617/+FQMQtskdzmnZSPPFGf73j/Kz3ueeey7MOuussXWK8TAEawRdBEGbb755LBetJiuuuGIMfAiECJ5ItDp99tlnMXjiruxHHnlkoCXroosuarMMtZatZf2/7McHAh7uVJ9aybjzOwEQ9x0j6KFVCifKs+SSS8YAidYs6o207rrrxjrjeeW6KA8BIK1bBKcHH3xwtGRekwIKKKBAOQWaKgCiVea7776LXSSpuhZbbLHw5JNPxpep2yc90u2RHbibluGRedJ8tDrQgkRLBuuaeOKJ40GWAy0tSDfffHNcdNJJJw3vvvtu/GMCXUe0VpAIagh8Tj/99Nh6RD6zKW2LadnnvN5xxx3DBRdcEFuKaKVJ63z44YfDvPPO25IXDuIEBdnEAZ1gi6CG/NLCQXdXCrxoOeKPRCBAntP2KXcajE33EAFIStjRupZsmZ6Wq3ze1rIsl5bNWqdttfZIfvljmZRoWXvkkUdiMDd48OAwZMiQ9FbLNlomZJ4Q6NL1SJcZXaMEuCYFFFBAgfIKNFUANPbYY8eaYjxMSgQlaXqa1plHuplef/31GGQRCPE33njjxbE1dE9x+vfGG28c7r777ti1xJgeEt1JRx11VAxmaNWpTCkAqJzOa1pJaLUgCKLlI81LXjhop3zwyFiZbGLsEcEOLTvZ+b799tvsbC3PUzDUMuF/TxhvRICVTdhmA4WUr+w8PG/PspXLtPa6tW2k+TfccMPYUkdrG/mrlbLrohy0FrEcA9BpDTIpoIACCpRXoKkCILp3+EsDmKk2WijWX3/9WIO0hmRTZbCQfY/WnpQ4Y4jEGJ/55psvtgTRrfTll1/GFidageh+olWHcSi0zCy33HKhd+/eaRWxu4oDMgfdt99+O7bmpDdpxSAgSfkjX9m8EcRst912Ya+99gobbbRRWiyOyaEbjOCK/LLeagduxu7stttu4YUXXohlYNB0GitEmfgjsQ66lFI+0nTeW2uttcJ//vOfOEaJ1yQCTbqWSLXK0NayrVnHFWf+0Y2YWnTII/lLeWW2r7/+OloQGJLwyL6PaarLynXRMkcZDj/88OhMa6JJAQUUUKC8Ak0VANFKcs0118QBrYxNOe200+IB/cADD4wHviuuuCLWJGctMZaHAbi0atCik01chPC9996LY2U424l1kS6//PIYwJx88snhvPPOiwORGWy9/PLLxzE4BBUMMmYsD91daUAxy26wwQbxPQIYWhsIeA466CDeiq0ODDZm4DPdL3ThEEQRsKREFw2Do2kBSmnBBReMwR0BHl03p5xySkuwl+bhcffdd4/BGeOSGEj8/fffh9lnnz22eHCgv+uuu8Lnn38eW5jo6mMgOV1flIUAgjFPc889dywzg64Zb0NAxqByAkISLSetlaHWsrWs44oz/+aff/5oyyBufKgXgk4GpdOVSUsfwR6tQAw+p8uvX79+sZ4pA+aMRyJwza6LSxCk8U0MhiZQYnC1SQEFFFCgvAIjDP2V/N+f9+Utw3DnnCLz659xKtmAYbhXVGMBDpKMMcmOi2GQMafeM96HgyzvX3zxxeGII46Ig6ppXeE0bRItEQRsKXEgT6ebp2mVj2yTwc6ViZYPWi/4q5UI0OgG60yiDHS79Rp6Flq6vEBaX1tlqLVsWkdbjxhQv62d7s/ytZyz68+ui1Yo/mhhqmeXaXZ7zfacM/M6mgj2y5KapZxlqQ/zqUAS+OMIm6YM5yMtKrSCcMBmbEv2mjfDuarcZqebadppp+3S7RGIZIMfNsYAY069nnrodXb4o/WEU6pT0JGCH+bNBj+8biv4YZ5qwQ/T23vATvlgmY4mysAZZdVSW2WotWy19VWb1ppBdt5aztn5sutKg8Hba5ldj88VUEABBYon0KkAiOup0EXBheYIfMoQ/HRnFRwy9Ho6dLfxl7pYdthhh3jxve7Ml9tWQAEFFFCg2QQ6FQCdeuqpYZpppolnxdCiYaotwHWF+Mt2wdRewncVUEABBRRQoCsEOjUGiLObOKOKwcJnn3127OLpiky6TgUUaDwBxqZ1NJWpK7JZytnRunQ5BbpLoFMBUMo03WCcXcNp0JVjV9I8PiqggAIKKKCAAkUR6FQXWCoEt0rg+jrcaoIbUZoUUECBtgSyVwlva97K9//2t79VTirs62YpZ2ErwIwp0IpAXQIgTg8ea6yxDH5aQXayAgr8WYCrjzdDapZyNkNdWsbGEuhwAMTNMun24gJ3nM7NRfZMCiiggAIKKKBAGQQ6HABxoTuuSMxF57igoEkBBRRQQAEFFCiLQIcDIAqYvct2WQpsPhVQQAEFFFBAgaa6F1hXVDe3t+CeYvvtt1/LjTi7YjuV62TcFZcg4H5b2bvbp/m4hxk3Qn3sscdabmaa3qPljjxz77IffvghTR7mkXkuvfTSkPdNP88999x4cc1hMlPlxfPPPx+vql3lLScpoIACCijQpoABUJtErc/ADTKPPvrosPPOOwduEppnsMB2X3nllXj7i8UWWyzcc889LRnt27dv+Oabb+INOwmACM5SOvPMM+PNS7fddtswySSTxBu1ZoMgysTNUzlzhVubzDHHHGnRLnvkRqUpffzxx/FGq+l1a49cW+X9999v7e26TM+61GWFrkQBBRRQoDACdbkOUGFKk3NGuDs7dwenpSTPRJBACxC3ICFtvvnm8fk///nPeCkC7jTPPHRR8rjooovGm79yl3fu93XzzTcHLl1AmmmmmQK349hll13i/dyYzt3rDz744Ph+V/878cQTAwEctwYpUuLedltuuWW8432R8tVIeWmWm4Q2Szkbad+0LM0h0GPo/akOaY6i1reUd999dwx8aIXgRpkMBOdu6zfeeGMMRi688MIwxRRTxCCElprrr78+3gmeaeku6Z9//nnsvmJZginuPj7ZZJPFlo1+/frFDE866aR/yjjzc9mBlE4//fTYysPNV1k3XV/cbX7ZZZcNl112WZh++unDkksuGbfP7Uuo8nQl3SeeeCI8++yzYdNNN433KLvvvvvCDTfcEIOhHj16pE0M80hwQPm5Weijjz4arrvuujDRRBOF8cYbL843aNCg8PDDDwe++Lk/XLop7HvvvRfouuKu8Lz32muvhZ122ilMOOGE8e7tU045ZbyT/MsvvxwI4lK65ZZbYgsXrWzMQ6KFi0COG6+2lR9cH3/88XDHHXcETkmeYYYZ4jqoF1rtuHP8JZdcEtdD/TA/N67lCudc14qyYU5Zmfbmm2+G6aabzot+RsWO/xswYECHFyZwL0tqlnKWpT7MpwJJwC6wJDGcj1yIjRYYupFWWWWV0LNnz7DNNtvEbqezzjorHuA5kJ9//vmB15tttlkMBOacc87YdfXhhx/G+4IRABCkELAQpNBdRXDw7rvvxlYagqTW0gcffBAvQ0DgRRCREtsj0FlqqaVidxI3XyVx3zYCJMYtpUTAxRW8SVdeeWUMrMjTMsssE8vy6aefpllbHgnmVlxxxbDvvvsGghqCgvnmmy8MHjw4BjfzzDNPWHDBBWNX2iKLLBIDC4IN8kPLEtu56KKLYssPwctCCy0UZp555hig9enTJwaDaWO0bhFg0q1Htx2BJVZse/vtt4+z1coPM1AebtxLi87xxx8fAyEu40CdsJ6rr7465nvppZeO5RlhhBHC4osvHoNXHMYdd9zY0kcZCIzuv//+2NKW8uijAgoooED5BAyAOlhnBDyjjTZaGHnkkWMrAS0FBEDc6JQDLq0N8847b9h1113D3nvvHQOPlVZaKY6p2WOPPWLwxHWU6JJinr///e8x6OBgz5iiI488MrY80ErSWqL1iTE6tOKwjZToxlpjjTViSwUtUl988UV8i1aM1VZbLQZkjKF566234sGc4IMgiKDgqKOOivd14yD/6quvhnPOOSettuVx3XXXjQEfAQRBCOOPuAXKbbfdFq8JxXRaqGgRotWG4IMrhffu3Tu2CNFVxyDr1MpDEEmQQWsOAVNKBCbkEzdao2jZ4pc/86611lpptlArP8xE+QguySMBH15cxmG55ZYLtLBRH/vss0+sC4JD5qPVipYhWpwIMBlUfvvtt8fyEcRR9yYFFFBAgfIKGADVse44UPKXLg9A9wrdPdnbgzDeJV0an5YG/lLioEsAlBIBC60drSWChd133z12WdGqwrZI66yzThzTQ3cT61h55ZVbVkGLFHngoM/ZYwQYjBGi24eUuocIONZcc82QuuJaVvC/J+SbLjAS8xJMEDQR+BDA0S1HSxbzDRkyJM6HTeoOixP+9y9rkH1OdxwtZiktvPDCsbWI19n50utq+eE98kPAResRrVQpP6wjux7qLTsgm2VT2nrrrWP3HD5PPfVUDADTez4qoIACCpRP4I+jbfnyXvgcp3E22dPUaXFI0ysLkA1+Kt+r9ZruJoIQWqPoWqO1gi46gg3GFtGqQWsPafTRRw8HHHBAuOCCC2KgxlW8ObinfL3wwgstm6Jlhi6q9iQCKVpXPvroo9jVRRBGy0x7ypQNQrLbYvpdd92VndQSvAwzscqLlB/e2nDDDeOlALhqebUxVVUWj5Oy+aLFj5Yj1kFgSGuQSQEFFFCgvAIGQJ2oO7q76OJJiatiE1CkRLcPf1yvJyVafzjNnMT82UTrQ3Z53q+ch/nZZrZliAHHnMlF1w0Hbbqz0v2H6Iai1YLBvdlE9xndPgRCBE4kXtOKkxKDo1Ne07TsI+UnkR9aRej64qw4bopLNxXjhziVPJWp0odlaRX69ttvW+ahJSq1RjHOiKDjmmuuiQ48p2uRlJ0vThj6r1p+CIToSkvlf/vtt1u2xZl02USwl7wZK5Vaisg/LVrk9fDDDw977bVXrpc8yObR5woooIAC9RHwLLAOOhLIcAo3ZwRxlhXdXCeccEI824jWgrnnnjuOIWHg77HHHhufP/TQQ/G0dMbZEKQcc8wx8UwoBt9y/R2WpwWH8SoMoOZihQQ0vE5njpFdWpQYOExLDy07HKgZQ0RrC11QjFuh64rAgm4kxv2k6/kQONEVRPfUxRdfHMfUJALWSR7pOmM5BmCTVwKrykTeOa2e0+wJnBizxPIEXKyfsUd0SREAsS5akxhPxLgiurUYg0NieQYm40cAwz3lGNxNNx1jmQhK6K477rjjYnDCc/J18sknx+CIs7EoW2v5IWihVYtB4Sw344wzxgHY5Iduw4EDB8bT/skX28ZwhRVWiPm59tprY0sPQSxngBEEUbcEUXSrOQ6ocq8YvtfNcnZUs5Rz+GrfuRXofgGvA5RDHdCqwEGT8TjZs7U6s2laJQgu0nijauuiJYbT0LOJ4IkWoexp9Nn3eU5gQwsLB/vWEuslQJh99tlj0EMXXErkjTITOPHIX62uMMYu0TXXWqJljECoVllr5Yf1Elylli7WVS2oq9w+BuSdwdC0FvFHsNlaF2bl8r6uLUAQ3tHE2YJlSc1SzrLUh/lUIAn8+ad9esfHugnQijPttNPWbX2siICjVkDAPJXBD9M4Xb2t1N6WDQKCagFdNhii7PzVSrWCH5ajFac9qbX8sGwKfnjenuCH+dKgap4TwPFn8IOGSQEFFCi/gGOAyl+HuZeACxAytoYWILqwujsVLT/d7eH2FVBAAQXaFrAFqG0j56gQYKwPY5ZIbbVCVSzaJS+Llp8uKaQrVUABBRSoq4ABUF05m2Nl1a7l050lL1p+utPCbSuggAIKtE/ALrD2OTmXAgoooIACCjSQgAFQA1WmRVFAAQUUUECB9gkYALXPybkUUEABBRRQoIEEDIAaqDItigIKKKCAAgq0T8AAqH1OzqWAAgoooIACDSTgWWA5Vib31uLqwq+99lrc6sQTTxwWWGCBcNttt8XpXGiPm5jONNNMYdCgQfE2DFy1eJ555mm5lUWO2e3yTXEF6DvvvDM89thj8VYXHd0gtwSZaKKJwuSTT96yCty4ESzXK5ptttnCkksu2fIeT1iG+uA9biabvVgj91HjnmhcMHHllVeOt7zg1iVcIXuqqaYaZj2+UEABBRQop4AtQDnVGzcd5aah3DrihhtuCFtttVVYZJFF4tWauZdV37594/2mCH5I3FOL2zdwIE738aqVVW4X0d2JW3MMT+LKzW+88Ua49NJLh2exlnm5FtFJJ50Ur27NzV1T4vYV3EiVW4/suOOO8YKN3NMsJay/+eabWAcEX/vtt196K9x9993h6KOPjneyJzjiGkPc2oN7lRE0vfjiiy3z+kQBBRRQoLwCBkA51B3BD607tDRwm4nDDjss3lfqwQcfjFsnKOLeRtyINJvee++9sP/++2cnVX3OTVlfeumlqu/lNZEWFwKL4UkEKLRudTTR6rPbbrvF24Jk10GAiQemtOxsu+224eCDDw4ESdwfjZugzj///LFOuNs9d4tPiYCKeuJ2HjPPPHO80CM3hyWtueaa8W7w3OXepIACCihQbgEDoC6uP7p5uJM5B9qUuBs69+QiMEqJgy53JH/88cfTpHineW5cmhLdYnfddVe8C/3rr78eJ1922WXxbuzcDuKRRx6J0wicOMhzV3buVp8Sd4K//PLLwxlnnDFMwMR6mU6LE91GLEfrTGuJVhPWwQ1eSXTrbbzxxrErj2Wr3R6jWt5bW//wTK92rzG6trJdVdzNnft60d3G/dG4cOIRRxwRN8ONKjfbbLOWTRKMnnDCCTFQoiuMVq1skLb55pvHALZlAZ8ooIACCpRSwACoi6vtuuuuC1NPPfWfbujJgZSghbElJO7Sznznn39+fP3www/H7pf4Yug/AikOxARKBFN0n9Hqsswyy8RHumposeCAve+++4YNNtggjou544474ioIhNZaa62wxBJLxJaMDTfcMJx++umxW+7QQw8NW265ZTzwszwtVAQ41RItJHQDMXZptdVWi8ESQcjiiy8eW0vIz7jjjjvMoq3lfZiZ6viCgOfll1+OLqyW1hzGWyXrs846K5x66qlhqaWWimOtDjzwwJat77nnnjGgo3z77LNPwC87Pgjjf/3rX/Gu8C0L+UQBBRRQoHQCBkBdXGW06nDwrUypRYjxLwyKnn766cMWW2wR+vXrF4OdK664Iqy33notizEOhZaKscYaK44PYjwR42cmmWSSOA+PBB7vvvtuoGuNda600kphueWWi+/vtddeYZVVVokDhZn373//e/xjkDDbZX0c/Hm+9tprx4HJLRv/35MPP/wwBjw9e/YMb731VphxxhnDmWeeGQcL06oy0kgjhSmnnPJPwV5rea9cf71eE5jR4kOAx9goghhacwheSASBa6yxRmwdu/HGG8MXX3zRsmnubs/4LLrn6Ep74YUXWt7jyRRTTBFbx1IL3DBv+kIBBRRQoDQCBkBdXFUfffRRGH300f+0FQY5c4YR3WAXXXRR7EIi+KC15JJLLoldMIxxSYnAZ+edd44HdbptaJUYMmRIerullYKDPGNf5pprrtgSNN1008V5Hn300TiQNy2w6KKLxi4vWnNSC0d65OBfbVD1k08+GYM5zqjij1YiWrHaSm3lva3lh/d9vCnvO++8E/bee+8Y4NBaxhl2pHXWWSfssssugbE9lJV6SAn7Dz74ILbIpaCRQDMlAisCWgJNkwIKKKBAeQUMgLq47qaddtqqY2LY7CabbBJeeeWVeHYRLUC0Liy//PLxoL3qqqsOkzMCKbpsOHivu+66cQBvdoYUvBBA0SpzzTXXBMYHbbfddnG2scceOx7U0zK0ArEM09ubyB8tH5wRRRDAH4Hc4MGD4ypSHirX11beK+evx2u6wU455ZTY1cWZXQQ8k002WaAVi7O5CIZotWLM03PPPRdbtNjuOeecE7v2KMtBBx0UuxrpxswmAs/U8pad7nMFFFBAgfIIGAB1cV1xCjsBQLVEFxUtEJxdlBJjcWitqAyAbr311tgqRDcXZyExVoeuJdIoo4wSvv322/iaAcoc4OkGuuWWW2IrB/PQnXbfffcFThEnMeaILiwGZKf1xDeG/mNQc7XEwG0GR++xxx6BAdUEW7QCcWbbqKOO2tIiVbm+WnlnfSlPaZu33357y7ronrr++utbxvOkedIjy1Yun97j8eKLL46DtTkLjERgw1ggusRIBHAMNCe4I3G2Xjrri9fTTDNNmHfeeXkaE2WmdSx1p6XpPiqggAIKlEugxyFDU7myXK7c0gJ03nnnxUHJI4888jCZ50J7AwcOjKdyE0SQmJ+zq9IYobQAB2rOsGLMCt0wBEAENMsuu2wMfo4//vjYMkNAxbVv2BbrXn311eMBfuGFF45Bzz333BMDHlo6CF5ozWFgM4FTr1694usjjzwyjo/p3bt3bCVJeSB4YIwPLSOnnXZaHAy99dZbx5YgxgVde+218eKNtL5kxz21lneCLwYkc+bbLLPMEliO1hW67wi2uCYSZ7NxhhnX5OH9bCI4orWLM+MoL/OnAdi0VFGOMcYYI5aTIJFEdxxlYKwVQSOGBIvpWku0DDH25/PPP4+tcwSDlDElxhNRbwwyN3VOYMCAAR1eQbpeVodXkOOCzVLOHEndlAJ1ERhh6K/n/zYJ1GV1rqSaAFd6ZhwJ16ypTLQoVI4RqjaN5WhZobo4APPIHy0WpLQMLSpMIzigi6cycTo6rTeMDWqty6pymcrXBAV0e1Wun+nkicHQlalW3ivnZWB2tmuOIIUWJs50a0/iIocEkZzSzhlgrSWuCcRp8dUS66BeKpdnnNYhQ38zZE+zr7a809oWYCxbRxPj3MqSmqWcZakP86lAErALLEl04SNdXZNOOukwY3DS5iqDH6ZXm8Z0DsYEPySClxT88Dotk6ZVBifMQ6I1hvFGHQ1+WActUNXWz/RqwQ/L1Mo772dTNvghWGNQcnuDH9ZDKw+tSJXBS3YbPG8t+OE91lG5PNcOYiyRwQ9CJgUUUKDcAgZAOdUfA5dT8JLTJhtiM3RpZS9U2F2FokuMrri55567u7LgdhVQQAEF6ijw3+aEOq7QVbUuQKuEqZwCXB/IpIACCijQOAK2ADVOXVoSBRRQQAEFFGingAFQO6GcTQEFFFBAAQUaR8AAqHHq0pIooIACCiigQDsFDIDaCeVsCiiggAIKKNA4AgZAjVOXlkQBBRRQQAEF2ilgANROKGdTQAEFFFBAgcYRMABqnLq0JAoooIACCijQTgEDoHZCOZsCCiiggAIKNI6AAVDj1KUlUUABBRRQQIF2ChgAtRPK2RRQQAEFFFCgcQQMgBqnLi2JAgoooIACCrRTwAConVDOpoACCiiggAKNI2AA1Dh1aUkUUEABBRRQoJ0CBkDthHI2BRRQQAEFFGgcAQOgxqlLS6KAAgoooIAC7RQwAGonlLMpoIACCiigQOMI1CUA+uCDD8IDDzzQOCqWRAEFFFBAAQUaWqDTAdBvv/0Wdthhh/DII480NJSFU0ABBRRQQIHGERixs0W57LLLwsILLxx+//33zq7K5RVQoIkExhxzzKYobbOUsykq00I2lECnAqBnn302TDXVVOHHH38MX3zxRUPBWBgFFOhagaWXXrprN1CQtTdLOQvCbTYUaLdAhwOg7777Ljz55JNhu+22CwMGDGj3Bp1RAQUUQOCXX37pMMSII3b4q6vD2+zogs1Szo76uJwC3SXQ4W+RU089NbzxxhvhhRdeCK+++mpsBZp88snDpptu2l1lcbsKKFAigVtvvbXDue3Tp0+Hl817wWYpZ96ubk+Bzgp0OADq27dv+Prrr+P2+/XrFwYNGhR69+7d2fy4vAIKKKCAAgoo0OUCHQ6Axh133MAfaYIJJoiDoHk0KaCAAgoooIACRRfocACULdjWW2+dfelzBRRQQAEFFFCg0AKdvg5QoUtn5hRQQAEFFFBAgSoCBkBVUJykgAIKKKCAAo0tYADU2PVr6RRQQAEFFFCgioABUBUUJymggAIKKKBAYwsYADV2/Vo6BRRQQAEFFKgiYABUBcVJCiiggAIKKNDYAgZAjV2/lk4BBRRQQAEFqggYAFVBcZICCiiggAIKNLaAAVBj16+lU0ABBRRQQIEqAgZAVVCcpIACCiiggAKNLWAA1Nj1a+kUUEABBRRQoIqAAVAVFCcpoIACCiigQGMLGAA1dv1aOgUUUEABBRSoImAAVAXFSQoooIACCijQ2AIGQI1dv5ZOAQUUUEABBaoIGABVQXGSAgoooIACCjS2gAFQY9evpVNAAQUUUECBKgIGQFVQnKSAAgoooIACjS1gANTY9WvpFFBAAQUUUKCKgAFQFRQnKaCAAgoooEBjCxgANXb9WjoFFFBAAQUUqCJgAFQFxUkKKKCAAgoo0NgCBkCNXb+WTgEFFFBAAQWqCIxYZZqTFFBAAQUUGC6B/v37D9f82Zn79OmTfelzBXIRsAUoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEjAAKlJtmBcFFFBAAQUUyEXAACgXZjeigAIKKKCAAkUSMAAqUm2YFwUUUEABBRTIRcAAKBdmN6KAAgoooIACRRIwACpSbZgXBRRQQAEFFMhFwAAoF2Y3ooACCiiggAJFEhixSJkxLwoooIACCijQ/QL9+/fvcCb69OnT4WXzXNAWoDy13ZYCCiiggAIKFEKgUy1A3377bbj66qvDTz/9FDbccMPQs2fPQhTKTCiggAIKKKCAArUEOtUCdN1114W55porvPrqq2G//fartR3fU0ABBRRQQAEFCiPQ4RYgWn3WX3/9MMooo4Sff/459OvXrzCFMiMKKFB8gVlmmaX4maxDDi1nHRBdRe4CzbDfjvD70NQZ2YEDB4Zdd901nHzyyWG66abrzKpcVgEFFFBAAQUUyEWgwy1AKXfPPfdcGDx4cNhoo43CE088kSb7qIACCtQUeOedd2q+X+vNXr161Xq7UO9Zzraro0z12XZpGmOOZthvOx0Arb322oG/eeaZJ3z88cdh0kknbYzatxQKKNClAi+++GKH11+mA6blbLuay1SfbZemMeZohv22U4Ogs9XMDjzxxBNnJ/lcAQUUUEABBRQopECHA6Cvv/46LLroouHiiy8OTz75ZDjyyCPDX/7S4dUVEsdMKaCAAgoooEBjCnS4C2zssccODzzwQFTp0aNHY+pYKgW6QaAZrsDaDaxuUgEFFBhGoMMBEGsx8BnG0hcKKKCAAgooUBIB+6xKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJ2AAVD9L16SAAgoooIACJREwACpJRZlNBRRQQAEFFKifgAFQ/SxdkwIKKKCAAgqURMAAqCQVZTYVUEABBRRQoH4CBkD1s3RNCiiggAIKKFASAQOgklSU2VRAAQUUUECB+gkYANXP0jUpoIACCiigQEkEDIBKUlFmUwEFFFBAAQXqJzBiZ1Y1ZMiQcP3114cRRxwxrLnmmmGUUUbpzOpcVgEFFFBAAQUUyEWgUy1AJ510Uvjll1/C1VdfHdZff/1cMuxGFFBAAQUUUECBzgp0uAXok08+CX379g0TTjhhWH311UOvXr3Cr7/+Gnr06NHZPLm8AgoooIACCijQpQIj/D40dXYL7733Xth5553DTTfd1NlVubwCTS/w448/dtigTN3QlrPtarY+2zZyjq4RaIbPZ10CoGOOOSYsueSSYYEFFuiamnCtCiiggAIKKKBAHQU63AWW8vDMM8+EWWed1eAngfioQCcFHnzwwQ6vYfHFF+/wsnkvaDnbFrc+2zZyjq4RaIbPZ6cCoIEDB4ZPP/00rLLKKrEGXnvttTDzzDN3TW24VgWaROCrr75qipJazsaq5mapz8aqtdZL0wz12eGzwBgE3bt377D77ruHGWaYIUw22WRh0KBBrWv6jgIKKKCAAgooUBCBDrcATTLJJOGNN94oSDHMhgIKKKCAAgoo0H6BDrcAtX8TzqmAAgoooIACChRLwACoWPVhbhRQQAEFFFAgBwEDoByQ3YQCCiiggAIKFEvAAKhY9WFuFFBAAQUUUCAHAQOgHJDdhAIKKKCAAgoUS8AAqFj1YW4UUEABBRRQIAeBDp8Gn0Pe3IQCwwj0799/mNftfdGnT5/2zup8CiiggAJNImALUJNUtMVUQAEFFFBAgT8EDID+sPCZAgoooIACCjSJgAFQk1S0xVRAAQUUUECBPwQMgP6w8JkCCiiggAIKNImAAVCTVLTFVEABBRRQQIE/BAyA/rDwmQIKKKCAAgo0iYABUJNUtMVUQAEFFFBAgT8EDID+sPCZAgoooIACCjSJgBdCbJKKtpgKKKCAAp0X6OgFWdmyF2XtvH8912ALUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCgEDoFJUk5lUQAEFFFBAgXoKGADVU9N1KaCAAgoooEApBAyASlFNZlIBBRRQQAEF6ilgAFRPTdelgAIKKKCAAqUQMAAqRTWZSQUUUEABBRSop4ABUD01XZcCCiiggAIKlELAAKgU1WQmFVBAAQUUUKCeAgZA9dR0XQoooIACCihQCoERS5FLM1lToH///jXfb+3NPn36tPaW0xVQQAEFFGhoAVuAGrp6LZwCCiiggAIKVBNo6BYgW0aqVbnTFFBAAQUUUMAWIPcBBRRQQAEFFGg6AQOgpqtyC6yAAgoooIACBkDuAwoooIACCijQdAIGQE1X5RZYAQUUUEABBQyA3AcUUEABBRRQoOkEOh0Afffdd+G+++5rOjgLrIACCiiggALlFejUafCDBg0KRx55ZHjvvffC0ksvXV4Fc66AAgoooIACTSXQqQBovPHGC8stt1w477zzCok2/vjjFzJf9c6U5ay3aPeuz/rsXv96b936rLdo967P+uxe/3pufYTfh6bOrPCOO+6IAdC1117bmdW4rAIKKKCAAgookJtAp1qAcstlBzfE+KSOpNFHH70ji3XbMpazNn2z1CcKZSprR/dby1l7f++ud63PtuX9fLZtlOccDR0A3XPPPR2yLNtNQi1n7WpulvpEoUxl7eh+azlr7+/d9a712ba8n8+2jfKco9NngeWZWbelgAIKKKCAAgrUQ6BTAdBXX30VGAM0YMCA8Prrr9cjP65DAQUUUEABBRTocoFOdYGNM8444eSTT+7yTLoBBRRQQAEFFFCgngKdagGqZ0ZclwIKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKI2AAVJiqMCMKKKCAAgookJeAAVBe0m5HAQUUUEABBQojYABUmKowIwoooIACCiiQl4ABUF7SbkcBBRRQQAEFCiNgAFSYqjAjCiiggAIKKJCXgAFQXtJuRwEFFFBAAQUKIzDC70NTYXJjRhRQQAEFFFBAgRwEbAHKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQMADKAdlNKKCAAgoooECxBAyAilUf5kYBBRRQQAEFchAwAMoB2U0ooIACCiigQLEEDICKVR/mRgEFFFBAAQVyEDAAygHZTSiggAIKKKBAsQQMgIpVH+ZGAQUUUEABBXIQaOgA6NVXXw0XXHBBuPjii9uk/PXXX8NBBx0Ufv/992Hmfeutt8Kxxx4bvvvuu2Gmd+eL77//Plx22WXh0UcfzSUb9957bzj//PNz2dbwbiRbb59++mk4+eSTwyOPPDK8q8l1/gceeCDul/XY6CuvvBIuv/zyuKqnnnoq7LPPPuHnn3+ux6o7vI5LLrkkvP76620un627tmZ+8803wzHHHBN++umnllkp57XXXhvOOuuslmnd/YTvG/I6PGXr7jy7/e4TKOLxpfs08t9yQwdARxxx3+7eSQAADlFJREFURNhiiy3CCCOM0KbsX/7ylzD11FP/aV6mnXrqqeG3335rcx15zTDqqKOGJ598Mrz88su5bPKvf/1ruOaaa3LZ1vBuJFtvE088cXj++efDwIED42o++eST8MMPPwzvKrt8fvbHK6+8slPbeeedd+LyY4wxRphoooni8znmmCOce+658eDbqZV3cmHyQ77aStm6Y96333671UUmn3zy+EPkl19+aZmH/ZLPwm233dYyrbufsA9S9lplK+p+2d12zbj9Ih5f6lkPRd/XGyYA+vHHH8PgwYNj3dGKwwHigw8+iAfATTbZJP4q5sszzcOMn3/+eUtdc1DaeOONh2kB+vrrrwOtLaOMMkr8QmuZOYcn//nPf/4UdH300UexHOR1nHHGibngF3H2VzETK5fFg7ITDKRADq/KRFkr08cffxwPMnyh552++uqrltYMflFTH5T1//7v/+JzWgAq661nz54xm5Rl3XXXDV988UVcB8ullhEeswfSvMs17rjjtmxyyJAhLc/Tk2r1kOqeeY4++uhw0003xfJMNdVUYdFFF42LjjzyyLGu0np4rLau7Ptd8XyFFVYI448/fsuq2e8oJ/tesudzmK27F154Ieyyyy7DBKzZ/ZhAZ7TRRovr/Pbbb1vWPfbYY7c8T0+6o8xp2yuttFIYb7zxWi1bdr9Mn0X25/Sc9WSNeM6+y/7P55jn2XnTdrv6kTrkM/PNN98Ms6laeefzyueP7xqW5Tn5TylbT6yXfYQypu/dNN+XX34ZW+C7o9zZzx35IQ+fffZZylp8zB57mJCtvzRjpRPTUzm74/iS8pUeW/t+5f20/1EP2cR0Wt2z9ZItZ7V9Pbt8EZ7nf1TrglLT5N6vX7/Qv3//cN5558UDJM/5wN1xxx3h2WefDdNNN104/vjjwyKLLBJbCHbYYYfw/vvvh759+8Yc8QXMr8z0oTzttNMC62U9BAF5pq222ioQOe+7777htddei5s+5JBDYkC3/fbbt2TloYceCocffnjggJNaPSqX/fDDD8M888wTfz3vvffeYc455wyXXnppOOCAA8LKK68cv5z4IF5xxRXhoosuCqusskr0Y0febrvtwnPPPRe7azhY5ZnojqSraL755guUmYB2yy23jHmkO3K11VaLdVNZbymP1C3dMHfeeWc0nGKKKeL8vH/ccccF7Loz8cVx5JFHhvXWWy92Z5IXWq/OOeec8PDDD8dgILWIZOue/ZP3X3rppUD3F11+BHqVqVqdVs7TFa/5kqRurrrqqkCgQvn4jK2zzjrhzDPPHOZzSJdd+szRnfvee+8FultJlftxyivd0csvv3wg0KgM4rurzClvHBiXXXbZuM9l98ts2bL7JQf966+/Ptx6661hjTXWiF3adNtnv6v4DsKIHwMccHDJ/ohL2+6qR74HNtxww7D55psHWtTnn3/+WK9sr6280xXdq1ev+H3CQXKPPfaIdVZtP2ddGDDf6aefHrfHNvg+f+ONN8KBBx7Yrm5VlqlXyn7uWCffhXzfcjxZeumlY1BXeeyprD/8Kp1YV3ceX9h+ZSJArfb9OmDAgFiHhx56aDweLLDAAnFffPfdd+PnmTo+5ZRT4uoqy1m5r1duswivGyIAGmussULv3r1jRTE2hoh6ueWWi79C11xzzTDvvPMGfnUTKKRuo7XWWitMP/308SDLL00Cg5FGGinWCV+sHDj5RUrrEa0teQUA/AJacMEFw2yzzRa/8P7973/HR7o2yN/BBx/cst9QLj6QfClx4Ki2LF+ek002WVh88cXjh47uiWmnnTYGg6yIPmiCH4JFyslBi3XRrUAeCJIwyDMRzN1///1h9dVXD1tvvXWgW4GDwgwzzBCzQV1Sd6RsvcUJ//s300wzxRaDtddeO9A1tOeee4bHH388vovjUkstlZ099+djjjlm2H///QNfsgSeJL5kqEsO8NjzJclBL1v3tILMPPPMgS+iueaaKz5Wa82qVqdxI138j7ph/yJRRuqHffD2228PO+200zCfQ8qaPnPsy5NMMkkMwKvtxynbu+66awwUOFDyGc2m7ipzygOfLQJtUna/zJYtu1/yPXPDDTfEAIGuEMYqzjLLLMMYEfDQysdngnpeeOGFW1p/03a78pFWRfY18sd3T/rhQQDfVt757uB7h+56WpD4Tua7utp+Pvvss8di9OjRI34GUpn4/uO7gB8LE0wwQZrc5Y+Vnzs2yIGeH6Uca9gPSZXHnsr6I3CtdOrO40vMdJV/1HO179cZZ5wx7m+bbrppuPrqq8M000wTbr755kCAT7k4pvIdW21/yO7r1Vpqq2Qj90kNEQDxRZoOFpW/CpMoX7SpEviCpgWASJ4vaT6cpBTk3HPPPS3zMp3m97wSeZh00knD2WefHejOIW/knYMgXygXXnhhS1b4siCNPvrocb5qy/I+3VdpXgzSc8pFwEMrGV+sG2ywQQwI+XV94403tnyZM1+yYX1dnThgDho0KH7IaBYniK2VauUtvUcrBK2EdIvSRdHdacQRR4xZoO7YZznoE3im4IH6uOuuu1qt+1Su1spRrU5bm7fe07N544s1dUuyneznkNfZedNzHis/A8xL4vPK+/wCp2Ugm7qzzCkf2a7iVB7eyz5Pr/meoZx87ji40vpHqjSi5faEE04It9xyS2wliTPl+I+8p/zzXUALVHvzvuOOO8Zy0ZLOD5rW9vPWinPUUUfFctMynbVtbf56Ta/8ziX4fPDBB2PdsI0+ffoEPsPVjj3Z+qvm1J3Hl474UPeMtyMttNBCsaWWY9Hf/va3QID07tDWoGrlTNtK+056XaTHhgiAiMb5cPHriw9YStnnaRqPRLIcdGgFINJPifn5m3XWWePOTl8uKfW/p/m68pEDP7+Q6PZhpyM/NO3TAkKXB1E3O1y2bOl5tWUr85rtr03v8SGm5YzEB51mTQzYqUmUv9py8c0u+MdBk9YRuhRohUu/DvlioUmZRD5Tnih/Mshmh0AvzcNBmAMNf+wr3Zmy+U3P+ZIg8KS7hERX0oorrli17jkQUCe1UrU6rTV/vd+rVh/VtpHKny1Te/Zjmtcru/66u8yV5atWNuZJ+yX55fOczjDlAFst8QOAX9y0XtMaWoTU3ryzD5Nvxlum4LXafp79bPO9m/Yfhh/wY5VufM6wyytVfucylIDjC2cdkjjTj89oa8eelM9qTt15fEn5qvaYrYPs92t2XvZDvj+ffvrpOJSAk2MI0KuVk+XSvp5dR5GeN0QARDMkB0zGRjB25r777otfLPxCpKIYR8P4AnZeDhwcYHnO2V00W9O6wnwccPkVSb81/aF0RXHqLV/OTM8j8auC5kR2Kg72NDfy4SMQoA+aJmgGmNKsTKBC9x3POXDypVG57Isvvhj7znmf9fBlRFn4YqF/ly9dzpSjK4zxNvvtt1+Ye+65w2abbRbN6JfnjCVs0piUrnbgi+Uf//hHOPHEE+NYAMaOkPjgMeaF8QC0jDGWgHEkqd4o+zPPPBNdeJ/xGHR9cbAk7bbbbvFLLA0gjxO74R/N+nRpEMjS6oMrdcM+zHt84dNaxa9nAvRs3dMdsthii8XLErAs3UCMdWJdTzzxROzKvPvuu6vWaR5FpQ74LLFf8Zz9jrFcPK/8HGY/c7R8UY+05JKYP/sZ4D1+sPCZZR/gBwFdu+zLfM75fFfbj/Moc7VttFY2Dixpv6Q1he4DDoiMSeR1pRHr5iDCvkCXZ96JYISuY743qAOe85mjvtqTd747t9lmm2G6nKvt53T58pmlDpMdQROXOGDMIvs9gVNeqdrnjh+mdAPS+sH3DN29lccexgRljzW0kFQ6defxpZZfte/XFIhSLr6P6UHgBynHFS65wf7MsaJaOdlW2tfTd3Ct7XfHeyMMLeAfTSbdkYM6bZOKIHjgFz8furZSmp/it9ZERwBCsJTmbWud9XqfII0ykK9UHvLJL8W2Ti+utmx788UA23SmTVqGljJaovgSzivxBUhgR386ze10x3GAIPBLdZJc2spTmp/5+OKmPAQQRU4c/CeccMKW/bha3WfLVass1eq01vzd+R4HHfY19vvW9mMOksyTuhCr5bc7y8wPp2233Tb+eMrmLVs2pmfrrz35JTDkpATKXqTUnrxTl9W+P6rt5wRc1G36Dq+27+dV/mrbZhrlye5/6fhQ6zupmlPaB9LyeZWr1nZSnrJlIeChC5PvpHT8wYF8Mx/HyJRqlTPNU6TH/w5EKFKOOpiXtEOmD05bq0nztxb8sHyq2DRvW+us1/vZL4tUHvKZdr5a26m2bK35s+9VBj+8x4DyvBNnEzD2h19X6UsynVad6iS5tJU35qeFhF8v2BTponmt5b2ym6Na3SeH1taRpler0/Re0R5pgk+ptf2YVpK2UneUmSCdX/YcQGjNqEzZsvFetv5q5feioQPkadmji6FowQ/lqJV33idl6/K/U/77v9p+XulUbd/PrqMrn1fbNtMqjwfpda3vpGpOaR9Iy3dlWdq77pSnbFn40cFf9viDQ7X9sVY525uHPOdrmBagPNHcVtcK8MuC7j26hZZccsnY1dGZLRJMcd0cuvPacwDtzLZctjkF6Makq49uuqmHnjFVr8S4P7r16UowKZC3AF2fdLXT3UwXWZGCtXpYGADVQ9F1KKCAAgoooECpBNoeLFOq4phZBRRQQAEFFFCgbQEDoLaNnEMBBRRQQAEFGkzAAKjBKtTiKKCAAgoooEDbAgZAbRs5hwIKKKCAAgo0mIABUINVqMVRQAEFFFBAgbYFDIDaNnIOBRRQQAEFFGgwAQOgBqtQi6OAAgoooIACbQsYALVt5BwKKKCAAgoo0GACBkANVqEWRwEFFFBAAQXaFjAAatvIORRQQAEFFFCgwQT+Hxr6T8CDMgAiAAAAAElFTkSuQmCC" alt="" width="373" height="280" /></p> <p> </p> <p>The code they use in base graphics is this (super blurry sorry, you can also <a href="http://motioninsocial.com/tufte/">go to the website</a> for a better view).</p> <p><img class="aligncenter wp-image-4646" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png" alt="Screen Shot 2016-02-11 at 12.56.53 PM" width="483" height="132" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-300x82.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-768x209.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-1024x279.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM-260x71.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.53-PM.png 1248w" sizes="(max-width: 483px) 100vw, 483px" /></p> <p>in ggplot2 the code is:</p> <p><img class="aligncenter wp-image-4647" src="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png" alt="Screen Shot 2016-02-11 at 12.56.39 PM" width="526" height="128" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-300x73.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-768x187.png 768w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-1024x249.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM-260x63.png 260w, http://simplystatistics.org/wp-content/uploads/2016/02/Screen-Shot-2016-02-11-at-12.56.39-PM.png 1334w" sizes="(max-width: 526px) 100vw, 526px" /></p> <p> </p> <p>Both require a significant amount of coding. The ggplot2 plot also takes advantage of the ggthemes package here. Which means, without that package for some specific plot, it would require more coding.</p> <p>The bottom line is for production graphics, any system requires work. So why do I still use base R like an old person? Because I learned all the stupid little tricks for that system, it was a huge pain, and it would be a huge pain to learn it again for ggplot2, to make very similar types of plots. This is one where neither system is particularly better, but the time-optimal solution is to stick with whichever system you learned first.</p> <p><strong>Grading student work</strong></p> <p>People I seriously respect suggest teaching ggplot2 before base graphics as a way to get people up and going quickly making pretty visualizations. This is a good solution to the <a href="http://simplystatistics.org/2014/08/13/swirl-and-the-little-data-scientists-predicament/">little data scientist’s predicament</a>. The tricky thing is that the defaults in ggplot2 are just pretty enough that they might trick you into thinking the graph is production ready using defaults. Say for example you make a plot of the latitude and longitude of <a href="https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/quakes.html">quakes</a> data in R, colored by the number of stations reporting. This is one case where ggplot2 crushes base R for simplicity because of the automated generation of a color scale. You can make this plot with just the line:</p> <p>ggplot() + geom_point(data=quakes,aes(x=lat,y=long,colour=stations))</p> <p>And get this out:</p> <p><img class="aligncenter wp-image-4649" src="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png" alt="quakes" width="420" height="370" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/quakes-300x264.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes-227x200.png 227w, http://simplystatistics.org/wp-content/uploads/2016/02/quakes.png 627w" sizes="(max-width: 420px) 100vw, 420px" /></p> <p>That is a pretty amazing plot in one line of code! What often happens with students in a first serious data analysis class is they think that plot is done. But it isn’t even close. Here are a few things you would need to do to make this plot production ready: (1) make the axes bigger, (2) make the labels bigger, (3) make the labels be full names (latitude and longitude, ideally with units when variables need them), (4) make the legend title be number of stations reporting. Those are the bare minimum. But a very common move by a person who knows a little R/data analysis would be to leave that graph as it is and submit it directly. I know this from lots of experience.</p> <p>The one nice thing about teaching base R here is that the base version for this plot is either (a) a ton of work or (b) ugly. In either case, it makes the student think very hard about what they need to do to make the plot better, rather than just assuming it is ok.</p> <p><strong>Where ggplot2 is better for sure</strong></p> <p>ggplot2 being compatible with piping, having a simple system for theming, having a good animation package, and in general being an excellent platform for developers who create](https://ggplot2-exts.github.io/index.html) are all huge advantages. It is also great for getting absolute newbies up and making medium-quality graphics in a huge hurry. This is a great way to get more people engaged in data science and I’m psyched about the reach and power ggplot2 has had. Still, I probably won’t use it for my own work, even thought it disappoints my data scientist friends.</p> Data handcuffs 2016-02-10T15:38:37+00:00 http://simplystats.github.io/2016/02/10/data-handcuffs <p>A few years ago, if you asked me what the top skills I got asked about for students going into industry, I’d definitely have said things like data cleaning, data transformation, database pulls, and other non-traditional statistical tasks. But as companies have progressed from the point of storing data to actually wanting to do something with it, I would say one of the hottest skills is understanding and dealing with data from randomized trials.</p> <p>In particular I see data scientists talking more about <a href="https://medium.com/@InVisionApp/a-b-and-see-a-beginner-s-guide-to-a-b-testing-a16406f1a239#.p7hoxirwo">A/B testing</a>, <a href="http://varianceexplained.org/r/bayesian-ab-testing/">sequential stopping rules</a>, <a href="https://twitter.com/hspter/status/696820603945414656">hazard regression</a> and other ideas that are really common in Biostatistics, which has traditionally focused on the analysis of data from designed experiments in biology.</p> <p>I think it is great that companies are choosing to do experiments, as this <a href="http://simplystatistics.org/2013/07/15/yes-clinical-trials-work/">still remains</a> the gold standard for how to generate knowledge about causal effects. One interesting new development though is the extreme lengths it appears some organizations are going to to be “data-driven”. They make all decisions based on data they have collected or experiments they have performed.</p> <p>But data mostly tell you about small scale effects and things that happened in the past. To be able to make big discoveries/improvements requires (a) having creative ideas that are not data supported and (b) trying them in experiments to see if they work. If you get too caught up in experimenting on the same set of conditions you will inevitably asymptote to a maximum and quickly reach diminishing returns. This is where the data handcuffs come in. Data can only tell you about the conditions that existed in the past, they often can’t predict conditions in the future or ideas that may work out or might not.</p> <p>In an interesting parallel to academic research a good strategy appears to be: (a) trying a bunch of things, including some things that have only a pretty modest chance of success, (b) doing experiments early and often when trying those things, and (c) getting very good at recognizing failure quickly and moving on to ideas that will be fruitful. The challenges are that in part (a) it is often difficult to generate really knew ideas, especially if you are already doing something that has had any level of success. There will be extreme pressure not to change what you are doing. In part (c) the challenge is that if you discard ideas too quickly you might miss a big opportunity, but if you don’t discard them quickly enough you will sink a lot of time/cost into utlimately not very fruitful projects.</p> <p>Regardless, almost all of the most <a href="http://simplystatistics.org/2013/09/25/is-most-science-false-the-titans-weigh-in/">interesting projects</a> I’ve worked on in my life were not driven by data that suggested they would be successful. They were often risks where the data either wasn’t in, or the data supported not doing at all. But as a statistician I decided to straight up ignore the data and try anyway. Then again, these ideas have also been the sources of <a href="http://simplystatistics.org/2012/01/11/healthnewsrater/">my biggest flameouts</a>.</p> Leek group guide to reading scientific papers 2016-02-09T13:59:53+00:00 http://simplystats.github.io/2016/02/09/leek-group-guide-to-reading-scientific-papers <p>The other day on Twitter Amelia requested a guide for reading papers</p> <blockquote class="twitter-tweet" data-width="550"> <p lang="en" dir="ltr"> I love <a href="https://twitter.com/jtleek">@jtleek</a>’s github guides to reviewing papers, writing R packages, giving talks, etc. Would love one on reading papers, for students. </p> <p> &mdash; Amelia McNamara (@AmeliaMN) <a href="https://twitter.com/AmeliaMN/status/695633602751635456">February 5, 2016</a> </p> </blockquote> <p> </p> <p>So I came up with a guide which you can find here: <a href="https://github.com/jtleek/readingpapers">Leek group guide to reading papers</a>. I actually found this to be one that I had the hardest time with. I described how I tend to read a paper but I’m not sure that is really the optimal (or even a very good) way. I’d really appreciate pull requests if you have ideas on how to improve the guide.</p> A menagerie of messed up data analyses and how to avoid them 2016-02-01T13:39:57+00:00 http://simplystats.github.io/2016/02/01/a-menagerie-of-messed-up-data-analyses-and-how-to-avoid-them <p><em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman &amp; Loken, 2014; John, Loewenstein, &amp; Prelec, 2012; Simmons, Nelson, &amp; Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p> <p> </p> <p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p> <p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p> <p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P &lt; 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P &lt; 0.05 just by chance.</p> <p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p> <p> </p> <p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p> <p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p> <p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another. [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some [<em>Update: I realize this may seem like I’m picking on people. I really don’t mean to, I have for sure made all of these mistakes and many more. I can give many examples, but the one I always remember is the time Rafa saved me from “I got a big one here” when I made a huge mistake as a first year assistant professor.</em></p> <p>In any introductory statistics or data analysis class they might teach you the basics, how to load a data set, how to munge it, how to do t-tests, maybe how to write a report. But there are a whole bunch of ways that a data analysis can be screwed up that often get skipped over. Here is my first crack at creating a “menagerie” of messed up data analyses and how you can avoid them. Depending on interest I could probably list a ton more, but as always I’m doing the non-comprehensive list :).</p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Outco<img class="alignleft wp-image-4613" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png" alt="direction411" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction411-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction411.png 256w" sizes="(max-width: 125px) 100vw, 125px" />me switching</strong></span></p> <p>_What it is: _Outcome switching is where you collect data looking at say, the relationship between exercise and blood pressure. Once you have the data, you realize that blood pressure isn’t really related to exercise. So you change the outcome and ask if HDL levels are related to exercise and you find a relationship. It turns out that when you do this kind of switch you have now biased your analysis because you would have just stopped if you found the original relationship.</p> <p style="text-align: left;"> <em>An example: </em><a href="http://www.vox.com/2015/12/29/10654056/ben-goldacre-compare-trials">In this article</a> they discuss how Paxil, an anti-depressant, was originally studied for several main outcomes, none of which showed an effect - but some of the secondary outcomes did. So they switched the outcome of the trial and used this result to market the drug. </p> <p style="text-align: left;"> <em>What you can do: </em>Pre-specify your analysis plan, including which outcomes you want to look at. Then very clearly state when you are analyzing a primary outcome or a secondary analysis. That way people know to take the secondary analyses with a grain of salt. You can even get paid to pre-specify with the OSF's <a href="https://cos.io/prereg/">pre-registration challenge</a>. </p> <p><img class="alignleft wp-image-4618" src="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png" alt="direction398" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/direction398-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/direction398.png 512w" sizes="(max-width: 125px) 100vw, 125px" /></p> <p><span style="text-decoration: underline;"><strong>Garden of forking paths</strong></span></p> <p>_What it is: _In this case you may or may not have specified your outcome and stuck with it. Let’s assume you have, so you are still looking at blood pressure and exercise. But it turns out a bunch of people had apparently erroneous measures of blood pressure. So you dropped those measurements and did the analysis with the remaining values. This is a totally sensible thing to do, but if you didn’t specify in advance how you would handle bad measurements, you can make a bunch of different choices here (the forking paths). You could drop them, impute them, multiply impute them, weight them, etc. Each of these gives a different result and you can accidentally pick the one that works best even if you are being “sensible”</p> <p><em>An example</em>: <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">This article</a> gives several examples of the forking paths. One is where authors report that at peak fertility women are more likely to wear red or pink shirts. They made several inclusion/exclusion choices (which women to include in which comparison group) for who to include that could easily have gone a different direction or were against stated rules.</p> <p>_What you can do: _Pre-specify every part of your analysis plan, down to which observations you are going to drop, transform, etc. To be honest this is super hard to do because almost every data set is messy in a unique way. So the best thing here is to point out steps in your analysis where you made a choice that wasn’t pre-specified and you could have made differently. Or, even better, try some of the different choices and make sure your results aren’t dramatically different.</p> <p> </p> <p><strong><img class="alignleft wp-image-4621" src="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png" alt="emoticon149" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/emoticon149.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">P-hacking</span></strong></p> <p>_What it is: _The nefarious cousin of the garden of forking paths. Basically here the person outcome switches, uses the garden of forking paths, intentionally doesn’t correct for multiple testing, or uses any of these other means to cheat and get a result that they like.</p> <p><em>An example:</em> This one gets talked about a lot and there is <a href="http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106">some evidence that it happens</a>. But it is usually pretty hard to ascribe purely evil intentions to people and I’d rather not point the finger here. I think that often the garden of forking paths results in just as bad an outcome without people having to try.</p> <p><em>What to do:</em> Know how to do an analysis well and don’t cheat.</p> <p><em>Update: </em> Some](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2649230) “when honest researchers face ambiguity about what analyses to run, and convince themselves those leading to better results are the correct ones (see e.g., Gelman &amp; Loken, 2014; John, Loewenstein, &amp; Prelec, 2012; Simmons, Nelson, &amp; Simonsohn, 2011; Vazire, 2015).” This coincides with the definition of “garden of forking paths”. I have been asked to point this out <a href="https://twitter.com/talyarkoni/status/694576205089996800">on Twitter.</a> It was never my intention to accuse anyone of accusing people of fraud. That being said, I still think that the connotation that many people think of when they think “p-hacking” corresponds to my definition above, although I agree with folks that isn’t helpful - which is why I prefer we call the non-nefarious version the garden of forking paths.</p> <p> </p> <p><strong><img class="alignleft wp-image-4623" src="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png" alt="paypal15" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/paypal15-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/paypal15.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">Uncorrected multiple testing </span></strong></p> <p>_What it is: _This one is related to the garden of forking paths and outcome switching. Most statistical methods for measuring the potential for error assume you are only evaluating one hypothesis at a time. But in reality you might be measuring a ton either on purpose (in a big genomics or neuroimaging study) or accidentally (because you consider a bunch of outcomes). In either case, the expected error rate changes a lot if you consider many hypotheses.</p> <p><em>An example: </em> The <a href="http://users.stat.umn.edu/~corbett/classes/5303/Bennett-Salmon-2009.pdf">most famous example</a> is when someone did an fMRI on a dead fish and showed that there were a bunch of significant regions at the P &lt; 0.05 level. The reason is that there is natural variation in the background of these measurements and if you consider each pixel independently ignoring that you are looking at a bunch of them, a few will have P &lt; 0.05 just by chance.</p> <p><em>What you can do</em>: Correct for multiple testing. When you calculate a large number of p-values make sure you <a href="http://varianceexplained.org/statistics/interpreting-pvalue-histogram/">know what their distribution</a> is expected to be and you use a method like Bonferroni, Benjamini-Hochberg, or q-value to correct for multiple testing.</p> <p> </p> <p><strong><img class="alignleft wp-image-4625" src="http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png" alt="animal162" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/animal162-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/animal162.png 256w" sizes="(max-width: 125px) 100vw, 125px" /><span style="text-decoration: underline;">I got a big one here</span></strong></p> <p><em>What it is:</em> One of the most painful experiences for all new data analysts. You collect data and discover a huge effect. You are super excited so you write it up and submit it to one of the best journals or convince your boss to be the farm. The problem is that huge effects are incredibly rare and are usually due to some combination of experimental artifacts and biases or mistakes in the analysis. Almost no effects you detect with statistics are huge. Even the relationship between smoking and cancer is relatively weak in observational studies and requires very careful calibration and analysis.</p> <p><em>An example:</em> <a href="http://www.ncbi.nlm.nih.gov/pubmed/17206142">In a paper</a> authors claimed that 78% of genes were differentially expressed between Asians and Europeans. But it turns out that most of the Asian samples were measured in one sample and the Europeans in another.](http://www.ncbi.nlm.nih.gov/pubmed/17597765) a large fraction of these differences.</p> <p><em>What you can do</em>: Be deeply suspicious of big effects in data analysis. If you find something huge and counterintuitive, especially in a well established research area, spend <em>a lot</em> of time trying to figure out why it could be a mistake. If you don’t, others definitely will, and you might be embarrassed.</p> <p><span style="text-decoration: underline;"><strong><img class="alignleft wp-image-4632" src="http://simplystatistics.org/wp-content/uploads/2016/02/man298.png" alt="man298" width="125" height="125" srcset="http://simplystatistics.org/wp-content/uploads/2016/02/man298-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2016/02/man298.png 256w" sizes="(max-width: 125px) 100vw, 125px" />Double complication</strong></span></p> <p><em>What it is</em>: When faced with a large and complicated data set, beginning analysts often feel compelled to use a big complicated method. Imagine you have collected data on thousands of genes or hundreds of thousands of voxels and you want to use this data to predict some health outcome. There is a severe temptation to use deep learning or blend random forests, boosting, and five other methods to perform the prediction. The problem is that complicated methods fail for complicated reasons, which will be extra hard to diagnose if you have a really big, complicated data set.</p> <p><em>An example:</em> There are a large number of examples where people use very small training sets and complicated methods. One example (there were many other problems with this analysis, too) is when people <a href="http://www.nature.com/nm/journal/v12/n11/full/nm1491.html">tried to use complicated prediction algorithms</a> to predict which chemotherapy would work best using genomics. Ultimately this paper was retracted for may problems, but the complication of the methods plus the complication of the data made it hard to detect.</p> <p><em>What you can do:</em> When faced with a big, messy data set, try simple things first. Use linear regression, make simple scatterplots, check to see if there are obvious flaws with the data. If you must use a really complicated method, ask yourself if there is a reason it is outperforming the simple methods because often with large data sets <a href="http://arxiv.org/pdf/math/0606441.pdf">even simple things work</a>.</p> <p> </p> <p> </p> <p> </p> <p> </p> <p> </p> <p><span style="text-decoration: underline;"><strong>Image credits:</strong></span></p> <ul> <li>Outcome switching. Icon made by <a href="http://hananonblog.wordpress.com" title="Hanan">Hanan</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Forking paths. Icon made by <a href="http://iconalone.com" title="Popcic">Popcic</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>P-hacking.Icon made by <a href="http://www.icomoon.io" title="Icomoon">Icomoon</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Uncorrected multiple testing.Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Big one here. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> <li>Double complication. Icon made by <a href="http://www.freepik.com" title="Freepik">Freepik</a> from <a href="http://www.flaticon.com" title="Flaticon">www.flaticon.com</a> is licensed under <a href="http://creativecommons.org/licenses/by/3.0/" title="Creative Commons BY 3.0">CC BY 3.0</a></li> </ul> Exactly how risky is breathing? 2016-01-26T09:58:23+00:00 http://simplystats.github.io/2016/01/26/exactly-how-risky-is-breathing <p>This <a href="http://nyti.ms/23nysp5">article by by George Johnson</a> in the NYT describes a study by Kamen P. Simonov​​ and Daniel S. Himmelstein​ that examines the hypothesis that people living at higher altitudes experience lower rates of lung cancer than people living at lower altitudes.</p> <blockquote> <p>All of the usual caveats apply. Studies like this, which compare whole populations, can be used only to suggest possibilities to be explored in future research. But the hypothesis is not as crazy as it may sound. Oxygen is what energizes the cells of our bodies. Like any fuel, it inevitably spews out waste — a corrosive exhaust of substances called “free radicals,” or “reactive oxygen species,” that can mutate DNA and nudge a cell closer to malignancy.</p> </blockquote> <p>I’m not so much focused on the science itself, which is perhaps intriguing, but rather on the way the article was written. First, George Johnson links to the <a href="https://peerj.com/articles/705/">paper</a> itself, <a href="http://simplystatistics.org/2015/01/15/how-to-find-the-science-paper-behind-a-headline-when-the-link-is-missing/">already a major victory</a>. Also, I thought he did a very nice job of laying out the complexity of doing a population-level study like this one–all the potential confounders, selection bias, negative controls, etc.</p> <p>I remember particulate matter air pollution epidemiology used to have this feel. You’d try to do all these different things to make the effect go away, but for some reason, under every plausible scenario, in almost every setting, there was always some association between air pollution and health outcomes. Eventually you start to believe it….</p> On research parasites and internet mobs - let's try to solve the real problem. 2016-01-25T14:34:08+00:00 http://simplystats.github.io/2016/01/25/on-research-parasites-and-internet-mobs-lets-try-to-solve-the-real-problem <p>A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</p> <blockquote> <p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p> </blockquote> <p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p> <ol> <li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good [A couple of days ago one of the editors of the New England Journal of Medicine <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1516564">posted an editorial</a> showing some moderate level of support for data sharing but also introducing the term “research parasite”:</li> </ol> <blockquote> <p>A second concern held by some is that a new class of research person will emerge — people who had nothing to do with the design and execution of the study but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited. There is concern among some front-line researchers that the system will be taken over by what some researchers have characterized as “research parasites.”</p> </blockquote> <p>While this is obviously the most inflammatory statement in the article, I think that there are several more important and overlooked misconceptions. The biggest problems are:</p> <ol> <li><strong>“</strong><strong>The first concern is that someone not involved in the generation and collection of the data may not understand the choices made in defining the parameters.</strong><strong>“ </strong>This almost certainly would be the fault of the investigators who published the data. If the authors adhere to good](https://github.com/jtleek/datasharing) policies and respond to queries from people using their data promptly then this should not be a problem at all.</li> <li><strong>“… but use another group’s data for their own ends, possibly stealing from the research productivity planned by the data gatherers, or even use the data to try to disprove what the original investigators had posited.” </strong>The idea that no one should be able to try to disprove ideas with the authors data has been covered in other blogs/on Twitter. One thing I do think is worth considering here is the concern about credit. I think that the traditional way credit has accrued to authors has been citations. But if you get a major study funded, say for 50 million dollars, run that study carefully, sit on a million conference calls, and end up with a single major paper, that could be frustrating. Which is why I think that a better policy would be to have the people who run massive studies get credit in a way that <em>is not papers</em>. They should get some kind of formal administrative credit. But then the data should be immediately and publicly available to anyone to publish on. That allows people who run massive studies to get credit and science to proceed normally.</li> <li><strong>“</strong><strong>The new investigators arrived on the scene with their own ideas and worked symbiotically, rather than parasitically, with the investigators holding the data, moving the field forward in a way that neither group could have done on its own.” </strong> The story that follows about a group of researchers who collaborated with the NSABP to validate their gene expression signature is very encouraging. But it isn’t the only way science should work. Researchers shouldn’t be constrained to one model or another. Sometimes collaboration is necessary, sometimes it isn’t, but in neither case should we label the researchers “symbiotic” or “parasitic”, terms that have extreme connotations.</li> <li><strong>“How would data sharing work best? We think it should happen symbiotically, not parasitically.”</strong> I think that it should happen <em>automatically</em>. If you generate a data set with public funds, you should be required to immediately make it available to researchers in the community. But you should <em>get credit for generating the data set and the hypothesis that led to the data set</em>. The problem is that people who generate data will almost never be as fast at analyzing it as people who know how to analyze data. But both deserve credit, whether they are working together or not.</li> <li><strong>“Start with a novel idea, one that is not an obvious extension of the reported work. Second, identify potential collaborators whose collected data may be useful in assessing the hypothesis and propose a collaboration. Third, work together to test the new hypothesis. Fourth, report the new findings with relevant coauthorship to acknowledge both the group that proposed the new idea and the investigative group that accrued the data that allowed it to be tested.”</strong> The trouble with this framework is that it preferentially accrues credit to data generators and doesn’t accurately describe the role of either party. To flip this argument around, you could just as easily say that anyone who uses <a href="http://salzberg-lab.org/">Steven Salzberg</a>’s software for aligning or assembling short reads should make him a co-author. I think Dr. Drazen would agree that not everyone who aligned reads should add Steven as co-author, despite his contribution being critical for the completion of their work.</li> </ol> <p>After the piece was posted there was predictable internet rage from <a href="https://twitter.com/dataparasite">data parasites</a>, a <a href="https://twitter.com/hashtag/researchparasite?src=hash">dedicated hashtag</a>, and half a dozen angry blog posts written about the piece. These inspired a <a href="http://www.nejm.org/doi/full/10.1056/NEJMe1601087">follow up piece</a> from Drazen. I recognize why these folks were upset - the “research parasites” thing was unnecessarily inflammatory. But <a href="http://simplystatistics.org/2014/03/05/plos-one-i-have-an-idea-for-what-to-do-with-all-your-profits-buy-hard-drives/">I also sympathize with data creators</a> who are also subject to a tough environment - particularly when they are junior scientists.</p> <p>I think the response to the internet outrage also misses the mark and comes off as a defense of people with angry perspectives on data sharing. I would have much rather seen a more pro-active approach from a leading journal of medicine. I’d like to see something that acknowledges different contributions appropriately and doesn’t slow down science. Something like:</p> <ol> <li>We will require all data, including data from clinical trials, to be made public immediately on publication as long as it poses minimal risk to the patients involved or the patients have been consented to broad sharing.</li> <li>When data are not made publicly available they are still required to be deposited with a third party such as the NIH or Figshare to be held available for request from qualified/approved researchers.</li> <li>We will require that all people who use data give appropriate credit to the original data generators in terms of data citations.</li> <li>We will require that all people who use software/statistical analysis tools give credit to the original tool developers in terms of software citations.</li> <li>We will include a new designation for leaders of major data collection or software generation projects that can be included to demonstrate credit for major projects undertaken and completed.</li> <li>When reviewing papers written by experimentalists with no statistical/computational co-authors we will require no fewer than 2 statistical/computational referees to ensure there has not been a mistake made by inexperienced researchers.</li> <li>When reviewing papers written by statistical/computational authors with no experimental co-authors we will require no fewer than 2 experimental referees to ensure there has not been a mistake made by inexperienced researchers.</li> </ol> <p> </p> Not So Standard Deviations Episode 8 - Snow Day 2016-01-24T21:41:44+00:00 http://simplystats.github.io/2016/01/24/not-so-standard-deviations-episode-8-snow-day <p>Hilary and I were snowed in over the weekend, so we recorded Episode 8 of Not So Standard Deviations. In this episode, Hilary and I talk about how to get your foot in the door with data science, the New England Journal’s view on data sharing, Google’s “Cohort Analysis”, and trying to predict a movie’s box office returns based on the movie’s script.</p> <p><a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">Subscribe to the podcast on iTunes</a>.</p> <p>Follow <a href="https://twitter.com/nssdeviations">@NSSDeviations</a> on Twitter!</p> <p>Show notes:</p> <ul> <li><a href="http://goo.gl/eUU2AK">Remembrances of Peter Hall</a></li> <li><a href="http://goo.gl/HbMu87">Research Parasites</a> (NEJM editorial by Dan Longo and Jeffrey Drazen)</li> <li>Amazon <a href="http://goo.gl/83DvvO">review/data analysis</a> of Fifty Shades of Grey</li> <li><a href="https://youtu.be/55psWVYSbrI">Time-lapse cats</a></li> <li><a href="https://getpocket.com">Pocket</a></li> </ul> <p>Apologies for my audio on this episode. I had a bit of a problem calibrating my microphone. I promise to figure it out for the next episode!</p> <p><a href="https://api.soundcloud.com/tracks/243634673/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p> <p> </p> Parallel BLAS in R 2016-01-21T11:53:07+00:00 http://simplystats.github.io/2016/01/21/parallel-blas-in-r <p>I’m working on a new chapter for my R Programming book and the topic is parallel computation. So, I was happy to see this tweet from David Robinson (@drob) yesterday:</p> <blockquote class="twitter-tweet" lang="en"> <p dir="ltr" lang="en"> How fast is this <a href="https://twitter.com/hashtag/rstats?src=hash">#rstats</a> code? x &lt;- replicate(5e3, rnorm(5e3)) x %*% t(x) For me, w/Microsoft R Open, 2.5sec. Wow. <a href="https://t.co/0SbijNxxVa">https://t.co/0SbijNxxVa</a> </p> <p> — David Robinson (@drob) <a href="https://twitter.com/drob/status/689916280233562112">January 20, 2016</a> </p> </blockquote> <p>What does this have to do with parallel computation? Briefly, the code generates 5,000 standard normal random variates, repeats this 5,000 times and stores them in a 5,000 x 5,000 matrix (x’). Then it computes x x’. The second part is key, because it involves a matrix multiplication.</p> <p>Matrix multiplication in R is handled, at a very low level, by the library that implements the Basic Linear Algebra Subroutines, or BLAS. The stock R that you download from CRAN comes with what’s known as a reference implementation of BLAS. It works, it produces what everyone agrees are the right answers, but it is in no way optimized. Here’s what I get when I run this code on my Mac using Studio and the CRAN version of R for Mac OS X:</p> <pre>system.time({ x &lt;- replicate(5e3, rnorm(5e3)); tcrossprod(x) }) user system elapsed 59.622 0.314 59.927 </pre> <p>Note that the “user” time and the “elapsed” time are roughly the same. Note also that I use the tcrossprod() function instead of the otherwise equivalent expression x %*% t(x). Both crossprod() and tcrossprod() are generally faster than using the %*% operator.</p> <p>Now, when I run the same code on my built-from-source version of R (version 3.2.3), here’s what I get:</p> <pre>system.time({ x &lt;- replicate(5e3, rnorm(5e3)); tcrossprod(x) }) user system elapsed 14.378 0.276 3.344 </pre> <p>Overall, it’s faster when I don’t run the code through RStudio (14s vs. 59s). Also on this version the elapsed time is about 1/4 the user time. Why is that?</p> <p>The build-from-source version of R is linked to Apple’s Accelerate framework, which is a large library that includes an optimized BLAS library for Intel chips. This optimized BLAS, in addition to being optimized with respect to the code itself, is designed to be multi-threaded so that it can split work off into chunks and run them in parallel on multi-core machines. Here, the tcrossprod() function was run in parallel on my machine, and so the elapsed time was about a quarter of the time that was “charged” to the CPU(s).</p> <p>David’s tweet indicated that when using Microsoft R Open, which is a custom built binary of R, that the (I assume?) elapsed time is 2.5 seconds. Looking at the attached link, it appears that Microsoft’s R Open is linked against <a href="https://software.intel.com/en-us/intel-mkl">Intel’s Math Kernel Library</a> (MKL) which contains, among other things, an optimized BLAS for Intel chips. I don’t know what kind of computer David was running on, but assuming it was similarly high-powered as mine, it would suggest Intel’s MKL sees slightly better performance. But either way, both Accelerate and MKL achieve that speed up through custom-coding of the BLAS routines and multi-threading on multi-core systems.</p> <p>If you’re going to be doing any linear algebra in R (and you will), it’s important to link to an optimized BLAS. Otherwise, you’re just wasting time unnecessarily. Besides Accelerate (Mac) and Intel MKL, theres AMD’s <a href="http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/">ACML</a> library for AMD chips and the <a href="http://math-atlas.sourceforge.net">ATLAS</a> library which is a general purpose tunable library. Also <a href="https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2">Goto’s BLAS</a> is optimized but is not under active development.</p> Profile of Hilary Parker 2016-01-14T21:15:46+00:00 http://simplystats.github.io/2016/01/14/profile-of-hilary-parker <p>If you’ve ever wanted to know more about my <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> co-host (and Johns Hopkins graduate) Hilary Parker, you can go check out the <a href="http://thisisstatistics.org/hilary-parker-gets-crafty-with-statistics-in-her-not-so-standard-job/">great profile of her</a> on the American Statistical Association’s This Is Statistics web site.</p> <blockquote> <p><strong>What advice would you give to high school students thinking about majoring in statistics?</strong></p> <p>It’s such a great field! Not only is the industry booming, but more importantly, the disciplines of statistics teaches you to think analytically, which I find helpful for just about every problem I run into. It’s also a great field to be interested in as a generalist– rather than dedicating yourself to studying one subject, you are deeply learning a set of tools that you can apply to any subject that you find interesting. Just one glance at the topics covered on The Upshot or 538 can give you a sense of that. There’s politics, sports, health, history… the list goes on! It’s a field with endless possibility for growth and exploration, and as I mentioned above, the more I explore the more excited I get about it.</p> </blockquote> Not So Standard Deviations Episode 7 - Statistical Royalty 2016-01-12T08:45:24+00:00 http://simplystats.github.io/2016/01/12/not-so-standard-deviations-episode-7-statistical-royalty <p>The latest episode of Not So Standard Deviations is out, and boy does Hilary have a story to tell.</p> <p>We also talk about Theranos and the pitfalls of diagnostic testing, Spotify’s Discover Weekly playlist generation algorithm (and the need for human product managers), and of course, a little Star Wars. Also, Hilary and I start a new segment where we each give some “free advertising” to something interesting that they think other people should know about.</p> <p>Show Notes:</p> <ul> <li><a href="http://goo.gl/JDk6ni">Gosset Icterometer</a></li> <li>The <a href="http://skybrudeconsulting.com/blog/2015/10/16/theranos-healthcare.html">dangers</a> of <a href="https://www.fredhutch.org/en/news/center-news/2013/11/scientists-urge-caution-personal-genetic-screenings.html">entertainment</a> <a href="http://mobihealthnews.com/35444/the-rise-of-the-seemingly-serious-but-just-for-entertainment-purposes-medical-app/">medicine</a></li> <li>Spotify’s Discover Weekly <a href="http://goo.gl/enzFeR">solves human curation</a>?</li> <li>David Robinson’s <a href="http://varianceexplained.org">Variance Explained</a></li> <li><a href="http://what3words.com">What3Words</a></li> </ul> <p><a href="https://api.soundcloud.com/tracks/241071463/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio for this episode</a>.</p> Jeff, Roger and Brian Caffo are doing a Reddit AMA at 3pm EST Today 2016-01-11T09:29:28+00:00 http://simplystats.github.io/2016/01/11/jeff-roger-and-brian-caffo-are-doing-a-reddit-ama-at-3pm-est-today <p>Jeff Leek, Brian Caffo, and I are doing a <a href="https://www.reddit.com/r/IAmA">Reddit AMA</a> TODAY at 3pm EST. We’re happy to answer questions about…anything…including our roles as Co-Directors of the <a href="https://www.coursera.org/specializations/jhu-data-science">Johns Hopkins Data Science Specialization</a> as well as the <a href="https://www.coursera.org/specializations/executive-data-science">Executive Data Science Specialization</a>.</p> <p>This is one of the few pictures of the three of us together.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189.jpg"><img class="alignright size-large wp-image-4586" src="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg" alt="IMG_0189" width="990" height="743" srcset="http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2016/01/IMG_0189-260x195.jpg 260w" sizes="(max-width: 990px) 100vw, 990px" /></a></p> A non-comprehensive list of awesome things other people did in 2015 2015-12-21T11:22:07+00:00 http://simplystats.github.io/2015/12/21/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2015 <p><em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p> <ol> <li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li> <li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li> <li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li> <li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li> <li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li> <li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li> <li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li> <li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li> <li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li> <li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li> <li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li> <li> <p>Amanda Cox an co. created this [<em>Editor’s Note: This is the third year I’m making a list of awesome things other people did this year. Just like the lists for <a href="http://simplystatistics.org/2013/12/20/a-non-comprehensive-list-of-awesome-things-other-people-did-this-year/">2013</a> and <a href="http://simplystatistics.org/2014/12/17/a-non-comprehensive-list-of-awesome-things-other-people-did-in-2014/">2014</a> I am doing this off the top of my head. I have avoided talking about stuff I worked on or that people here at Hopkins are doing because this post is supposed to be about other people’s awesome stuff. I wrote this post because a blog often feels like a place to complain, but we started Simply Stats as a place to be pumped up about the stuff people were doing with data. This year’s list is particularly “off the cuff” so I’d appreciate additions if you have ‘em. I have surely missed awesome things people have done.</em></p> </li> <li>I hear the <a href="http://sml.princeton.edu/tukey">Tukey conference</a> put on by my former advisor John S. was amazing. Out of it came this really good piece by David Donoho on <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 years of Data Science</a>.</li> <li>Sherri Rose wrote really accurate and readable guides on <a href="http://drsherrirose.com/academic-cvs-for-statistical-science-faculty-positions">academic CVs</a>, <a href="http://drsherrirose.com/academic-cover-letters-for-statistical-science-faculty-positions">academic cover letters</a>, and <a href="http://drsherrirose.com/how-to-be-an-effective-phd-researcher">how to be an effective PhD researcher</a>.</li> <li>I am not 100% sold on the deep learning hype, but Michael Nielson wrote this awesome book on <a href="http://neuralnetworksanddeeplearning.com/">deep learning and neural networks</a>. I like how approachable it is and how un-hypey it is. I also thought Andrej Karpathy’s <a href="http://karpathy.github.io/2015/10/25/selfie/">blog post</a> on whether you have a good selfie or not was fun.</li> <li>Thomas Lumley continues to be must read regardless of which blog he writes for with a ton of snarky fun posts debunking the latest ridiculous health headlines on <a href="http://www.statschat.org.nz/2015/11/27/to-find-the-minds-construction-near-the-face/">statschat</a> and more in depth posts like this one on pre-filtering multiple tests on <a href="http://notstatschat.tumblr.com/post/131478660126/prefiltering-very-large-numbers-of-tests">notstatschat</a>.</li> <li>David Robinson is making a strong case for top data science blogger with his series of <a href="http://varianceexplained.org/r/bayesian_fdr_baseball/">awesome</a> <a href="http://varianceexplained.org/r/credible_intervals_baseball/">posts</a> on <a href="http://varianceexplained.org/r/empirical_bayes_baseball/">empirical Bayes</a>.</li> <li>Hadley Wickham doing Hadley Wickham things again. <a href="https://github.com/hadley/readr">readr</a> is the biggie for me this year.</li> <li>I’ve been really enjoying the solid coverage of science/statistics from the (not entirely statistics focused as the name would suggest) <a href="https://twitter.com/statnews">STAT</a>.</li> <li>Ben Goldacre and co. launched <a href="http://opentrials.net/">OpenTrials</a> for aggregating all the clinical trial data in the world in an open repository.</li> <li>Christie Aschwanden’s piece on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science Isn’t Broken </a> is a must read and one of the least polemic treatments of the reproducibility/replicability issue I’ve read. The p-hacking graphic is just icing on the cake.</li> <li>I’m excited about the new <a href="http://blog.revolutionanalytics.com/2015/06/r-consortium.html">R Consortium</a> and the idea of having more organizations that support folks in the R community.</li> <li>Emma Pierson’s blog and writeups in various national level news outlets continue to impress. I thought <a href="https://www.washingtonpost.com/news/grade-point/wp/2015/10/15/a-better-way-to-gauge-how-common-sexual-assault-is-on-college-campuses/">this one</a> on changing the incentives for sexual assault surveys was particularly interesting/good.</li> <li>Amanda Cox an co. created this ](http://www.nytimes.com/interactive/2015/05/28/upshot/you-draw-it-how-family-income-affects-childrens-college-chances.html) , which is an amazing way to teach people about pre-conceived biases in the way we think about relationships and correlations. I love the crowd-sourcing view on data analysis this suggests.</li> <li>As usual Philip Guo was producing gold over on his blog. I appreciate this piece on <a href="http://www.pgbovine.net/tips-for-data-driven-research.htm">twelve tips for data driven research</a>.</li> <li>I am really excited about the new field of adaptive data analysis. Basically understanding how we can let people be “real data analysts” and still get reasonable estimates at the end of the day. <a href="http://www.sciencemag.org/content/349/6248/636.abstract">This paper</a> from Cynthia Dwork and co was one of the initial salvos that came out this year.</li> <li>Datacamp <a href="https://www.datacamp.com/courses/intro-to-python-for-data-science?utm_source=growth&amp;utm_campaign=python&amp;utm_medium=button">incorporated Python</a> into their platform. The idea of interactive education for R/Python/Data Science is a very cool one and has tons of potential.</li> <li>I was really into the idea of <a href="http://projecteuclid.org/euclid.aoas/1430226098">Cross-Study validatio</a>n that got proposed this year. With the growth of public data in a lot of areas we can really start to get a feel for generalizability.</li> <li>The Open Science Foundation did this <a href="http://www.sciencemag.org/content/349/6251/aac4716">incredible replication of 100 different studies</a> in psychology with attention to detail and care that deserves a ton of attention.</li> <li>Florian’s piece “<a href="http://www.ncbi.nlm.nih.gov/pubmed/26402330">You are not working for me; I am working with you.</a>” should be required reading for all students/postdocs/mentors in academia. This is something I still hadn’t fully figured out until I read Florian’s piece.</li> <li>I think Karl Broman’s post on why <a href="https://kbroman.wordpress.com/2015/09/09/reproducibility-is-hard/">reproducibility is hard</a> is a great introduction to the real issues in making data analyses reproducible.</li> <li>This was the year of the f1000 post-publication review paper. I thought <a href="http://f1000research.com/articles/4-121/v1">this one</a> from Yoav and the ensuing fallout was fascinating.</li> <li>I love pretty much everything out of Di Cook/Heike Hoffman’s groups. This year I liked the paper on <a href="http://download.springer.com/static/pdf/611/art%253A10.1007%252Fs00180-014-0534-x.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1007%2Fs00180-014-0534-x&amp;token2=exp=1450714996~acl=%2Fstatic%2Fpdf%2F611%2Fart%25253A10.1007%25252Fs00180-014-0534-x.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1007%252Fs00180-014-0534-x*~hmac=3c5f5c7c1b2381685437659d8ffd64e1cb2c52d1dfd10506cad5d2af1925c0ac">visual statistical inference in high-dimensional low sample size settings</a>.</li> <li>This is pretty recent, but Nathan Yau’s <a href="https://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/">day in the life graphic is mesmerizing</a>.</li> </ol> <p>This was a year where open source data people <a href="http://treycausey.com/emotional_rollercoaster_public_work.html">described</a> their <a href="https://twitter.com/johnmyleswhite/status/666429299327569921">pain</a> from people being demanding/mean to them for their contributions. As the year closes I just want to give a big thank you to everyone who did awesome stuff I used this year and have completely ungraciously failed to acknowledge.</p> <p> </p> Not So Standard Deviations: Episode 6 - Google is the New Fisher 2015-12-18T13:08:10+00:00 http://simplystats.github.io/2015/12/18/not-so-standard-deviations-episode-6-google-is-the-new-fisher <p>Episode 6 of Not So Standard Deviations is now posted. In this episode Hilary and I talk about the analytics of our own podcast, and analyses that seem easy but are actually hard.</p> <p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>.</p> <p>This will be our last episode for 2015 so see you in 2016!</p> <p>Notes</p> <ul> <li><a href="https://goo.gl/X0TFt9">Roger’s books on Leanpub</a></li> <li><a href="https://goo.gl/VO0ckP">KPIs</a></li> <li><a href="http://replyall.soy">Reply All</a>, a great podcast</li> <li><a href="http://user2016.org">Use R! 2016 conference</a> where Don Knuth is an invited speaker!</li> <li><a href="http://goo.gl/wUcTBT">Liz Stuart’s directory of propensity score software</a></li> <li><a href="https://goo.gl/CibhJ0">A/B testing</a></li> <li><a href="https://goo.gl/qMyksb">iid</a></li> <li><a href="https://goo.gl/qHVzWQ">R 3.2.3 release notes</a></li> <li><a href="http://www.pqr-project.org/">pqR</a></li> <li><a href="https://goo.gl/pFOVkx">John Myles White’s tweet</a></li> </ul> <p><a href="https://api.soundcloud.com/tracks/237909534/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> Instead of research on reproducibility, just do reproducible research 2015-12-11T12:18:33+00:00 http://simplystats.github.io/2015/12/11/instead-of-research-on-reproducibility-just-do-reproducible-research <p>Right now reproducibility, replicability, false positive rates, biases in methods, and other problems with science are the hot topic. As I mentioned in a previous post pointing out a flaw with a scientific study is way easier to do correctly than generating a new scientific study. Some folks have noticed that right now there is a huge market for papers pointing out how science is flawed. The combination of the relative ease of pointing out flaws and the huge payout for writing these papers is helping to generate the hype around the “reproducibility crisis”.</p> <p>I <a href="http://www.slideshare.net/jtleek/evidence-based-data-analysis-45800617">gave a talk</a> a little while ago at an NAS workshop where I stated that all the tools for reproducible research exist (the caveat being really large analyses - although that is changing as well). To make a paper completely reproducible, open, and available for post publication review you can use the following approach with no new tools/frameworks needed.</p> <ol> <li>Use <a href="https://github.com/">Github </a>for version control.</li> <li>Use <a href="http://rmarkdown.rstudio.com/">rmarkdown</a> or <a href="http://ipython.org/notebook.html">iPython notebooks</a> for your analysis code</li> <li>When your paper is done post it to <a href="http://arxiv.org/">arxiv</a> or <a href="http://biorxiv.org/">biorxiv</a>.</li> <li>Post your data to an appropriate repository like <a href="http://www.ncbi.nlm.nih.gov/sra">SRA</a> or a general purpose site like <a href="https://figshare.com/">figshare.</a></li> <li>Send any software you develop to a controlled repository like <a href="https://cran.r-project.org/">CRAN</a> or <a href="http://bioconductor.org/">Bioconductor</a>.</li> <li>Participate in the <a href="http://simplystatistics.org/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics/">post publication discussion on Twitter and with a Blog</a></li> </ol> <p>This is also true of open science, open data sharing, reproducibility, replicability, post-publication peer review and all the other issues forming the “reproducibility crisis”. There is a lot of attention and heat that has focused on the “crisis” or on folks who make a point to take a stand on reproducibility or open science or post publication review. But in the background, outside of the hype, there are a large group of people that are quietly executing solid, open, reproducible science.</p> <p>I wish that this group would get more attention so I decided to point out a few of them. Next time somebody asks me about the research on reproducibility or open science I’ll just point them here and tell them to just follow the lead of people doing it.</p> <ul> <li><strong>Karl Broman</strong> - posts all of his <a href="http://kbroman.org/pages/talks.html">talks online </a>, generates many widely used <a href="http://kbroman.org/pages/software.html">open source packages</a>, writes <a href="http://kbroman.org/pages/tutorials.html">free/open tutorials</a> on everything from knitr to making webpages, makes his <a href="http://www.ncbi.nlm.nih.gov/pubmed/26290572">papers</a> highly <a href="https://github.com/kbroman/Paper_SampleMixups">reproducible</a>.</li> <li><strong>Jessica Li</strong> - <a href="http://www.stat.ucla.edu/~jingyi.li/software-and-data.html">posts her data online and writes open source software for her analyses</a>.</li> <li><strong>Mark Robinson - </strong>posts many of his papers as <a href="http://biorxiv.org/search/author1%3Arobinson%252C%2Bmd%20numresults%3A10%20sort%3Arelevance-rank%20format_result%3Astandard">preprints on biorxiv</a>, makes his <a href="https://github.com/markrobinsonuzh/diff_splice_paper">analyses reproducible</a>, writes <a href="http://bioconductor.org/packages/release/bioc/html/Repitools.html">open source software </a></li> <li><strong>Florian Markowetz -<a href="http://www.markowetzlab.org/software/"> </a></strong><a href="http://www.markowetzlab.org/software/">writes open source software</a>, provides <a href="http://www.markowetzlab.org/data.php">Bioconductor data for major projects</a>, links <a href="http://www.markowetzlab.org/publications.php">his papers with his code</a> nicely on his publications page.</li> <li><strong>Raphael Gottardo</strong> - <a href="http://www.rglab.org/software.html">writes/maintains many open source software packages</a>, makes <a href="https://github.com/RGLab/BNCResponse">his analyses reproducible and available via Github</a>, posts <a href="http://biorxiv.org/content/early/2015/06/15/020842">preprints of his papers</a>.</li> <li><strong>Genevera Allen - </strong>writes](https://cran.r-project.org/web/packages/TCGA2STAT/index.html) to make data easier to access, posts <a href="http://biorxiv.org/content/early/2015/09/24/027516">preprints on biorxiv</a> and <a href="http://arxiv.org/pdf/1502.03853v1.pdf">on arxiv</a></li> <li><strong>Lorena Barba</strong> - <a href="http://openedx.seas.gwu.edu/courses/GW/MAE6286/2014_fall/about">teaches open source moocs</a>, with lessons as <a href="https://github.com/barbagroup/CFDPython">open source iPython modules</a>, and <a href="https://github.com/barbagroup/pygbe">reproducible code for her analyses</a>.</li> <li><strong>Alicia Oshlack - </strong>writes papers with <a href="http://www.genomemedicine.com/content/7/1/43">completely reproducible analyses</a>, <a href="http://bioconductor.org/packages/release/bioc/html/missMethyl.html">publishes lots of open source software</a> and publishes <a href="http://biorxiv.org/content/early/2015/01/23/013698">preprints</a> for her papers.</li> <li><strong>Baggerly and Coombs</strong> - although they are famous for a <a href="https://projecteuclid.org/euclid.aoas/1267453942">highly public reproducible piece of research</a> they have also quietly implemented policies like <a href="http://magazine.amstat.org/blog/2011/01/01/scipolicyjan11/">making all reports reproducible for their consulting center</a>.</li> </ul> <p>This list was made completely haphazardly as all my lists are, but just to indicate there are a ton of people out there doing this. One thing that is clear too is that grad students and postdocs are adopting the approach I described at a very high rate.</p> <p>Moreover there are people that have been doing parts of this for a long time (like the <a href="http://arxiv.org/">physics</a> or <a href="http://biostats.bepress.com/jhubiostat/">biostatistics</a> communities with preprints, or how people have used <a href="https://projecteuclid.org/euclid.aoas/1267453942">Sweave for a long time</a>) . I purposely left people off the list like Titus and Ethan who have gone all in, even posting their <a href="http://ivory.idyll.org/blog/grants-posted.html">grants</a> <a href="http://jabberwocky.weecology.org/2012/08/10/a-list-of-publicly-available-grant-proposals-in-the-biological-sciences/">online</a>. I did this because they are very loud advocates of open science, but I wanted to highlight quieter contributors and point out that while there is a lot of noise going on over in one corner, many people are quietly doing really good science in another.</p> By opposing tracking well-meaning educators are hurting disadvantaged kids 2015-12-09T10:10:02+00:00 http://simplystats.github.io/2015/12/09/by-opposing-tracking-well-meaning-educators-are-hurting-disadvantaged-kids <div class="page" title="Page 2"> <div class="layoutArea"> <div class="column"> <p> An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track. </p> </div> </div> </div> <p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p> <p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p> <blockquote> <p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p> </blockquote> <p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p> <p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a [&lt;div class="page" title="Page 2"&gt;</p> <div class="layoutArea"> <div class="column"> <p> An unfortunate fact about the US K-12 system is that the education gap between poor and rich is growing. One manifestation of this trend is that we rarely see US kids from disadvantaged backgrounds become tenure track faculty, especially in the STEM fields. In my experience, the ones that do make it, when asked how they overcame the suboptimal math education their school district provided, often respond "I was <a href="https://en.wikipedia.org/wiki/Tracking_(education)">tracked</a>" or "I went to a <a href="https://en.wikipedia.org/wiki/Magnet_school">magnet school</a>". Magnet schools filter students with admission tests and then teach at a higher level than an average school, so essentially the entire school is an advanced track. </p> </div> </div> <p>&lt;/div&gt;</p> <p>Twenty years of classroom instruction experience has taught me that classes with diverse academic abilities present one of the most difficult teaching challenges. Typically, one is forced to focus on only a sub-group of students, usually the second quartile. As a consequence the lower and higher quartiles are not properly served. At the university level, we minimize this problem by offering different levels: remedial math versus math for engineers, probability for the Masters program versus probability for PhD students, co-ed intramural sports versus the varsity basketball team, intro to World Music versus a spot in the orchestra, etc. In K-12, tracking seems like the obvious solution to teaching to an array of student levels.</p> <p>Unfortunately, there has been a trend recently to move away from tracking and several school districts now forbid it. The motivation seems to be a series of <a href="http://www.tandfonline.com/doi/abs/10.1207/s15430421tip4501_9">observational</a> <a href="http://files.eric.ed.gov/fulltext/ED329615.pdf">studies</a> that note that “low-track classes tend to be primarily composed of low-income students, usually minorities, while upper-track classes are usually dominated by students from socioeconomically successful groups.” Tracking opponents infer that this unfortunate reality is due to bias (conscious or unconscious) in the the informal referrals that are typically used to decide which students are advanced. However, <strong>this is a critique of the referral system, not of tracking itself.</strong> A simple fix is to administer an objective test or use the percentiles from <a href="http://www.doe.mass.edu/mcas/overview.html">state assessment tests</a>. In fact, such exams have been developed and implemented. A recent study (summarized <a href="http://www.vox.com/2015/11/23/9784250/card-giuliano-gifted-talented">here</a>) examined the data from a district that for a period of time implemented an objective assessment and found that</p> <blockquote> <p>[t]he number of Hispanic students [in the advanced track increased] by 130 percent and the number of black students by 80 percent.</p> </blockquote> <p>Unfortunately, instead of maintaining the placement criteria, which benefited underrepresented minorities without relaxing standards, these school districts reverted to the old, flawed system due to budget cuts.</p> <p>Another argument against tracking is that students benefit more from being in classes with higher-achieving peers, rather than being in a class with students with similar subject mastery and a teacher focused on their level. However a](http://web.stanford.edu/~pdupas/Tracking_rev.pdf) (and the only one of which I am aware) finds that tracking helps all students:</p> <blockquote> <p>We find that tracking students by prior achievement raised scores for all students, even those assigned to lower achieving peers. On average, after 18 months, test scores were 0.14 standard deviations higher in tracking schools than in non-tracking schools (0.18 standard deviations higher after controlling for baseline scores and other control variables). After controlling for the baseline scores, students in the top half of the pre-assignment distribution gained 0.19 standard deviations, and those in the bottom half gained 0.16 standard deviations. <strong>Students in all quantiles benefited from tracking. </strong></p> </blockquote> <p>I believe that without tracking, the achievement gap between disadvantaged children and their affluent peers will continue to widen since involved parents will seek alternative educational opportunities, including private schools or subject specific extracurricular acceleration programs. With limited or no access to advanced classes in the public system, disadvantaged students will be less prepared to enter the very competitive STEM fields. Note that competition comes not only from within the US, but from other countries including many with educational systems that track.</p> <p>To illustrate the extreme gap, the following exercises are from a 7th grade public school math class (in a high performing school district):</p> <table style="width: 100%;"> <tr> <td> <a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.49.41-AM.png" alt="Screen Shot 2015-12-07 at 11.49.41 AM" width="275" /></a> </td> <td> <a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png"><img src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-09-at-9.00.57-AM.png" alt="Screen Shot 2015-12-09 at 9.00.57 AM" width="275" /></a> </td> </tr> </table> <p>(Click to enlarge). There is no tracking so all students must work on these problems. Meanwhile, in a 7th grade advanced, private math class, that same student can be working on problems like these:<a href="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png"><img class="alignnone size-full wp-image-4511" src="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png" alt="Screen Shot 2015-12-07 at 11.47.45 AM" width="1165" height="341" srcset="http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-300x88.png 300w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-1024x300.png 1024w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM-260x76.png 260w, http://simplystatistics.org/wp-content/uploads/2016/12/Screen-Shot-2015-12-07-at-11.47.45-AM.png 1165w" sizes="(max-width: 1165px) 100vw, 1165px" /></a>Let me stress that there is nothing wrong with the first example if it is the appropriate level of the student. However, a student who can work at the level of the second example, should be provided with the opportunity to do so notwithstanding their family’s ability to pay. Poorer kids in districts which do not offer advanced classes will not only be less equipped to compete with their richer peers, but many of the academically advanced ones may, I suspect, dismiss academics due to lack of challenge and boredom. Educators need to consider evidence when making decisions regarding policy. Tracking can be applied unfairly, but that aspect can be remedied. Eliminating tracking all together takes away a crucial tool for disadvantaged students to move into the STEM fields and, according to the empirical evidence, hurts all students.</p> Not So Standard Deviations: Episode 5 - IRL Roger is Totally With It 2015-12-03T09:52:47+00:00 http://simplystats.github.io/2015/12/03/not-so-standard-deviations-episode-5-irl-roger-is-totally-with-it <p>I just posted Episode 5 of Not So Standard Deviations so check your feeds! Sorry for the long delay since the last episode but we got a bit tripped up by the Thanksgiving holiday.</p> <p>In this episode, Hilary and I open up the mailbag and go through some of the feedback we’ve gotten on the previous episodes. The rest of the time is spent talking about the importance of reproducibility in data analysis both in academic research and in industry settings.</p> <p>If you haven’t already, you can subscribe to the podcast through <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">iTunes</a>. Or you can use the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">SoundCloud RSS feed</a> directly.</p> <p>Notes:</p> <ul> <li>Hilary’s <a href="https://youtu.be/7B3n-5atLxM">talk on reproducible analysis in production</a> at the New York R Conference</li> <li>Hilary’s <a href="https://youtu.be/zlSOckFpYqg">Ignite presentation</a> at Strata 2013</li> <li>Roger’s <a href="https://youtu.be/aH8dpcirW1U">talk on “Computational and Policy Tools for Reproducible Research”</a> at the Applied Mathematics Perspectives Workshop in Vancouver, 2011</li> <li>Duke Scandal <a href="http://goo.gl/rEO5QD">Starter Set</a></li> <li><a href="https://youtu.be/7gYIs7uYbMo">Keith Baggerly’s talk</a> on Duke Scandal</li> <li>The <a href="https://goo.gl/RtpBZa">Web of Trust</a></li> <li><a href="https://goo.gl/MlM0gu">testdat</a> R package</li> </ul> <p><a href="https://api.soundcloud.com/tracks/235689361/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> <p>Or you can listen right here:</p> Thinking like a statistician: the importance of investigator-initiated grants 2015-12-01T11:40:29+00:00 http://simplystats.github.io/2015/12/01/thinking-like-a-statistician-fund-more-investigator-initiated-grants <p>A substantial amount of scientific research is funded by investigator-initiated grants. A researcher has an idea, writes it up and sends a proposal to a funding agency. The agency then elicits help from a group of peers to evaluate competing proposals. Grants are awarded to the most highly ranked ideas. The percent awarded depends on how much funding gets allocated to these types of proposals. At the NIH, the largest funding agency of these types of grants, the success rate recently <a href="https://nihdirectorsblog.files.wordpress.com/2013/09/sequestration-success-rates1.jpg">fell below 20% from a high above 35%</a>. Part of the reason these percentages have fallen is to make room for large collaborative projects. Large projects seem to be increasing, and not just at the NIH. In Europe, for example, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> has an estimated cost of over 1 billion US over 10 years. To put this in perspective, 1 billion dollars can fund over 500 <a href="http://grants.nih.gov/grants/funding/r01.htm">NIH R01s</a>. R01 is the NIH mechanism most appropriate for investigator initiated proposals.</p> <p>The merits of big science has been widely debated (for example <a href="http://www.michaeleisen.org/blog/?p=1179">here</a> and <a href="http://simplystatistics.org/2013/02/27/please-save-the-unsolicited-r01s/">here</a>). And most agree that some big projects have been successful. However, in this post I present a statistical argument highlighting the importance of investigator-initiated awards. The idea is summarized in the graph below.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png"><img class="alignnone size-full wp-image-4483" src="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png" alt="Rplot" width="1112" height="551" srcset="http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-300x149.png 300w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-1024x507.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot-260x129.png 260w, http://simplystatistics.org/wp-content/uploads/2015/12/Rplot.png 1112w" sizes="(max-width: 1112px) 100vw, 1112px" /></a></p> <p>The two panes above represent two different funding strategies: fund-many-R01s (left) or reduce R01s to fund several large projects (right). The grey crosses represent investigators and the gold dots represent potential paradigm-shifting geniuses. Location on the Cartesian plane represent research areas, with the blue circles denoting areas that are prime for an important scientific advance. The largest scientific contributions occur when a gold dot falls in a blue circle. Large contributions also result from the accumulation of incremental work produced by grey crosses in the blue circles.</p> <p>Although not perfect, the peer review approach implemented by most funding agencies appears to work quite well at weeding out unproductive researchers and unpromising ideas. They also seem to do well at spreading funds across general areas. For example NIH spreads funds across <a href="https://www.nih.gov/institutes-nih/list-nih-institutes-centers-offices">diseases and public health challenges</a> (for example cancer, mental health, heart, genomics, heart and lung disease.) as well as <a href="https://www.nigms.nih.gov/Pages/default.aspx">general medicine</a>, <a href="https://www.genome.gov/">genomics</a> and <a href="https://www.nlm.nih.gov/">information.</a> However, precisely predicting who will be a gold dot or what specific area will be a blue circle seems like an impossible endeavor. Increasing the number of tested ideas and researchers therefore increases our chance of success. When a funding agency decides to invest big in a specific area (green dollar signs) they are predicting the location of a blue circle. As funding flows into these areas, so do investigators (note the clusters). The total number of funded lead investigators also drops. The risk here is that if the dollar sign lands far from a blue dot, we pull researchers away from potentially fruitful areas. If after 10 years of funding, the <a href="https://www.humanbrainproject.eu/">Human Brain Project</a> doesn’t <a href="https://www.humanbrainproject.eu/mission">“achieve a multi-level, integrated understanding of brain structure and function”</a> we will have missed out on trying out 500 ideas by hundreds of different investigators. With a sample size this large, we expect at least a  handful of these attempts to result in the type of impactful advance that justifies funding scientific research.</p> <p>The simulation presented (code below) here is clearly an over simplification, but it does depict the statistical reason why I favor investigator-initiated grants.  The simulation clearly depicts that the strategy of funding many investigator-initiated grants is key for the continued success of scientific research.</p> <p><tt><br /> set.seed(2)<br /> library(rafalib)<br /> thecol=”gold3”<br /> mypar(1,2,mar=c(0.5,0.5,2,0.5))<br /> ###<br /> ## Start with the many R01s model<br /> ###<br /> ##generate location of 2,000 investigators<br /> N = 2000<br /> x = runif(N)<br /> y = runif(N)<br /> ## 1% are geniuses<br /> Ng = N<em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> ## generate location of important areas of research<br /> M0 = 10<br /> x0 = runif(M0)<br /> y0 = runif(M0)<br /> r0 = rep(0.03,M0)<br /> ##Make the plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”Many R01s”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x,y,pch=g,col=ifelse(g==4,NA,thecol))<br /> ### Generate the location of 5 big projects<br /> M1 = 5<br /> x1 = runif(M1)<br /> y1 = runif(M1)<br /> ##make initial plot<br /> nullplot(xaxt=”n”,yaxt=”n”,main=”A Few Big Projects”)<br /> symbols(x0,y0,circles=r0,fg=”black”,bg=”blue”,<br /> lwd=3,add=TRUE,inches=FALSE)<br /> ### Generate location of investigators attracted<br /> ### to location of big projects. There are 1000 total<br /> ### investigators<br /> Sigma = diag(2)</em>0.005<br /> N1 = 200<br /> Ng1 = round(N1<em>0.01)<br /> g1 = rep(4,N);g1[1:Ng1]=16<br /> library(MASS)<br /> for(i in 1:M1){<br /> xy = mvrnorm(N1,c(x1[i],y1[i]),Sigma)<br /> points(xy[,1],xy[,2],pch=g1,col=ifelse(g1==4,”grey”,thecol))<br /> }<br /> ### generate location of investigators that ignore big projects<br /> ### note now 500 instead of 200. Note overall total<br /> ## is also less because large projects result in less<br /> ## lead investigators<br /> N = 500<br /> x = runif(N)<br /> y = runif(N)<br /> Ng = N</em>0.01<br /> g = rep(4,N);g[1:Ng]=16<br /> points(x,y,pch=g,col=ifelse(g==4,”grey”,thecol))<br /> points(x1,y1,pch=””,col=”darkgreen”,cex=2,lwd=2)<br /> </tt></p> A thanksgiving dplyr Rubik's cube puzzle for you 2015-11-25T12:14:06+00:00 http://simplystats.github.io/2015/11/25/a-thanksgiving-dplyr-rubiks-cube-puzzle-for-you <p><a href="http://nickcarchedi.com/">Nick Carchedi</a> is back visiting from <a href="https://www.datacamp.com/">DataCamp</a> and for fun we came up with a <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html">[Nick Carchedi](http://nickcarchedi.com/) is back visiting from [DataCamp](https://www.datacamp.com/) and for fun we came up with a</a> Rubik’s cube puzzle. Here is how it works. To solve the puzzle you have to make a 4 x 3 data frame that spells Thanksgiving like this:</p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/4d4b63a035973231e6d4">Gist</a>. </noscript> </div> <p><span style="line-height: 1.5;">To solve the puzzle you need to pipe this data frame in </span></p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/aae1218a8f4d1220e07d">Gist</a>. </noscript> </div> <p>and pipe out the Thanksgiving data frame using only the dplyr commands <em>arrange</em>, <em>mutate</em>, <em>slice</em>, <em>filter</em> and <em>select</em>. For advanced users you can try our slightly more complicated puzzle:</p> <div class="oembed-gist"> <noscript> View the code on <a href="https://gist.github.com/jtleek/b82531d9dac78ba3c60a">Gist</a>. </noscript> </div> <p>See if you can do it <a href="http://www.theguardian.com/technology/video/2015/nov/24/boy-completes-rubiks-cube-in-49-seconds-word-recordvideo">this fast</a>. Post your solutions in the comments and Happy Thanksgiving!</p> 20 years of Data Science: from Music to Genomics 2015-11-24T10:00:56+00:00 http://simplystats.github.io/2015/11/24/20-years-of-data-science-and-data-driven-discovery-from-music-to-genomics <p>I finally got around to reading David Donoho’s <a href="https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf">50 Years of Data Science</a> paper. I highly recommend it. The following quote seems to summarize the sentiment that motivated the paper, as well as why it has resonated among academic statisticians:</p> <div class="page" title="Page 5"> <div class="layoutArea"> <div class="column"> <blockquote> <p> The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers. </p> </blockquote> </div> </div> </div> <p>The reason we started this blog over four years ago was because, as Jeff wrote in his inaugural post, we were “<a href="http://simplystatistics.org/2011/09/07/first-things-first/">fired up about the new era where data is abundant and statisticians are scientists</a>”. It was clear that many disciplines were becoming data-driven and that interest in data analysis was growing rapidly. We were further motivated because, despite this <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">new found interest in our work</a>, academic statisticians were, in general, more interested in the development of context free methods than in leveraging applied statistics to take <a href="http://simplystatistics.org/2012/06/22/statistics-and-the-science-club/">leadership roles</a> in data-driven projects. Meanwhile, great and highly visible applied statistics work was occurring in other fields such as astronomy, computational biology, computer science, political science and economics. So it was not completely surprising that some (bio)statistics departments were being left out from larger university-wide data science initiatives. Some of <a href="http://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/">our</a> <a href="http://simplystatistics.org/2013/04/15/data-science-only-poses-a-threat-to-biostatistics-if-we-dont-adapt/">posts</a> exhorted academic departments to embrace larger numbers of applied statisticians:</p> <blockquote> <p>[M]any of the giants of our discipline were very much interested in solving specific problems in genetics, agriculture, and the social sciences. In fact, many of today’s most widely-applied methods were originally inspired by insights gained by answering very specific scientific questions. I worry that the balance between application and theory has shifted too far away from applications. An unfortunate consequence is that our flagship journals, including our applied journals, are publishing too many methods seeking to solve many problems but actually solving none. By shifting some of our efforts to solving specific problems we will get closer to the essence of modern problems and will actually inspire more successful generalizable methods.</p> </blockquote> <p>Donoho points out that John Tukey had a similar preoccupation 50 years ago:</p> <div class="page" title="Page 10"> <div class="layoutArea"> <div class="column"> <blockquote> <p> For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt. ... All in all I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data </p> </blockquote> <p> Many applied statisticians do the things Tukey mentions above. In the blog we have encouraged them to <a href="http://simplystatistics.org/2014/09/15/applied-statisticians-people-want-to-learn-what-we-do-lets-teach-them/">teach the gory details of what what they do</a>, along with the general methodology we currently teach. With all this in mind, several months ago, when I was invited to give a talk at a department that was, at the time, deciphering their role in their university's data science initiative, I gave a talk titled<em> 20 years of Data Science: from Music to Genomics. </em>The goal was to explain why <em>applied statistician</em> is not considered synonymous with <em>data scientist </em>even when we focus on the same goal: <a href="https://en.wikipedia.org/wiki/Data_science">extract knowledge or insights from data.</a> </p> <p> The first example in the talk related to how academic applied statisticians tend to emphasize the parts that will be most appreciated by our math stat colleagues and ignore the aspects that are today being heralded as the linchpins of data science. I used my thesis papers as examples. <a href="http://archive.cnmat.berkeley.edu/Research/1998/Rafael/tesis.pdf">My dissertation work</a> was about finding meaningful parametrization of musical sound signals that<img class="wp-image-4449 alignright" src="http://www.biostat.jhsph.edu/~ririzarr/Demo/img7.gif" alt="Spectrogram" width="380" height="178" /> my collaborators could use to manipulate sounds to create new ones. To do this, I prepared a database of sounds, wrote code to extract and import the digital representations from CDs into S-plus (yes, I'm that old), visualized the data to motivate models, wrote code in C (or was it Fortran?) to make the analysis go faster, and tested these models with residual analysis by ear (you can listen to them <a href="http://www.biostat.jhsph.edu/~ririzarr/Demo/">here</a>). None of these data science aspects were highlighted in the <a href="http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n42.pdf">papers</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/000313001300339969#.Vk4_ht-rQUE">I</a> <a href="http://www.tandfonline.com/doi/abs/10.1198/016214501750332875#.Vk4_mN-rQUE">wrote </a><a href="http://www.tandfonline.com/doi/abs/10.1198/016214501753168082#.Vk4_qt-rQUE">about</a> my <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract?userIsAuthenticated=false&amp;deniedAccessCustomisedMessage=">thesis</a>. Here is a screen shot from <a href="http://onlinelibrary.wiley.com/doi/10.1111/1467-9892.01515/abstract">this paper</a>: </p> </div> </div> </div> <p><a href="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png"><img class="wp-image-4449 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png" alt="Screen Shot 2015-04-15 at 12.24.40 PM" width="320" height="342" srcset="http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-957x1024.png 957w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM-187x200.png 187w, http://simplystatistics.org/wp-content/uploads/2016/05/Screen-Shot-2015-04-15-at-12.24.40-PM.png 1204w" sizes="(max-width: 320px) 100vw, 320px" /></a></p> <p>I am actually glad I wrote out and published all the technical details of this work. It was great training. My point was simply that based on the focus of these papers, this work would not be considered data science.</p> <p>The rest of my talk described some of the work I did once I transitioned into applications in Biology. I was fortunate to have a <a href="http://www.jhsph.edu/faculty/directory/profile/3859/scott-zeger">department chair</a> that appreciated lead-author papers in the subject matter journals as much as statistical methodology papers. This opened the door for me to become a full fledged applied statistician/data scientist. In the talk I described how <a href="http://bioinformatics.oxfordjournals.org/content/20/3/307.short">developing software packages,</a> <a href="http://www.nature.com/nmeth/journal/v2/n5/abs/nmeth756.html">planning</a> the <a href="http://www.nature.com/nmeth/journal/v4/n11/abs/nmeth1102.html">gathering of data</a> to <a href="http://www.ncbi.nlm.nih.gov/pubmed/?term=16108723">aid method development</a>, developing <a href="http://www.ncbi.nlm.nih.gov/pubmed/14960458">web tools</a> to assess data analysis techniques in the wild, and facilitating <a href="http://www.ncbi.nlm.nih.gov/pubmed/19151715">data-driven discovery</a> in biology has been very gratifying and, simultaneously, helped my career. However, at some point, early in my career, senior members of my department encouraged me to write and submit a methods paper to a statistical journal to go along with every paper I sent to the subject matter journals. Although I do write methods papers when I think the ideas add to the statistical literature, I did not follow the advice to simply write papers for the sake of publishing in statistics journals. Note that if (bio)statistics departments require applied statisticians to do this, then it becomes harder to have an impact as data scientists. Departments that are not producing widely used methodology or successful and visible applied statistics projects (or both), should not be surprised when they are not included in data science initiatives. So, applied statistician, read that Tukey quote again, listen to <a href="https://youtu.be/vbb-AjiXyh0">President Obama</a>, and go do some great data science.</p> <p> </p> <p> </p> Some Links Related to Randomized Controlled Trials for Policymaking 2015-11-19T12:49:03+00:00 http://simplystats.github.io/2015/11/19/some-links-related-to-randomized-controlled-trials-for-policymaking <div> <p> In response to <a href="http://simplystatistics.org/2015/11/17/why-are-randomized-trials-not-used-by-policymakers/">my previous post</a>, <a href="https://gspp.berkeley.edu/directories/faculty/avi-feller">Avi Feller</a> sent me these links related to efforts promoting the use of RCTs and evidence-based approaches for policymaking: </p> <ul> <li> The theme of this year's just-concluded APPAM conference (the national public policy research organization) was "evidence-based policymaking," with a headline panel on using experiments in policy (see <a href="http://www.appam.org/events/fall-research-conference/2015-fall-research-conference-information/" target="_blank">here</a> and <a href="http://www.appam.org/2015appam-student-summary-using-experiments-for-evidence-based-policy-lessons-from-the-private-sector/" target="_blank">here</a>). </li> </ul> <ul> <li> Jeff Liebman has written extensively about the use of randomized experiments in policy (see <a href="http://govinnovator.com/ten_year_challenge/" target="_blank">here</a> for a recent interview). </li> </ul> <ul> <li> The White House now has an entire office devoted to running randomized trials to improve government performance (the so-called "nudge unit"). Check out their recent annual report <a href="https://www.whitehouse.gov/sites/default/files/microsites/ostp/sbst_2015_annual_report_final_9_14_15.pdf" target="_blank">here</a>. </li> </ul> <ul> <li> JPAL North America just launched a major initiative to help state and local governments run randomized trials (see <a href="https://www.povertyactionlab.org/about-j-pal/news/j-pal-north-america-state-and-local-innovation-initiative-release" target="_blank">here</a>). </li> </ul> </div> Given the history of medicine, why are randomized trials not used for social policy? 2015-11-17T10:42:24+00:00 http://simplystats.github.io/2015/11/17/why-are-randomized-trials-not-used-by-policymakers <p>Policy changes can have substantial societal effects. For example, clean water and hygiene policies have saved millions, if not billions, of lives. But effects are not always positive. For example, <a href="https://en.wikipedia.org/wiki/Prohibition_in_the_United_States">prohibition</a>, or the “noble experiment”, boosted organized crime, slowed economic growth and increased deaths caused by tainted liquor. Good intentions do not guarantee desirable outcomes.</p> <p>The medical establishment is well aware of the danger of basing decisions on the good intentions of doctors or biomedical researchers. For this reason, randomized controlled trials (RCTs) are the standard approach to determining if a new treatment is safe and effective. In these trials an objective assessment is achieved by assigning patients at random to a treatment or control group, and then comparing the outcomes in these two groups. Probability calculations are used to summarize the evidence in favor or against the new treatment. Modern RCTs are considered <a href="http://abcnews.go.com/Health/TenWays/story?id=3605442&amp;page=1">one of the greatest medical advances of the 20th century</a>.</p> <p>Despite their unprecedented success in medicine, RCTs have not been fully adopted outside of scientific fields. In <a href="http://www.badscience.net/2011/05/we-should-so-blatantly-do-more-randomised-trials-on-policy/">this post</a>, Ben Goldcare advocates for politicians to learn from scientists and base policy decisions on RCTs. He provides several examples in which results contradicted conventional wisdom. In <a href="https://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en">this TED talk</a> Esther Duflo convincingly argues that RCTs should be used to determine what interventions are best at fighting poverty. Although some RCTs are being conducted, they are still rare and oftentimes ignored by policymakers. For example, despite at least <a href="http://peabody.vanderbilt.edu/research/pri/VPKthrough3rd_final_withcover.pdf">two</a> <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">RCT</a>s finding that universal pre-K programs are not effective, polymakers in New York <a href="http://www.npr.org/sections/ed/2015/09/08/438584249/new-york-city-mayor-goes-all-in-on-free-preschool">are implementing a400 million a year program</a>. Supporters of this noble endeavor defend their decision by pointing to observational studies and “expert” opinion that support their preconceived views. Before the 1950s, indifference to RCTs was common among medical doctors as well, and the outcomes were at times devastating.</p> <p>Today, when we <a href="http://www.ncbi.nlm.nih.gov/pubmed/7058834">compare conclusions from non-RCT studies to RCTs</a>, we note the unintended strong effects that preconceived notions can have. The first chapter in <a href="http://www.amazon.com/Statistics-4th-Edition-David-Freedman/dp/0393929728">this book</a> provides a summary and some examples. One example comes from <a href="http://www.jameslindlibrary.org/grace-nd-muench-h-chalmers-tc-1966/">a study</a> of 51 studies on the effectiveness of the portacaval shunt. Here is table summarizing the conclusions of the 51 studies:</p> <table> <tr> <td> Design </td> <td> Marked Improvement </td> <td> Moderate Improvement </td> <td> None </td> </tr> <tr> <td> No control </td> <td> 24 </td> <td> 7 </td> <td> 1 </td> </tr> <tr> <td> Controls; but no randomized </td> <td> 10 </td> <td> 3 </td> <td> 2 </td> </tr> <tr> <td> Randomized </td> <td> </td> <td> 1 </td> <td> 3 </td> </tr> </table> <p>Compare the first and last column to appreciate the importance of the randomized trials.</p> <p>A particularly troubling example relates to the studies on Diethylstilbestrol (DES). DES is a drug that was used to prevent spontaneous abortions. Five out of five studies using historical controls found the drug to be effective, yet all three randomized trials found the opposite. Before the randomized trials convinced doctors to stop using this drug , it was given to thousands of women. This turned out to be a tragedy as later studies showed DES has <a href="http://diethylstilbestrol.co.uk/des-side-effects/">terrible side effects</a>. Despite the doctors having the best intentions in mind, ignoring the randomized trials resulted in unintended consequences.</p> <p>Well meaning experts are regularly implementing policies without really testing their effects. Although randomized trials are not always possible, it seems that they are rarely considered, in particular when the intentions are noble. <span style="line-height: 1.5;">Just like well-meaning turn-of-the-20th-century doctors, convinced that they were doing good, put their patients at risk by providing ineffective treatments, well intentioned policies may end up hurting society.</span></p> <p><strong>Update: </strong>A reader pointed me to <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">these</a> <a href="http://eml.berkeley.edu//~crwalters/papers/kline_walters.pdf">preprints</a> which point out that the control group in <a href="http://www.acf.hhs.gov/sites/default/files/opre/executive_summary_final.pdf">one of the cited</a> early education RCTs included children that receive care in a range of different settings, not just staying at home. This implies that the signal is attenuated if what we want to know is if the program is effective for children that would otherwise stay at home. In <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2534811">this preprint</a> they use statistical methodology (principal stratification framework) to obtain separate estimates: the effect for children that would otherwise go to other center-based care and the effect for children that would otherwise stay at home. They find no effect for the former group but a significant effect for the latter. Note that in this analysis the effect being estimated is no longer based on groups assigned at random. Instead, model assumptions are used to infer the two effects. To avoid dependence on these assumptions we will have to perform an RCT with better defined controls. Also note that the<span style="line-height: 1.5;"> RCT data facilitated the principal stratification framework analysis. I also want to restate what <a href="http://simplystatistics.org/2014/04/17/correlation-does-not-imply-causation-parental-involvement-edition/">I’ve posted before</a>, “I am not saying that observational studies are uninformative. If properly analyzed, observational data can be very valuable. For example, the data supporting smoking as a cause of lung cancer is all observational. Furthermore, there is an entire subfield within statistics (referred to as causal inference) that develops methodologies to deal with observational data. But unfortunately, observational data are commonly misinterpreted.”</span></p> So you are getting crushed on the internet? The new normal for academics. 2015-11-16T09:49:04+00:00 http://simplystats.github.io/2015/11/16/so-you-are-getting-crushed-on-the-internet-the-new-normal-for-academics <p>Roger and I were just talking about all the discussion around the <a href="http://www.pnas.org/content/early/2015/10/29/1518393112.full.pdf">Case and Deaton paper</a> on death rates for middle class people. Andrew Gelman <a href="http://www.slate.com/articles/health_and_science/science/2015/11/death_rates_for_white_middle_aged_americans_are_not_increasing.html">discussed it</a> among many others. They noticed a potential bias in the analysis and did some re-analysis. Just yesterday <a href="http://noahpinionblog.blogspot.com/2015/11/gelman-vs-case-deaton-academics-vs.html">Noah Smith</a> wrote a piece about academics versus blogs and how many academics are taken by surprise when they see their paper being discussed so rapidly on the internet. Much of the debate comes down to the speed, tone, and ferocity of internet discussion of academic work - along with the fact that sometimes it isn’t fully fleshed out.</p> <p>I have been seeing this play out not just in the case of this specific paper, but many times that folks have been confronted with blogs or the quick publication process of <a href="http://f1000research.com/">f1000Research</a>. I think it is pretty scary for folks who aren’t used to “internet speed” to see this play out and I thought it would be helpful to make a few points.</p> <ol> <li><strong>Everyone is an internet scientist now.</strong> The internet has arrived as part of academics and if you publish a paper that is of interest (or if you are a Nobel prize winner, or if you dispute a claim, etc.) you will see discussion of that paper within a day or two on the blogs. This is now a fact of life.</li> <li><strong>The internet loves a fight</strong>. The internet responds best to personal/angry blog posts or blog posts about controversial topics like p-values, errors, and bias. Almost certainly if someone writes a blog post about your work or an f1000 paper it will be about an error/bias/correction or something personal.</li> <li><strong>Takedowns are easier than new research and happen faster</strong>. It is much, much easier to critique a paper than to design an experiment, collect data, figure out what question to ask, ask it quantitatively, analyze the data, and write it up. This doesn’t mean the critique won’t be good/right it just means it will happen much much faster than it took you to publish the paper because it is easier to do. All it takes is noticing one little bug in the code or one error in the regression model. So be prepared for speed in the response.</li> </ol> <p>In light of these three things, you have a couple of options about how to react if you write an interesting paper and people are discussing it - which they will certainly do (point 1), in a way that will likely make you uncomfortable (point 2), and faster than you’d expect (point 3). The first thing to keep in mind is that the internet wants you to “fight back” and wants to declare a “winner”. Reading about amicable disagreements doesn’t build audience. That is why there is reality TV. So there will be pressure for you to score points, be clever, be fast, and refute every point or be declared the loser. I have found from my own experience that is what I feel like doing too. I think that resisting this urge is both (a) very very hard and (b) the right thing to do. I find the best solution is to be proud of your work, but be humble, because no paper is perfect and thats ok. If you do the best you can , sensible people will acknowledge that.</p> <p>I think these are the three ways to respond to rapid internet criticism of your work.</p> <ul> <li><strong>Option 1: Respond on internet time.</strong> This means if you publish a big paper that you think might be controversial  you should block off a day or two to spend time on the internet responding. You should be ready to do new analysis quickly, be prepared to admit mistakes quickly if they exist, and you should be prepared to make it clear when there aren’t. You will need social media accounts and you should probably have a blog so you can post longer form responses. Github/Figshare accounts make it better for quickly sharing quantitative/new analyses. Again your goal is to avoid the personal and stick to facts, so I find that Twitter/Facebook are best for disseminating your more long form responses on blogs/Github/Figshare. If you are going to go this route you should try to respond to as many of the major criticisms as possible, but usually they cluster into one or two specific comments, which you can address all in one.</li> <li><strong>Option2 : Respond in academic time.</strong> You might have spent a year writing a paper to have people respond to it essentially instantaneously. Sometimes they will have good points, but they will rarely have carefully thought out arguments given the internet-speed response (although remember point 3 that good critiques can be faster than good papers). One approach is to collect all the feedback, ignore the pressure for an immediate response, and write a careful, scientific response which you can publish in a journal or in a fast outlet like f1000Research. I think this route can be the most scientific and productive if executed well. But this will be hard because people will treat that like “you didn’t have a good answer so you didn’t respond immediately”. The internet wants a quick winner/loser and that is terrible for science. Even if you choose this route though, you should make sure you have a way of publicizing your well thought out response - through blogs, social media, etc. once it is done.</li> <li><strong>Option 3: Do not respond.</strong> This is what a lot of people do and I’m unsure if it is ok or not. Clearly internet facing commentary can have an impact on you/your work/how it is perceived for better or worse. So if you ignore it, you are ignoring those consequences. This may be ok, but depending on the severity of the criticism may be hard to deal with and it may mean that you have a lot of questions to answer later. Honestly, I think as time goes on if you write a big paper under a lot of scrutiny Option 3 is going to go away.</li> </ul> <p>All of this only applies if you write a paper that a ton of people care about/is controversial. Many technical papers won’t have this issue and if you keep your claims small, this also probably won’t apply. But I thought it was useful to try to work out how to act under this “new normal”.</p> Prediction Markets for Science: What Problem Do They Solve? 2015-11-10T20:29:19+00:00 http://simplystats.github.io/2015/11/10/prediction-markets-for-science-what-problem-do-they-solve <p>I’ve recently seen a bunch of press on <a href="http://www.pnas.org/content/early/2015/11/04/1516179112.abstract">this paper</a>, which describes an experiment with developing a prediction market for scientific results. From FiveThirtyEight:</p> <blockquote> <p>Although <a href="http://fivethirtyeight.com/datalab/psychology-is-starting-to-deal-with-its-replication-problem/">replication is essential for verifying results</a>, the <a href="http://fivethirtyeight.com/features/science-isnt-broken/">current scientific culture does little to encourage it in most fields</a>. That’s a problem because it means that misleading scientific results, like those from the “shades of gray” study, <a href="http://pss.sagepub.com/content/22/11/1359.short?rss=1&amp;ssource=mfr">could be common in the scientific literature</a>. Indeed, a 2005 study claimed that <a href="http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124">most published research findings are false.</a></p> <p>[…]</p> <p>The researchers began by selecting some studies slated for replication in the <a href="https://osf.io/ezcuj/wiki/home/">Reproducibility Project: Psychology</a> — a project that aimed to reproduce 100 studies published in three high-profile psychology journals in 2008. They then recruited psychology researchers to take part in <a href="https://osf.io/yjmht/">two prediction markets</a>. These are the same types of markets that people use <a href="http://www.nytimes.com/2015/10/24/upshot/betting-markets-call-marco-rubio-front-runner-in-gop.html?_r=0">to bet on who’s going to be president</a>. In this case, though, researchers were betting on whether a study would replicate or not.</p> </blockquote> <p>There are all kinds of prediction markets these days–for politics, general ideas–so having one for scientific ideas is not too controversial. But I’m not sure I see exactly what problem is solved by having a prediction market for science. In the paper, they claim that the market-based bets were better predictors of the general survey that was administrated to the scientists. I’ll admit that’s an interesting result, but I’m not yet convinced.</p> <p>First off, it’s worth noting that this work comes out of the massive replication project conducted by the Center for Open Science, where I believe they <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">have a</a> <a href="http://simplystatistics.org/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication/">fundamentally flawed definition of replication</a>. So I’m not sure I can really agree with the idea of basing a prediction market on such a definition, but I’ll let that go for now.</p> <p>The purpose of most markets is some general notion of “price discovery”. One popular market is the stock market and I think it’s instructive to see how that works. Basically, people continuously bid on the shares of certain companies and markets keep track of all the bids/offers and the completed transactions. If you are interested in finding out what people are willing to pay for a share of Apple, Inc., then it’s probably best to look at…what people are willing to pay. That’s exactly what the stock market gives you. You only run into trouble when there’s no liquidity, so no one shows up to bid/offer, but that would be a problem for any market.</p> <p>Now, suppose you’re interested in finding out what the “true fundamental value” of Apple, Inc. Some people think the stock market gives you that at every instance, while <a href="http://www.econ.yale.edu/~shiller/">others</a> think that the stock market can behave irrationally for long periods of time. Perhaps in the very long run, you get a sense of the fundamental value of a company, but that may not be useful information at that point.</p> <p>What does the market for scientific hypotheses give you? Well, it would be one thing if granting agencies participated in the market. Then, we would never have to write grant applications. The granting agencies could then signal what they’d be willing to pay for different ideas. But that’s not what we’re talking about.</p> <p>Here, we’re trying to get at whether a given hypothesis is <em>true or not</em>. The only real way to get information about that is to conduct an experiment. How many people betting in the markets will have conducted an experiment? Likely the minority, given that the whole point is to save money by not having people conduct experiments investigating hypotheses that are likely false.</p> <p>But if market participants aren’t contributing real information about an hypothesis, what are they contributing? Well, they’re contributing their <em>opinion</em> about an hypothesis. How is that related to science? I’m not sure. Of course, participants could be experts in the field (although not necessarily) and so their opinions will be informed by past results. And ultimately, it’s consensus amongst scientists that determines, after repeated experiments, whether an hypothesis is true or not. But at the early stages of investigation, it’s not clear how valuable people’s opinions are.</p> <p>In a way, this reminds me of a time a while back when the EPA was soliciting “expert opinion” about the health effects of outdoor air pollution, as if that were a reasonable substitute for collecting actual data on the topic. At least it cost less money–just the price of a conference call.</p> <p>There’s a version of this playing out in the health tech market right now. Companies like <a href="http://simplystatistics.org/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui/">Theranos</a> and 23andMe are selling health products that they claim are better than some current benchmark. In particular, Theranos claims its blood tests are accurate when only using a tiny sample of blood. Is this claim true or not? No one outside Theranos knows for sure, but we can look to the financial markets.</p> <p>Theranos can point to the marketplace and show that people are willing to pay for its products. Indeed, the 9 billion valuation of the private company is another indicator that people…highly value the company. But ultimately, <em>we still don’t know if their blood tests are accurate</em> because we don’t have any data. If we were to go by the financial markets alone, we would necessarily conclude that their tests are good, because why else would anyone invest so much money in the company?</p> <p>I think there may be a role to play for prediction markets in science, but I’m not sure discovering the truth about nature is one of them.</p> Biostatistics: It's not what you think it is 2015-11-09T10:00:20+00:00 http://simplystats.github.io/2015/11/09/biostatistics-its-not-what-you-think-it-is <p><a href="http://www.hsph.harvard.edu/biostatistics">My department</a> recently sent me on a recruitment trip for our graduate program. I had the opportunity to chat with undergrads interested in pursuing a career related to data analysis. I found that several did not know about the existence of Departments of <em>Biostatistics</em> and most of the rest thought <em>Biostatistics</em> was the study of clinical trials. We <a href="http://simplystatistics.org/2012/08/14/statistics-statisticians-need-better-marketing/">have</a> <a href="http://simplystatistics.org/2011/11/02/we-need-better-marketing/">posted</a> on the need for better marketing for Statistics, but Biostatistics needs it even more. So this post is for students considering a career as applied statisticians or data science and are considering PhD programs.</p> <p>There are dozens of Biostatistics departments and most run PhD programs. As an undergraduate, you may have never heard of it because they are usually in schools that undergrads don’t regularly frequent: Public Health and Medicine. However, they are very active in research and teaching graduate students. In fact, the 2014 US News &amp; World Report <a href="http://US News and R">ranking of Statistics Departments</a> includes three Biostat departments in the top five spots. Although clinical trials are a popular area of interest in these departments, there are now many other areas of research. With so many fields of science shifting to data intensive research, Biostatistics has adapted to work in these areas. Today pretty much any Biostat department will have people working on projects related to genetics, genomics, computational biology, electronic medical records, neuroscience, environmental sciences, and epidemiology, health-risk analysis, and clinical decision making. Through collaborations, academic biostatisticians have early access to the cutting edge datasets produced by public health scientists and biomedical researchers. Our research usually revolves in either developing statistical methods that are used by researchers working in these fields or working directly with a collaborator in data-driven discovery.</p> <p><strong>How is it different from Statistics? </strong>In the grand scheme of things, they are not very different. As implied by the name, Biostatisticians focus on data related to biology while statisticians tend to be more general. However, the underlying theory and skills we learn are similar. In my view, the major difference is that Biostatisticians, in general, tend to be more interested in data and the subject matter, while in Statistics Departments more emphasis is given to the mathematical theory.</p> <p><strong>What type of job can I get with a Phd In Biostatistics? </strong><a href="http://fortune.com/2015/04/27/best-worst-graduate-degrees-jobs/">A well paying one</a>. And you will have many options to chose from. Our graduates tend to go to academia, industry or government. Also, the <strong>Bio </strong>in the name does not keep our graduates for landing non-bio related jobs, such as in high tech. The reason for this is that the training our students receive and the what they learn from research experiences can be widely applied to data analysis challenges.</p> <p><strong>How should I prepare if I want to apply to a PhD program?</strong> First you need to decide if you are going to like it. One way to do this is to participate in one of the <a href="http://www.nhlbi.nih.gov/research/training/summer-institute-biostatistics-t15">summer programs</a> where you get a glimpse of what we do. My department runs <a href="http://www.hsph.harvard.edu/biostatistics/diversity/summer-program/">one of these as well</a>. However, as an undergrad I would mainly focus on courses. Undergraduate research experiences are a good way to get an idea of what it’s like, but it is difficult to do real research unless you can set aside several hours a week for several consecutive months. This is difficult as an undergrad because you have to make sure to do well in your courses, prepare for the GRE, and get a solid mathematical and computing foundation in order to conduct research later. This is why these programs are usually in the summer. If you decide to apply to a PhD program, I recommend you take advanced math courses such as Real Analysis and Matrix Algebra. If you plan to develop software for complex datasets, I recommend CS courses that cover algorithms and optimization. Note that programming skills are not the same thing as the theory taught in these CS courses. Programming skills in R will serve you well if you plan to analyze data regardless of what academic route you follow. Python and a low-level language such as C++ are more powerful languages that many biostatisticians use these days.</p> <p>I think the demand for well-trained researchers that can make sense of data will continue to be on the rise. If you want a fulfilling job where you analyze data for a living, you should consider a PhD in Biostatistics.</p> Not So Standard Deviations: Episode 4 - A Gajillion Time Series 2015-11-07T11:46:49+00:00 http://simplystats.github.io/2015/11/07/not-so-standard-deviations-episode-4-a-gajillion-time-series <p>Episode 4 of Not So Standard Deviations is hot off the audio editor. In this episode Hilary first explains to me what heck is DevOps and then we talk about the statistical challenges in detecting rare events in an enormous set of time series data. There’s also some discussion of Ben and Jerry’s and the t-test, so you’ll want to hang on for that.</p> <p>Notes:</p> <ul> <li><a href="https://goo.gl/259VKI">Nobody Loves Graphite Anymore</a></li> <li><a href="http://goo.gl/zB7wM9">A response</a></li> <li><a href="https://goo.gl/7PgLKY">Why Gosset is awesome</a></li> </ul> <p> </p> How I decide when to trust an R package 2015-11-06T13:41:02+00:00 http://simplystats.github.io/2015/11/06/how-i-decide-when-to-trust-an-r-package <p>One thing that I’ve given a lot of thought to recently is the process that I use to decide whether I trust an R package or not. Kasper Hansen took a break from <a href="https://twitter.com/KasperDHansen/status/657589509975076864">trolling me</a> <a href="https://twitter.com/KasperDHansen/status/621315346633519104">on Twitter</a> to talk about how he trusts packages on Github less than packages that are on CRAN and particularly Bioconductor. A couple of points he makes that I think are very relevant. First, that having a package on CRAN/Bioconductor raises trust in that package:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/michaelhoffman">@michaelhoffman</a> But it's not on Bioconductor or CRAN. This decreases trust substantially. </p> <p> &mdash; Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/659777449098637312">October 29, 2015</a> </p> </blockquote> <p>The primary reason is because Bioc/CRAN demonstrate something about the developer’s willingness to do the boring but critically important parts of package development like documentation, vignettes, minimum coding standards, and being sure that their code isn’t just a rehash of something else. The other big point Kasper made was the difference between a repository - which is user oriented and should provide certain guarantees and Github - which is a developer platform and makes things easier/better for developers but doesn’t have a user guarantee system in place.</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/StrictlyStat">@StrictlyStat</a> CRAN is a repository, not a development platform. It is user oriented, not developer oriented. GH is the reverse. </p> <p> &mdash; Kasper Daniel Hansen (@KasperDHansen) <a href="https://twitter.com/KasperDHansen/status/661746848437243904">November 4, 2015</a> </p> </blockquote> <p>This discussion got me thinking about when/how I depend on R packages and how I make that decision. The scenarios where I depend on R packages are:</p> <ol> <li>Quick and dirty analyses for myself</li> <li>Shareable data analyses that I hope are reproducible</li> <li>As dependencies of R packages I maintain</li> </ol> <p>As you move from 1-3 it is more and more of a pain if the package I’m depending on breaks. If it is just something I was doing for fun, its not that big of a deal. But if it means I have to rewrite/recheck/rerelease my R package than that is a much bigger headache.</p> <p>So my scale for how stringent I am about relying on packages varies by the type of activity, but what are the criteria I use to measure how trustworthy a package is? For me, the criteria are in this order:</p> <ol> <li><strong>People prior </strong></li> <li><strong>Forced competence</strong></li> <li><strong>Indirect data</strong></li> </ol> <p>I’ll explain each criteria in a minute, but the main purpose of using these criteria is (a) to ensure that I’m using a package that works and (b) to ensure that if the package breaks I can trust it will be fixed or at least I can get some help from the developer.</p> <p><strong>People prior</strong></p> <p>The first thing I do when I look at a package I might depend on is look at who the developer is. If that person is someone I know has developed widely used, reliable software and who quickly responds to requests/feedback then I immediately trust the package. I have a list of people like <a href="https://en.wikipedia.org/wiki/Brian_D._Ripley">Brian</a>, or <a href="https://github.com/hadley">Hadley,</a> or <a href="https://github.com/jennybc">Jenny</a>, or <a href="http://rafalab.dfci.harvard.edu/index.php/software-and-data">Rafa</a>, who could post their package just as a link to their website and I would trust it. It turns out almost all of these folks end up putting their packages on CRAN/Bioconductor anyway. But even if they didn’t I assume that the reason is either (a) the package is very new or (b) they have a really good reason for not distributing it through the normal channels.</p> <p><strong>Forced competence</strong></p> <p>For people who I don’t know about or whose software I’ve never used, then I have very little confidence in the package a priori. This is because there are a ton of people developing R packages now with highly variable levels of commitment to making them work. So as a placeholder for all the variables I don’t know about them, I use the repository they choose as a surrogate. My personal prior on the trustworthiness of a package from someone I don’t know goes something like:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png"><img class="aligncenter wp-image-4410 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png" alt="Screen Shot 2015-11-06 at 1.25.01 PM" width="843" height="197" srcset="http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-300x70.png 300w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM-260x61.png 260w, http://simplystatistics.org/wp-content/uploads/2015/11/Screen-Shot-2015-11-06-at-1.25.01-PM.png 843w" sizes="(max-width: 843px) 100vw, 843px" /></a></p> <p>This prior is based on the idea of forced competence. In general, you have to do more to get a package approved on Bioconductor than on CRAN (for example you have to have a good vignette) and you have to do more to get a package on CRAN (pass R CMD CHECK and survive the review process) than to put it on Github.</p> <p>This prior isn’t perfect, but it does tell me something about how much the person cares about their package. If they go to the work of getting it on CRAN/Bioc, then at least they cared enough to document it. They are at least forced to be minimally competent - at least at the time of submission and enough for the packages to still pass checks.</p> <p><strong>Indirect data</strong></p> <p>After I’ve applied my priors I then typically look at the data. For Bioconductor I look at the badges, like how downloaded it is, whether it passes the checks, and how well it is covered by tests. I’m already inclined to trust it a bit since it is on that platform, but I use the data to adjust my prior a bit. For CRAN I might look at the <a href="http://cran-logs.rstudio.com/">download stats</a> provided by Rstudio. The interesting thing is that as John Muschelli points out, Github actually has the most indirect data available for a package:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> .<a href="https://twitter.com/KasperDHansen">@KasperDHansen</a> Flipside: CRAN has no issue pages, stars/ratings, outdated limits on size, and limited development cycle/turnover. </p> <p> &mdash; John Muschelli (@StrictlyStat) <a href="https://twitter.com/StrictlyStat/status/661746348409114624">November 4, 2015</a> </p> </blockquote> <p>If I’m going to use a package that is on Github from a person who isn’t on my prior list of people to trust then I look at a few things. The number of stars/forks/watchers is one thing that is a quick and dirty estimate of how used a package is. I also look very carefully at how many commits the person has submitted to both the package in question and in general all other packages over the last couple of months. If the person isn’t actively developing either the package or anything else on Github, that is a bad sign. I also look to see how quickly they have responded to issues/bug reports on the package in the past if possible. One idea I haven’t used but I think is a good one is to submit an issue for a trivial change to the package and see if I get a response very quickly. Finally I look and see if they have some demonstration their package works across platforms (say with a <a href="https://travis-ci.org/">travis badge</a>). If the package is highly starred, frequently maintained, all issues are responded to and up-to-date, and passes checks on all platform then that data might overwhelm my prior and I’d go ahead and trust the package.</p> <p><strong>Summary</strong></p> <p>In general one of the best things about the R ecosystem is being able to rely on other packages so that you don’t have to write everything from scratch. But there is a hard balance to strike with keeping the dependency list small. One way I maintain this balance is using the strategy I’ve outlined to worry less about trustworthy dependencies.</p> The Statistics Identity Crisis: Am I a Data Scientist 2015-10-30T14:21:08+00:00 http://simplystats.github.io/2015/10/30/the-statistics-identity-crisis-am-i-a-data-scientist <p>The joint ASA/Simply Statistics webinar on the statistics identity crisis is now live!</p> Faculty/postdoc job opportunities in genomics across Johns Hopkins 2015-10-30T10:33:06+00:00 http://simplystats.github.io/2015/10/30/facultypostdoc-job-opportunities-in-genomics-across-johns-hopkins <p>It’s pretty exciting to be in genomics at Hopkins right now with three new Bloomberg professors in genomics areas, a ton of stellar junior faculty, and a really fun group of students/postdocs. If you want to get in on the action here is a non-comprehensive list of great opportunities.</p> <h2 id="span-styletext-decoration-underlinestrongfaculty-jobsstrongspan"><span style="text-decoration: underline;"><strong>Faculty Jobs</strong></span></h2> <p><strong>Job: </strong>Multiple tenure track faculty positions in all areas including in genomics</p> <p><strong>Department: </strong> Biostatistics</p> <p><strong>To apply</strong>: <a href="http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/faculty-ad-2016-combined-large-final.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Tenure track position in data intensive biology</p> <p><strong>Department: </strong> Biology</p> <p><strong>To apply</strong>: <a href="http://apply.interfolio.com/31146">http://apply.interfolio.com/31146</a></p> <p><strong>Deadline: </strong>Nov 1st and ongoing</p> <p><strong>Job:</strong> Tenure track positions in bioinformatics, with focus on proteomics or sequencing data analysis</p> <p><strong>Department: </strong> Oncology Biostatistics</p> <p><strong>To apply</strong>: <a href="https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf">https://www.research-it.onc.jhmi.edu/DBB/PhD_Statistician.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p> </p> <h2 id="span-styletext-decoration-underlinestrongpostdoc-jobsstrongspan"><span style="text-decoration: underline;"><strong>Postdoc Jobs</strong></span></h2> <p><strong>Job:</strong> Postdoc(s) in statistical methods/software development for RNA-seq</p> <p><strong>Employer: </strong> Jeff Leek</p> <p><strong>To apply</strong>: email Jeff (<a href="http://jtleek.com/jobs/">http://jtleek.com/jobs/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Data scientist for integrative genomics in the human brain (MS/PhD)</p> <p><strong>Employer: </strong> Andrew Jaffe</p> <p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+)</p> <p><strong>Employer: </strong> Andrew Jaffe</p> <p><strong>To apply</strong>: email Andrew (<a href="http://www.aejaffe.com/jobs.html">http://www.aejaffe.com/jobs.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> PhD developing scalable software and algorithms for analyzing sequencing data</p> <p><strong>Employer: </strong> Ben Langmead</p> <p><strong>To apply</strong>: http://www.cs.jhu.edu/graduate-studies/phd-program/</p> <p><strong>Deadline:</strong> See site</p> <p><strong>Job:</strong> Postdoctoral researcher developing scalable software and algorithms for analyzing sequencing data</p> <p><strong>Employer: </strong> Ben Langmead</p> <p><strong>To apply</strong>: email Ben (<a href="http://www.langmead-lab.org/open-positions/">http://www.langmead-lab.org/open-positions/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Postdoctoral researcher developing algorithms for challenging problems in large-scale genomics whole-genome assenbly, RNA-seq analysis, and microbiome analysis</p> <p><strong>Employer: </strong> Steven Salzberg</p> <p><strong>To apply</strong>: email Steven (<a href="http://salzberg-lab.org/">http://salzberg-lab.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong> Research associate for genomic data processing and analysis (BA+) in cancer</p> <p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p> <p><strong>To apply</strong>: email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral researcher developing algorithms for biomarkers development and precision medicine application in cancer</p> <p><strong>Employer: </strong> Luigi Marchionni (with Don Geman)</p> <p><strong>To apply</strong>: email Luigi (<a href="http://luigimarchionni.org/">http://luigimarchionni.org/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job:</strong>Postdoctoral researcher developing methods in machine learning, genomics, and regulatory variation</p> <p><strong>Employer: </strong> Alexis Battle</p> <p><strong>To apply</strong>: email Alexis (<a href="http://battlelab.jhu.edu/join_us.html">http://battlelab.jhu.edu/join_us.html</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral fellow with interests in biomarker discovery for Alzheimer’s disease</p> <p><strong>Employer: </strong> Madhav Thambisetty / Ingo Ruczinski</p> <p><strong>To apply</strong>: <a href="http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers"> http://www.alzforum.org/jobs/postdoctoral-research-fellow-alzheimers-disease-biomarkers</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral positions for research in the interface of statistical genetics, precision medicine and big data</p> <p><strong>Employer: </strong> Nilanjan Chatterjee</p> <p><strong>To apply</strong>: <a href="http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf">http://www.jhsph.edu/departments/biostatistics/_docs/postdoc-ad-chatterjee.pdf</a></p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral research developing algorithms and software for time course pattern detection in genomics data</p> <p><strong>Employer: </strong> Elana Fertig</p> <p><strong>To apply</strong>: email Elana (ejfertig@jhmi.edu)</p> <p><strong>Deadline:</strong> Review ongoing</p> <p><strong>Job: </strong>Postdoctoral fellow to develop novel methods for large-scale DNA and RNA sequence analysis related to human and/or plant genetics, such as developing methods for discovering structural variations in cancer or for assembling and analyzing large complex plant genomes.</p> <p><strong>Employer: </strong> Mike Schatz</p> <p><strong>To apply</strong>: email Mike (<a href="http://schatzlab.cshl.edu/apply/">http://schatzlab.cshl.edu/apply/</a>)</p> <p><strong>Deadline:</strong> Review ongoing</p> <h2 id="span-styletext-decoration-underlinestrongstudentsstrongspan"><span style="text-decoration: underline;"><strong>Students</strong></span></h2> <p>We are all always on the hunt for good Ph.D. students. At Hopkins students are admitted to specific departments. So if you find a faculty member you want to work with, you can apply to their department. Here are the application details for the various departments admitting students to work on genomics:<a href="https://ccb.jhu.edu/students.shtml"> https://ccb.jhu.edu/students.shtml</a></p> <p> </p> <p> </p> <p> </p> The statistics identity crisis: am I really a data scientist? 2015-10-29T13:32:13+00:00 http://simplystats.github.io/2015/10/29/the-statistics-identity-crisis-am-i-really-a-data-scientist <p> </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p> <p> </p> <p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p> <p> </p> <p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a> </p> <p> &mdash; Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a> </p> </blockquote> <p>but it turns out not everyone fit in the room:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a> </p> <p> &mdash; sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a> </p> </blockquote> <p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p> <table width="100%" cellspacing="0" cellpadding="4" bgcolor="white"> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&amp;T</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b> </td> </tr> </table> <p>You can watch it on Youtube or Google Plus. Here is the link:</p> <p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p> <p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag [ </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png"><img class="aligncenter wp-image-4397" src="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png" alt="crisis" width="508" height="127" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/crisis-300x75.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis-260x65.png 260w, http://simplystatistics.org/wp-content/uploads/2015/10/crisis.png 720w" sizes="(max-width: 508px) 100vw, 508px" /></a></p> <p> </p> <p><em>Tl;dr: We will host a Google Hangout of our popular JSM session October 30th 2-4 PM EST. </em></p> <p> </p> <p>I organized a session at JSM 2015 called <em>“The statistics identity crisis: am I really a data scientist?”</em> The session turned out to be pretty popular:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Packed room of statisticians with identity crises at <a href="https://twitter.com/hashtag/JSM2015?src=hash">#JSM2015</a> session: are we really data scientists? <a href="http://t.co/eLsGosoTCt">pic.twitter.com/eLsGosoTCt</a> </p> <p> &mdash; Dr Ruth Etzioni (@retzioni) <a href="https://twitter.com/retzioni/status/631134032357502978">August 11, 2015</a> </p> </blockquote> <p>but it turns out not everyone fit in the room:</p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> This is the closest I can get to <a href="https://twitter.com/statpumpkin">@statpumpkin</a>'s talk. <a href="https://twitter.com/hashtag/jsm2015?src=hash">#jsm2015</a> still had no clue how to predict session attendance. <a href="http://t.co/gTb4OqdAo3">pic.twitter.com/gTb4OqdAo3</a> </p> <p> &mdash; sandy griffith (@sgrifter) <a href="https://twitter.com/sgrifter/status/631134590229442560">August 11, 2015</a> </p> </blockquote> <p>Thankfully, Steve Pierson at the ASA had the awesome idea to re-run the session for people who couldn’t be there. So we will be hosting a Google Hangout with the following talks:</p> <table width="100%" cellspacing="0" cellpadding="4" bgcolor="white"> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314339">'Am I a Data Scientist?': The Applied Statistics Student's Identity Crisis</a> — <b>Alyssa Frazee, Stripe</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314376">How Industry Views Data Science Education in Statistics Departments</a> — <b>Chris Volinsky, AT&amp;T</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314414">Evaluating Data Science Contributions in Teaching and Research</a> — <b>Lance Waller, Emory University</b> </td> </tr> <tr> <td align="right" valign="top" width="110"> </td> <td> <a href="https://www.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=314641">Teach Data Science and They Will Come</a> — <b>Jennifer Bryan, The University of British Columbia</b> </td> </tr> </table> <p>You can watch it on Youtube or Google Plus. Here is the link:</p> <p>https://plus.google.com/events/chuviltukohj2inbqueap9h7228</p> <p>The session will be held October 30th (tomorrow!) from 2-4PM EST. You can watch it live and discuss the talks using the hashtag](https://twitter.com/search?q=%23jsm2015) or you can watch later as the video will remain on Youtube.</p> Discussion of the Theranos Controversy with Elizabeth Matsui 2015-10-28T14:54:50+00:00 http://simplystats.github.io/2015/10/28/discussion-of-the-theranos-controversy-with-elizabeth-matsui <p>Theranos is a Silicon Valley diagnostic testing company that has been in the news recently. The story of Theranos has fascinated me because I think it represents a perfect collision of the tech startup culture and the health care culture and how combining them together can generate unique problems.</p> <p>I talked with Elizabeth Matsui, a Professor of Pediatrics in the Division of Allergy and Immunology here at Johns Hopkins, to discuss Theranos, the realities of diagnostic testing, and the unique challenges that a health-tech startup faces with respect to doing good science and building products people want to buy.</p> <p>Notes:</p> <ul> <li>Original <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">Wall Street Journal story</a> on Theranos (paywalled)</li> <li>Related stories in <a href="http://www.wired.com/2015/10/theranos-scandal-exposes-the-problem-with-techs-hype-cycle/">Wired</a> and NYT’s <a href="http://www.nytimes.com/2015/10/28/business/dealbook/theranos-under-fire.html">Dealbook</a> (not paywalled)</li> <li>Theranos <a href="https://www.theranos.com/news/posts/custom/theranos-facts">response</a> to WSJ story</li> </ul> <iframe width="100%" height="166" scrolling="no" frameborder="no" src="https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/230510705%3Fsecret_token%3Ds-WbZX8&amp;color=ff5500&amp;auto_play=false&amp;hide_related=false&amp;show_comments=true&amp;show_user=true&amp;show_reposts=false"></iframe> Not So Standard Deviations: Episode 3 - Gilmore Girls 2015-10-24T23:17:18+00:00 http://simplystats.github.io/2015/10/24/not-so-standard-deviations-episode-3-gilmore-girls <p>I just uploaded Episode 3 of <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> so check your feeds. In this episode Hilary and I talk about our jobs and the life of the data scientist in both academia and the tech industry. It turns out that they’re not as different as I would have thought.</p> <p><a href="https://api.soundcloud.com/tracks/229957578/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> We need a statistically rigorous and scientifically meaningful definition of replication 2015-10-20T10:05:22+00:00 http://simplystats.github.io/2015/10/20/we-need-a-statistically-rigorous-and-scientifically-meaningful-definition-of-replication <p>Replication and confirmation are indispensable concepts that help define scientific facts. However, the way in which we reach scientific consensus on a given finding is rather complex. Although <a href="http://simplystatistics.org/2015/06/24/how-public-relations-and-the-media-are-distorting-science/">some press releases try to convince us otherwise</a>, rarely is one publication enough. In fact, most published results go unnoticed and no attempts to replicate them are made. These are not debunked either; they simply get discarded to the dustbin of history. The very few results that garner enough attention for others to spend time and energy on them are assessed by an ad-hoc process involving a community of peers. The assessments are usually a combination of deductive reasoning, direct attempts at replication, and indirect checks obtained by attempting to build on the result in question. This process eventually leads to a result either being accepted by consensus or not. For particularly important cases, an official scientific consensus report may be commissioned by a national academy or an established scientific society. Examples of results that have become part of the scientific consensus in this way include smoking causing lung cancer, HIV causing AIDS, and climate change being caused by humans. In contrast, the published result that vaccines cause autism has been thoroughly debunked by several follow up studies. In none of these four cases a simple definition of replication was used to confirm or falsify a result. The same is true for most results for which there is consensus. Yet science moves on, and continues to be an incomparable force at improving our quality of life.</p> <p>Regulatory agencies, such as the FDA, are an exception since they clearly spell out a <a href="http://www.fda.gov/downloads/Drugs/.../Guidances/ucm078749.pdf">definition</a> of replication. For example, to approve a drug they may require two independent clinical trials, adequately powered, to show statistical significance at some predetermined level. They also require a large enough effect size to justify the cost and potential risks associated with treatment. This is not to say that FDA approval is equivalent to scientific consensus, but they do provide a clearcut definition of replication.</p> <p>In response to a growing concern over a <em><a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a></em>, projects such as the <a href="http://osc.centerforopenscience.org/">Open Science Collaboration</a> have commenced to systematically try to replicate published results. In a <a href="http://simplystatistics.org/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science/">recent post</a>, Jeff described one of their <a href="http://www.sciencemag.org/content/349/6251/aac4716">recent papers</a> on estimating the reproducibility of psychological science (they really mean replicability; see note below). This Science paper led to lay press reports with eye-catching headlines such as “only 36% of psychology experiments replicate”. Note that the 36% figure comes from a definition of replication that mimics the definition used by regulatory agencies: results are considered replicated if a p-value &lt; 0.05 was reached in both the original study and the replicated one. Unfortunately, this definition ignores both effect size and statistical power. If power is not controlled, then the expected proportion of correct findings that replicate can be quite small. For example, if I try to replicate the smoking-causes-lung-cancer result with a sample size of 5, there is a good chance it will not replicate. In his post, Jeff notes that for several of the studies that did not replicate, the 95% confidence intervals intersected. So should intersecting confidence intervals be our definition of replication? This too has a flaw since it favors imprecise studies with very large confidence intervals. If effect size is ignored, we may waste our time trying to replicate studies reporting practically meaningless findings. Generally defining replication for published studies is not as easy as for highly controlled clinical trials. However, one clear improvement from what is currently being done is to consider statistical power and effect sizes.</p> <p>To further illustrate this, let’s consider a very concrete example with real life consequences. Imagine a loved one has a disease with high mortality rates and asks for your help in evaluating the scientific evidence on treatments. Four experimental drugs are available all with promising clinical trials resulting in p-values &lt;0.05. However, a replication project redoes the experiments and finds that only the drug A and drug B studies replicate (p&lt;0.05). So which drug do you take? Let’s give a bit more information to help you decide. Here are the p-values for both original and replication trials:</p> <table style="width: 100%;"> <tr> <td> Drug </td> <td> Original </td> <td> Replication </td> <td> Replicated </td> </tr> <tr> <td> A </td> <td> 0.0001 </td> <td> 0.001 </td> <td> Yes </td> </tr> <tr> <td> B </td> <td> &lt;0.000001 </td> <td> 0.03 </td> <td> Yes </td> </tr> <tr> <td> C </td> <td> 0.03 </td> <td> 0.06 </td> <td> No </td> </tr> <tr> <td> D </td> <td> &lt;0.000001 </td> <td> 0.10 </td> <td> No </td> <td> </td> </tr> </table> <p>Which drug would you take now? The information I have provided is based on p-values and therefore is missing a key piece of information: the effect sizes. Below I show the confidence intervals for all four studies (left) and four replication studies (right). Note that except for drug B, all confidence intervals intersect. In light of the figure below, which one would you chose?</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png"><img class=" wp-image-4368 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/10/replication.png" alt="replication" width="359" height="338" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/replication-300x283.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/replication-212x200.png 212w, http://simplystatistics.org/wp-content/uploads/2015/10/replication.png 617w" sizes="(max-width: 359px) 100vw, 359px" /></a></p> <p>I would be inclined to go with drug D because it has a large effect size, a small p-value, and the replication experiment effect estimate fell inside a 95% confidence interval. I would definitely not go with A since it provides marginal benefits, even if the trial found a statistically significant effect and was replicated. So the p-value based definition of replication is practically worthless from a practical standpoint.</p> <p>It seems that before continuing the debate over replication, and certainly before declaring that we are in a <a href="http://www.nature.com/news/reproducibility-1.17552">reproducibility crisis</a>, we need a statistically rigorous and scientifically meaningful definition of replication. This definition does not necessarily need to be dichotomous (replicated or not) and it will probably require more than one replication experiment and more than one summary statistic: one for effect size and one for uncertainty. In the meantime, we should be careful not to dismiss the current scientific process, which seems to be working rather well at either ignoring or debunking false positive results while producing useful knowledge and discovery.</p> <hr /> <p>Footnote on reproducible versus replication: As Jeff pointed out, the cited Open Science Collaboration paper is about replication, not reproducibility. A study is considered reproducible if an independent researcher can recreate the tables and figures from the original raw data. Replication is not nearly as simple to define because it involves probability. To replicate the experiment it has to be performed again, with a different random sample and new set of measurement errors.</p> Theranos runs head first into the realities of diagnostic testing 2015-10-16T08:42:11+00:00 http://simplystats.github.io/2015/10/16/thorns-runs-head-first-into-the-realities-of-diagnostic-testing <p>The Wall Street Journal has published a <a href="http://www.wsj.com/articles/theranos-has-struggled-with-blood-tests-1444881901">lengthy investigation</a> into the diagnostic testing company Theranos.</p> <blockquote> <p>The company offers more than 240 tests, ranging from cholesterol to cancer. It claims its technology can work with just a finger prick. Investors have poured more than400 million into Theranos, valuing it at 9 billion and her majority stake at more than half that. The 31-year-old Ms. Holmes’s bold talk and black turtlenecks draw comparisons to Apple<span class="company-name-type"> Inc.</span> cofounder Steve Jobs.</p> </blockquote> <p>If ever there were a warning sign, the comparison to Steve Jobs has got to be it.</p> <blockquote> <p>But Theranos has struggled behind the scenes to turn the excitement over its technology into reality. At the end of 2014, the lab instrument developed as the linchpin of its strategy handled just a small fraction of the tests then sold to consumers, according to four former employees.</p> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> One former senior employee says Theranos was routinely using the device, named Edison after the prolific inventor, for only 15 tests in December 2014. Some employees were leery about the machine’s accuracy, according to the former employees and emails reviewed by The Wall Street Journal. </div> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> </div> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> In a complaint to regulators, one Theranos employee accused the company of failing to report test results that raised questions about the precision of the Edison system. Such a failure could be a violation of federal rules for laboratories, the former employee said. </div> </blockquote> <div class=" media-object wrap scope-web|mobileapps " data-layout="wrap "> With these kinds of stories, it's always hard to tell whether there's reality here or it's just a bunch of axe grinding. But one thing that's for sure is that people are talking, and probably not for good reasons. </div> Minimal R Package Check List 2015-10-14T08:21:48+00:00 http://simplystats.github.io/2015/10/14/minimal-r-package-check-list <p>A little while back I had the pleasure of flying in a small Cessna with a friend and for the first time I got to see what happens in the cockpit with a real pilot. One thing I noticed was that basically you don’t lift a finger without going through some sort of check list. This starts before you even roll the airplane out of the hangar. It makes sense because flying is a pretty dangerous hobby and you want to prevent problems from occurring when you’re in the air.</p> <p>That experience got me thinking about what might be the minimal check list for building an R package, a somewhat less dangerous hobby. First off, much has changed (for the better) since I started making R packages and I wanted to have some clean documentation of the process, particularly with using RStudio’s tools. So I wiped off my installations of both R and RStudio and started from scratch to see what it would take to get someone to build their first R package.</p> <p>The list is basically a “pre-flight” list-–the presumption here is that you actually know the important details of building packages, but need to make sure that your environment is setup correctly so that you don’t run into errors or problems. I find this is often a problem for me when teaching students to build packages because I focus on the details of actually making the packages (i.e. DESCRIPTION files, Roxygen, etc.) and forget that way back when I actually configured my environment to do this.</p> <p><strong>Pre-flight Procedures for R Packages</strong></p> <ol> <li>Install most recent version of R</li> <li>Install most recent version of RStudio</li> <li>Open RStudio</li> <li>Install <strong>devtools</strong> package</li> <li>Click on Project –&gt; New Project… –&gt; New Directory –&gt; R package</li> <li>Enter package name</li> <li>Delete boilerplate code and “hello.R” file</li> <li>Goto “man” directory an delete “hello.Rd” file</li> <li>In File browser, click on package name to go to the top level directory</li> <li>Click “Build” tab in environment browser</li> <li>Click “Configure Build Tools…”</li> <li>Check “Generate documentation with Roxygen”</li> <li>Check “Build &amp; Reload” when Roxygen Options window opens –&gt; Click OK</li> <li>Click OK in Project Options window</li> </ol> <p>At this point, you’re clear to build your package, which obviously involves writing R code, Roxygen documentation, writing package metadata, and building/checking your package.</p> <p>If I’m missing a step or have too many steps, I’d like to hear about it. But I think this is the minimum number of steps you need to configure your environment for building R packages in RStudio.</p> <p>UPDATE: I’ve made some changes to the check list and will be posting future updates/modifications to my <a href="https://github.com/rdpeng/daprocedures/blob/master/lists/Rpackage_preflight.md">GitHub repository</a>.</p> Profile of Data Scientist Shannon Cebron 2015-10-03T09:32:20+00:00 http://simplystats.github.io/2015/10/03/profile-of-data-scientist-shannon-cebron <p>The “This is Statistics” campaign has a nice <a href="http://thisisstatistics.org/interview-with-shannon-cebron-from-pegged-software/">profile of Shannon Cebron</a>, a data scientist working at the Baltimore-based Pegged Software.</p> <blockquote> <p><strong>What advice would you give to someone thinking of a career in data science?</strong></p> <p>Take some advanced statistics courses if you want to see what it’s like to be a statistician or data scientist. By that point, you’ll be familiar with enough statistical methods to begin solving real-world problems and understanding the power of statistical science. I didn’t realize I wanted to be a data scientist until I took more advanced statistics courses, around my third year as an undergraduate math major.</p> </blockquote> Not So Standard Deviations: Episode 2 - We Got it Under 40 Minutes 2015-10-02T09:00:29+00:00 http://simplystats.github.io/2015/10/02/not-so-standard-deviations-episode-2-we-got-it-under-40-minutes <p>Episode 2 of my podcast with Hilary Parker, <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a>, is out! In this episode, we talk about user testing for statistical methods, navigating the Hadleyverse, the crucial significance of rename(), and the secret reason for creating the podcast (hint: it rhymes with “bee”). Also, I erroneously claim that <a href="http://www.stat.purdue.edu/~wsc/">Bill Cleveland</a> is <em>way</em> older than he actually is. Sorry Bill.</p> <p>In other news, <a href="https://itunes.apple.com/us/podcast/not-so-standard-deviations/id1040614570">we are finally on iTunes</a> so you can subscribe from there directly if you want (just search for “Not So Standard Deviations” or paste the link directly into your podcatcher.</p> <p><a href="https://api.soundcloud.com/tracks/226538106/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file for this episode</a>.</p> <p>Notes:</p> <ul> <li><a href="http://www.sciencemag.org/content/229/4716/828.short">Bill Cleveland’s paper in Science</a>, on graphical perception, <strong>published in 1985</strong></li> <li><a href="https://www.eventbrite.com/e/statistics-making-a-difference-a-conference-in-honor-of-tom-louis-tickets-16248614042">TomFest</a></li> </ul> A glass half full interpretation of the replicability of psychological science 2015-10-01T10:00:53+00:00 http://simplystats.github.io/2015/10/01/a-glass-half-full-interpretation-of-the-replicability-of-psychological-science <p style="line-height: 18.0pt;"> <em>tl;dr: 77% of replication effects from the psychology replication study were in (or above) the 95% prediction interval based on the original effect size. This isn't perfect and suggests (a) there is still room for improvement, (b) the scientists who did the replication study are pretty awesome at replicating, (c) we need a better definition of replication that respects uncertainty but (d) the scientific sky isn't falling. We wrote this up in a <a href="http://arxiv.org/abs/1509.08968">paper on arxiv</a>; <a href="https://github.com/jtleek/replication_paper">the code is here.</a> </em> </p> <p style="line-height: 18.0pt;"> <span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">A week or two ago a paper came out in Science on<span class="apple-converted-space"> </span><a href="http://www.sciencemag.org/content/349/6251/aac4716">Estimating the reproducibility of psychological science</a>. The basic behind the study was to take a sample of studies that appeared in a particular journal in 2008 and try to replicate each of these studies. Here I'm using the definition that reproducibility is the ability to recalculate all results given the raw data and code from a study and replicability is the ability to re-do the study and get a consistent result. </span> </p> <p style="line-height: 18.0pt;"> <span style="font-size: 12.0pt; font-family: Georgia; color: #333333;">The paper is pretty incredible and the authors did an amazing job of going back to the original sources and trying to be faithful to the original study designs. I have to admit when I first heard about the study design I was incredibly pessimistic about the results (I suppose grouchy is a natural default state for many statisticians –especially those with sleep deprivation). I mean 2008 was well before the push toward reproducibility had really taken off (Biostatistics was one of the first journals to adopt a policy on reproducible research and that didn't happen <a href="http://biostatistics.oxfordjournals.org/content/10/3/405.full">until 2009</a>). More importantly, the student researchers from those studies had possibly moved on, study populations may change, there could be any number of minor variations in the study design and so forth. I thought the chances of getting any effects in the same range was probably pretty low. </span> </p> <p style="line-height: 18.0pt;"> So when the results were published I was pleasantly surprised. I wasn’t the only one: </p> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Someone has to say it, but this plot shows that science is, in fact, working. <a href="http://t.co/JUy10xHfbH">http://t.co/JUy10xHfbH</a> <a href="http://t.co/lJSx6IxPw2">pic.twitter.com/lJSx6IxPw2</a> </p> <p> &mdash; Roger D. Peng (@rdpeng) <a href="https://twitter.com/rdpeng/status/637009904289452032">August 27, 2015</a> </p> </blockquote> <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> Looks like psychologists are in a not-too-bad spot on the ROC curves of science (<a href="http://t.co/fPsesCn2yK">http://t.co/fPsesCn2yK</a>) <a href="http://t.co/9rAOdZWvzv">http://t.co/9rAOdZWvzv</a> </p> <p> &mdash; Joe Pickrell (@joe_pickrell) <a href="https://twitter.com/joe_pickrell/status/637304244538896384">August 28, 2015</a> </p> </blockquote> <p>But that was definitely not the prevailing impression that the paper left on social and mass media. A lot of the discussion around the paper focused on the <a href="https://github.com/jtleek/replication_paper/blob/gh-pages/in_the_media.md">idea that only 36% of the studies</a> had a p-value less than 0.05 in both the original and replication study. But many of the sample sizes were small and the effects were modest. So the first question I asked myself was, “Well what would we expect to happen if we replicated these studies?” The original paper measured replicability in several ways and tried hard to calibrate expected coverage of confidence intervals for the measured effects.</p> <p>With <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a> and <a href="http://www.biostat.jhsph.edu/~prpatil/">Prasad</a> we tried a little different approach. We estimated the 95% prediction interval for the replication effect given the original effect size.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png"><img class="aligncenter wp-image-4337" src="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png" alt="pi_figure_nofilter" width="397" height="397" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/pi_figure_nofilter.png 1050w" sizes="(max-width: 397px) 100vw, 397px" /></a></p> <p> </p> <p>72% of the replication effects were within the 95% prediction interval and 2 were above the interval (showed a stronger signal in replication in than predicted from original study). This definitely shows that there is still room for improvement in replication of these studies - we would expect 95% of the effects to fall into the 95% prediction interval. But at least my opinion is that 72% (or 77% if you count the 2 above the P.I.) of studies falling in the prediction interval is (a) not bad and (b) a testament to the authors of the reproducibility paper and their efforts to get the studies right.</p> <p>An important point here is that replication and reproducibility aren’t the same thing. When reproducing a study we expect the numbers and figures to be <em>exactly the same. _But a replication involves recollection of data and is subject to variation and so _we don’t expect the answer to be exactly the same in the replication</em>. This is of course made more confusing by regression to the mean, publication bias, and <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf">the garden of forking paths</a>. Our use of a prediction interval measures both the variation expected in the original study and in the replication. One thing we noticed when re-analyzing the data is how many of the studies had very low sample sizes. <a href="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png"><img class="aligncenter wp-image-4339" src="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png" alt="samplesize_figure_nofilter" width="450" height="450" srcset="http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/10/samplesize_figure_nofilter.png 1050w" sizes="(max-width: 450px) 100vw, 450px" /></a></p> <p> </p> <p>Sample sizes were generally bigger in the replication, but often very low regardless. This makes it more difficult to disentangle what didn’t replicate from what is just expected variation for a small sample size study. The point remains whether those small studies should be trusted in general, but for the purposes of measuring replication it makes the problem more difficult.</p> <p>One thing I have been thinking about a lot and this study drove home is that if we are measuring replication we need a definition that incorporates uncertainty directly. Suppose that you collect a data set <strong>D0</strong> from an original study and <strong>D1</strong> from a replication. Then replication means that the data from a study replicates if <strong>D0 ~ F </strong>and <strong>D1 ~ F. </strong>Informally, if the data are generated from the same distribution in both experiments then the study replicates. To get an estimate you apply a pipeline to the data set to get an estimate <strong>e0 = p(D0). </strong>If the study is also reproducible than <strong>p</strong><strong>()</strong> is the same for both studies and <strong>p</strong><strong>(D0) ~ G </strong>and <strong>p</strong><strong>(D1)</strong> <strong>~ G</strong>, subject to some conditions on <strong>p</strong><strong>(). </strong></p> <p>One interesting consequence of this definition is that each complete replication data set represents <em>only a single data point</em> for measuring replication. To measure replication with this definition you either need to make assumptions about the data generating distribution for <strong>D0</strong> and <strong>D1</strong> or you need to perform a complete replication of a study many times to determine if it replicates. However, it does mean that we can define replication even for studies with very small number of replicates as the data generating distribution may be arbitrarily variable in each case.</p> <p>Regardless of this definition I was excited that the <a href="https://osf.io/">OSF </a>folks did the study and pulled it off as well as they did and was a bit bummed about the most common reaction. I think there is an easy narrative that “science is broken” which I think isn’t a positive thing for a number of reasons. I love the way that {reproducibility/replicability/open science/open publication} are becoming more and more common, but often think we fall into the same trap in wanting to report these results as clear cut as we do when reporting exaggerations or oversimplifications of scientific discoveries in headlines. I’m excited to see how these kinds of studies look in 10 years when Github/open science/pre-prints/etc. are all the standards.</p> Apple Music's Moment of Truth 2015-09-30T07:38:08+00:00 http://simplystats.github.io/2015/09/30/apple-musics-moment-of-truth <p>Today is the day when Apple, Inc. learns whether it’s brand new streaming music service, Apple Music, is going to be a major contributor to the bottom line or just another streaming service (JASS?). Apple Music launched 3 months ago and all new users are offered a 3-month free trial. Today, that free trial ends and the big question is how many people will start to <strong>pay</strong> for their subscription, as opposed to simply canceling it. My guess is that most people (&gt; 50%) will opt to pay, but that’s a complete guess. For what it’s worth, I’ll be paying for my subscription. After adding all this music to my library, I’d hate to see it all go away.</p> <p>Back on August 18, 2015, consumer market research firm MusicWatch <a href="http://www.businesswire.com/news/home/20150818005755/en#.VddbR7Scy6F">released a study</a> that claimed, among other things, that</p> <blockquote> <p>Among people who had tried Apple Music, 48 percent reported they are not currently using the service.</p> </blockquote> <p>This would suggest that almost half of people who had signed up for the free trial period of Apple Music were not interested in using it further and would likely not pay for it once the trial ended. If it were true, it would be a blow to the newly launched service.</p> <p>But how did MusicWatch arrive at its number? It claimed to have surveyed 5,000 people in its study. Shortly before the survey by MusicWatch was released, Apple claimed that about 11 million people had signed up for their new Apple Music service (because the service had just launched, everyone who had signed up was in the free trial period). Clearly, 5,000 people do not make up the entire population, so we have but a small sample of users.</p> <p>What is the target that MusicWatch was trying to answer? It seems that they wanted to know the percentage of <strong>all people who had signed up for Apple Music</strong> that were still using the service. Can they make inference about the entire population from the sample of 5,000?</p> <p>If the sample is representative and the individuals are independent, we could use the number 48% as an estimate of the percentage in the population who no longer use the service. The press release from MusicWatch did not indicate any measure of uncertainty, so we don’t know how reliable the number is.</p> <p>Interestingly, soon after the MusicWatch survey was released, Apple released a statement to the publication <em>The Verge</em>, stating that 79% of users who had signed up were still using the service (i.e. only 21% had stopped using it, as opposed to 48% reported by MusicWatch). In other words, Apple just came out and <em>gave us the truth</em>! This was unusual because Apple typically does not make public statements about newly launched products. I just found this amusing because I’ve never been in a situation where I was trying to estimate a parameter and then someone later just told me what its value was.</p> <p>If we believe that Apple and MusicWatch were measuring the same thing in their analyses (and it’s not clear that they were), then it would suggest that MusicWatch’s estimate of the population percentage (48%) was quite far off from the true value (21%). What would explain this large difference?</p> <ol> <li><strong>Random variation</strong>. It’s true that MusicWatch’s survey was a small sample relative to the full population, but the sample was still big with 5,000 people. Furthermore, the analysis was fairly simple (just taking the proportion of users still using the service), so the uncertainty associated with that estimate is unlikely to be that large.</li> <li><strong>Selection bias</strong>. Recall that it’s not clear how MusicWatch sampled its respondents, but it’s possible that the way that they did it led them to capture a set of respondents who were less inclined to use Apple Music. Beyond this, we can’t really say more without knowing the details of the survey process.</li> <li><strong>Respondents are not independent</strong>. It’s possible that the survey respondents are not independent of each other. This would primiarily affect the uncertainty about the estimate, making it larger than we might expect if the respondents were all independent. However, since we do not know what MusicWatch’s uncertainty about their estimate was in the first place, it’s difficult to tell if dependence between respondents could play a role. Apple’s number, of course, has no uncertainty.</li> <li><strong>Measurement differences</strong>. This is the big one, in my opinion. We don’t know is how either MusicWatch or Apple defined “still using the service”. You could imagine a variety of ways to determine whether a person was still using the service. You could ask “Have you used it in the last week?” or perhaps “Did you use it yesterday?” Responses to these questions would be quite different and would likely lead to different overall percentages of usage.</li> </ol> We Used Data to Improve our HarvardX Courses: New Versions Start Oct 15 2015-09-29T09:53:31+00:00 http://simplystats.github.io/2015/09/29/we-used-data-to-improve-our-harvardx-courses-new-versions-start-oct-15 <p>You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p> <p>Last semester we successfully [You can sign up following links <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a></p> <p>Last semester we successfully](http://simplystatistics.org/2014/11/25/harvardx-biomedical-data-science-open-online-training-curriculum-launches-on-january-19/) of my <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">Data Analysis course</a>. To create the second version, the first was split into eight courses. Over 2,000 students successfully completed the first of these, but, as expected, the numbers were lower for the more advanced courses. We wanted to remove any structural problems keeping students from maximizing what they get from our courses, so we studied the assessment questions data, which included completion rate and time, and used the findings to make improvements. We also used qualitative data from the discussion board. The major changes to version 3 are the following:</p> <ul> <li>We no longer use R packages that Microsoft Windows users had trouble installing in the first course.</li> <li>All courses are now designed to be completed in 4 weeks.</li> <li>We added new assessment questions.</li> <li>We improved the assessment questions determined to be problematic.</li> <li>We split the two courses that students took the longest to complete into smaller modules. Students now have twice as much time to complete these.</li> <li>We consolidated the case studies into one course.</li> <li>We combined the materials from the statistics courses into a <a href="http://simplystatistics.org/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown/">book</a>, which you can download <a href="https://leanpub.com/dataanalysisforthelifesciences">here</a>. The material in the book match the materials taught in class so you can use it to follow along.</li> </ul> <p>You can enroll into any of the seven courses following the links below. We will be on the discussion boards starting October 15, and we hope to see you there.</p> <ol> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x">Statistics and R for the Life Sciences</a> starts October 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x">Introduction to Linear Models and Matrix Algebra</a> starts November 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x">Statistical Inference and Modeling for High-throughput Experiments</a> starts December 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-4-harvardx-ph525-4x">High-Dimensional Data Analysis</a> starts January 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-5-harvardx-ph525-5x">Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays</a> starts February 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x">High-performance Computing for Reproducible Genomics</a> starts March 15.</li> <li><a href="https://www.edx.org/course/data-analysis-life-sciences-7-case-harvardx-ph525-7x">Case Studies in Functional Genomics</a> start April 15.</li> </ol> <p>The landing page for the series continues to be <a href="http://genomicsclass.github.io/book/pages/classes.html">here</a>.</p> Data Analysis for the Life Sciences - a book completely written in R markdown 2015-09-23T09:37:27+00:00 http://simplystats.github.io/2015/09/23/data-analysis-for-the-life-sciences-a-book-completely-written-in-r-markdown <p class="p1"> The book <em>Data Analysis for the Life Sciences</em> is now available on <a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>. </p> <p class="p1"> <span class="s1"><img class="wp-image-4313 alignright" src="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg" alt="title_page" width="222" height="287" srcset="http://simplystatistics.org/wp-content/uploads/2015/09/title_page-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/09/title_page-791x1024.jpg 791w" sizes="(max-width: 222px) 100vw, 222px" />Data analysis is now part of practically every research project in the life sciences. In this book we use data and computer code to teach the necessary statistical concepts and programming skills to become a data analyst. Following in the footsteps of <a href="https://www.stat.berkeley.edu/~statlabs/">Stat Labs</a>, instead of showing theory first and then applying it to toy examples, we start with actual applications and describe the theory as it becomes necessary to solve specific challenges.<span class="Apple-converted-space"> We use simulations and data analysis examples to teach statistical concepts. </span></span><span class="s1">The book includes links to computer code that readers can use to program along as they read the book.</span> </p> <p class="p1"> It includes the following chapters: Inference, Exploratory Data Analysis, Robust Statistics, Matrix Algebra, Linear Models, Inference for High-Dimensional Data, Statistical Modeling, Distance and Dimension Reduction, Practical Machine Learning, and Batch Effects. </p> <p class="p1"> The text was completely written in R markdown and every section contains a link to the document that was used to create that section. This means that you can use <a href="http://yihui.name/knitr/">knitr</a> to reproduce any section of the book on your own computer. You can also access all these markdown documents directly from <a href="https://github.com/genomicsclass/labs">GitHub</a>. Please send a pull request if you fix a typo or other mistake! For now we are keeping the R markdowns for the exercises private since they contain the solutions. But you can see the solutions if you take our <a href="http://genomicsclass.github.io/book/pages/classes.html">online course</a> quizzes. If we find that most readers want access to the solutions, we will open them up as well. </p> <p class="p1"> The material is based on the online courses I have been teaching with <a href="http://mikelove.github.io/">Mike Love</a>. As we created the course, Mike and I wrote R markdown documents for the students and put them on GitHub. We then used<a href="http://www.stephaniehicks.com/githubPages_tutorial/pages/githubpages-jekyll.html"> jekyll</a> to create a <a href="http://genomicsclass.github.io/book/">webpage</a> with html versions of the markdown documents. Jeff then convinced us to publish it on <del>Leanbup</del><a href="https://leanpub.com/dataanalysisforthelifesciences">Leanpub</a>. So we wrote a shell script that compiled the entire book into a Leanpub directory, and after countless hours of editing and tinkering we have a 450+ page book with over 200 exercises. The entire book compiles from scratch in about 20 minutes. We hope you like it. </p> The Leek group guide to writing your first paper 2015-09-18T10:57:26+00:00 http://simplystats.github.io/2015/09/18/the-leek-group-guide-to-writing-your-first-paper <blockquote class="twitter-tweet" width="550"> <p lang="en" dir="ltr"> The <a href="https://twitter.com/jtleek">@jtleek</a> guide to writing your first academic paper <a href="https://t.co/APLrEXAS46">https://t.co/APLrEXAS46</a> </p> <p> &mdash; Stephen Turner (@genetics_blog) <a href="https://twitter.com/genetics_blog/status/644540432534368256">September 17, 2015</a> </p> </blockquote> <p>I have written guides on <a href="https://github.com/jtleek/reviews">reviewing papers</a>, <a href="https://github.com/jtleek/datasharing">sharing data</a>, and <a href="https://github.com/jtleek/rpackages">writing R packages</a>. One thing I haven’t touched on until now has been writing papers. Certainly for me, and I think for a lot of students, the hardest transition in graduate school is between taking classes and doing research.</p> <p>There are several hard parts to this transition including trying to find a problem, trying to find an advisor, and having a ton of unstructured time. One of the hardest things I’ve found is knowing (a) when to start writing your first paper and (b) how to do it. So I wrote a guide for students in my group:</p> <p><a href="https://github.com/jtleek/firstpaper">https://github.com/jtleek/firstpaper</a></p> <p>On how to write your first paper. It might be useful for other folks as well so I put it up on Github. Just like with the other guides I’ve written this is a very opinionated (read: doesn’t apply to everyone) guide. I also would appreciate any feedback/pull requests people have.</p> Not So Standard Deviations: The Podcast 2015-09-17T10:57:45+00:00 http://simplystats.github.io/2015/09/17/not-so-standard-deviations-the-podcast <p>I’m happy to announce that I’ve started a brand new podcast called <a href="https://soundcloud.com/nssd-podcast">Not So Standard Deviations</a> with Hilary Parker at Etsy. Episode 1 “RCatLadies Origin Story” is available through SoundCloud. In this episode we talk about the origins of RCatLadies, evidence-based data analysis, my new book, and the Python vs. R debate.</p> <p>You can subscribe to the podcast using the <a href="http://feeds.soundcloud.com/users/soundcloud:users:174789515/sounds.rss">RSS feed</a> from SoundCloud. We’ll be getting it up on iTunes hopefully very soon.</p> <p><a href="https://api.soundcloud.com/tracks/224180667/download?client_id=02gUJC0hH2ct1EGOcYXQIzRFU91c72Ea&amp;oauth_token=1-138878-174789515-deb24181d01af">Download the audio file</a>.</p> <p>Show Notes:</p> <ul> <li><a href="https://twitter.com/rcatladies">RCatLadies Twitter account</a></li> <li>Hilary’s <a href="http://hilaryparker.com/2013/01/30/hilary-the-most-poisoned-baby-name-in-us-history/">analysis of the name Hilary</a></li> <li><a href="https://leanpub.com/artofdatascience">The Art of Data Science</a></li> <li>What is <a href="http://www.amstat.org/meetings/jsm.cfm">JSM</a>?</li> <li><a href="https://en.wikipedia.org/wiki/A_rising_tide_lifts_all_boats">A rising tide lifts all boats</a></li> </ul> Interview with COPSS award Winner John Storey 2015-08-25T09:25:28+00:00 http://simplystats.github.io/2015/08/25/interview-with-copss-award-winner-john-storey <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p> <p> </p> <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics. This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p> <p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p> <p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p> <p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p> <p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p> <p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p> <p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p> <p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p> <p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis). For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p> <p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p> <p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder [<a href="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey.jpg"><img class="aligncenter wp-image-4289 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg" alt="jdstorey" width="198" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-198x300.jpg 198w, http://simplystatistics.org/wp-content/uploads/2015/08/jdstorey-132x200.jpg 132w" sizes="(max-width: 198px) 100vw, 198px" /></a></p> <p> </p> <p><em>Editor’s Note: We are again pleased to interview the COPSS President’s award winner. The <a href="https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award">COPSS Award</a> is one of the most prestigious in statistics, sometimes called the Nobel Prize in statistics. This year the award went to <a href="http://www.genomine.org/">John Storey</a> who also won the <a href="http://sml.princeton.edu/news/john-storey-receives-2015-mortimer-spiegelman-award">Mortimer Spiegelman award</a> for his outstanding contribution to public health statistics. This interview is a <a href="https://twitter.com/simplystats/status/631607146572988417">particular pleasure</a> since John was my Ph.D. advisor and has been a major role model and incredibly supportive mentor for me throughout my career. He also <a href="https://github.com/jdstorey/simplystatistics">did the whole interview in markdown and put it under version control at Github</a> so it is fully reproducible. </em></p> <p><strong>SimplyStats: Do you consider yourself to be a statistician, data scientist, machine learner, or something else?</strong></p> <p>JS: For the most part I consider myself to be a statistician, but I’m also very serious about genetics/genomics, data analysis, and computation. I was trained in statistics and genetics, primarily statistics. I was also exposed to a lot of machine learning during my training since Rob Tibshirani was my <a href="http://genealogy.math.ndsu.nodak.edu/id.php?id=69303">PhD advisor</a>. However, I consider my research group to be a data science group. We have the <a href="http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram">Venn diagram</a> reasonably well covered: experimentalists, programmers, data wranglers, and developers of theory and methods; biologists, computer scientists, and statisticians.</p> <p><strong>**SimplyStats:</strong> How did you find out you had won the COPSS Presidents’ Award?**</p> <p>JS: I received a phone call from the chairperson of the awards committee while I was visiting the Department of Statistical Science at Duke University to <a href="https://stat.duke.edu/events/15731.html">give a seminar</a>. It was during the seminar reception, and I stepped out into the hallway to take the call. It was really exciting to get the news!</p> <p><strong>**SimplyStats: </strong>One of the areas where you have had a big impact is inference in massively parallel problems. How do you feel high-dimensional inference is different from more traditional statistical inference?**</p> <p>JS: My experience is that the most productive way to approach high-dimensional inference problems is to first think about a given problem in the scenario where the parameters of interest are random, and the joint distribution of these parameters is incorporated into the framework. In other words, I first gain an understanding of the problem in a Bayesian framework. Once this is well understood, it is sometimes possible to move in a more empirical and nonparametric direction. However, I have found that I can be most successful if my first results are in this Bayesian framework.</p> <p>As an example, Theorem 1 from <a href="http://genomics.princeton.edu/storeylab/papers/Storey_Annals_2003.pdf">Storey (2003) Annals of Statistics</a> was the first result I obtained in my work on false discovery rates. This paper <a href="https://statistics.stanford.edu/research/false-discovery-rate-bayesian-interpretation-and-q-value">first appeared as a technical report in early 2001</a>, and the results spawned further work on a <a href="http://genomics.princeton.edu/storeylab/papers/directfdr.pdf">point estimation approach</a> to false discovery rates, the <a href="http://genomics.princeton.edu/storeylab/papers/ETST_JASA_2001.pdf">local false discovery rate</a>, <a href="http://www.bioconductor.org/packages/release/bioc/html/qvalue.html">q-value</a> and its <a href="http://www.pnas.org/content/100/16/9440.full">application to genomics</a>, and a <a href="http://genomics.princeton.edu/storeylab/papers/623.pdf">unified theoretical framework</a>.</p> <p>Besides false discovery rates, this approach has been useful in my work on the <a href="http://genomics.princeton.edu/storeylab/papers/Storey_JRSSB_2007.pdf">optimal discovery procedure</a> as well as <a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">surrogate variable analysis</a> (in particular, <a href="http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2011.645777#.VdxderxVhBc">Desai and Storey 2012</a> for surrogate variable analysis). For high-dimensional inference problems, I have also found it is important to consider whether there are any plausible underlying causal relationships among variables, even if causal inference in not the goal. For example, causal model considerations provided some key guidance in a <a href="http://www.nature.com/ng/journal/v47/n5/full/ng.3244.html">recent paper of ours</a> on testing for genetic associations in the presence of arbitrary population structure. I think there is a lot of insight to be gained by considering what is the appropriate approach for a high-dimensional inference problem under different causal relationships among the variables.</p> <p><strong>SimplyStats: Do you have a process when you are tackling a hard problem or working with students on a hard problem?</strong></p> <p>JS: I like to work on statistics research that is aimed at answering a specific scientific problem (usually in genomics). My process is to try to understand the why in the problem as much as the how. The path to success is often found in the former. I try first to find solutions to research problems by using simple tools and ideas. I like to get my hands dirty with real data as early as possible in the process. I like to incorporate some theory into this process, but I prefer methods that work really well in practice over those that have beautiful theory justifying them without demonstrated success on real-world applications. In terms of what I do day-to-day, listening to music is integral to my process, for both concentration and creative inspiration: typically <a href="https://en.wikipedia.org/wiki/King_Crimson">King Crimson</a> or some <a href="http://www.metal-archives.com/">variant of metal</a> or <a href="https://en.wikipedia.org/wiki/Brian_Eno">ambient</a> – which Simply Statistics co-founder](http://jtleek.com/) got to <del>endure</del> enjoy for years during his PhD in my lab.</p> <p><strong>SimplyStats: You are the founding Director of the Center for Statistics and Machine Learning at Princeton. What parts of the new gig are you most excited about?</strong></p> <p>JS: Princeton closed its Department of Statistics in the early 1980s. Because of this, the style of statistician and machine learner we have here today is one who’s comfortable being appointed in a field outside of statistics or machine learning. Examples include myself in genomics, Kosuke Imai in political science, Jianqing Fan in finance and economics, and Barbara Engelhardt in computer science. Nevertheless, statistics and machine learning here is strong, albeit too small at the moment (which will be changing soon). This is an interesting place to start, very different from most universities.</p> <p>What I’m most excited about is that we get to answer the question: “What’s the best way to build a faculty, educate undergraduates, and create a PhD program starting now, focusing on the most important problems of today?”</p> <p>For those who are interested, we’ll be releasing a <a href="http://www.princeton.edu/strategicplan/taskforces/sml/">public version of our strategic plan</a> within about six months. We’re trying to do something unique and forward-thinking, which will hopefully make Princeton an influential member of the statistics, machine learning, and data science communities.</p> <p><strong>SimplyStats: You are organizing the Tukey conference at Princeton (to be held September 18, <a href="http://csml.princeton.edu/tukey">details here</a>).</strong> <strong>Do you think Tukey’s influence will affect your vision for re-building statistics at Princeton?</strong></p> <p>JS: Absolutely, Tukey has been and will be a major influence in how we re-build. He made so many important contributions, and his approach was extremely forward thinking and tied into real-world problems. I strongly encourage everyone to read Tukey’s 1962 paper titled <a href="https://projecteuclid.org/euclid.aoms/1177704711">The Future of Data Analysis</a>. Here he’s 50 years into the future, foreseeing the rise of data science. This paper has truly amazing insights, including:</p> <blockquote> <p>For a long time I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.</p> <p>All in all, I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.</p> <p>Data analysis is a larger and more varied field than inference, or incisive procedures, or allocation.</p> <p>By and large, the great innovations in statistics have not had correspondingly great effects upon data analysis. . . . Is it not time to seek out novelty in data analysis?</p> </blockquote> <p>In this regard, another paper that has been influential in how we are re-building is Leo Breiman’s titled <a href="http://projecteuclid.org/euclid.ss/1009213726">Statistical Modeling: The Two Cultures</a>. We’re building something at Princeton that includes both cultures and seamlessly blends them into a bigger picture community concerned with data-driven scientific discovery and technology development.</p> <p><strong>SimplyStats:</strong> <strong>What advice would you give young statisticians getting into the discipline now?</strong></p> <p>JS: My most general advice is don’t isolate yourself within statistics. Interact with and learn from other fields. Work on problems that are important to practitioners of science and technology development. I recommend that students should master both “traditional statistics” and at least one of the following: (1) computational and algorithmic approaches to data analysis, especially those more frequently studied in machine learning or data science; (2) a substantive scientific area where data-driven discovery is extremely important (e.g., social sciences, economics, environmental sciences, genomics, neuroscience, etc.). I also recommend that students should consider publishing in scientific journals or computer science conference proceedings, in addition to traditional statistics journals. I agree with a lot of the constructive advice and commentary given on the Simply Statistics blog, such as encouraging students to learn about reproducible research, problem-driven research, software development, improving data analyses in science, and outreach to non-statisticians. These things are very important for the future of statistics.</p> The Next National Library of Medicine Director Can Help Define the Future of Data Science 2015-08-24T10:00:26+00:00 http://simplystats.github.io/2015/08/24/the-next-national-library-of-medicine-director-can-help-define-the-future-of-data-science <p>The main motivation for starting this blog was to share our enthusiasm about the increased importance of data and data analysis in science, industry, and society in general. Based on recent initiatives, such as <a href="https://datascience.nih.gov/bd2k">BD2k</a>, it is clear that the NIH is also enthusiastic and very much interested in supporting data science. For those that don’t know, the National Institutes of Health (NIH) is the largest public funder of biomedical research in the world. This federal agency has an annual budget of about30 billion.</p> <p>The NIH has <a href="http://www.nih.gov/icd/icdirectors.htm">several institutes</a>, each with its own budget and capability to guide funding decisions. Currently, the missions of most of these institutes relate to a specific disease or public health challenge.  Many of them fund research in statistics and computing because these topics are important components of achieving their specific mission. Currently, however, there is no institute directly tasked with supporting data science per se. This is about to change.</p> <p>The National Library of Medicine (NLM) is one of the few NIH institutes that is not focused on a particular disease or public health challenge. Apart from the important task of maintaining an actual library, it supports, among many other initiatives, indispensable databases such as PubMed, GeneBank and GEO. After over 30 years of successful service as NLM director, Dr. Donald Lindberg stepped down this year and, as is customary, an advisory board was formed to advice the NIH on what’s next for NLM. One of the main recommendations of <a href="http://acd.od.nih.gov/reports/Report-NLM-06112015-ACD.pdf">the report</a> is the following:</p> <blockquote> <p>NLM  should be the intellectual and programmatic epicenter for data science at NIH and stimulate its advancement throughout biomedical research and application.</p> </blockquote> <p>Data science features prominently throughout the report making it clear the NIH is very much interested in further supporting this field. The next director can therefore have an enormous influence in the futre of data science. So, if you love data, have administrative experience, and a vision about the future of data science as it relates to the medical and related sciences, consider this exciting opportunity.</p> <p>Here is the <a href="http://www.jobs.nih.gov/vacancies/executive/nlm_director.htm">ad</a>.</p> <p> </p> <p> </p> <p> </p> Interview with Sherri Rose and Laura Hatfield 2015-08-21T13:20:14+00:00 http://simplystats.github.io/2015/08/21/interview-with-sherri-rose-and-laura-hatfied <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png"><img class="aligncenter wp-image-4273 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png" alt="Sherri Rose and Laura Hatfield" width="300" height="200" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-300x200.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose-260x173.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/hatfieldrose.png 975w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p style="text-align: center;"> Rose/Hatfield © Savannah Bergquist </p> <p><em><a href="http://www.hcp.med.harvard.edu/faculty/core/laura-hatfield-phd">Laura Hatfield</a> and <a href="http://www.drsherrirose.com/">Sherri Rose</a> are Assistant Professors specializing in biostatistics at Harvard Medical School in the <a href="http://www.hcp.med.harvard.edu">Department of Health Care Policy</a>. Laura received her PhD in Biostatistics from the University of Minnesota and Sherri completed her PhD in Biostatistics at UC Berkeley. They are developing novel statistical methods for health policy problems.</em></p> <p><strong><em>**_SimplyStats</em></strong>: Do you consider yourselves statisticians, data scientists, machine learners, or something else?_**</p> <p><strong>Rose</strong>: I’d definitely say a statistician. Even when I’m working on things that fall into the categories of data science or machine learning, there’s underlying statistical theory guiding that process, be it for methods development or applications. Basically, there’s a statistical foundation to everything I do.</p> <p><strong>Hatfield</strong>: When people ask what I do, I start by saying that I do research in health policy. Then I say I’m a statistician by training and I work with economists and physicians. People have mistaken ideas about what a statistician or professor does, so describing my context and work seems more informative. If I’m at a party, I usually wrap it up in a bow as, “I crunch numbers to study how Obamacare is working.” [laughs]</p> <p> </p> <p><strong><em>SimplyStats: What is the</em></strong> <a href="http://www.healthpolicydatascience.org/"><strong><em>Health Policy Data Science Lab</em></strong></a><strong><em>? How did you decide to start that?</em></strong></p> <p><strong>Hatfield</strong>: We wanted to give our trainees a venue to promote their work and get feedback from their peers. And it helps me keep up on the cool projects Sherri and her students are working on.</p> <p><strong>Rose</strong>: This grew out of us starting to jointly mentor trainees. It’s been a great way for us to make intellectual contributions to each other’s work through Lab meetings. Laura and I approach statistics from <em>completely</em> different frameworks, but work on related applications, so that’s a unique structure for a lab.</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>What kinds of problems are your groups working on these days? Are they mostly focused on health policy?_**</p> <p><strong>Rose</strong>: One of the fun things about working in health policy is that it is quite expansive. Statisticians can have an even bigger impact on science and public health if we take that next step: thinking about the policy implications of our research. And then, who needs to see the work in order to influence relevant policies. A couple projects I’m working on that demonstrate this breadth include a machine learning framework for risk adjustment in insurance plan payment and a new estimator for causal effects in a complex epidemiologic study of chronic disease. The first might be considered more obviously health policy, but the second will have important policy implications as well.</p> <p><strong>Hatfield</strong>: When I start an applied collaboration, I’m also thinking, “Where is the methods paper?” Most of my projects use messy observational data, so there is almost always a methods paper. For example, many studies here need to find a control group from an administrative data source. I’ve been keeping track of challenges in this process. One of our Lab students is working with me on a pathological case of a seemingly benign control group selection method gone bad. I love the creativity required in this work; my first 10 analysis ideas may turn out to be infeasible given the data, but that’s what makes this fun!</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>What are some particular challenges of working with large health data?_**</p> <p><strong>Hatfield</strong>: When I first heard about the huge sample sizes, I was excited! Then I learned that data not collected for research purposes…</p> <p><strong>Rose</strong>: This was going to be my answer!</p> <p><strong>Hatfield</strong>: …are <em>very</em> hard to use for research! In a recent project, I’ve been studying how giving people a tool to look up prices for medical services changes their health care spending. But the data set we have leaves out [painful pause] a lot of variables we’d like to use for control group selection and… a lot of the prices. But as I said, these gaps in the data are begging to be filled by new methods.</p> <p><strong>Rose</strong>: I think the fact that we have similar answers is important. I’ve repeatedly seen “big data” not have a strong signal for the research question, since they weren’t collected for that purpose. It’s easy to get excited about thousands of covariates in an electronic health record, but so much of it is noise, and then you end up with an R<sup>2</sup> of 10%. It can be difficult enough to generate an effective prediction function, even with innovative tools, let alone try to address causal inference questions. It goes back to basics: what’s the research question and how can we translate that into a statistical problem we can answer given the limitations of the data.</p> <p><strong><em>**_SimplyStats: </em></strong>You both have very strong data science skills but are in academic positions. Do you have any advice for students considering the tradeoff between academia and industry?_**</p> <p><strong>Hatfield</strong>: I think there is more variance within academia and within industry than between the two.</p> <p><strong>Rose</strong>: Really? That’s surprising to me…</p> <p><strong>Hatfield</strong>: I had stereotypes about academic jobs, but my current job defies those.</p> <p><strong>Rose</strong>: What if a larger component of your research platform included programming tools and R packages? My immediate thought was about computing and its role in academia. Statisticians in genomics have navigated this better than some other areas. It can surely be done, but there are still challenges folding that into an academic career.</p> <p><strong>Hatfield</strong>: I think academia imposes few restrictions on what you can disseminate compared to industry, where there may be more privacy and intellectual property concerns. But I take your point that R packages do not impress most tenure and promotion committees.</p> <p><strong>Rose</strong>: You want to find a good match between how you like spending your time and what’s rewarded. Not all academic jobs are the same and not all industry jobs are alike either. I wrote a more detailed <a href="http://simplystatistics.org/2015/02/18/navigating-big-data-careers-with-a-statistics-phd/">guest post</a> on this topic for <em>Simply Statistics</em>.</p> <p><strong>Hatfield</strong>: I totally agree you should think about how you’d actually spend your time in any job you’re considering, rather than relying on broad ideas about industry versus academia. Do you love writing? Do you love coding? etc.</p> <p> </p> <p><strong><em>**_SimplyStats: </em></strong>You are both adopters of social media as a mechanism of disseminating your work and interacting with the community. What do you think of social media as a scientific communication tool? Do you find it is enhancing your careers?_**</p> <p><strong>Hatfield</strong>: Sherri is my social media mentor!</p> <p><strong>Rose</strong>: I think social media can be a useful tool for networking, finding and sharing neat articles and news, and putting your research out there to a broader audience. I’ve definitely received speaking invitations and started collaborations because people initially “knew me from Twitter.” It’s become a way to recruit students as well. Prospective students are more likely to “know me” from a guest post or Twitter than traditional academic products, like journal articles.</p> <p><strong>Hatfield</strong>: I’m grateful for our <a href="https://twitter.com/HPDSLab">Lab’s new Twitter</a> because it’s a purely academic account. My personal account has been awkwardly transitioning to include professional content; I still tweet silly things there.</p> <p><strong>Rose</strong>: My timeline might have <a href="https://twitter.com/sherrirose/status/569613197600272386">a cat picture</a> or <a href="https://twitter.com/sherrirose/status/601822958491926529">two</a>.</p> <p><strong>Hatfield</strong>: My very favorite thing about academic Twitter is discovering things I wouldn’t have even known to search for, especially packages and tricks in R. For example, that’s how I got converted to tidy data and dplyr.</p> <p><strong>Rose</strong>: I agree. I think it’s a fantastic place to become exposed to work that’s incredibly related to your own but in another field, and you wouldn’t otherwise find it preparing a typical statistics literature review.</p> <p> </p> <p><strong><em>**</em></strong><em>SimplyStats: </em><strong><em>**What would you change in the statistics community?</em></strong></p> <p><strong>Rose</strong>: Mentoring. I was tremendously lucky to receive incredible mentoring as a graduate student and now as a new faculty member. Not everyone gets this, and trainees don’t know where to find guidance. I’ve actively reached out to trainees during conferences and university visits, erring on the side of offering too much unsolicited help, because I feel there’s a need for that. I also have a <a href="http://drsherrirose.com/resources">resources page</a> on my website that I continue to update. I wish I had a more global solution beyond encouraging statisticians to take an active role in mentoring not just your own trainees. We shouldn’t lose good people because they didn’t get the support they needed.</p> <p><strong>Hatfield</strong>: I think we could make conferences much better! Being in the same physical space at the same time is very precious. I would like to take better advantage of that at big meetings to do work that requires face time. Talks are not an example of this. Workshops and hackathons and panels and working groups – these all make better use of face-to-face time. And are a lot more fun!</p> <p> </p> If you ask different questions you get different answers - one more way science isn't broken it is just really hard 2015-08-20T14:52:34+00:00 http://simplystats.github.io/2015/08/20/if-you-ask-different-quetions-you-get-different-asnwers-one-more-way-science-isnt-broken-it-is-just-really-hard <p>If you haven’t already read the amazing piece by Christie Aschwanden on why <a href="http://fivethirtyeight.com/features/science-isnt-broken/">Science isn’t Broken</a> you should do so immediately. It does an amazing job of capturing the nuance of statistics as applied to real data sets and how that can be misconstrued as science being “broken” without falling for the easy “everything is wrong” meme.</p> <p>One thing that caught my eye was how the piece highlighted a crowd-sourced data analysis of soccer red cards. The key figure for that analysis is this one:</p> <p> </p> <p><a href="http://fivethirtyeight.com/features/science-isnt-broken/"><img class="aligncenter" src="https://espnfivethirtyeight.files.wordpress.com/2015/08/truth-vigilantes-soccer-calls2.png?w=1024&amp;h=597" alt="" width="1024" height="597" /></a></p> <p>I think the figure and <a href="https://osf.io/qix4g/">underlying data</a> for this figure are fascinating in that they really highlight the human behavioral variation in data analysis and you can even see some <a href="http://simplystatistics.org/2015/04/29/data-analysis-subcultures/">data analysis subcultures </a>emerging from the descriptions of how people did the analysis and justified or not the use of covariates.</p> <p>One subtlety of the figure that I missed on the original reading is that not all of the estimates being reported are measuring the same thing. For example, if some groups adjusted for the country of origin of the referees and some did not, then the estimates for those two groups are measuring different things (the association conditional on country of origin or not, respectively). In this case the estimates may be different, but entirely consistent with each other, since they are just measuring different things.</p> <p>If you ask two people to do the analysis and you only ask them the simple question: <em>Are referees more likely to give  red cards to dark skinned players?</em> then you may get a different answer based on those two estimates. But the reality is the answers the analysts are reporting are actually to the questions:</p> <ol> <li>Are referees more likely to give  red cards to dark skinned players holding country of origin fixed?</li> <li>Are referees more likely to give  red cards to dark skinned players averaging over country of origin (and everything else)?</li> </ol> <p>The subtlety lies in the fact that changes to covariates in the analysis are actually changing the hypothesis you are studying.</p> <p>So in fact the conclusions in that figure may all be entirely consistent after you condition on asking the same question. I’d be interested to see the same plot, but only for the groups that conditioned on the same set of covariates, for example. This is just one more reason that science is really hard and why I’m so impressed at how well the FiveThirtyEight piece captured this nuance.</p> <p> </p> <p> </p> P > 0.05? I can make any p-value statistically significant with adaptive FDR procedures 2015-08-19T10:38:31+00:00 http://simplystats.github.io/2015/08/19/p-0-05-i-can-make-any-p-value-statistically-significant-with-adaptive-fdr-procedures <p>Everyone knows now that you have to correct for multiple testing when you calculate many p-values otherwise this can happen:</p> <div style="width: 550px" class="wp-caption aligncenter"> <a href="http://xkcd.com/882/"><img class="" src=" http://imgs.xkcd.com/comics/significant.png" alt="" width="540" height="1498" /></a> <p class="wp-caption-text"> http://xkcd.com/882/ </p> </div> <p> </p> <p>One of the most popular ways to correct for multiple testing is to estimate or control the <a href="https://en.wikipedia.org/wiki/False_discovery_rate">false discovery rate</a>. The false discovery rate attempts to quantify the fraction of made discoveries that are false. If we call all p-values less than some threshold <em>t</em> significant, then borrowing notation from this <a href="http://www.ncbi.nlm.nih.gov/pubmed/12883005">great introduction to false discovery rates </a></p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif"><img class="aligncenter size-full wp-image-4246" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr3.gif" alt="fdr3" width="285" height="40" /></a></p> <p> </p> <p>So <em>F(t)</em> is the (unknown) total number of null hypotheses called significant and <em>S(t)</em> is the total number of hypotheses called significant. The FDR is the expected ratio of these two quantities, which, under certain assumptions can be approximated by the ratio of the expectations.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif"><img class="aligncenter size-full wp-image-4247" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr4.gif" alt="fdr4" width="246" height="44" /></a></p> <p> </p> <p>To get an estimate of the FDR we just need an estimate for  <em>E[_F(t)]</em> _ and <em>E[S(t)]. _The latter is pretty easy to estimate as just the total number of rejections (the number of _p &lt; t</em>). If you assume that the p-values follow the expected distribution then <em>E[_F(t)]</em>  <em>can be approximated by multiplying the fraction of null hypotheses, multiplied by the total number of hypotheses and multiplied by _t</em> since the p-values are uniform. To do this, we need an estimate for <span class="MathJax_Preview"><img src="http://simplystatistics.org/wp-content/plugins/latex/cache/tex_d4c98d75e25f5d28461f1da221eb7a95.gif" style="vertical-align: middle; border: none; padding-bottom:1px;" class="tex" alt="\pi_0" /></span>, the proportion of null hypotheses. There are a large number of ways to estimate this quantity but it is almost always estimated using the full distribution of computed p-values in an experiment. The most popular estimator compares the fraction of p-values greater than some cutoff to the number you would expect if every single hypothesis were null. This fraction is about the fraction of null hypotheses.</p> <p>Combining the above equation with our estimates for <em>E[_F(t)]</em> _ and _E[S(t)] _we get:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif"><img class="aligncenter size-full wp-image-4250" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr5.gif" alt="fdr5" width="238" height="42" /></a></p> <p> </p> <p>The q-value is a multiple testing analog of the p-value and is defined as:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif"><img class="aligncenter size-full wp-image-4258" src="http://simplystatistics.org/wp-content/uploads/2015/08/fdr61.gif" alt="fdr6" width="163" height="26" /></a></p> <p> </p> <p>This is of course a very loose version of this and you can get a more technical description <a href="http://www.genomine.org/papers/directfdr.pdf">here</a>. But the main thing to notice is that the q-value depends on the estimated proportion of null hypotheses, which depends on the distribution of the observed p-values. The smaller the estimated fraction of null hypotheses, the smaller the FDR estimate and the smaller the q-value. This suggests a way to make any p-value significant by altering its “testing partners”. Here is a quick example. Suppose that we have done a test and have a p-value of 0.8. Not super significant. Suppose we perform this test in conjunction with a number of hypotheses that are null generating a p-value distribution like this.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png"><img class="aligncenter size-medium wp-image-4260" src="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png" alt="uniform-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/uniform-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Then you get a q-value greater than 0.99 as you would expect. But if you test that exact same p-value with a ton of other non-null hypotheses that generate tiny p-values in a distribution that looks like this:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png"><img class="aligncenter size-medium wp-image-4261" src="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png" alt="significant-pvals" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/08/significant-pvals.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>Then you get a q-value of 0.0001 for that same p-value of 0.8. The reason is that the estimate of the fraction of null hypotheses goes essentially to zero, which drives down the q-value. You can do this with any p-value, if you make its testing partners have sufficiently low p-values then the q-value will also be as small as you like.</p> <p>A couple of things to note:</p> <ul> <li>Obviously doing this on purpose to change the significance of a calculated p-value is cheating and shouldn’t be done.</li> <li>For correctly calculated p-values on a related set of hypotheses this is actually a sensible property to have - if you have almost all very small p-values and one very large p-value, you are doing a set of tests where almost everything appears to be alternative and you should weight that in some sensible way.</li> <li>This is the reason that sometimes a “multiple testing adjusted” p-value (or q-value) is smaller than the p-value itself.</li> <li>This doesn’t affect non-adaptive FDR procedures - but those procedures still depend on the “testing partners” of any p-value through the total number of tests performed. This is why people talk about the so-called “multiple testing burden”. But that is a subject for a future post. It is also the reason non-adaptive procedures can be severely underpowered compared to adaptive procedures when the p-values are correct.</li> <li>I’ve appended the code to generate the histograms and calculate the q-values in this post in the following gist.</li> </ul> <p> </p> UCLA Statistics 2015 Commencement Address 2015-08-12T10:34:03+00:00 http://simplystats.github.io/2015/08/12/ucla-statistics-2015-commencement-address <p>I was asked to speak at the <a href="http://www.stat.ucla.edu">UCLA Department of Statistics</a> Commencement Ceremony this past June. As one of the first graduates of that department back in 2003, I was tremendously honored to be invited to speak to the graduates. When I arrived I was just shocked at how much the department had grown. When I graduated I think there were no more than 10 of us between the PhD and Master’s programs. Now they have ~90 graduates per year with undergrad, Master’s and PhD. It was just stunning.</p> <p>Here’s the text of what I said, which I think I mostly stuck to in the actual speech.</p> <p> </p> <p><strong>UCLA Statistics Graduation: Some thoughts on a career in statistics</strong></p> <p>When I asked Rick [Schoenberg] what I should talk about, he said to ‘talk for 95 minutes on asymptotic properties of maximum likelihood estimators under nonstandard conditions”. I thought this is a great opportunity! I busted out Tom Ferguson’s book and went through my old notes. Here we go. Let X be a complete normed vector space….</p> <p>I want to thank the department for inviting me here today. It’s always good to be back. I entered the UCLA stat department in 1999, only the second entering class, and graduated from UCLA Stat in 2003. Things were different then. Jan was the chair and there were not many classes so we could basically do whatever we wanted. Things are different now and that’s a good thing. Since 2003, I’ve been at the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health, where I was first a postdoctoral fellow and then joined the faculty. It’s been a wonderful place for me to grow up and I’ve learned a lot there.</p> <p>It’s just an incredible time to be a statistician. You guys timed it just right. I’ve been lucky enough to witness two periods like this, the first time being when I graduated from college at the height of the dot come boom. Today, it’s not computer programming skills that the world needs, but rather it’s statistical skills. I wish I were in your shoes today, just getting ready to startup. But since I’m not, I figured the best thing I could do is share some of the things I’ve learned and talk about the role that these things have played in my own life.</p> <p>Know your edge: What’s the one thing that you know that no one else seems to know? You’re not a clone—you have original ideas and skills. You might think they’re not valuable but you’re wrong. Be proud of these ideas and use them to your advantage. As an example, I’ll give you my one thing. Right now, I believe the greatest challenge facing the field of statistics today is getting the entire world to know what we in this room already know. Data are everywhere today and the biggest barrier to progress is our collective inability to process and analyze those data to produce useful information. The need for the things that we know has absolutely exploded and we simply have not caught up. That’s why I created, along with Jeff Leek and Brian Caffo, the Johns Hopkins Data Science Specialization, which is currently the most successful massive open online course program ever. Our goal is to teach the entire world statistics, which we think is an essential skill. We’re not quite there yet, but—assuming you guys don’t steal my idea—I’m hopeful that we’ll get there sometime soon.</p> <p>At some point the edge you have will no longer work: That sounds like a bad thing, but it’s actually good. If what you’re doing really matters, then at some point everyone will be doing it. So you’ll need to find something else. I’ve been confronted with this problem at least 3 times in my life so far. Before college, I was pretty good at the violin, and it opened a lot of doors for me. It got me into Yale. But when I got to Yale, I quickly realized that there were a lot of really good violinists here. Suddenly, my talent didn’t have so much value. This was when I started to pick up computer programming and in 1998 I learned an obscure little language called R. When I got to UCLA I realized I was one of the only people who knew R. So I started a little brown bag lunch series where I’d talk about some feature of R to whomever would show up (which wasn’t many people usually). Picking up on R early on turned out to be really important because it was a small community back then and it was easy to have a big impact. Also, as more and more people wanted to learn R, they’d usually call on me. It’s always nice to feel needed. Over the years, the R community exploded and R’s popularity got to the point where it was being talked about in the New York Times. But now you see the problem. Saying that you know R doesn’t exactly distinguish you anymore, so it’s time to move on again. These days, I’m realizing that the one useful skill that I have is the ability to make movies. Also, my experience being a performer on the violin many years ago is coming in handy. My ability to quickly record and edit movies was one of the key factors that enabled me to create an entire online data science program in 2 months last year.</p> <p>Find the right people, and stick with them forever. Being a statistician means working with other people. Choose those people wisely and develop a strong relationship. It doesn’t matter how great the project is or how famous or interesting the other person is, if you can’t get along then bad things will happen. Statistics and data analysis is a highly verbal process that requires constant and very clear communication. If you’re uncomfortable with someone in any way, everything will suffer. Data analysis is unique in this way—our success depends critically on other people. I’ve only had a few collaborators in the past 12 years, but I love them like family. When I work with these people, I don’t necessarily know what will happen, but I know it will be good. In the end, I honestly don’t think I’ll remember the details of the work that I did, but I’ll remember the people I worked with and the relationships I built.</p> <p>So I hope you weren’t expecting a new asymptotic theorem today, because this is pretty much all I’ve got. As you all go on to the next phase of your life, just be confident in your own ideas, be prepared to change and learn new things, and find the right people to do them with. Thank you.</p> Correlation is not a measure of reproducibility 2015-08-12T10:33:25+00:00 http://simplystats.github.io/2015/08/12/correlation-is-not-a-measure-of-reproducibility <p>Biologists make wide use of correlation as a measure of reproducibility. Specifically, they quantify reproducibility with the correlation between measurements obtained from replicated experiments. For example, <a href="https://genome.ucsc.edu/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf">the ENCODE data standards document</a> states</p> <blockquote> <p>A typical R<sup>2</sup> (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.</p> </blockquote> <p>However, for  reasons I will explain here, correlation is not necessarily informative with regards to reproducibility. The mathematical results described below are not inconsequential theoretical details, and understanding them will help you assess new technologies, experimental procedures and computation methods.</p> <p>Suppose you have collected data from an experiment</p> <p style="text-align: center;"> <em>x</em><sub>1</sub>, <em>x</em><sub>2</sub>,..., <em>x</em><sub>n</sub> </p> <p>and want to determine if  a second experiment replicates these findings. For simplicity, we represent data from the second experiment as adding unbiased (averages out to 0) and statistically independent measurement error <em>d</em> to the first:</p> <p style="text-align: center;"> <em>y</em><sub>1</sub>=<em>x</em><sub>1</sub>+<em>d</em><sub>1</sub>, <em>y</em><sub>2</sub>=<em>x</em><sub>2</sub>+<em>d</em><sub>2</sub>, ... <em>y</em><sub>n</sub>=<em>x</em><sub>n</sub>+<em>d</em><sub>n</sub>. </p> <p>For us to claim reproducibility we want the differences</p> <p style="text-align: center;"> <em>d</em><sub>1</sub>=<em>y</em><sub>1</sub>-<em>x</em><sub>1</sub>, <em>d</em><sub>2</sub>=<em>y</em><sub>2</sub>-<em>x</em><sub>2</sub>,<em>... </em>,<em>d</em><sub>n</sub>=<em>y</em><sub>n</sub>-<em>x</em><sub>n</sub> </p> <p>to be “small”. To give this some context, imagine the <em>x</em> and <em>y</em> are log scale (base 2) gene expression measurements which implies the <em>d</em> represent log fold changes. If these differences have a standard deviation of 1, it implies that fold changes of 2 are typical between replicates. If our replication experiment produces measurements that are typically twice as big or twice as small as the original, I am not going to claim the measurements are reproduced. However, as it turns out, such terrible reproducibility can still result in correlations higher than 0.92.</p> <p>To someone basing their definition of correlation on the current common language usage this may seem surprising, but to someone basing it on math, it is not. To see this, note that the mathematical definition of correlation tells us that because <em>d</em> and <em>x</em> are independent:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula.png"><img class=" aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/pearsonformula-300x55.png" alt="pearsonformula" width="300" height="55" /></a></p> <p>This tells us that correlation summarizes the variability of <em>d</em> relative to the variability of <em>x</em>. Because of the wide range of gene expression values we observe in practice, the standard deviation of <em>x</em> can easily be as large as 3 (variance is 9). This implies we expect to see correlations as high as 1/sqrt(1+1/9) = 0.95, despite the lack of reproducibility when comparing <em>x</em> to <em>y</em>.</p> <p>Note that using Spearman correlation does not fix this problem. A Spearman correlation of 1 tells us that the ranks of <em>x</em> and <em>y</em> are preserved, yet doest not summarize the actual differences. The problem comes down to the fact that we care about the variability of <em>d</em> and correlation, Pearson or Spearman, does not provide an optimal summary. While correlation relates to the preservation of ranks, a much more appropriate summary of reproducibly is the distance between <em>x</em> and <em>y</em> which is related to the standard deviation of the differences <em>d</em>. A very simple R command you can use to generate this summary statistic is:</p> <pre>sqrt(mean(d^2))</pre> <p>or the robust version:</p> <pre>median(abs(d)) ##multiply by 1.4826 for unbiased estimate of true sd </pre> <p>The equivalent suggestion for plots it to make an <a href="https://en.wikipedia.org/wiki/MA_plot">MA-plot</a> instead of a scatterplot.</p> <p>But aren’t correlations and distances directly related? Sort of, and this actually brings up another problem. If the <em>x</em> and <em>y</em> are standardized to have average 0 and standard deviation 1 then, yes, correlation and distance are directly related:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png"><img class=" size-medium wp-image-4202 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png" alt="distcorr" width="300" height="51" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-300x51.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr-260x44.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/distcorr.png 878w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>However, if instead <em>x</em> and <em>y</em> have different average values, which would put into question reproducibility, then distance is sensitive to this problem while correlation is not. If the standard devtiation is 1, the formula is:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2.png"><img class=" size-medium wp-image-4204 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png" alt="distcor2" width="300" height="27" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-300x27.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/distcor2-1024x94.png 1024w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Once we consider units (standard deviations different from 1) then the relationship becomes even more complicated. Two advantages of distance you should be aware of are:</p> <ol> <li>it is in the same units as the data, while correlations have no units making it hard to interpret and select thresholds, and</li> <li>distance accounts for bias (differences in average), while correlation does not.</li> </ol> <p>A final important point relates to the use of correlation with data that is not approximately normal. The useful interpretation of correlation as a summary statistic stems from the bivariate normal approximation: for every standard unit increase in the first variable, the second variable increased <em>r</em> standard units, with <em>r</em> the correlation. A  summary of this is <a href="http://genomicsclass.github.io/book/pages/exploratory_data_analysis_2.html">here</a>. However, when data is not normal this interpretation no longer holds. Furthermore, heavy tail distributions, which are common in genomics, can lead to instability. Here is an example of uncorrelated data with a single pointed added that leads to correlations close to 1. This is quite common with RNAseq data.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2.png"><img class=" size-medium wp-image-4208 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png" alt="supp_figure_2" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/supp_figure_2-200x200.png 200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> rafalib package now on CRAN 2015-08-10T10:00:26+00:00 http://simplystats.github.io/2015/08/10/rafalib-package-now-on-cran <p>For the last several years I have been <a href="https://github.com/ririzarr/rafalib">collecting functions</a> I routinely use during exploratory data analysis in a private R package. <a href="http://mike-love.net/">Mike Love</a> and I used some of these in our HarvardX course and now, due to popular demand, I have created man pages and added the <a href="https://cran.r-project.org/web/packages/rafalib/">rafalib</a> package to CRAN. Mike has made several improvements and added some functions of his own. Here is quick descriptions of the rafalib functions I most use:</p> <p>mypar - Before making a plot in R I almost always type <tt>mypar()</tt>. This basically gets around the suboptimal defaults of <tt>par</tt>. For example, it makes the margins (<tt>mar</tt>, <tt>mpg</tt>) smaller and defines RColorBrewer colors as defaults.  It is optimized for the RStudio window. Another advantage is that you can type <tt>mypar(3,2)</tt> instead of <tt>par(mfrow=c(3,2))</tt>. <tt>bigpar()</tt> is optimized for R presentations or PowerPoint slides.</p> <p>as.fumeric - This function turns characters into factors and then into numerics. This is useful, for example, if you want to plot values <tt>x,y</tt> with colors defined by their corresponding categories saved in a character vector <tt>labs</tt><tt>plot(x,y,col=as.fumeric(labs))</tt>.</p> <p>shist (smooth histogram, pronounced <em>shitz</em>) - I wrote this function because I have a hard time interpreting the y-axis of <tt>density</tt>. The height of the curve drawn by <tt>shist</tt> can be interpreted as the height of a histogram if you used the units shown on the plot. Also, it automatically draws a smooth histogram for each entry in a matrix on the same plot.</p> <p>splot (subset plot) - The datasets I work with are typically large enough that</p> <p><tt>plot(x,y)</tt> involves millions of points, which is <a href="http://stackoverflow.com/questions/7714677/r-scatterplot-with-too-many-points">a problem</a>. Several solution are available to avoid over plotting, such as alpha-blending, hexbinning and 2d kernel smoothing. For reasons I won’t explain here, I generally prefer subsampling over these solutions. <tt>splot</tt> automatically subsamples. You can also specify an index that defines the subset.</p> <p>sboxplot (smart boxplot) - This function draws points, boxplots or outlier-less boxplots depending on sample size. Coming soon is the kaboxplot (Karl Broman box-plots) for when you have too many boxplots.</p> <p>install_bioc - For Bioconductor users, this function simply does the <tt>source(“http://www.bioconductor.org/biocLite.R”)</tt> for you and then uses <tt>BiocLite</tt> to install.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png"><img class="alignnone size-large wp-image-4190" src="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png" alt="unnamed" width="990" height="747" srcset="http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-300x226.png 300w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-1024x773.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1-260x196.png 260w, http://simplystatistics.org/wp-content/uploads/2015/08/unnamed1.png 1035w" sizes="(max-width: 990px) 100vw, 990px" /></a></p> Interested in analyzing images of brains? Get started with open access data. 2015-08-09T21:29:17+00:00 http://simplystats.github.io/2015/08/09/interested-in-analyzing-images-of-brains-get-started-with-open-access-data <div> <i>Editor's note: This is a guest post by <a href="http://www.anieloyan.com/" target="_blank"><span class="lG">Ani</span> Eloyan</a>. She is an Assistant Professor of Biostatistics at Brown University. Dr. Eloyan’s work focuses on</i> <i>semi-parametric likelihood based methods for matrix decompositions, statistical analyses of brain images, and the integration of various types of complex data structures for analyzing health care data</i><i>. She received her PhD in statistics from North Carolina State University and subsequently completed a postdoctoral fellowship in the <a href="http://www.biostat.jhsph.edu/">Department of Biostatistics at Johns Hopkins University</a>. Dr. Eloyan and her team won the <a>ADHD200 Competition</a></i> <i>discussed in <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00061/abstract" target="_blank">this</a> article. She tweets <a href="https://twitter.com/eloyan_ani">@eloyan_ani</a>.</i> </div> <div> <i> </i> </div> <div> <div> Neuroscience is one of the exciting new fields for biostatisticians interested in real world applications where they can contribute novel statistical approaches. Most research in brain imaging has historically included studies run for small numbers of patients. While justified by the costs of data collection, the claims based on analyzing data for such small numbers of subjects often do not hold for our populations of interest. As discussed in <a href="http://www.huffingtonpost.com/american-statistical-association/wanted-neuroquants_b_3749363.html" target="_blank">this</a> article, there is a huge demand for biostatisticians in the field of quantitative neuroscience; so called neuroquants or neurostatisticians. However, while more statisticians are interested in the field, we are far from competing with other substantive domains. For instance, a quick search of abstract keywords in the online program of the upcoming <a href="https://www.amstat.org/meetings/jsm/2015/" target="_blank">JSM2015</a> conference of “brain imaging” and “neuroscience” results in 15 records, while a search of the words “genomics” and “genetics” generates 76 <a>records</a>. </div> <div> </div> <div> Assuming you are trained in statistics and an aspiring neuroquant, how would you go about working with brain imaging data? As a graduate student in the <a href="http://www.stat.ncsu.edu/" target="_blank">Department of Statistics at NCSU</a> several years ago, I was very interested in working on statistical methods that would be directly applicable to solve problems in neuroscience. But I had this same question: “Where do I find the data?” I soon learned that to <i>really</i>approach substantial relevant problems I also needed to learn about the subject matter underlying these complex data structures. </div> <div> </div> <div> In recent years, several leading groups have uploaded their lab data with the common goal of fostering the collection of high dimensional brain imaging data to build powerful models that can give generalizable results. <a href="http://www.nitrc.org/" target="_blank">Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC)</a> founded in 2006 is a platform for public data sharing that facilitates streamlining data processing pipelines and compiling high dimensional imaging datasets for crowdsourcing the analyses. It includes data for people with neurological diseases and neurotypical children and adults. If you are interested in Alzheimer’s disease, you can check out <a href="http://adni.loni.usc.edu/" target="_blank">ADNI</a>. <a href="http://fcon_1000.projects.nitrc.org/indi/abide/" target="_blank">ABIDE</a> provides data for people with Autism Spectrum Disorder and neurotypical peers. <a href="http://fcon_1000.projects.nitrc.org/indi/adhd200/" target="_blank">ADHD200</a> was released in 2011 as a part of a competition to motivate building predictive methods for disease diagnoses using functional magnetic resonance imaging (MRI) in addition to demographic information to predict whether a child has attention deficit hyperactivity disorder (ADHD). While the competition ended in 2011, the dataset has been widely utilized afterwards in studies of ADHD.  According to Google Scholar, the <a href="http://www.nature.com/mp/journal/v19/n6/abs/mp201378a.html" target="_blank">paper</a> introducing the ABIDE set has been cited 129 times since 2013 while the <a href="http://journal.frontiersin.org/article/10.3389/fnsys.2012.00062/full" target="_blank">paper</a> discussing the ADHD200 has been cited 51 times since <span style="font-family: Arial;">2012. These are only a few examples from the list of open access datasets that could of utilized by statisticians. </span> </div> <div> </div> <div> Anyone can download these datasets (you may need to register and complete some paperwork in some cases), however, there are several data processing and cleaning steps to perform before the final statistical analyses. These preprocessing steps can be daunting for a statistician new to the field, especially as the tools used for preprocessing may not be available in R. <a href="https://hopstat.wordpress.com/2014/08/27/statisticians-in-neuroimaging-need-to-learn-preprocessing/" target="_blank">This</a> discussion makes the case as to why statisticians need to be involved in every step of preprocessing the data, while <u><a href="https://hopstat.wordpress.com/2014/06/17/fslr-an-r-package-interfacing-with-fsl-for-neuroimaging-analysis/" target="_blank">this R package</a></u> contains new tools linking R to a commonly used platform <a href="http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/" target="_blank">FSL</a>. However, as a newcomer, it can be easier to start with data that are already processed. <a href="http://projecteuclid.org/euclid.ss/1242049389" target="_blank">This</a> excellent overview by Dr. Martin Lindquist provides an introduction to the different types of analyses for brain imaging data from a statisticians point of view, while our<a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0089470" target="_blank">paper</a> provides tools in R and example datasets for implementing some of these methods. At least one course on Coursera can help you get started with <a href="https://www.coursera.org/course/fmri" target="_blank">functional MRI</a> data. Talking to and reading the papers of biostatisticians working in the field of quantitative neuroscience and scientists in the field of neuroscience is the key. </div> </div> Statistical Theory is our "Write Once, Run Anywhere" 2015-08-09T11:19:53+00:00 http://simplystats.github.io/2015/08/09/statistical-theory-is-our-write-once-run-anywhere <p>Having followed the software industry as a casual bystander, I periodically see the tension flare up between the idea of writing “native apps”, software that is tuned to a particular platform (Windows, Mac, etc.) and more cross-platform apps, which run on many platforms without too much modification. Over the years it has come up in many different forms, but they fundamentals are the same. Back in the day, there was Java, which was supposed to be the platform that ran on any computing device. Sun Microsystems originated the phrase “<a href="https://en.wikipedia.org/wiki/Write_once,_run_anywhere">Write Once, Run Anywhere</a>” to illustrate the cross-platform strengths of Java. More recently, Steve Jobs famously <a href="https://www.apple.com/hotnews/thoughts-on-flash/">banned Flash</a> from any iOS device. Apple is also moving away from standards like OpenGL and towards its own Metal platform.</p> <p>What’s the problem with “write once, run anywhere”, or of cross-platform development more generally, assuming it’s possible? Well, there are a <a href="https://en.wikipedia.org/wiki/Cross-platform#Challenges_to_cross-platform_development">number of issues</a>: often there are performance penalties, it may be difficult to use the native look and feel of a platform, and you may be reduced to using the “lowest common denominator” of feature sets. It seems to me that anytime a new meta-platform comes out that promises to relieve programmers of the burden of having to write for multiple platforms, it eventually gets modified or subsumed by the need to optimize apps for a given platform as much as possible. The need to squeeze as much juice out of an app seems to be too important an opportunity to pass up.</p> <p>In statistics, theory and theorems are our version of “write once, run anywhere”. The basic idea is that theorems provide an abstract layer (a “virtual machine”) that allows us to reason across a large number of specific problems. Think of the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">central limit theorem</a>, probably our most popular theorem. It could be applied to any problem/situation where you have a notion of sample size that could in principle be increasing.</p> <p>But can it be applied to every situation, or even any situation? This might be more of a philosophical question, given that the CLT is stated asymptotically (maybe we’ll find out the answer eventually). In practice, my experience is that many people attempt to apply it to problems where it likely is not appropriate. Think, large-scale studies with a sample size of 10. Many people will use Normal-based confidence intervals in those situations, but they probably have very poor coverage.</p> <p>Because the CLT doesn’t apply in many situations (small sample, dependent data, etc.), variations of the CLT have been developed, as well as entirely different approaches to achieving the same ends, like confidence intervals, p-values, and standard errors (think bootstrap, jackknife, permutation tests). While the CLT an provide beautiful insight in a large variety of situations, in reality, one must often resort to a custom solution when analyzing a given dataset or problem. This should be a familiar conclusion to anyone who analyzes data. The promise of “write once, run anywhere” is always tantalizing, but the reality never seems to meet that expectation.</p> <p>Ironically, if you look across history and all programming languages, probably the most “cross-platform” language is C, which was originally considered to be too low-level to be broadly useful. C programs run on basically every existing platform and the language has been completely standardized so that compilers can be written to produce well-defined output. The keys to C’s success I think are that it’s a very simple/small language which gives enormous (sometimes dangerous) power to the programmer, and that an enormous toolbox (compiler toolchains, IDEs) has been developed over time to help developers write applications on all platforms.</p> <p>In a sense, we need “compilers” that can help us translate statistical theory for specific data analysis problems. In many cases, I’d imagine the compiler would “fail”, meaning the theory was not applicable to that problem. This would be a Good Thing, because right now we have no way of really enforcing the appropriateness of a theorem for specific problems.</p> <p>More practically (perhaps), we could develop <a href="http://simplystatistics.org/2012/08/27/a-deterministic-statistical-machine/">data analysis pipelines</a> that could be applied to broad classes of data analysis problems. Then a “compiler” could be employed to translate the pipeline so that it worked for a given dataset/problem/toolchain.</p> <p>The key point is to recognize that there is a “translation” process that occurs when we use theory to justify certain data analysis actions, but this translation process is often not well documented or even thought through. Having an explicit “compiler” for this would help us to understand the applicability of certain theorems and may serve to prevent bad data analysis from occurring.</p> Autonomous killing machines won't look like the Terminator...and that is why they are so scary 2015-07-30T11:09:22+00:00 http://simplystats.github.io/2015/07/30/autonomous-killing-machines-wont-look-like-the-terminator-and-that-is-why-they-are-so-scary <p>Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p> <p> </p> <p> </p> <p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p> <p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p> <ol> <li>A drone with the ability to fly on its own</li> <li>The ability to make decisions about what people to target</li> <li>The ability to find those people and attack them</li> </ol> <p> </p> <p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p> <p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p> <p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a [Just a few days ago many of the most incredible minds in science and technology <a href="http://www.theguardian.com/technology/2015/jul/27/musk-wozniak-hawking-ban-ai-autonomous-weapons">urged governments to avoid using artificial intelligence</a> to create autonomous killing machines. One thing that always happens when such a warning is put into place is you see the inevitable Terminator picture:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg"><img class="aligncenter wp-image-4160 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg" alt="terminator" width="300" height="180" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/terminator-300x180.jpeg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator-260x156.jpeg 260w, http://simplystatistics.org/wp-content/uploads/2015/07/terminator.jpeg 620w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>The reality is that robots that walk and talk are getting better but still have a ways to go:</p> <p> </p> <p> </p> <p>Does this mean that I think all those really smart people are silly for making this plea about AI now though? No, I think they are probably just in time.</p> <p>The reason is that the first autonomous killing machines will definitely not look anything like the Terminator. They will more likely than not be drones, that are already in widespread use by the military, and will soon be flying over our heads <a href="http://money.cnn.com/2015/07/29/technology/amazon-drones-air-space/">delivering Amazon products</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg"><img class="aligncenter size-medium wp-image-4161" src="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg" alt="drone" width="300" height="238" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/drone-300x238.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/07/drone-1024x814.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/drone.jpg 1200w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>I also think that when people think about “artificial intelligence” they also think about robots that can mimic the behaviors of a human being, including the ability to talk, hold a conversation, <a href="https://en.wikipedia.org/wiki/Turing_test">or pass the Turing test</a>. But it turns out that the “artificial intelligence” you would need to create an automated killing system is much much simpler than that and is mostly some basic data science. The things you would need are:</p> <ol> <li>A drone with the ability to fly on its own</li> <li>The ability to make decisions about what people to target</li> <li>The ability to find those people and attack them</li> </ol> <p> </p> <p>The first issue, being able to fly on autopilot, is something that has existed for a while. You have probably flown on a plane that has <a href="https://en.wikipedia.org/wiki/Autopilot">used autopilot</a> for at least some of the flight. I won’t get into the details on this one because I think it is the least interesting - it has been around a while and we didn’t get the dire warnings about autonomous agents.</p> <p>The second issue, about deciding which people to target is already in existence as well. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. We have already seen programs like <a href="https://en.wikipedia.org/wiki/PRISM_(surveillance_program)">PRISM</a> and others that collect individual level metadata and presumably use those to make predictions. While the true and false positive rates are probably messed up by the fact that there are very very few “true positives” these programs are being developed and even relatively simple statistical models can be used to build a predictor - even if those don’t work.</p> <p>The second issue is being able to find people to attack them. This is where the real “artificial intelligence” comes in to play. But it isn’t artificial intelligence like you might think about. It could be just as simple as having the drone fly around and take people’s pictures. Then we could use those pictures to match up with the people identified through metadata and attack them. Facebook has a](file:///Users/jtleek/Downloads/deepface.pdf) that demonstrates an algorithm that can identify people with near human level accuracy. This approach is based on something called deep neural nets, which sounds very intimidating, but is actually just a set of nested nonlinear <a href="https://en.wikipedia.org/wiki/Deep_learning">logistic regression models</a>. These models have gotten very good because (a) we are getting better at fitting them mathematically and computationally but mostly (b) we have much more data to train them with than we ever did before. The speed that this part of the process is developing is (I think) why there is so much recent concern about potentially negative applications like autonomous killing machines.</p> <p>The scary thing is that these technologies could be combined *right now* to create such a system that was not controlled directly by humans but made automated decisions and flew drones to carry out those decisions. The technology to shrink these type of deep neural net systems to identify people is so good it can even be made simple enough to <a href="http://googleresearch.blogspot.com/2015/07/how-google-translate-squeezes-deep.html">run on a phone f</a>or things like language translation and could easily be embedded in a drone.</p> <p>So I am with Musk, Hawking, and others who would urge caution by governments in developing these systems. Just because we can make it doesn’t mean it will do what we want. Just look at how well Facebook/Amazon/Google make suggestions for “other things you might like” to get an idea about how potentially disastrous automated killing systems could be.</p> <p> </p> Announcing the JHU Data Science Hackathon 2015 2015-07-28T13:31:04+00:00 http://simplystats.github.io/2015/07/28/announcing-the-jhu-data-science-hackathon-2015 <p>We are pleased to announce that the Department of Biostatistics at the Johns Hopkins Bloomberg School of Public Health will be hosting the first ever <a href="https://www.regonline.com/jhudash">JHU Data Science Hackathon</a> (DaSH) on <strong>September 21-23, 2015</strong> at the Baltimore Marriott Waterfront.</p> <p>This event will be an opportunity for data scientists and data scientists-in-training to get together and hack on real-world problems collaboratively and to learn from each other. The DaSH will feature data scientists from government, academia, and industry presenting problems and describing challenges in their respective areas. There will also be a number of networking opportunities where attendees can get to know each other. We think this will be  fun event and we encourage people from all areas, including students (graduate and undergraduate), to attend.</p> <p>To get more details and to sign up for the hackathon, you can go to the <a href="https://www.regonline.com/jhudash">DaSH web site</a>. We will be posting more information as the event nears.</p> <p>Organizers:</p> <ul> <li>Jeff Leek</li> <li>Brian Caffo</li> <li>Roger Peng</li> <li>Leah Jager</li> </ul> <p>Funding:</p> <ul> <li>National Institutes of Health</li> <li>Johns Hopkins University</li> </ul> <p> </p> stringsAsFactors: An unauthorized biography 2015-07-24T11:04:20+00:00 http://simplystats.github.io/2015/07/24/stringsasfactors-an-unauthorized-biography <p>Recently, I was listening in on the conversation of some colleagues who were discussing a bug in their R code. The bug was ultimately traced back to the well-known phenomenon that functions like ‘read.table()’ and ‘read.csv()’ in R convert columns that are detected to be character/strings to be factor variables. This lead to the spontaneous outcry from one colleague of</p> <blockquote> <p>Why does stringsAsFactors not default to FALSE????</p> </blockquote> <p>The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. The argument also appears in ‘read.table()’ and related functions because of the role these functions play in reading in table data and converting them to data frames. By default, ‘stringsAsFactors’ is set to TRUE.</p> <p>This argument dates back to May 20, 2006 when it was originally introduced into R as the ‘charToFactor’ argument to ‘data.frame()’. Soon afterwards, on May 24, 2006, it was changed to ‘stringsAsFactors’ to be compatible with S-PLUS by request from Bill Dunlap.</p> <p>Most people I talk to today who use R are completely befuddled by the fact that ‘stringsAsFactors’ is set to TRUE by default. First of all, it should be noted that before the ‘stringsAsFactors’ argument even existed, the behavior of R was to coerce all character strings to be factors in a data frame. If you didn’t want this behavior, you had to manually coerce each column to be character.</p> <p>So here’s the story:</p> <p>In the old days, when R was primarily being used by statisticians and statistical types, this setting strings to be factors made total sense. In most tabular data, if there were a column of the table that was non-numeric, it almost certainly encoded a categorical variable. Think sex (male/female), country (U.S./other), region (east/west), etc. In R, categorical variables are represented by ‘factor’ vectors and so character columns got converted factor.</p> <p>Why do we need factor variables to begin with? Because of modeling functions like ‘lm()’ and ‘glm()’. Modeling functions need to treat expand categorical variables into individual dummy variables, so that a categorical variable with 5 levels will be expanded into 4 different columns in your modeling matrix. There’s no way for R to know it should do this unless it has some extra information in the form of the factor class. From this point of view, setting ‘stringsAsFactors = TRUE’ when reading in tabular data makes total sense. If the data is just going to go into a regression model, then R is doing the right thing.</p> <p>There’s also a more obscure reason. Factor variables are encoded as integers in their underlying representation. So a variable like “disease” and “non-disease” will be encoded as 1 and 2 in the underlying representation. Roughly speaking, since integers only require 4 bytes on most systems, the conversion from string to integer actually saved some space for long strings. All that had to be stored was the integer levels and the labels. That way you didn’t have to repeat the strings “disease” and “non-disease” for as many observations that you had, which would have been wasteful.</p> <p>Around June of 2007, R introduced hashing of CHARSXP elements in the underlying C code thanks to Seth Falcon. What this meant was that effectively, character strings were hashed to an integer representation and stored in a global table in R. Anytime a given string was needed in R, it could be referenced by its underlying integer. This effectively put in place, globally, the factor encoding behavior of strings from before. Once this was implemented, there was little to be gained from an efficiency standpoint by encoding character variables as factor. Of course, you still needed to use ‘factors’ for the modeling functions.</p> <p>The difference nowadays is that R is being used a by a very wide variety of people doing all kinds of things the creators of R never envisioned. This is, of course, wonderful, but it introduces lots of use cases that were not originally planned for. I find that most often, the people complaining about ‘stringsAsFactors’ not being FALSE are people who are doing things that are not the traditional statistical modeling things (things that old-time statisticians like me used to do). In fact, I would argue that if you’re upset about ‘stringsAsFactors = TRUE’, then it’s a pretty good indicator that you’re either not a statistician by training, or you’re doing non-traditional statistical things.</p> <p>For example, in genomics, you might have the names of the genes in one column of data. It really doesn’t make sense to encode these as factors because they won’t be used in any modeling function. They’re just labels, essentially. And because of CHARSXP hashing, you don’t gain anything from an efficiency standpoint by converting them to factors either.</p> <p>But of course, given the long-standing behavior of R, many people depend on the default conversion of characters to factors when reading in tabular data. Changing this default would likely result in an equal number of people complaining about ‘stringsAsFactors’.</p> <p>I fully expect that this blog post will now make all R users happy. If you think I’ve missed something from this unauthorized biography, please let me know on Twitter (@rdpeng).</p> The statistics department Moneyball opportunity 2015-07-17T09:21:16+00:00 http://simplystats.github.io/2015/07/17/the-statistics-department-moneyball-opportunity <p><a href="https://en.wikipedia.org/wiki/Moneyball"></a> is a book and a movie about Billy Bean. It makes statisticians look awesome and I loved the movie. I loved it so much I’m putting the movie trailer right here:</p> <p>The basic idea behind Moneyball was that the Oakland Athletics were able to build a very successful baseball team on a tight budget by valuing skills that many other teams undervalued. In baseball those skills were things like on-base percentage and slugging percentage. By correctly valuing these skills and their impact on a teams winning percentage, the A’s were able to build one of the most successful regular season teams on a minimal budget. This graph shows what an outlier they were, from a nice <a href="http://fivethirtyeight.com/features/billion-dollar-billy-beane/">fivethirtyeight analysis</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png"><img class="aligncenter wp-image-4146" src="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png" alt="oakland" width="500" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2015/07/oakland-1024x818.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland-250x200.png 250w, http://simplystatistics.org/wp-content/uploads/2015/07/oakland.png 1150w" sizes="(max-width: 500px) 100vw, 500px" /></a></p> <p> </p> <p>I think that the data science/data analysis revolution that we have seen over the last decade has created a similar moneyball opportunity for statistics and biostatistics departments. Traditionally in these departments the highest value activities have been publishing a select number of important statistics journals (JASA, JRSS-B, Annals of Statistics, Biometrika, Biometrics and more recently journals like Biostatistics and Annals of Applied Statistics). But there are some hugely valuable ways to contribute to statistics/data science that don’t necessarily end with papers in those journals like:</p> <ol> <li>Creating good, well-documented, and widely used software</li> <li>Being primarily an excellent collaborator who brings in grant money and is a major contributor to science through statistics</li> <li>Publishing in top scientific journals rather than statistics journals</li> <li>Being a good scientific communicator who can attract talent</li> <li>Being a statistics educator who can build programs</li> </ol> <p>Another thing that is undervalued is not having a Ph.D. in statistics or biostatistics. The fact that these skills are undervalued right now means that up and coming departments could identify and recruit talented people that might be missed by other departments and have a huge impact on the world. One tricky thing is that the rankings of department are based on the votes of people from other departments who may or may not value these same skills. Another tricky thing is that many industry data science positions put incredibly high value on these skills and so you might end up competing with them for people - a competition that will definitely drive up the market value of these data scientist/statisticians. But for the folks that want to stay in academia, now is a prime opportunity.</p> The Mozilla Fellowship for Science 2015-07-10T11:10:26+00:00 http://simplystats.github.io/2015/07/10/the-mozilla-fellowship-for-science <p>This looks like an <a href="https://www.mozillascience.org/fellows">interesting opportunity</a> for grad students, postdocs, and early career researchers:</p> <blockquote> <p>We’re looking for researchers with a passion for open source and data sharing, already working to shift research practice to be more collaborative, iterative and open. Fellows will spend 10 months starting September 2015 as community catalysts at their institutions, mentoring the next generation of open data practitioners and researchers and building lasting change in the global open science community.</p> <p>Throughout their fellowship year, chosen fellows will receive training and support from Mozilla to hone their skills around open source and data sharing. They will also craft code, curriculum and other learning resources that help their local communities learn open data practices, and teach forward to their peers.</p> </blockquote> <p>Here’s what you get:</p> <blockquote> <p>Fellows will receive:</p> <ul> <li>A stipend of $60,000 USD, paid in 10 monthly installments.</li> <li>One-time health insurance supplement for Fellows and their families, ranging from$3,500 for single Fellows to $7,000 for a couple with two or more children.</li> <li>One-time childcare allotment for families with children of up to$6,000.</li> <li>Allowance of up to $3,000 towards the purchase of laptop computer, digital cameras, recorders and computer software; fees for continuing studies or other courses, research fees or payments, to the extent related to the fellowship.</li> <li>All approved fellowship trips – domestic and international – are covered in full.</li> </ul> </blockquote> <p>Deadline is August 14.</p> JHU, UMD researchers are getting a really big Big Data center 2015-07-08T16:26:45+00:00 http://simplystats.github.io/2015/07/08/jhu-umd-researchers-are-getting-a-really-big-big-data-center <p>From <a href="http://technical.ly/baltimore/2015/07/07/jhu-umd-big-data-maryland-advanced-research-computing-center-marcc/">Technical.ly Baltimore</a>:</p> <blockquote> <p>A nondescript, 3,700-square-foot building on Johns Hopkins’ Bayview campus will house a new data storage and computing center for university researchers. The$30 million Maryland Advanced Research Computing Center (MARCC) will be available to faculty from JHU and the University of Maryland, College Park.</p> </blockquote> <p>The web site has a pretty cool time-lapse video of the construction of the computing center. There’s also a bit more detail at the <a href="http://hub.jhu.edu/2015/07/06/computing-center-bayview">JHU Hub</a> site.</p> The Massive Future of Statistics Education 2015-07-03T10:17:24+00:00 http://simplystats.github.io/2015/07/03/the-massive-future-of-statistics-education <p><em>NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. </em></p> <p>Data are eating the world, but our collective ability to analyze data is going on a starvation diet.</p> <div id="content"> <p> Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions. </p> <p> So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we <em>are</em> interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone. </p> <p> In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college. </p> <p> Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight. </p> <p> <strong>Limited Capacity</strong> </p> <p> The McKinsey Global Institute, in a <a href="http://www.mckinsey.com/insights/americas/us_game_changers">highly cited report</a>, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality. </p> <p> Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of <a href="http://community.amstat.org/blogs/steve-pierson/2014/02/09/largest-graduate-programs-in-statistics">730 people</a>. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent. </p> <p> On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in <em>some</em> data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for <em>everyone</em> to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time. </p> <p> <strong>Enter the MOOCs</strong> </p> <p> In April of 2014, Jeff Leek, Brian Caffo, and I launched the <a href="https://www.coursera.org/specialization/jhudatascience/1">Johns Hopkins Data Science Specialization</a> on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is </p> <ol> <li> Formulating a question that can be answered with data </li> <li> Assembling, cleaning, tidying data relevant to a question </li> <li> Exploring data, checking, eliminating hypotheses </li> <li> Developing a statistical model </li> <li> Making statistical inference </li> <li> Communicating findings </li> <li> Making the work reproducible </li> </ol> <p> We took these basic steps and designed courses around each one of them. </p> <p> Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month. </p> <p> We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is <em>what we do every single day</em>. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work. </p> <p> The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist. </p> <p> For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with <a href="http://swiftkey.com/en/">SwiftKey</a>, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public. </p> <p> <strong>Degree Alternatives</strong> </p> <p> The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis. </p> <p> All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses. </p> <p> Each learner who completes a course using Coursera’s “Signature Track” (which currently costs 49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses. </p> <p> Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs. </p> <p> <strong>Early Numbers</strong> </p> <p> So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project. </p> <p> <strong>Some Early Lessons</strong> </p> <p> From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned. </p> <p> <strong>Data Science as Art and Science. </strong>Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower. </p> <p> When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach? </p> <p> Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his <a href="http://www.nap.edu/catalog/1910/the-future-of-statistical-software-proceedings-of-a-forum">1991 talk at the National Academies of Science</a>, we had a process that we regularly espoused but barely understood. </p> <p> <strong>Content vs. Curation</strong>.<strong> </strong>Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing? </p> <p> Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a <em>curation</em> of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated. </p> <p> <strong>Skill sets vs. Certification</strong>. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly. </p> <p> In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing. </p> <p> <strong>Conclusions</strong> </p> <p> As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work. </p> </div> Looks like this R thing might be for real 2015-07-02T10:01:45+00:00 http://simplystats.github.io/2015/07/02/looks-like-this-r-thing-might-be-for-real <p>Not sure how I missed this, but the Linux Foundation just announced the <a href="http://www.linuxfoundation.org/news-media/announcements/2015/06/linux-foundation-announces-r-consortium-support-millions-users">R Consortium</a> for supporting the “world’s most popular language for analytics and data science and support the rapid growth of the R user community”. From the Linux Foundation:</p> <blockquote> <p>The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.</p> <p>Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.</p> </blockquote> How Airbnb built a data science team 2015-07-01T08:39:29+00:00 http://simplystats.github.io/2015/07/01/how-airbnb-built-a-data-science-team <p>From <a href="http://venturebeat.com/2015/06/30/how-we-scaled-data-science-to-all-sides-of-airbnb-over-5-years-of-hypergrowth/">Venturebeat</a>:</p> <blockquote> <p>Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.</p> <p>But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.</p> </blockquote> How public relations and the media are distorting science 2015-06-24T10:07:45+00:00 http://simplystats.github.io/2015/06/24/how-public-relations-and-the-media-are-distorting-science <p>Throughout history, engineers, medical doctors and other applied scientists have helped convert basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries. And even the exceptions stand on the shoulders of incremental contributions. The researchers that produce this knowledge go through a slow and painstaking process to reach these achievements.</p> <p>In contrast, most science related media reports that grab the public’s attention fall into three categories:</p> <ol> <li>The <em>exaggerated big discovery</em>: Recent examples include the discovery of <a href="http://www.cbsnews.com/news/dangerous-pathogens-and-mystery-microbes-ride-the-subway/">the bubonic plague in the NYC subway</a>, <a href="http://www.bbc.com/news/science-environment-32287609">liquid water in mars</a>, and <a href="http://www.nytimes.com/2015/05/24/opinion/sunday/infidelity-lurks-in-your-genes.html?ref=opinion&amp;_r=3">the infidelity gene</a>.</li> <li><em>Over-promising</em>: These try to explain a complicated basic science finding and, in the case of biomedical research, then speculate without much explanation that the finding will ”lead to a deeper understanding of diseases and new ways to treat or cure them”.</li> <li><em>Science is broken</em>: These tend to report an anecdote about an allegedly corrupt scientist, maybe cite the “Why Most Published Research Findings are False” paper, and then extrapolate speculatively.</li> </ol> <p>In my estimation, despite the attention grabbing headlines, the great majority of the subject matter included in these reports will not have an impact on our lives and will not even make it into scientific textbooks. So does science still have anything to offer? Reports of the third category have even scientists particularly worried. I, however, remain optimistic. First, I do not see any empirical evidence showing that the negative effects of the lack of reproducibility are worse now than 50 years ago. Furthermore, although not widely reported in the lay press, I continue to see bodies of work built by several scientists over several years or decades with much promise of leading to tangible improvements to our quality of life. Recent advances that I am excited about include <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">insulators</a>, <a href="http://www.ncbi.nlm.nih.gov/pubmed/24955707">PD-1 pathway inhibitors</a>, <a href="https://en.wikipedia.org/wiki/CRISPR">clustered regularly interspaced short palindromic repeats</a>, advances in solar energy technology, and prosthetic robotics.</p> <p>However, there is one general aspect of science that I do believe has become worse. Specifically, it’s a shift in how much scientists jockey for media attention, even if it’s short-lived. Instead of striving for having a sustained impact on our field, which may take decades to achieve, an increasing number of scientists seem to be placing more value on appearing in the New York Times, giving a Ted Talk or having a blog or tweet go viral. As a consequence, too many of us end up working on superficial short term challenges that, with the help of a professionally crafted press release, may result in an attention grabbing media report. NB: I fully support science communication efforts, but not when the primary purpose is garnering attention, rather than educating.</p> <p>My concern spills over to funding agencies and philanthropic organizations as well. Consider the following two options. Option 1: be the funding agency representative tasked with organizing a big science project with a well-oiled PR machine. Option 2: be the funding agency representative in charge of several small projects, one of which may with low, but non-negligible, probability result in a Nobel Prize 30 years down the road. In the current environment, I see a preference for option 1.</p> <p>I am also concerned about how this atmosphere may negatively affect societal improvements within science. Publicly shaming transgressors on Twitter or expressing one’s outrage on a blog post can garner many social media clicks. However, these may have a smaller positive impact than mundane activities such as serving on a committee that, after several months of meetings, implements incremental, yet positive, changes. Time and energy spent on trying to increase internet clicks is time and energy we don’t spend on the tedious administrative activities that are needed to actually affect change.</p> <p>Because so many of the scientists that thrive in this atmosphere of short-lived media reports are disproportionately rewarded, I imagine investigators starting their careers feel some pressure to garner some media attention of their own. Furthermore, their view of how they are evaluated may be highly biased because evaluators that ignore media reports and focus more on the specifics of the scientific content, tend to be less visible. So if you want to spend your academic career slowly building a body of work with the hopes of being appreciated decades from now, you should not think that it is hopeless based on what is perhaps, a distorted view of how we are currently being evaluated.</p> <p>Update: changed topological insulators links to <a href="http://scienceblogs.com/principles/2010/07/20/whats-a-topological-insulator/">these</a> <a href="http://physics.gmu.edu/~pnikolic/articles/Topological%20insulators%20(Physics%20World,%20February%202011).pdf">two</a>. <a href="http://spectrum.ieee.org/semiconductors/materials/topological-insulators">Here</a> is one more. Via David S.</p> Interview at Leanpub 2015-06-16T21:49:33+00:00 http://simplystats.github.io/2015/06/16/interview-at-leanpub <p>A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book <em><a href="https://leanpub.com/rprogramming">R Programming for Data Science</a></em>. So far, I’ve only published one book through Leanpub but I’m a huge fan. They’ve developed a system that is, in my opinion, perfect for academic publishing. The book’s written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.</p> <p>The full interview transcript is over at the <a href="http://blog.leanpub.com/2015/06/roger-peng.html">Leanpub blog</a>. If you want to listen to the audio of the interview, you can subscribe to the Leanpub <a href="https://itunes.apple.com/ca/podcast/id517117137?mt=2">podcast on iTunes</a>.</p> <p><a href="https://leanpub.com/rprogramming"><em>R Programming for Data Science</em></a> is available at Leanpub for a suggested price of15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who’ve already bought a copy!</p> Johns Hopkins Data Science Specialization Captsone 2 Top Performers 2015-06-10T14:33:09+00:00 http://simplystats.github.io/2015/06/10/johns-hopkins-data-science-specialization-captsone-2-top-performers <p><em>The second capstone session of the <a href="https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage">Johns Hopkins Data Science Specialization</a> concluded recently. This time, we had 1,040 learners sign up to participate in the session, which again featured a project developed in collaboration with the amazingly innovative folks at <a href="http://swiftkey.com/en/">SwiftKey</a>. </em></p> <p><em>We’ve identified the learners listed below as the top performers in this capstone session. This is an incredibly talented group of people who have worked very hard throughout the entire nine-course specialization.  Please take some time to read their stories and look at their work. </em></p> <h1 id="ben-apple">Ben Apple</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg"><img class="aligncenter size-medium wp-image-4091" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg" alt="Ben_Apple" width="300" height="285" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple-300x285.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ben_Apple.jpg 360w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Ben Apple is a Data Scientist and Enterprise Architect with the Department of Defense.  Mr. Apple holds a MS in Information Assurance and is a PhD candidate in Information Sciences.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>As a self trained data scientist I was looking for a program that would formalize my established skills while expanding my data science knowledge and tool box.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>The capstone project was the most demanding aspect of the program.  As such I most proud of the finale project.  The project stretched each of us beyond the standard course work of the program and was quite satisfying.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>To open doors so that I may further my research into the operational value of applying data science thought and practice to analytics of my domain.</p> <p><strong>Final Project: </strong><a href="https://bengapple.shinyapps.io/coursera_nlp_capstone">https://bengapple.shinyapps.io/coursera_nlp_capstone</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/bengapple/71376">http://rpubs.com/bengapple/71376</a></p> <p> </p> <h1 id="ivan-corneillet">Ivan Corneillet</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg"><img class="aligncenter size-medium wp-image-4092" src="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg" alt="Ivan.Corneillet" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-300x300.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet-200x200.jpg 200w, http://simplystatistics.org/wp-content/uploads/2015/06/Ivan.Corneillet.jpg 400w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>A technologist, thinker, and tinkerer, Ivan facilitates the establishment of start-up companies by advising these companies about the hiring process, product development, and technology development, including big data, cloud computing, and cybersecurity. In his 17-year career, Ivan has held a wide range of engineering and management positions at various Silicon Valley companies. Ivan is a recent Wharton MBA graduate, and he previously earned his master’s degree in computer science from the Ensimag, and his master’s degree in electrical engineering from Université Joseph Fourier, both located in France.</p> <p><strong>**Why did you take the JHU Data Science Specialization?</strong>**</p> <p>There are three reasons why I decided to enroll in the JHU Data Science Specialization. First, fresh from college, my formal education was best suited for scaling up the Internet’s infrastructure. However, because every firm in every industry now creates products and services from analyses of data, I challenged myself to learn about Internet-scale datasets. Second, I am a big supporter of MOOCs. I do not believe that MOOCs should replace traditional education; however, I do believe that MOOCs and traditional education will eventually coexist in the same way that open-source and closed-source software does (read my blog post for more information on this topic: http://ivantur.es/16PHild). Third, the Johns Hopkins University brand certainly motivated me to choose their program. With a great name comes a great curriculum and fantastic professors, right?</p> <p>Once I had completed the program, I was not disappointed at all. I had read a blog post that explained that the JHU Data Science Specialization was only a start to learning about data science. I certainly agree, but I would add that this program is a great start, because the curriculum emphasizes information that is crucial, while providing additional resources to those who wish to deepen their understanding of data science. My thanks to Professors Caffo, Leek, and Peng; the TAs, and Coursera for building and delivering this track!</p> <p><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</p> <p>The capstone project made for a very rich and exhilarating learning experience, and was my favorite course in the specialization. Because I did not have prior knowledge in natural language processing (NLP), I had to conduct a fair amount of research. However, the program’s minimal-guidance approach mimicked a real-world environment, and gave me the opportunity to leverage my experience with developing code and designing products to get the most out of the skillset taught in the track. The result was that I created a data product that implemented state-of-the-art NLP algorithms using what I think are the best technologies (i.e., C++, JavaScript, R, Ruby, and SQL), given the choices that I had made. Bringing everything together is what made me the most proud. Additionally, my product capabilities are a far cry from IBM’s Watson, but while I am well versed in supercomputer hardware, this track helped me to gain a much deeper appreciation of Watson’s AI.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-1"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>Thanks to the broad skillset that the specialization covered, I feel confident wearing a data science hat. The concepts and tools covered in this program helped me to better understand the concerns that data scientists have and the challenges they face. From a business standpoint, I am also better equipped to identify the opportunities that lie ahead.</p> <p><strong>Final Project: </strong><a href="https://paspeur.shinyapps.io/wordmaster-io/">https://paspeur.shinyapps.io/wordmaster-io/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/paspeur/wordmaster-io">http://rpubs.com/paspeur/wordmaster-io</a></p> <p>#</p> <h1 id="oscar-de-len">Oscar de León</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon.jpg"><img class="aligncenter size-medium wp-image-4093" src="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg" alt="Oscar_De_Leon" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Oscar_De_Leon-260x195.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>Oscar is an assistant researcher at a research institute in a developing country, graduated as a licentiate in biochemistry and microbiology in 2010 from the same university which hosts the institute. He has always loved technology, programming and statistics and has engaged in self learning of these subjects from an early age, finally using his abilities in the health-related research in which he has been involved since 2008. He is now working on the design, execution and analysis of various research projects, consulting for other researchers and students, and is looking forward to develop his academic career in biostatistics.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-1"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I wanted to integrate my R experience into a more comprehensive data analysis workflow, which is exactly what this specialization offers. This was in line with the objectives of my position at the research institute in which I work, so I presented a study plan to my supervisor and she approved it. I also wanted to engage in an activity which enabled me to document my abilities in a verifiable way, and a Coursera Specialization seemed like a good option.</p> <p>Additionally, I’ve followed the JHSPH group’s courses since the first offering of Mathematical Biostatistics Bootcamp in November 2012. They have proved the standards and quality of education at their institution, and it was not something to let go by.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-1"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I’m not one to usually interact with other students, and certainly didn’t do it during most of the specialization courses, but I decided to try out the fora on the Capstone project. It was wonderful; sharing ideas with, and receiving criticism form, my peers provided a very complete learning experience. After all, my contributions ended being appreciated by the community and a few posts stating it were very rewarding. This re-kindled my passion for teaching, and I’ll try to engage in it more from now on.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-2"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>First, I’ll file it with HR at my workplace, since our research projects payed for the specialization <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <p>I plan to use the certificate as a credential for data analysis with R when it is relevant. For example, I’ve been interested in offering an R workshop for life sciences students and researchers at my University, and this certificate (and the projects I prepared during the specialization) could help me show I have a working knowledge on the subject.</p> <p><strong>Final Project: </strong><a href="https://odeleon.shinyapps.io/ngram/">https://odeleon.shinyapps.io/ngram/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/chemman/n-gram">http://rpubs.com/chemman/n-gram</a></p> <p>#</p> <h1 id="jeff-hedberg">Jeff Hedberg</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg"><img class="aligncenter size-full wp-image-4094" src="http://simplystatistics.org/wp-content/uploads/2015/06/Jeff_Hedberg.jpg" alt="Jeff_Hedberg" width="200" height="200" /></a></p> <p>I am passionate about turning raw data into actionable insights that solve relevant business problems. I also greatly enjoy leading large, multi-functional projects with impact in areas pertaining to machine and/or sensor data.  I have a Mechanical Engineering Degree and an MBA, in addition to a wide range of Data Science (IT/Coding) skills.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-2"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I was looking to gain additional exposure into Data Science as a current practitioner, and thought this would be a great program.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-2"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I am most proud of completing all courses with distinction (top of peers).  Also, I’m proud to have achieved full points on my Capstone project having no prior experience in Natural Language Processing.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-3"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I am going to add this to my Resume and LinkedIn Profile.  I will use it to solidify my credibility as a data science practitioner of value.</p> <p><strong>Final Project: </strong><a href="https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/">https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/jhedbergfd3s/74960">https://rpubs.com/jhedbergfd3s/74960</a></p> <p>#</p> <h1 id="hernn-martnez-foffani">Hernán Martínez-Foffani</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg"><img class="aligncenter size-medium wp-image-4095" src="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg" alt="Hernán_Martínez-Foffani" width="300" height="225" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-120x90.jpg 120w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-300x225.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-1024x768.jpg 1024w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani-260x195.jpg 260w, http://simplystatistics.org/wp-content/uploads/2015/06/Hernán_Martínez-Foffani.jpg 1256w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>I was born in Argentina but now I’m settled in Spain. I’ve been working in computer technology since the eighties, in digital networks, programming, consulting, project management. Now, as CTO in a software company, I lead a small team of programmers developing a supply chain management app.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-3"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>In my opinion the curriculum is carefully designed with a nice balance between theory and practice. The JHU authoring and the teachers’ widely known prestige ensure the content quality. The ability to choose the learning pace, one per month in my case, fits everyone’s schedule.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-3"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>The capstone definitely. It resulted in a fresh and interesting challenge. I sweat a lot, learned much more and in the end had a lot of fun.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-4"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>While for the time being I don’t have any specific plan for the certificate, it’s a beautiful reward for the effort done.</p> <p><strong>Final Project: </strong><a href="https://herchu.shinyapps.io/shinytextpredict/">https://herchu.shinyapps.io/shinytextpredict/</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/herchu1/shinytextprediction">http://rpubs.com/herchu1/shinytextprediction</a></p> <p>#</p> <h1 id="francois-schonken">Francois Schonken</h1> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1.jpg"><img class="aligncenter size-medium wp-image-4097" src="http://simplystatistics.org/wp-content/uploads/2015/06/Francois-Schonken1-197x300.jpg" alt="Francois Schonken" width="197" height="300" /></a></p> <p>I’m a 36 year old South African male born and raised. I recently (4 years now) immigrated to lovely Melbourne, Australia. I wrapped up a BSc (Hons) Computer Science with specialization in Computer Systems back in 2001. Next I co-found a small boutique Software Development house operating from South Africa. I wrapped my MBA, from Melbourne Business School, in 2013 and now I consult for my small boutique Software Development house and 2 (very) small internet start-ups.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-4"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>One of the core subjects in my MBA was Data Analysis, basically an MBA take on undergrad Statistics with focus on application over theory (not that there was any shortage of theory). Waiting in a lobby room some 6 months later I was paging through the financial section of business focused weekly. I came across an article explaining how a Melbourne local applied a language called R to solve a grammatically and statistically challenging issue. The rest, as they say, is history.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-4"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I’m quite proud of both my Developing Data Products and Capstone projects, but for me these tangible outputs merely served as a vehicle to better understand a different way of thinking about data. I’ve spend most of my Software Development life dealing with one form or the other form of RDBS (Relational Database Management System). This, in my experience, leads to a very set oriented way of thinking about data.</p> <p>I’m most proud of developing a new tool in my “Skills Toolbox” that I consider highly complementary to both my Software and Business outlook on projects.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-5"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>Honestly, I had not planned on using my Certificate in and of itself. The skills I’ve acquired has already helped shape my thinking in designing an in-house web based consulting collaboration platform.</p> <p>I do not foresee this being the last time I’ll be applying Data Science thinking moving forward on my journey.</p> <p><strong>Final Project: </strong><a href="https://schonken.shinyapps.io/WordPredictor">https://schonken.shinyapps.io/WordPredictor</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/schonken/sentence-builder">http://rpubs.com/schonken/sentence-builder</a></p> <p>#</p> <h1 id="david-j-tagler">David J. Tagler</h1> <p>David is passionate about solving the world’s most important and challenging problems. His expertise spans chemical/biomedical engineering, regenerative medicine, healthcare technology management, information technology/security, and data science/analysis. David earned his Ph.D. in Chemical Engineering from Northwestern University and B.S. in Chemical Engineering from the University of Notre Dame.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-5"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I enrolled in this specialization in order to advance my statistics, programming, and data analysis skills. I was interested in taking a series of courses that covered the entire data science pipeline. I believe that these skills will be critical for success in the future.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-5"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>I am most proud of the R programming and modeling skills that I developed throughout this specialization. Previously, I had no experience with R. Now, I can effectively manage complex data sets, perform statistical analyses, build prediction models, create publication-quality figures, and deploy web applications.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-6"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I look forward to utilizing these skills in future research projects. Furthermore, I plan to take additional courses in data science, machine learning, and bioinformatics.</p> <p><strong>Final Project: </strong><a href="http://dt444.shinyapps.io/next-word-predict">http://dt444.shinyapps.io/next-word-predict</a></p> <p><strong>Project Slide Deck: </strong><a href="http://rpubs.com/dt444/next-word-predict">http://rpubs.com/dt444/next-word-predict</a></p> <p>#</p> <h1 id="melissa-tan">Melissa Tan</h1> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan.png"><img class="aligncenter size-medium wp-image-4099" src="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png" alt="MelissaTan" width="300" height="198" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-300x198.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/MelissaTan-260x172.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p>I’m a financial journalist from Singapore. I did philosophy and computer science at the University of Chicago, and I’m keen on picking up more machine learning and data viz skills.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-6"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>I wanted to keep up with coding, while learning new tools and techniques for wrangling and analyzing data that I could potentially apply to my job. Plus, it sounded fun. <img src="http://simplystatistics.org/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-6"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>Building a word prediction app pretty much from scratch (with a truckload of forum reading). The capstone project seemed insurmountable initially and ate up all my weekends, but getting the app to work passably was worth it.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-7"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>It’ll go on my CV, but I think it’s more important to be able to actually do useful things. I’m keeping an eye out for more practical opportunities to apply and sharpen what I’ve learnt.</p> <p><strong>Final Project: </strong><a href="https://melissatan.shinyapps.io/word_psychic/">https://melissatan.shinyapps.io/word_psychic/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/melissatan/capstone">https://rpubs.com/melissatan/capstone</a></p> <p>#</p> <h1 id="felicia-yii">Felicia Yii</h1> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii.jpg"><img class="aligncenter size-medium wp-image-4100" src="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg" alt="FeliciaYii" width="232" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-232x300.jpg 232w, http://simplystatistics.org/wp-content/uploads/2015/06/FeliciaYii-793x1024.jpg 793w" sizes="(max-width: 232px) 100vw, 232px" /></a></p> <p>Felicia likes to dream, think and do. With over 20 years in the IT industry, her current fascination is at the intersection of people, information and decision-making.  Ever inquisitive, she has acquired an expertise in subjects as diverse as coding to cookery to costume making to cosmetics chemistry. It’s not apparent that there is anything she can’t learn to do, apart from housework.  Felicia lives in Wellington, New Zealand with her husband, two children and two cats.</p> <h4 id="why-did-you-take-the-jhu-data-science-specialization-7"><strong>**Why did you take the JHU Data Science Specialization?</strong>**</h4> <p>Well, I love learning and the JHU Data Science Specialization appealed to my thirst for a new challenge. I’m really interested in how we can use data to help people make better decisions.  There’s so much data out there these days that it is easy to be overwhelmed by it all. Data visualisation was at the heart of my motivation when starting out. As I got into the nitty gritty of the course, I really began to see the power of making data accessible and appealing to the data-agnostic world. There’s so much potential for data science thinking in my professional work.</p> <h4 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-7"><strong>**What are you most proud of doing as part of the JHU Data Science Specialization?</strong>**</h4> <p>Getting through it for starters while also working and looking after two children. Seriously though, being able to say I know what ‘practical machine learning’ is all about.  Before I started the course, I had limited knowledge of statistics, let alone knowing how to apply them in a machine learning context.  I was thrilled to be able to use what I learned to test a cool game concept in my final project.</p> <h4 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-8"><strong>**How are you planning on using your Data Science Specialization Certificate?</strong>**</h4> <p>I want to use what I have learned in as many ways possible. Firstly, I see opportunities to apply my skills to my day-to-day work in information technology. Secondly, I would like to help organisations that don’t have the skills or expertise in-house to apply data science thinking to help their decision making and communication. Thirdly, it would be cool one day to have my own company consulting on data science. I’ve more work to do to get there though!</p> <p><strong>Final Project: </strong><a href="https://micasagroup.shinyapps.io/nwpgame/">https://micasagroup.shinyapps.io/nwpgame/</a></p> <p><strong>Project Slide Deck: </strong><a href="https://rpubs.com/MicasaGroup/74788">https://rpubs.com/MicasaGroup/74788</a></p> <p> </p> Batch effects are everywhere! Deflategate edition 2015-06-09T11:47:27+00:00 http://simplystats.github.io/2015/06/09/batch-effects-are-everywhere-deflategate-edition <p>In my opinion, batch effects are the biggest challenge faced by genomics research, especially in precision medicine. As we point out in <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">this review</a>, they are everywhere among high-throughput experiments. But batch effects are not specific to genomics technology. In fact, in <a href="http://amstat.tandfonline.com/doi/abs/10.1080/00401706.1972.10488878">this 1972 paper</a> (paywalled), <a href="http://en.wikipedia.org/wiki/William_J._Youden">WJ Youden</a> describes batch effects in the context of measurements made by physicists. Check out this plot of <a href="https://en.wikipedia.org/wiki/Astronomical_unit">astronomical unit</a> <del>speed of light</del> estimates <strong>with an estimate of spread <del>confidence intervals</del></strong> (red and green are same lab).</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png"><img class=" wp-image-4295 aligncenter" src="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png" alt="Rplot" width="467" height="290" srcset="http://simplystatistics.org/wp-content/uploads/2015/06/Rplot-300x186.png 300w, http://simplystatistics.org/wp-content/uploads/2015/06/Rplot.png 903w" sizes="(max-width: 467px) 100vw, 467px" /></a></p> <p style="text-align: center;"> <p> &nbsp; </p> <p> Sometimes you find batch effects where you least expect them. For example, in the <a href="http://en.wikipedia.org/wiki/Deflategate">deflategate</a> debate. Here is quote from the New England patriot's deflategate<a href="http://www.boston.com/sports/football/patriots/2015/05/14/key-takeaways-from-the-patriots-deflategate-report-rebuttal/hK0J0J9abNgtGyhTwlW53L/story.html"> rebuttal</a> (written with help from Nobel Prize winner <a href="http://en.wikipedia.org/wiki/Roderick_MacKinnon">Roderick MacKinnon</a>) </p> <blockquote> <p> in other words, the Colts balls were measured after the Patriots balls and had warmed up more. For the above reasons, the Wells Report conclusion that physical law cannot explain the pressures is incorrect. </p> </blockquote> <p style="text-align: left;"> Here is another one: </p> <blockquote> <p style="text-align: left;"> In the pressure measurements physical conditions were not very well-defined and major uncertainties, such as which gauge was used in pre-game measurements, affect conclusions. </p> </blockquote> <p style="text-align: left;"> So NFL, please read <a href="http://www.ncbi.nlm.nih.gov/pubmed/20838408">our paper</a> before you accuse a player of cheating. </p> <p style="text-align: left;"> Disclaimer: I live in New England but I am <a href="http://www.urbandictionary.com/define.php?term=Ball+so+Hard+University">Ravens</a> fan. </p> </p> I'm a data scientist - mind if I do surgery on your heart? 2015-06-08T14:15:39+00:00 http://simplystats.github.io/2015/06/08/im-a-data-scientist-mind-if-i-do-surgery-on-your-heart <p>There has been a lot of recent interest from scientific journals and from other folks in creating checklists for data science and data analysis. The idea is that the checklist will help prevent results that won’t reproduce or replicate from the literature. One analogy that I’m frequently hearing is the analogy with checklists for surgeons that <a href="http://www.nejm.org/doi/full/10.1056/NEJMsa0810119">can help reduce patient mortality</a>.</p> <p>The one major difference between checklists for surgeons and checklists I’m seeing for research purposes is the difference in credentialing between people allowed to perform surgery and people allowed to perform complex data analysis. You would never let me do surgery on you. I have no medical training at all. But I’m frequently asked to review papers that include complicated and technical data analyses, but have no trained data analysts or statisticians. The most common approach is that a postdoc or graduate student in the group is assigned to do the analysis, even if they don’t have much formal training. Whenever this happens red flags are up all over the place. Just like I wouldn’t trust someone without years of training and a medical license to do surgery on me, I wouldn’t let someone without years of training and credentials in data analysis make major conclusions from complex data analysis.</p> <p>You might argue that the consequences for surgery and for complex data analysis are on completely different scales. I’d agree with you, but not in the direction that you might think. I would argue that high pressure and complex data analysis can have much larger consequences than surgery. In surgery there is usually only one person that can be hurt. But if you do a bad data analysis, say claiming say that <a href="http://www.ncbi.nlm.nih.gov/pubmed/9500320">vaccines cause autism</a>, that can have massive consequences for hundreds or even thousands of people. So complex data analysis, especially for important results, should be treated with at least as much care as surgery.</p> <p>The reason why I don’t think checklists alone will solve the problem is that they are likely to be used by people without formal training. One obvious (and recent) example that I think makes this really clear is the <a href="https://developer.apple.com/healthkit/">HealthKit</a> data we are about to start seeing. A ton of people signed up for studies on their iPhones and it has been all over the news. The checklist will (almost certainly) say to have a big sample size. HealthKit studies will certainly pass the checklist, but they are going to get <a href="http://en.wikipedia.org/wiki/Dewey_Defeats_Truman">Truman/Deweyed</a> big time if they aren’t careful about biased sampling.</p> <div> If I walked into an operating room and said I'm going to start dabbling in surgery I would be immediately thrown out. But people do that with statistics and data analysis all the time. What they really need is to require careful training and expertise in data analysis on each paper that analyzes data. Until we treat it as a first class component of the scientific process we'll continue to see retractions, falsifications, and irreproducible results flourish. </div> Interview with Class Central 2015-06-04T09:27:20+00:00 http://simplystats.github.io/2015/06/04/4063 <p>Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.</p> <p>See the <a href="https://www.class-central.com/report/data-science-specialization/">full interview</a> at the Class Central site. Below is short excerpt.</p> Interview with Chris Wiggins, chief data scientist at the New York Times 2015-06-01T09:00:27+00:00 http://simplystats.github.io/2015/06/01/interview-with-chris-wiggins-chief-data-scientist-at-the-new-york-times <p><em>Editor’s note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you’d like to ask, you can tweet them @simplystats and I’ll see if I can work them in. After the livestream we’ll leave the video on Youtube so you can check out the interview if you can’t watch the live stream. I’m embedding the Youtube video here but if you can’t see the live stream when it is running go check out the event page: <a href="https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o">https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o</a>.</em></p> Science is a calling and a career, here is a career planning guide for students and postdocs 2015-05-28T10:16:47+00:00 http://simplystats.github.io/2015/05/28/science-is-a-calling-and-a-career-here-is-a-career-planning-guide-for-students-and-postdocs <p><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em> <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md"><em>Editor’s note: This post was inspired by a really awesome career planning guide that Ben Langmead</em></a> <em>which you should go check out right now. You can also find the slightly adapted</em> <a href="https://github.com/jtleek/careerplanning"><em>Leek group career planning guide</em></a> <em>here.</em></p> <p>The most common reason that people go into science is altruistic. They loved dinosaurs and spaceships when they were a kid and that never wore off. On some level this is one of the reasons I love this field so much, it is an area where if you can get past all the hard parts can really keep introducing wonder into what you work on every day.</p> <p>Sometimes I feel like this altruism has negative consequences. For example, I think that there is less emphasis on the career planning and development side in the academic community. I don’t think this is malicious, but I do think that sometimes people think of the career part of science as unseemly. But if you have any job that you want people to pay you to do, then there will be parts of that job that will be career oriented. So if you want to be a professional scientist, being brilliant and good at science is not enough. You also need to pay attention to and plan carefully your career trajectory.</p> <p>A colleague of mine, Ben Langmead, created a really nice guide for his postdocs to thinking about and planning the career side of a postdoc <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">which he has over on Github</a>. I thought it was such a good idea that I immediately modified it and asked all of my graduate students and postdocs to fill it out. It is kind of long so there was no penalty if they didn’t finish it, but I think it is an incredibly useful tool for thinking about how to strategize a career in the sciences. I think that the more we are concrete about the career side of graduate school and postdocs, including being honest about all the realistic options available, the better prepared our students will be to succeed on the market.</p> <p>You can find the <a href="https://github.com/jtleek/careerplanning">Leek Group Guide to Career Planning</a> here and make sure you also go <a href="https://github.com/BenLangmead/langmead-lab/blob/master/postdoc_questionnaire.md">check out Ben’s</a> since it was his idea and his is great.</p> <p> </p> Is it species or is it batch? They are confounded, so we can't know 2015-05-20T11:11:18+00:00 http://simplystats.github.io/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know <p>In a 2005 OMICS <a href="http://online.liebertpub.com/doi/abs/10.1089/153623104773547462" target="_blank">paper</a>, an analysis of human and mouse gene expression microarray measurements from several tissues led the authors to conclude that “any tissue is more similar to any other human tissue examined than to its corresponding mouse tissue”. Note that this was a rather surprising result given how similar tissues are between species. For example, both mice and humans see with their eyes, breathe with their lungs, pump blood with their hearts, etc… Two follow-up papers (<a href="http://mbe.oxfordjournals.org/content/23/3/530.abstract?ijkey=2c3d98666afbc99949fdcf514f10e3fedadee259&amp;keytype2=tf_ipsecsha" target="_blank">here</a> and <a href="http://mbe.oxfordjournals.org/content/24/6/1283.abstract?ijkey=366fdf09da56a5dd0cfdc5f74082d9c098ae7801&amp;keytype2=tf_ipsecsha" target="_blank">here</a>) demonstrated that platform-specific technical variability was the cause of this apparent dissimilarity. The arrays used for the two species were different and thus measurement platform and species were completely <strong>confounded</strong>. In a 2010 paper, we confirmed that once this technical variability  was accounted for, the number of genes expressed in common  between the same tissue across the two species was much higher than the those expressed in common  between two species across the different tissues (see Figure 2 <a href="http://nar.oxfordjournals.org/content/39/suppl_1/D1011.full" target="_blank">here</a>).</p> <p>So <a href="http://genomicsclass.github.io/book/pages/confounding.html">what is confounding</a> and <a href="http://www.nature.com/ng/journal/v39/n7/full/ng0707-807.html">why is it a problem</a>? This topic has been discussed broadly. We wrote a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">review</a> some time ago. But based on recent discussions I’ve participated in, it seems that there is still some confusion. Here I explain, aided by some math, how confounding leads to problems in the context of estimating species effects in genomics. We will use</p> <ul> <li><em>X<sub>i</sub></em> to represent the gene expression measurements for human tissue <em>i,</em></li> <li><em>a<sub>X</sub></em> to represent the level of expression that is specific to humans and</li> <li><em>b<sub>X</sub></em> to represent the batch effect introduced by the use of the human microarray platform.</li> <li>Therefore <em>X<sub>i</sub></em> =<em>a<sub>X </sub></em>+ <em>b<sub>X </sub></em>+ <em>e<sub>i</sub></em>, with <em>e<sub>i</sub></em> the tissue <em>i</em> effect and other uninteresting sources of variability.</li> </ul> <p>Similarly, we will use:</p> <ul> <li><em>Y<sub>i</sub></em> to represent the measurements for mouse tissue <em>i</em></li> <li><em>a<sub>Y</sub></em>  to represent the mouse specific level and</li> <li><em>b<sub>Y</sub></em> the batch effect introduced by the use of the mouse microarray platform.</li> <li>Therefore <em>Y</em><sub>i</sub> = <em>a<sub>Y</sub></em>+ <em>b<sub>Y</sub></em> + <em>f<sub>i</sub></em>, with <em>f<sub>i</sub></em> tissue <em>i</em> effect and other uninteresting sources of variability.</li> </ul> <p>If we are interested in estimating a species effect that is general across tissues, then we are interested in the following quantity:</p> <p style="text-align: center;">  <em>a<sub>Y</sub> - a<sub>X</sub></em> </p> <p>Naively, we would think that we can estimate this quantity using the observed differences between the species that cancel out the tissue effect. We observe a difference for each tissue: <em>Y<sub>1 </sub></em> - <em>X<sub>1 </sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2 </sub></em>, etc… The problem is that <em>a<sub>X</sub></em> and <em>b<sub>X</sub></em> are always together as are <em>a<sub>Y</sub></em> and <em>b<sub>Y</sub></em>. We say that the batch effect <em>b<sub>X</sub></em> is <strong>confounded</strong> with the species effect <em>a<sub>X</sub></em>. Therefore, on average, the observed differences include both the species and the batch effects. To estimate the difference above we would write a model like this:</p> <p style="text-align: center;"> <em>Y<sub>i</sub></em> - <em>X<sub>i</sub></em> = (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>) + other sources of variability </p> <p style="text-align: left;"> and then estimate the unknown quantities of interest: (<em>a<sub>Y</sub> - a<sub>X</sub></em>) and (<em>b<sub>Y</sub> - b<sub>X</sub></em>) from the observed data <em>Y<sub>1</sub></em> - <em>X<sub>1</sub></em>, <em>Y<sub>2</sub></em> - <em>X<sub>2</sub></em>, etc... The problem is that, we can estimate the aggregate effect (<em>a<sub>Y</sub> - a<sub>X</sub></em>) + (<em>b<sub>Y</sub> - b<sub>X</sub></em>), but, mathematically, we can't tease apart the two differences.  To see this note that if we are using least squares, the estimates (<em>a<sub>Y</sub> - a<sub>X</sub></em>) = 7,  (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=3  will fit the data exactly as well as (<em>a<sub>Y</sub> - a<sub>X</sub></em>)=3,(<em>b<sub>Y</sub> - b<sub>X</sub></em>)=7 since </p> <p style="text-align: center;"> <em>{(Y-X) -(7+3))^2 = {(Y-X)- (3+7)}^2.</em> </p> <p style="text-align: left;"> In fact, under these circumstances, there are an infinite number of solutions to the standard statistical estimation approaches. A simple analogy is to try to find a unique solution to the equations m+n = 0. If batch and species are not confounded then we are able to tease apart differences just as if we were given another equation: m+n=0; m-n=2. You can learn more about this in <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x">this linear models course</a>. </p> <p style="text-align: left;"> Note that the above derivation apply to each gene affected by the batch effect. In practice we commonly see hundreds of genes affected. As a consequence, when we compute distances between two samples from different species we may see large differences even where there is no species effect. This is because the <em>b<sub>Y</sub> - b<sub>X  </sub></em>differences for each gene are squared and added up. </p> <p style="text-align: left;"> In summary, if you completely confound your variable of interest, in this case species, with a batch effect, you will not be able to estimate the effect of either. In fact, in a <a href="http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html">2010 Nature Genetics Review</a>  about batch effects we warned about "cases in which batch effects are confounded with an outcome of interest and result in misleading biological or clinical conclusions". We also warned that none of the existing solutions for batch effects (Combat, SVA, RUV, etc...) can save you from a situation with perfect confounding. Because we can't always predict what will introduce unwanted variability, we recommend randomization as an experimental design approach. </p> <p style="text-align: left;"> Almost a decade later after the OMICS paper was published, the same surprising conclusion was reached in <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">this PNAS paper</a>:  "tissues appear more similar to one another within the same species than to the comparable organs of other species". This time RNAseq was used for both species and therefore the different platform issue was not considered<sup>*</sup>. Therefore, the authors implicitly assumed that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0. However, in a recent F1000 Research <a href="http://f1000research.com/articles/4-121/v1" target="_blank">publication</a> Gilad and Mizrahi-Man describe describe an exercise in <a href="http://projecteuclid.org/euclid.aoas/1267453942">forensic bioinformatics</a> that led them to discover that mice and human samples were run in different lanes or different instruments. The confounding was near perfect (see <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure1.gif">Figure 1</a>). As pointed out by these authors, with this experimental design we can't  simply accept that (<em>b<sub>Y</sub> - b<sub>X</sub></em>)=0, which implies that we can't estimate a species effect. Gilad and Mizrahi-Man then apply a <a href="http://biostatistics.oxfordjournals.org/content/8/1/118.abstract">linear model</a> (ComBat) to account for the batch/species effect and find that <a href="https://f1000researchdata.s3.amazonaws.com/manuscripts/7019/9f5f4330-d81d-46b8-9a3f-d8cb7aaf577e_figure3.gif">samples cluster almost perfectly by tissue</a>. However, Gilad and Mizrahi-Man correctly note that,  due to the confounding, if there is in fact a species effect, this approach will remove it along with the batch effect. Unfortunately, due to the experimental design it will be hard or impossible to determine if it's batch or if it's species. More data  and more analyses are needed. </p> <p>Confounded designs ruin experiments. Current batch effect removal methods will not save you. If you are designing a large genomics experiments, learn about randomization.</p> <p style="text-align: left;">  * The fact that RNAseq was used does not necessarily mean there is no platform effect. The species have different genomes, with different sequences and thus can lead to different biases during experimental protocols. </p> <p style="text-align: left;"> <strong>Update: </strong>Shin Lin has repeated a small version of the experiment described in the <a href="http://www.pnas.org/content/111/48/17224.abstract" target="_blank">PNAS paper</a>. The new experimental design does not confound lane/instrument with species. The new data confirms their original results pointing to the fact that lane/instrument do not explain the clustering by species. You can see his response in the comments <a href="http://f1000research.com/articles/4-121/v1" target="_blank">here</a>. </p> Residual expertise - or why scientists are amateurs at most of science 2015-05-18T10:21:18+00:00 http://simplystats.github.io/2015/05/18/residual-expertise <p><em>Editor’s note: I have been unsuccessfully attempting to finish a book I started 3 years ago about how and why everyone should get pumped about reading and understanding scientific papers. I’ve adapted part of one of the chapters into this blogpost. It is pretty raw but hopefully gets the idea across. </em></p> <p>An episode of_ The Daily Show with Jon Stewart_ featured physicist Lisa Randall, an incredible physicist and noted scientific communicator, as the invited guest.</p> <div style="background-color: #000000; width: 520px;"> <div style="padding: 4px;"> &lt;/p&gt; <p style="text-align: left; background-color: #ffffff; padding: 4px; margin-top: 4px; margin-bottom: 0px; font-family: Arial, Helvetica, sans-serif; font-size: 12px;"> <b><a href="http://thedailyshow.cc.com/">The Daily Show</a></b><br /> Get More: <a href="http://thedailyshow.cc.com/full-episodes/">Daily Show Full Episodes</a>,<a href="http://www.facebook.com/thedailyshow">The Daily Show on Facebook</a>,<a href="http://thedailyshow.cc.com/videos">Daily Show Video Archive</a> </p> </div> </div> <p>Near the end of the interview, Stewart asked Randall why, with all the scientific progress we have made, that we have been unable to move away from fossil fuel-based engines. The question led to the exchange:</p> <blockquote> <p><em>Randall: “So this is part of the problem, because I’m a scientist doesn’t mean I know the answer to that question.”</em></p> <p>**</p> </blockquote> <blockquote> <p>** <em>Stewart: ”Oh is that true? Here’s the thing, here’s what’s part of the answer. You could say anything and I would have no idea what you are talking about.”</em></p> </blockquote> <p>Professor Randall is a world leading physicist, the first woman to achieve tenure in physics at Princeton, Harvard, and MIT, and a member of the National Academy of Sciences.2 But when it comes to the science of fossil fuels, she is just an amateur. Her response to this question is just perfect - it shows that even brilliant scientists can just be interested amateurs on topics outside of their expertise. Despite Professor Randall’s over-the-top qualifications, she is an amateur on a whole range of scientific topics from medicine, to computer science, to nuclear engineering. Being an amateur isn’t a bad thing, and recognizing where you are an amateur may be the truest indicator of genius. That doesn’t mean Professor Randall can’t know a little bit about fossil fuels or be curious about why we don’t all have nuclear-powered hovercrafts yet. It just means she isn’t the authority.</p> <p>Stewart’s response is particularly telling and indicative of what a lot of people think about scientists. It takes years of experience to become an expert in a scientific field - some have suggested as many as 10,000 hours of dedicated time. Professor Randall is a scientist - so she must have more information about any scientific problem than an informed amateur like Jon Stewart. But of course this isn’t true, Jon Stewart (and you) could quickly learn as much about fossil fuels as a scientist if the scientist wasn’t already an expert in the area. Sure a background in physics would help, but there are a lot of moving parts in our dependence on fossil fuels, including social, political, economic problems in addition to the physics involved.</p> <p>This is an example of “residual expertise” - when people without deep scientific training are willing to attribute expertise to scientists even if it is outside their primary area of focus. It is closely related to the logical fallacy behind the <a href="http://en.wikipedia.org/wiki/Argument_from_authority">argument from authority</a>:</p> <blockquote> <p>A is an authority on a particular topic</p> <p>A says something about that topic</p> <p>A is probably correct</p> </blockquote> <p>the difference is that with residual expertise you assume that since A is an authority on a particular topic, if they say something about another, potentially related topic, they will probably be correct. This idea is critically important, it is how quacks make their living. The logical leap of faith from “that person is a doctor” to “that person is a doctor so of course they understand epidemiology, or vaccination, or risk communication” is exactly the leap empowered by the idea of residual expertise. It is also how you can line up scientific experts against any well established doctrine like evolution or climate change. Experts in the field will know all of the relevant information that supports key ideas in the field and what it would take to overturn those ideas. But experts outside of the field can be lined up and their residual expertise used to call into question even the most supported ideas.</p> <p>What does this have to do with you?</p> <p>Most people aren’t necessarily experts in scientific disciplines they care about. But becoming a successful amateur requires a much smaller time commitment than becoming an expert, but can still be incredibly satisfying, fun, and useful. This book is designed to help you become a fired-up amateur in the science of your choice. Think of it like a hobby, but one where you get to learn about some of the coolest new technologies and ideas coming out in the scientific literature. If you can ignore the way residual expertise makes you feel silly for reading scientific papers you don’t fully understand - you can still learn a ton and have a pretty fun time doing it.</p> <p> </p> <p> </p> The tyranny of the idea in science 2015-05-08T11:58:51+00:00 http://simplystats.github.io/2015/05/08/the-tyranny-of-the-idea-in-science <p>There are a lot of analogies between <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">startups and academic science labs</a>. One thing that is definitely very different is the relative value of ideas in the startup world and in the academic world. For example, <a href="http://simplystatistics.org/2012/09/20/every-professor-is-a-startup/">Paul Graham has said:</a></p> <blockquote> <p>Actually, startup ideas are not million dollar ideas, and here’s an experiment you can try to prove it: just try to sell one. Nothing evolves faster than markets. The fact that there’s no market for startup ideas suggests there’s no demand. Which means, in the narrow sense of the word, that startup ideas are worthless.</p> </blockquote> <p>In academics, almost the opposite is true. There is huge value to being first with an idea, even if you haven’t gotten all the details worked out or stable software in place. Here are a couple of extreme examples illustrated with Nobel prizes:</p> <ol> <li><strong>Higgs Boson</strong> - Peter Higgs <a href="http://journals.aps.org/pr/abstract/10.1103/PhysRev.145.1156">postulated the Boson in 1964</a>, <a href="http://www.symmetrymagazine.org/article/october-2013/nobel-prize-in-physics-honors-prediction-of-higgs-boson">he won the Nobel Prize in 2013 for that prediction</a>, in between tons of people did follow on work, someone convinced Europe to build one of the <a href="http://en.wikipedia.org/wiki/Large_Hadron_Collider">most expensive pieces of scientific equipment ever built</a> and conservatively thousands of scientists and engineers had to do a ton of work to get the equipment to (a) work and (b) confirm the prediction.</li> <li><strong>Human genome</strong> - <a href="http://en.wikipedia.org/wiki/Molecular_Structure_of_Nucleic_Acids:_A_Structure_for_Deoxyribose_Nucleic_Acid">Watson and Crick postulated the structure of DNA</a> in 1953, <a href="http://www.nobelprize.org/nobel_prizes/medicine/laureates/1962/">they won the Nobel Prize in  medicine in 1962</a> for this work. But the real value of the human genome was realized when the <a href="http://en.wikipedia.org/wiki/Human_Genome_Project">largest biological collaboration in history sequenced the human genome</a>, along with all of the subsequent work in the genomics revolution.</li> </ol> <p>These are two large scale examples where the academic scientific community (as represented by the Nobel committee, mostly because it is a concrete example) rewards the original idea and not the hard work to achieve that idea. I call this, “the tyranny of the idea.” I notice a similar issue on a much smaller scale, for example when people <a href="http://ivory.idyll.org/blog/2015-software-as-a-primary-product-of-science.html">don’t recognize software as a primary product of science</a>. I feel like these decisions devalue the real work it takes to make any scientific idea a reality. Sure the ideas are good, but it isn’t clear that some ideas wouldn’t be discovered by someone else - but surely we aren’t going to build another large hadron collider. I’d like to see the scales correct back the other way a little bit so we put at least as much emphasis on the science it takes to follow through on an idea as on discovering it in the first place.</p> Mendelian randomization inspires a randomized trial design for multiple drugs simultaneously 2015-05-07T11:30:09+00:00 http://simplystats.github.io/2015/05/07/mendelian-randomization-inspires-a-randomized-trial-design-for-multiple-drugs-simultaneously <p>Joe Pickrell has an interesting new paper out about <a href="http://biorxiv.org/content/early/2015/04/16/018150.full-text.pdf+html">Mendelian randomization.</a> He discusses some of the interesting issues that come up with these studies and performs a mini-review of previously published studies using the technique.</p> <p>The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). A very famous example of this was an experiment performed by Leonid Kruglyak’s group where they took two strains of yeast and repeatedly mated them, then measured genetics and gene expression data. The experimental design looked like this:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06.jpg"><img class="aligncenter wp-image-4009 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg" alt="Slide06" width="300" height="224" srcset="http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-300x224.jpg 300w, http://simplystatistics.org/wp-content/uploads/2015/05/Slide06-260x194.jpg 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>If you look at the allele inherited from the two parental strains (BY, RM)  at two separate genes on different chromsomes in each of the 112 segregants (yeast offspring)  they do appear to be random and independent:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM.png"><img class="aligncenter wp-image-4010 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/05/Screen-Shot-2015-05-07-at-11.20.46-AM-235x300.png" alt="Screen Shot 2015-05-07 at 11.20.46 AM" width="235" height="300" /></a></p> <p> </p> <p> </p> <p>So this is a randomized trial in yeast where the yeast were each randomized to many many genetic “treatments” simultaneously. Now this isn’t strictly true, since genes on the same chromosomes near each other aren’t exactly random and in humans it is definitely not true since there is population structure, non-random mating and a host of other issues. But you can still do cool things to try to infer causality from the genetic “treatments” to downstream things like gene expression ( <a href="http://genomebiology.com/2007/8/10/r219">and even do a reasonable job in the model organism case</a>).</p> <p>In my mind this raises a potentially interesting study design for clinical trials. Suppose that there are 10 treatments for a disease that we know about. We design a study where each of the patients in the trial was randomized to receive treatment or placebo for each of the 10 treatments. So on average each person would get 5 treatments.  Then you could try to tease apart the effects using methods developed for the Mendelian randomization case. Of course, this is ignoring potential interactions, side effects of taking multiple drugs simultaneously, etc. But I’m seeing lots of <a href="http://www.nature.com/news/personalized-medicine-time-for-one-person-trials-1.17411">interesting proposals</a> for new trial designs (<a href="http://notstatschat.tumblr.com/post/118102423391/precise-answers-but-not-necessarily-to-the-right">which may or may not work</a>), so I thought I’d contribute my own interesting idea.</p> Rafa's citations above replacement in statistics journals is crazy high. 2015-05-01T11:18:47+00:00 http://simplystats.github.io/2015/05/01/rafas-citations-above-replacement-in-statistics-journals-is-crazy-high <p><em>Editor’s note:  I thought it would be fun to do some bibliometrics on a Friday. This is super hacky and the CAR/Y stat should not be taken seriously. </em></p> <p>I downloaded data on the 400 most cited papers between 2000-2010 in some statistical journals from <a href="webofscience.com/">Web of Science</a>. Here is a boxplot of the average number of citations per year (from publication date - 2015) to these papers in the journals Annals of Statistics, Biometrics, Biometrika, Biostatistics, JASA, Journal of Computational and Graphical Statistics, Journal of Machine Learning Research, and Journal of the Royal Statistical Society Series B.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/05/journals.png"><img class="aligncenter wp-image-4001" src="http://simplystatistics.org/wp-content/uploads/2015/05/journals-300x300.png" alt="journals" width="500" height="500" srcset="http://simplystatistics.org/wp-content/uploads/2015/05/journals-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/05/journals-1024x1024.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/05/journals-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/05/journals.png 1050w" sizes="(max-width: 500px) 100vw, 500px" /></a></p> <p> </p> <p>There are several interesting things about this graph right away. One is that JASA has the highest median number of citations, but has fewer “big hits” (papers with 100+ citations/year) than Annals of Statistics, JMLR, or JRSS-B. Another thing is how much of a lottery developing statistical methods seems to be. Most papers, even among the 400 most cited, have around 3 citations/year on average. But a few lucky winners have 100+ citations per year. One interesting thing for me is the papers that get 10 or more citations per year but aren’t huge hits. I suspect these are the papers that <a href="http://simplystatistics.org/2014/07/25/academic-statisticians-there-is-no-shame-in-developing-statistical-solutions-that-solve-just-one-problem/">solve one problem well but don’t solve the most general problem ever</a>.</p> <p>Something that jumps out from that plot is the outlier for the journal Biostatistics. One of their papers is cited 367.85 times per year. The next nearest competitor is 67.75 and it is 19 standard deviations above the mean! The paper in question is: “Exploration, normalization, and summaries of high density oligonucleotide array probe level data”, which is the paper that introduced RMA, one of the most popular methods for pre-processing microarrays ever created. It was written by Rafa and colleagues. It made me think of the statistic “<a href="http://www.fangraphs.com/library/misc/war/">wins above replacement</a>” which quantifies how many extra wins a baseball team gets by playing a specific player in place of a league average replacement.</p> <p>What about a “citations /year above replacement” statistic where you calculate for each journal:</p> <blockquote> <p>Median number of citations to a paper/year with Author X - Median number of citations/year to an average paper in that journal</p> </blockquote> <p>Then average this number across journals. This attempts to quantify how many extra citations/year a person’s papers generate compared to the “average” paper in that journal. For Rafa the numbers look like this:</p> <ul> <li>Biostatistics: Rafa = 15.475, Journal = 1.855, CAR/Y =  13.62</li> <li>JASA: Rafa = 74.5, Journal = 5.2, CAR/Y = 69.3</li> <li>Biometrics: Rafa = 4.33, Journal = 3.38, CAR/Y = 0.95</li> </ul> <p>So Rafa’s citations above replacement is (13.62 + 69.3 + 0.95)/3 =  27.96! There are a couple of reasons why this isn’t a completely accurate picture. One is the low sample size, the second is the fact that I only took the 400 most cited papers in each journal. Rafa has a few papers that didn’t make the top 400 for journals like JASA - which would bring down his CAR/Y.</p> <p> </p> Figuring Out Learning Objectives the Hard Way 2015-04-30T11:10:06+00:00 http://simplystats.github.io/2015/04/30/figuring-out-learning-objectives-the-hard-way <p>When building the <a href="https://www.coursera.org/specialization/genomics/41" title="Genomic Data Science Specialization">Genomic Data Science Specialization</a> (which starts in June!) we had to figure out the learning objectives for each course. We initially set our ambitions high, but as you can see in this video below, Steven Salzberg brought us back to Earth.</p> Data analysis subcultures 2015-04-29T10:23:57+00:00 http://simplystats.github.io/2015/04/29/data-analysis-subcultures <p>Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p> <blockquote> <p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p> </blockquote> <p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a> [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p> <blockquote> <p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p> </blockquote> <p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a>](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across <a href="http://www.badscience.net/category/evidence-based-policy/">multiple disciplines</a>.</p> <p>But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back <strong>very</strong> different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:</p> <ul> <li>Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).</li> <li>In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.</li> <li>This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.</li> <li>Psychology has a history of using <a href="http://en.wikipedia.org/wiki/Psychological_statistics">parametric statistics</a>, genomics is big into <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">empirical Bayes</a>, and you see a lot of Bayesian statistics in <a href="https://www1.ethz.ch/iac/people/knuttir/papers/meinshausen09nat.pdf">climate studies</a>.</li> <li>You see [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</li> </ul> <blockquote> <p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p> </blockquote> <p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a> [Roger and I responded to the controversy around the journal that banned p-values today <a href="http://www.nature.com/news/statistics-p-values-are-just-the-tip-of-the-iceberg-1.17412">in Nature.</a> A piece like this requires a lot of information packed into very little space but I thought one idea that deserved to be talked about more was the idea of data analysis subcultures. From the paper:</p> <blockquote> <p>Data analysis is taught through an apprenticeship model, and different disciplines develop their own analysis subcultures. Decisions are based on cultural conventions in specific communities rather than on empirical evidence. For example, economists call data measured over time ‘panel data’, to which they frequently apply mixed-effects models. Biomedical scientists refer to the same type of data structure as ‘longitudinal data’, and often go at it with generalized estimating equations.</p> </blockquote> <p>I think this is one of the least appreciated components of modern data analysis. Data analysis is almost entirely taught through an apprenticeship culture with completely different behaviors taught in different disciplines. All of these disciplines agree about the mathematical optimality of specific methods under very specific conditions. That is why you see <a href="http://psychclassics.yorku.ca/Peirce/small-diffs.htm">methods</a> like <a href="http://en.wikipedia.org/wiki/Statistical_Methods_for_Research_Workers">randomized trials</a>](http://www.ted.com/talks/esther_duflo_social_experiments_to_fight_poverty?language=en) across <a href="http://www.badscience.net/category/evidence-based-policy/">multiple disciplines</a>.</p> <p>But any real data analysis is always a multi-step process involving data cleaning and tidying, exploratory analysis, model fitting and checking, summarization and communication. If you gave someone from economics, biostatistics, statistics, and applied math an identical data set they’d give you back <strong>very</strong> different reports on what they did, why they did it, and what it all meant. Here are a few examples I can think of off the top of my head:</p> <ul> <li>Economics calls longitudinal data panel data and uses mostly linear mixed effects models, while generalized estimating equations are more common in biostatistics (this is the example from Roger/my paper).</li> <li>In genome wide association studies the family wise error rate is the most common error rate to control. In gene expression studies people frequently use the false discovery rate.</li> <li>This is changing a bit, but if you learned statistics at Duke you are probably a Bayesian and if you learned at Berkeley you are probably a frequentist.</li> <li>Psychology has a history of using <a href="http://en.wikipedia.org/wiki/Psychological_statistics">parametric statistics</a>, genomics is big into <a href="http://www.bioconductor.org/packages/release/bioc/html/limma.html">empirical Bayes</a>, and you see a lot of Bayesian statistics in <a href="https://www1.ethz.ch/iac/people/knuttir/papers/meinshausen09nat.pdf">climate studies</a>.</li> <li>You see](http://en.wikipedia.org/wiki/White_test) used a lot in econometrics, but that is hardly ever done through formal hypothesis testing in biostatistics.</li> <li>Training sets and test sets are used in machine learning for prediction, but rarely used for inference.</li> </ul> <p>This is just a partial list I thought of off the top of my head, there are a ton more. These decisions matter <strong>a lot</strong> in a data analysis.  The problem is that the behavioral component of a data analysis is incredibly strong, no matter how much we’d like to think of the process as mathematico-theoretical. Until we acknowledge that the most common reason a method is chosen is because, “I saw it in a widely-cited paper in journal XX from my field” it is likely that little progress will be made on resolving the statistical problems in science.</p> Why is there so much university administration? We kind of asked for it. 2015-04-13T17:13:16+00:00 http://simplystats.github.io/2015/04/13/why-is-there-so-much-university-administration-we-kind-of-asked-for-it <p>The latest commentary on the rising cost of college tuition is by Paul F. Campos and is titled <a href="http://www.nytimes.com/2015/04/05/opinion/sunday/the-real-reason-college-tuition-costs-so-much.html">The Real Reason College Tuition Costs So Much</a>. There has been much debate about this article and whether Campos is right or wrong…and I don’t plan to add to that. However, I wanted to pick up on a major point of the article that I felt got left hanging out there: The rising levels of administrative personnel at universities.</p> <p>Campos argues that the reason college tuition is on the rise is not that colleges get less and less money from the government (mostly state government for state schools), but rather that there is an increasing number of administrators at universities that need to be paid in dollars and cents. He cites a study that shows that for the California State University system, in a 34 year period, the number of of faculty rose by about 3% whereas the number of administrators rose by 221%.</p> <p>My initial thinking when I saw the 221% number was “only that much?” I’ve been a faculty member at Johns Hopkins now for about 10 years, and just in that short period I’ve seen the amount of administrative work I need to do go up what feels like at least 221%. Partially, of course, that is a result of climbing up the ranks. As you get more qualified to do administrative work, you get asked to do it! But even adjusting for that, there are quite a few things that faculty need to do now that they weren’t required to do before.  Frankly, I’m grateful for the few administrators that we do have around here to help me out with various things.</p> <p>Campos seems to imply (but doesn’t come out and say) that the bulk of administrators are not necessary. And that if we were to cut these people from the payrolls, that we could reduce tuition down to what it was in the old days. Or at least, it would be cheaper. This argument reminds me about debates over the federal budget: Everyone thinks the budget is too big, but no one wants to suggest something to cut.</p> <p>My point here is that the reason there are so many administrators is that there’s actually quite a bit of administration to do. And the amount of administration that needs to be done has increased over the past 30 years.</p> <p>Just for fun, I decided to go to the <a href="http://webapps.jhu.edu/jhuniverse/administration/">Johns Hopkins University Administration</a> web site to see who all these administrators were.  This site shows the President’s Cabinet and the Deans of the individual schools, which isn’t everybody, but it represents a large chunk. I don’t know all of these people, but I have met and worked with a few of them.</p> <p>For the moment I’m going to skip over individual people because, as much as you might think they are overpaid, no individual’s salary is large enough to move the needle on college tuition. So I’ll stick with people who actually represent large offices with staff. Here’s a sample.</p> <ul> <li><strong>University President</strong>. Call me crazy, but I think the university needs a President. In the U.S. the university President tends to focus on outward facing activities like raising money from various sources, liasoning with the government(s), and pushing university initiatives around the world. This is not something I want to do (but I think it’s necessary), I’d rather have the President take care of it for me.</li> <li> <p><strong>University Provost</strong>. At most universities in the U.S. the Provost is the “senior academic officer”, which means that he/she runs the university. This is a big job, especially at big universities, and require coordinating across a variety of constituencies. Also, at JHU, the Provost’s office deals with a number of compliance related issues like Title IX, accreditation, Americans with Disabilities Act, and many others. I suppose we could save some money by violating federal law, but that seems short-sighted. The people in this office do tough work involving a ton of paper. One example involves online education. Most states in the U.S. say that if you’re going to run an education program in their state, it needs to be approved by some regulatory body. Some states have essentially a reciprocal agreement, so if it’s okay in your state, then it’s okay in their state. But many states require an entire approval process for a program to run in that state. And by “a program” I mean something like an M.S. in Mathematics. If you want to run an M.S. in English that’s another approval, etc. So someone has to go to all the 50 states and D.C. and get approval for every online program that JHU runs in order to enroll students into that program from that state. I think Arkansas actually requires that someone come to Arkansas and testify in person about a program asking for approval.</p> <p>I support online education programs, and I’m glad the Provost’s office is getting all those approvals for us.&lt;/li&gt;</p> <ul> <li><strong>Corporate Security</strong>. This may be a difficult one for some people to understand, but bear in mind that much of Johns Hopkins is located in East Baltimore. If you’ve ever seen the TV show <a href="http://en.wikipedia.org/wiki/The_Wire">The Wire</a>, then you know why we need corporate security.</li> <li><strong>Facilities and Real Estate</strong>. Johns Hopkins owns and deals with a lot of real estate; it’s a big organization. Who is supposed to take care of all that? For example, we just installed a brand new supercomputer jointly with the University of Maryland, called <a href="http://marcc.jhu.edu">MARCC</a>. I’m really excited to use this supercomputer for research, but systems like this require a bit of space. A lot of space actually. So we needed to get some land to put it on. If you’ve ever bought a house, you know how much paperwork is involved.</li> <li><strong>Development and Alumni Relations</strong>. I have a new appreciation for this office now that I co-direct a <a href="https://www.coursera.org/specialization/jhudatascience/1">program</a> that has enrolled over 1.5 million people in just over a year. It’s critically important that we keep track of our students for many reasons: tracking student careers and success, tapping them to mentor current students, developing relationships with organizations that they’re connected to are just a few.</li> <li><strong>General Counsel</strong>. I’m not he lawbreaking type, so I need lawyers to help me out.</li> <li><strong>Enterprise Development</strong>. This office involves, among other things, technology transfer, which I have recently been involved with quite a bit for my role in the Data Science Specialization offered through Coursera. This is just to say that I personally benefit from this office. I’ve heard people say that universities shouldn’t be involved in tech transfer, but Bayh-Dole is what it is and I think Johns Hopkins should play by the same rules as everyone else. I’m not interested in filing patents, trademarks, and copyrights, so it’s good to have people doing that for me.&lt;/ul&gt;</li> </ul> <p>Okay, that’s just a few offices, but you get the point. These administrators seem to be doing a real job (imagine that!) and actually helping out the university. Many of these people are actually helping <em>me</em> out. Some of these jobs are essentially required by the existence of federal laws, and so we need people like this.</p> <p>So, just to recap, I think there are in fact more administrators in universities than there used to be. Is this causing an increase in tuition? It’s possible, but it’s probably not the only cause. If you believe the CSU study, there was about a 3.5% annual increase in the number of administrators each year from 1975 to 2008. College tuition during that time period went up <a href="http://trends.collegeboard.org/college-pricing/figures-tables/average-rates-growth-published-charges-decade">around 4% per year</a> (inflation adjusted). But even so, much of this administration needs to be done (because faculty don’t want to do it), so this is a difficult path to go down if you’re looking for ways to lower tuition.</p> <p>Even if we’ve found the smoking gun, the question is what do we do about it?</p> </li> </ul> Genomics Case Studies Online Courses Start in Two Weeks (4/27) 2015-04-13T10:00:29+00:00 http://simplystats.github.io/2015/04/13/genomics-case-studies-online-courses-start-in-two-weeks-427 <p>The last month of the <a href="http://genomicsclass.github.io/book/pages/classes.html">HarvardX Data Analysis for Genomics series</a> start on 4/27. We will cover case studies on RNAseq, Variant calling, ChipSeq and DNA methylation. Faculty includes Shirley Liu, Mike Love, Oliver Hoffman and the HSPH Bioinformatics Core. Although taking the previous courses on the series will help, the four case study courses were developed as stand alone and you can obtain a certificate for each one without taking any other course.</p> <p>Each course is presented over two weeks but will remain open until June 13 to give students an opportunity to take them all if they wish. For more information follow the links listed below.</p> <ol> <li><a href="https://www.edx.org/course/case-study-rna-seq-data-analysis-harvardx-ph525-5x">RNA-seq data analysis</a> will be lead by Mike Love</li> <li><a href="https://www.edx.org/course/case-study-variant-discovery-and-genotyping-harvardx-ph525-6x">Variant Discovery and Genotyping</a> will be taught by Shannan Ho Sui, Oliver Hofmann, Radhika Khetani and Meeta Mistry (from the The HSPH Bioinformatics Core)</li> <li><a href="https://www.edx.org/course/case-study-chip-seq-data-analysis-harvardx-ph525-7x">ChIP-seq data analysis</a> will be lead by Shirley Liu</li> <li><a href="https://www.edx.org/course/case-study-dna-methylation-data-analysis-harvardx-ph525-8x">DNA methylation data analysis</a> will be lead by Rafael Irizarry</li> </ol> A blessing of dimensionality often observed in high-dimensional data sets 2015-04-09T15:19:13+00:00 http://simplystats.github.io/2015/04/09/a-blessing-of-dimensionality-often-observed-in-high-dimensional-data-sets <p><a href="http://www.jstatsoft.org/v59/i10/paper"></a> have one observation per row and one variable per column.  Using this definition, big data sets can be either:</p> <ol> <li><strong>Wide</strong> - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.</li> <li><strong>Tall</strong> - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.</li> </ol> <p>The <a href="http://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was <a href="http://simplystatistics.org/2014/10/24/an-interactive-visualization-to-teach-about-the-curse-of-dimensionality/">nicely illustrated</a> by my student Prasad (although it looks like his quota may be up on Rstudio).</p> <p>For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:</p> <blockquote> <p>No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.</p> </blockquote> <p>As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.</p> <p>This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I’d make a very short list of some of these ideas:</p> <p><strong>1. Idea: </strong><a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841439/">De-convolving mixed observations from high-dimensional data. </a></p> <p><strong>How the blessing plays a role: </strong>The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (<a href="http://odin.mdacc.tmc.edu/~wwang7/">Wenyi Wang</a> came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).</p> <p><strong>2. Idea:</strong> <a href="http://biostatistics.oxfordjournals.org/content/5/2/155.short">The two groups model for false discovery rates</a>.</p> <p><strong>How the blessing plays a role: </strong> The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.</p> <p> </p> <p><strong>3. Idea: </strong><a href="http://www.degruyter.com/view/j/sagmb.2004.3.issue-1/sagmb.2004.3.1.1027/sagmb.2004.3.1.1027.xml">Empirical Bayes variance shrinkage for linear models</a></p> <p><strong>How the blessing plays a role: </strong> A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement’s estimates.</p> <p> </p> <p><strong>4. Idea: </strong><a href="http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.0030161">Surrogate variable analysis</a></p> <p><strong>How the blessing plays a role: </strong> Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (<a href="http://www.slideshare.net/jtleek/jhu-feb2009">see my JHU job talk for more on the blessing</a>)</p> <p> </p> <p>The blessing of dimensionality I’m describing here is related to the idea that <a href="http://andrewgelman.com/2004/10/27/the_blessing_of/">Andrew Gelman refers to in this 2004 post.</a>  Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get <strong>better estimates</strong> of the structure in your high-dimensional data - a nice blessing!</p> How to Get Ahead in Academia 2015-04-09T13:38:01+00:00 http://simplystats.github.io/2015/04/09/how-to-get-ahead-in-academia <p>This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom.</p> <p>The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don’t want to miss Scott Zeger’s secret formula for getting promoted!</p> Why You Need to Study Statistics 2015-04-02T21:42:06+00:00 http://simplystats.github.io/2015/04/02/why-you-need-to-study-statistics <p>The American Statistical Association is continuing its campaign to get you to study statistics, if you haven’t already. I have to agree with them that being a statistician is a pretty good job. Their latest video highlights a wide range of statisticians working in industry, government, and academia. You can check it out here:</p> Teaser trailer for the Genomic Data Science Specialization on Coursera 2015-03-26T10:06:43+00:00 http://simplystats.github.io/2015/03/26/teaser-trailer-for-the-genomic-data-science-specialization-on-coursera <p> </p> <p>We have been hard at work in the studio putting together our next specialization to launch on Coursera. It will be called the “Genomic Data Science Specialization” and includes a spectacular line up of instructors: <a href="http://salzberg-lab.org/">Steven Salzberg</a>, <a href="http://ccb.jhu.edu/people/mpertea/">Ela Pertea</a>, <a href="http://jamestaylor.org/">James Taylor</a>, <a href="http://ccb.jhu.edu/people/florea/">Liliana Florea</a>, <a href="http://www.hansenlab.org/">Kasper Hansen</a>, and me. The specialization will cover command line tools, statistics, Galaxy, Bioconductor, and Python. There will be a capstone course at the end of the sequence featuring an in-depth genomic analysis. If you are a grad student, postdoc, or principal investigator in a group that does genomics this specialization is for you. If you are a person looking to transition into one of the hottest areas of research with the new precision medicine initiative this is for you. Get pumped and share the teaser-trailer with your friends!</p> Introduction to Bioconductor HarvardX MOOC starts this Monday March 30 2015-03-24T09:24:27+00:00 http://simplystats.github.io/2015/03/24/introduction-to-bioconductor-harvardx-mooc-starts-this-monday-march-30 <p>Bioconductor is one of the most widely used open source toolkits for biological high-throughput data. In this four week course, co-taught with Vince Carey and Mike Love, we will introduce you to Bioconductor’s general infrastructure and then focus on two specific technologies: next generation sequencing and microarrays. The lectures and assessments will be annotated in case you want to focus only on one of these two technologies. Although if you plan to be a bioinformatician we recommend you learn both.</p> <p>Topics covered include:</p> <ul> <li>A short introduction to molecular biology and measurement technology</li> <li>An overview on how to leverage the platform and genome annotation packages and experimental archives</li> <li>GenomicsRanges: the infrastructure for storing, manipulating and analyzing next generation sequencing data</li> <li>Parallel computing and cloud concepts</li> <li>Normalization, preprocessing and bias correction.</li> <li>Statistical inference in practice: including hierarchical models and gene set enrichment analysis</li> <li>Building statistical analysis pipelines of genome-scale assays including the creation of reproducible reports</li> </ul> <p>Throughout the class we will be using data examples from both next generation sequencing and microarray experiments.</p> <p>We will assume <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x">basic knowledge of Statistics and R</a>.</p> <p>For more information visit the <a href="https://www.edx.org/course/introduction-bioconductor-harvardx-ph525-4x">course website</a>.</p> A surprisingly tricky issue when using genomic signatures for personalized medicine 2015-03-19T13:06:32+00:00 http://simplystats.github.io/2015/03/19/a-surprisingly-tricky-issue-when-using-genomic-signatures-for-personalized-medicine <p>My student Prasad Patil has a really nice paper that <a href="http://bioinformatics.oxfordjournals.org/content/early/2015/03/18/bioinformatics.btv157.full.pdf?keytype=ref&amp;ijkey=loVpUJfJxG2QjoE">just came out in Bioinformatics</a> (<a href="http://biorxiv.org/content/early/2014/06/06/005983">preprint</a> in case paywalled). The paper is about a surprisingly tricky normalization issue with genomic signatures. Genomic signatures are basically statistical/machine learning functions applied to the measurements for a set of genes to predict how long patients will survive, or how they will respond to therapy. The issue is that usually when building and applying these signatures, people normalize across samples in the training and testing set.</p> <p>An example of this normalization is to mean-center the measurements for each gene in the testing/application stage, then apply the prediction rule. The problem is that if you use a different set of samples when calculating the mean you can get a totally different prediction function. The basic problem is illustrated in this graphic.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM.png"><img class="aligncenter wp-image-3947 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-300x227.png" alt="Screen Shot 2015-03-19 at 12.58.03 PM" width="300" height="227" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-300x227.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-12.58.03-PM-260x197.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p>This seems like a pretty esoteric statistical issue, but it turns out that this one simple normalization problem can dramatically change the results of the predictions. In particular, we show that the predictions for the same patient, with the exact same data, can change dramatically if you just change the subpopulations of patients within the testing set. In this plot, Prasad made predictions for the exact same set of patients two times when the patient population varied in ER status composition. As many as 30% of the predictions were different for the same patient with the same data if you just varied who they were being predicted with.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png"><img class="aligncenter wp-image-3948 size-full" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png" alt="Screen Shot 2015-03-19 at 1.02.25 PM" width="494" height="277" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM-300x168.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-19-at-1.02.25-PM.png 494w" sizes="(max-width: 494px) 100vw, 494px" /></a></p> <p> </p> <p>This paper highlights how tricky statistical issues can slow down the process of translating ostensibly really useful genomic signatures into clinical practice and lends even more weight to the idea that precision medicine is a statistical field.</p> A simple (and fair) way all statistics journals could drive up their impact factor. 2015-03-18T16:32:10+00:00 http://simplystats.github.io/2015/03/18/a-simple-and-fair-way-all-statistics-journals-could-drive-up-their-impact-factor <p>Hypothesis:</p> <blockquote> <p>If every method in every stats journal was implemented in a corresponding R package (<a href="http://hilaryparker.com/2014/04/29/writing-an-r-package-from-scratch/">easy</a>), was required to have a  companion document that was a tutorial on how to use the software (<a href="http://www.bioconductor.org/help/package-vignettes/">easy</a>), included a reference to how to cite the paper if you used the software (<a href="http://www.inside-r.org/r-doc/utils/citation">easy</a>) and the paper/tutorial was posted to the relevant message boards for the communities of interest (<a href="http://seqanswers.com/forums/showthread.php?t=42018">easy</a>) that journal would see a dramatic bump in its impact factor.</p> </blockquote> Data science done well looks easy - and that is a big problem for data scientists 2015-03-17T10:47:12+00:00 http://simplystats.github.io/2015/03/17/data-science-done-well-looks-easy-and-that-is-a-big-problem-for-data-scientists <p>Data science has a ton of different definitions. For the purposes of this post I’m going to use the definition of data science we used when creating our Data Science program online. Data science is:</p> <blockquote> <p>Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience.</p> </blockquote> <p>In general the data science process is iterative and the different components blend together a little bit. But for simplicity lets discretize the tasks into the following 7 steps:</p> <ol> <li>Define the question of interest</li> <li>Get the data</li> <li>Clean the data</li> <li>Explore the data</li> <li>Fit statistical models</li> <li>Communicate the results</li> <li>Make your analysis reproducible</li> </ol> <p>A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst’s time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don’t need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don’t have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist’s usually instead try to go back and get the right data.</p> <p>The result of this process is that most well executed and successful data science projects don’t (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I’ve evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.</p> <p>It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren’t quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks <strong>super easy </strong>to someone who either doesn’t know about the data collection and cleaning process or doesn’t care.</p> <p>This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think “oh that is easy” or “that is just a trivial statistical/machine learning model” and don’t see all of the work that goes into solving the real problems in data science. A concrete example of this is in academic statistics. It is customary for people to show theorems in their talks and maybe even some of the proof. This gives people working on theoretical projects an opportunity to “show their stuff” and demonstrate how good they are. The equivalent for a data scientist would be showing how they found and cleaned multiple data sets, merged them together, checked for biases, and arrived at a simplified data set. Showing the “proof” would be equivalent to showing how they matched IDs. These things often don’t look nearly as impressive in talks, particularly if the audience doesn’t have experience with how incredibly delicate real data analysis is. I imagine versions of this problem play out in industry as well (candidate X did a good analysis but it wasn’t anything special, candidate Y used Hadoop to do BIG DATA!).</p> <p>The really tricky twist is that bad data science looks easy too. You can scrape a data set off the web and slap a machine learning algorithm on it no problem. So how do you judge whether a data science project is really “hard” and whether the data scientist is an expert? Just like with anything, there is no easy shortcut to evaluating data science projects. You have to ask questions about the details of how the data were collected, what kind of biases might exist, why they picked one data set over another, etc.  In the meantime, don’t be fooled by what looks like simple data science - <a href="http://fivethirtyeight.com/interactives/senate-forecast/">it can often be pretty effective</a>.</p> <p> </p> <p><em>Editor’s note: If you like this post, you might like my pay-what-you-want book Elements of Data Analytic Style: <a href="https://leanpub.com/datastyle">https://leanpub.com/datastyle</a></em></p> <p> </p> π day special: How to use Bioconductor to find empirical evidence in support of π being a normal number 2015-03-14T10:15:10+00:00 http://simplystats.github.io/2015/03/14/%cf%80-day-special-how-to-use-bioconductor-to-find-empirical-evidence-in-support-of-%cf%80-being-a-normal-number <p><em>Editor’s note: Today 3/14/15 at some point between  9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. </em></p> <p>Happy π day everybody!</p> <p>I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for <a href="http://www.davidhbailey.com/dhbpapers/normality.pdf">evidence that π is a normal number</a>. A <a href="http://en.wikipedia.org/wiki/Normal_number">normal number</a> is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10<sup>−m</sup>. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the <a href="http://stuff.mit.edu/afs/sipb/contrib/pi/">first billion digits of π</a> (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).</p> <p>To test our hypothesis, let Y<sub>1</sub>, …, Y<sub>100</sub> be the number of “00”, “01”, …,”99” in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, <del>0.89,</del> 0.92, and 0.99.</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi-3/" rel="attachment wp-att-2792"><img class="alignnone size-full wp-image-2792" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png" alt="pi" width="4800" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi2-300x187.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2-1024x640.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png 4800w" sizes="(max-width: 4800px) 100vw, 4800px" /></a></p> <p>Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi2/" rel="attachment wp-att-2793"><img class="alignnone size-full wp-image-2793" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png" alt="pi2" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi21-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p> <p>These observed counts should also be independent, and to explore this we can look at autocorrelation plots:</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/piacf-2/" rel="attachment wp-att-2794"><img class="alignnone size-full wp-image-2794" src="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png" alt="piacf" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p> <p>To do this in about an hour and with just a few lines of code (included below), I used the <a href="http://www.bioconductor.org/">Bioconductor</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html">Biostrings</a> package to match strings and the <em>foreach</em> function to parallelize.</p> <p><em>Editor’s note: Today 3/14/15 at some point between  9:26:53 and 9:26:54 it was the most π day of them all. Below is a repost from last year. </em></p> <p>Happy π day everybody!</p> <p>I wanted to write some simple code (included below) to the test parallelization capabilities of my  new cluster. So, in honor of  π day, I decided to check for <a href="http://www.davidhbailey.com/dhbpapers/normality.pdf">evidence that π is a normal number</a>. A <a href="http://en.wikipedia.org/wiki/Normal_number">normal number</a> is a real number whose infinite sequence of digits has the property that picking any given random m digit pattern is 10<sup>−m</sup>. For example, using the Poisson approximation, we can predict that the pattern “123456789” should show up between 0 and 3 times in the <a href="http://stuff.mit.edu/afs/sipb/contrib/pi/">first billion digits of π</a> (it actually shows up twice starting, at the 523,551,502-th and  773,349,079-th decimal places).</p> <p>To test our hypothesis, let Y<sub>1</sub>, …, Y<sub>100</sub> be the number of “00”, “01”, …,”99” in the first billion digits of  π. If  π is in fact normal then the Ys should be approximately IID binomials with N=1 billon and p=0.01.  In the qq-plot below I show Z-scores (Y - 10,000,000) /  √9,900,000) which appear to follow a normal distribution as predicted by our hypothesis. Further evidence for π being normal is provided by repeating this experiment for 3,4,5,6, and 7 digit patterns (for 5,6 and 7 I sampled 10,000 patterns). Note that we can perform a chi-square test for the uniform distribution as well. For patterns of size 1,2,3 the p-values were 0.84, <del>0.89,</del> 0.92, and 0.99.</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi-3/" rel="attachment wp-att-2792"><img class="alignnone size-full wp-image-2792" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png" alt="pi" width="4800" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi2-300x187.png 300w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2-1024x640.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi2.png 4800w" sizes="(max-width: 4800px) 100vw, 4800px" /></a></p> <p>Another test we can perform is to divide the 1 billion digits into 100,000 non-overlapping segments of length 10,000. The vector of counts for any given pattern should also be binomial. Below I also include these qq-plots.</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/pi2/" rel="attachment wp-att-2793"><img class="alignnone size-full wp-image-2793" src="http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png" alt="pi2" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/pi21-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/pi21.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p> <p>These observed counts should also be independent, and to explore this we can look at autocorrelation plots:</p> <p><a href="http://simplystatistics.org/2014/03/14/using-bioconductor-to-find-some-empirical-evidence-in-support-of-%cf%80-being-a-normal-number/piacf-2/" rel="attachment wp-att-2794"><img class="alignnone size-full wp-image-2794" src="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png" alt="piacf" width="5600" height="3000" srcset="http://simplystatistics.org/wp-content/uploads/2014/03/piacf1-1024x548.png 1024w, http://simplystatistics.org/wp-content/uploads/2014/03/piacf1.png 5600w" sizes="(max-width: 5600px) 100vw, 5600px" /></a></p> <p>To do this in about an hour and with just a few lines of code (included below), I used the <a href="http://www.bioconductor.org/">Bioconductor</a> <a href="http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html">Biostrings</a> package to match strings and the <em>foreach</em> function to parallelize.</p> <p>`</p> <p>NB: A normal number has the above stated property in any base. The examples above a for base 10.</p> De-weaponizing reproducibility 2015-03-13T10:24:05+00:00 http://simplystats.github.io/2015/03/13/de-weaponizing-reproducibility <div> A couple of weeks ago Roger and I went to a <a href="http://sites.nationalacademies.org/DEPS/BMSA/DEPS_153236">conference on statistical reproducibility </a>held at the National Academy of Sciences. The discussion was pretty wide ranging and I love that the thinking about reproducibility is coming back to statistics. There was pretty widespread support for the idea that prevention is the <a href="http://arxiv.org/abs/1502.03169">right way to approach reproducibility</a>. </div> <div> </div> <div> It turns out I was the last speaker of the whole conference. This is an unenviable position to be in with so many bright folks speaking first as they covered a huge amount of what I wanted to say. <a href="http://www.slideshare.net/jtleek/evidence-based-data-analysis">My talk focused on three key points:</a> </div> <div> </div> <ol> <li>The tools for reproducibility already exist, the barrier isn’t tools</li> <li>We need to de-weaponize reproducibility</li> <li>Prevention is the right approach to reproducibility</li> </ol> <p> </p> <p>In terms of the first point, <a href="http://simplystatistics.org/2014/09/04/why-the-three-biggest-positive-contributions-to-reproducible-research-are-the-ipython-notebook-knitr-and-galaxy/">tools like iPython, knitr, and Galaxy </a>can be used to all but the absolutely largest analysis reproducible right now.  Our group does this all the time with our papers and so do many others. The problem isn’t a lack of tools.</p> <p>Speaking to point two, I think many people would agree that part of the issue is culture change. One issue that is increasingly concerning to me is the “weaponization” of reproducibility.  I have been noticing is that some of us (like me, my students, other folks at JHU, and lots of particularly junior computational people elsewhere) are trying really hard to be reproducible. Most of the time this results in really positive reactions from the community. But when a co-author of mine and I wrote that paper about the <a href="http://biostatistics.oxfordjournals.org/content/early/2013/09/24/biostatistics.kxt007.abstract">science-wise false discovery rate</a>, one of the discussants used our code (great), improved on it (great), identified a bug (great), and then did his level best to humiliate us both in front of the editor and the general public because of that bug (<a href="http://simplystatistics.org/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation/">not so great</a>).</p> <div> </div> <div> I have seen this happen several times. Most of the time if a paper is reproducible the authors get a pat on the back and their code is either ignored, or used in a positive way. But for high-profile and important problems, people  largely use eproducibility to: </div> <div> </div> <ol> <li> Impose regulatory hurdles in the short term while people transition to reproducibility. One clear example of this is the <a href="https://www.congress.gov/bill/113th-congress/house-bill/4012">Secret Science Reform Act</a> which is a bill that imposes strict reproducibility conditions on all science before it can be used as evidence for regulation.</li> <li>Humiliate people who aren’t good coders or who make mistakes in their code. This is what happened in my paper when I produced reproducible code for my analysis, but has also happened <a href="http://simplystatistics.org/2014/01/28/marie-curie-says-stop-hating-on-quilt-plots-already/">to other people</a>.</li> <li>Take advantage of people’s code to plagiarize/straight up steal work. I have stories about this I’d rather not put on the internet</li> </ol> <p> </p> <p>Of the three, I feel like (1) and (2) are the most common. Plagiarism and scooping by theft I think are actually relatively rare based on my own anecdotal experience. But I think that the “weaponization” of reproducibility to block regulation or to humiliate folks who are new to computational sciences is more common than I’d like it to be. Until reproducibility is the standard for everyone - which I think is possible now and will happen as the culture changes -  the people who are the early adopters are at risk of being bludgeoned with their own reproducibility. As a community, if we want widespread reproducibility adoption we have to be ferocious about not allowing this to happen.</p> The elements of data analytic style - so much for a soft launch 2015-03-03T11:22:28+00:00 http://simplystats.github.io/2015/03/03/the-elements-of-data-analytic-style-so-much-for-a-soft-launch <p><em>Editor’s note: I wrote a book called Elements of Data Analytic Style. Buy it on <a href="https://leanpub.com/datastyle">Leanpub</a> or <a href="http://www.amazon.com/Elements-Data-Analytic-Style-ebook/dp/B00U6D80YY/ref=sr_1_1?ie=UTF8&amp;qid=1425397222&amp;sr=8-1&amp;keywords=elements+of+data+analytic+style">Amazon</a>! If you buy it on Leanpub, you get all updates (there are likely to be some) for free and you can pay what you want (including zero) but the author would be appreciative if you’d throw a little scratch his way. </em></p> <p>So uh, I was going to soft launch my new book The Elements of Data Analytic Style yesterday. I figured I’d just quietly email my Coursera courses to let them know I created a new reference. It turns out that that wasn’t very quiet. First this happened:</p> <blockquote class="twitter-tweet" width="550"> <p> <a href="https://twitter.com/jtleek">@jtleek</a> <a href="https://twitter.com/albertocairo">@albertocairo</a> <a href="https://twitter.com/simplystats">@simplystats</a> Instabuy. And apparently not just for me: it looks like you just Slashdotted <a href="https://twitter.com/leanpub">@leanpub</a>'s website. </p> <p> &mdash; Andrew Janke (@AndrewJanke) <a href="https://twitter.com/AndrewJanke/status/572474567467401216">March 2, 2015</a> </p> </blockquote> <p> </p> <p>and sure enough the website was down:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM.png"><img class="aligncenter wp-image-3919 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-300x202.png" alt="Screen Shot 2015-03-02 at 2.14.05 PM" width="300" height="202" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-300x202.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-1024x690.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/03/Screen-Shot-2015-03-02-at-2.14.05-PM-260x175.png 260w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p> </p> <p>then overnight it did something like 6,000+ units:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera.png"><img class="aligncenter wp-image-3920 size-medium" src="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-300x300.png" alt="whoacoursera" width="300" height="300" srcset="http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/03/whoacoursera.png 480w" sizes="(max-width: 300px) 100vw, 300px" /></a></p> <p> </p> <p> </p> <p>So lesson learned, there is no soft open with Coursera. Here is the post I was going to write though:</p> <p> </p> <p>### Post I was gonna write</p> <p>I have been doing data analysis for something like 10 years now (gulp!) and teaching data analysis in person for 6+ years. One of the things we do in <a href="https://github.com/jtleek/jhsph753and4">my data analysis class at Hopkins</a> is to perform a complete data analysis (from raw data to written report) every couple of weeks. Then I grade each assignment for everything from data cleaning to the written report and reproducibility. I’ve noticed over the course of teaching this class (and classes online) that there are many common elements of data analytic style that I don’t often see in textbooks, or when I do, I see them spread across multiple books.</p> <p>I’ve posted on some of these issues in some open source guides I’ve posted to Github like:</p> <ul> <li><a href="http://simplystatistics.org/2014/05/22/10-things-statistics-taught-us-about-big-data-analysis/" target="_self">10 things statistics taught us about big data analysis</a></li> <li><a href="https://github.com/jtleek/rpackages" target="_self">The Leek Group Guide to R packages</a></li> <li><a href="https://github.com/jtleek/datasharing" target="_self">How to share data with a statistician</a></li> </ul> <p>But I decided that it might be useful to have a more complete guide to the “art” part of data analysis. One goal is to summarize in a succinct way the most common difficulties encountered by practicing data analysts. It may be a useful guide for peer reviewers who could refer to section numbers when evaluating manuscripts, for instructors who have to grade data analyses, as a supplementary text for a data analysis class, or just as a useful reference. It is modeled loosely in format and aim on the <a href="http://www.bartleby.com/141/">Elements of Style</a> by William Strunk. Just as with the EoS, both the checklist and my book cover a small fraction of the field of data analysis, but my experience is that once these elements are mastered, data analysts benefit most from hands on experience in their own discipline of application, and that many principles may be non-transferable beyond the basics. But just as with writing, new analysts would do better to follow the rules until they know them well enough to violate them.</p> <ul> <li><a href="https://leanpub.com/datastyle/">Buy EDAS on Leanpub</a></li> <li><a href="http://www.amazon.com/Elements-Data-Analytic-Style-ebook/dp/B00U6D80YY/ref=sr_1_1?ie=UTF8&amp;qid=1425397222&amp;sr=8-1&amp;keywords=elements+of+data+analytic+style">Buy EDAS on Amazon</a></li> </ul> <p>The book includes a basic checklist that may be useful as a guide for beginning data analysts or as a rubric for evaluating data analyses. I’m reproducing it here so you can comment/hate/enjoy on it.</p> <p> </p> <p><em><strong>The data analysis checklis</strong>t</em></p> <p>This checklist provides a condensed look at the information in this book. It can be used as a guide during the process of a data analysis, as a rubric for grading data analysis projects, or as a way to evaluate the quality of a reported data analysis.</p> <p><strong>I Answering the question</strong></p> <ol> <li> <p>Did you specify the type of data analytic question (e.g. exploration, assocation causality) before touching the data?</p> </li> <li> <p>Did you define the metric for success before beginning?</p> </li> <li> <p>Did you understand the context for the question and the scientific or business application?</p> </li> <li> <p>Did you record the experimental design?</p> </li> <li> <p>Did you consider whether the question could be answered with the available data?</p> </li> </ol> <p><strong>II Checking the data</strong></p> <ol> <li> <p>Did you plot univariate and multivariate summaries of the data?</p> </li> <li> <p>Did you check for outliers?</p> </li> <li> <p>Did you identify the missing data code?</p> </li> </ol> <p><strong>III Tidying the data</strong></p> <ol> <li> <p>Is each variable one column?</p> </li> <li> <p>Is each observation one row?</p> </li> <li> <p>Do different data types appear in each table?</p> </li> <li> <p>Did you record the recipe for moving from raw to tidy data?</p> </li> <li> <p>Did you create a code book?</p> </li> <li> <p>Did you record all parameters, units, and functions applied to the data?</p> </li> </ol> <p><strong>IV Exploratory analysis</strong></p> <ol> <li> <p>Did you identify missing values?</p> </li> <li> <p>Did you make univariate plots (histograms, density plots, boxplots)?</p> </li> <li> <p>Did you consider correlations between variables (scatterplots)?</p> </li> <li> <p>Did you check the units of all data points to make sure they are in the right range?</p> </li> <li> <p>Did you try to identify any errors or miscoding of variables?</p> </li> <li> <p>Did you consider plotting on a log scale?</p> </li> <li> <p>Would a scatterplot be more informative?</p> </li> </ol> <p><strong>V Inference</strong></p> <ol> <li> <p>Did you identify what large population you are trying to describe?</p> </li> <li> <p>Did you clearly identify the quantities of interest in your model?</p> </li> <li> <p>Did you consider potential confounders?</p> </li> <li> <p>Did you identify and model potential sources of correlation such as measurements over time or space?</p> </li> <li> <p>Did you calculate a measure of uncertainty for each estimate on the scientific scale?</p> </li> </ol> <p><strong>VI Prediction</strong></p> <ol> <li> <p>Did you identify in advance your error measure?</p> </li> <li> <p>Did you immediately split your data into training and validation?</p> </li> <li> <p>Did you use cross validation, resampling, or bootstrapping only on the training data?</p> </li> <li> <p>Did you create features using only the training data?</p> </li> <li> <p>Did you estimate parameters only on the training data?</p> </li> <li> <p>Did you fix all features, parameters, and models before applying to the validation data?</p> </li> <li> <p>Did you apply only one final model to the validation data and report the error rate?</p> </li> </ol> <p><strong>VII Causality</strong></p> <ol> <li> <p>Did you identify whether your study was randomized?</p> </li> <li> <p>Did you identify potential reasons that causality may not be appropriate such as confounders, missing data, non-ignorable dropout, or unblinded experiments?</p> </li> <li> <p>If not, did you avoid using language that would imply cause and effect?</p> </li> </ol> <p><strong>VIII Written analyses</strong></p> <ol> <li> <p>Did you describe the question of interest?</p> </li> <li> <p>Did you describe the data set, experimental design, and question you are answering?</p> </li> <li> <p>Did you specify the type of data analytic question you are answering?</p> </li> <li> <p>Did you specify in clear notation the exact model you are fitting?</p> </li> <li> <p>Did you explain on the scale of interest what each estimate and measure of uncertainty means?</p> </li> <li> <p>Did you report a measure of uncertainty for each estimate on the scientific scale?</p> </li> </ol> <p><strong>IX Figures</strong></p> <ol> <li> <p>Does each figure communicate an important piece of information or address a question of interest?</p> </li> <li> <p>Do all your figures include plain language axis labels?</p> </li> <li> <p>Is the font size large enough to read?</p> </li> <li> <p>Does every figure have a detailed caption that explains all axes, legends, and trends in the figure?</p> </li> </ol> <p><strong>X Presentations</strong></p> <ol> <li> <p>Did you lead with a brief, understandable to everyone statement of your problem?</p> </li> <li> <p>Did you explain the data, measurement technology, and experimental design before you explained your model?</p> </li> <li> <p>Did you explain the features you will use to model data before you explain the model?</p> </li> <li> <p>Did you make sure all legends and axes were legible from the back of the room?</p> </li> </ol> <p><strong>XI Reproducibility</strong></p> <ol> <li> <p>Did you avoid doing calculations manually?</p> </li> <li> <p>Did you create a script that reproduces all your analyses?</p> </li> <li> <p>Did you save the raw and processed versions of your data?</p> </li> <li> <p>Did you record all versions of the software you used to process the data?</p> </li> <li> <p>Did you try to have someone else run your analysis code to confirm they got the same answers?</p> </li> </ol> <p><strong>XI R packages</strong></p> <ol> <li> <p>Did you make your package name “Googleable”</p> </li> <li> <p>Did you write unit tests for your functions?</p> </li> <li> <p>Did you write help files for all functions?</p> </li> <li> <p>Did you write a vignette?</p> </li> <li> <p>Did you try to reduce dependencies to actively maintained packages?</p> </li> <li> <p>Have you eliminated all errors and warnings from R CMD CHECK?</p> </li> </ol> <p> </p> Advanced Statistics for the Life Sciences MOOC Launches Today 2015-03-02T09:37:39+00:00 http://simplystats.github.io/2015/03/02/advanced-statistics-for-the-life-sciences-mooc-launches-today <p>In <a href="https://www.edx.org/course/advanced-statistics-life-sciences-harvardx-ph525-3x#.VPRzYSnffwc">In</a> we will teach statistical techniques that are commonly used in the analysis of high-throughput data and their corresponding R implementations. In Week 1 we will explain inference in the context of high-throughput data and introduce the concept of error controlling procedures. We will describe the strengths and weakness of the Bonferroni correction, FDR and q-values. We will show how to implement these in cases in which  thousands of tests are conducted, as is typically done with genomics data. In Week 2 we will introduce the concept of mathematical distance and how it is used in exploratory data analysis, clustering, and machine learning. We will describe how techniques such as principal component analysis (PCA) and the singular value decomposition (SVD) can be used for dimension reduction in high dimensional data. During week 3 we will describe confounding, latent variables and factor analysis in the context of high dimensional data and how this relates to batch effects. We will show how to implement methods such as SVA to perform inference on data affected by batch effects. Finally, during week 4 we will show how statistical modeling, and empirical Bayes modeling in particular, are powerful techniques that greatly improve precision in high-throughput data. We will be using R code to explain concepts throughout the course. We will also be using exploratory data analysis and data visualization to motivate the techniques we teach during each week.</p> Navigating Big Data Careers with a Statistics PhD 2015-02-18T10:12:29+00:00 http://simplystats.github.io/2015/02/18/navigating-big-data-careers-with-a-statistics-phd <div> <em>Editor's note: This is a guest post by <a href="http://www.drsherrirose.com/" target="_blank">Sherri Rose</a>. She is an Assistant Professor of Biostatistics in the Department of Health Care Policy at Harvard Medical School. Her work focuses on nonparametric estimation, causal inference, and machine learning in health settings. Dr. Rose received her BS in statistics from The George Washington University and her PhD in biostatistics from the University of California, Berkeley, where she coauthored a book on <a href="http://drsherrirose.com/targeted-learning-book/" target="_blank">Targeted Learning</a>. She tweets <a href="https://twitter.com/sherrirose" target="_blank">@sherrirose</a>.</em> </div> <div> </div> <div> A quick scan of the science and technology headlines often yields two words: big data. The amount of information we collect has continued to increase, and this data can be found in varied sectors, ranging from social media to genomics. Claims are made that big data will solve an array of problems, from understanding devastating diseases to predicting political outcomes. There is substantial “big data” hype in the press, as well as business and academic communities, but how do upcoming, current, and recent statistical science PhDs handle the array of training opportunities and career paths in this new era? <a href="http://www.amstat.org/newsroom/pressreleases/2015-StatsFastestGrowingSTEMDegree.pdf" target="_blank">Undergraduate interest in statistics degrees is exploding</a>, bringing new talent to graduate programs and the post-PhD job pipeline.  Statistics training is diversifying, with students focusing on theory, methods, computation, and applications, or a blending of these areas. A few years ago, Rafa outlined the academic career options for statistics PhDs in <a href="http://simplystatistics.org/2011/09/12/advice-for-stats-students-on-the-academic-job-market/" target="_blank">two</a> <a href="http://simplystatistics.org/2011/09/15/another-academic-job-market-option-liberal-arts/" target="_blank">posts</a>, which cover great background material I do not repeat here. The landscape for statistics PhD careers is also changing quickly, with a variety of companies attracting top statistics students in new roles.  As a <a href="http://www.drsherrirose.com/" target="_blank">new faculty member</a> at the intersection of machine learning, causal inference, and health care policy, I've already found myself frequently giving career advice to trainees.  The choices have become much more nuanced than just academia vs. industry vs. government. </div> <div> </div> <div> </div> <div> So, you find yourself inspired by big data problems and fascinated by statistics. While you are a student, figuring out what you enjoy working on is crucial. This exploration could involve engaging in internship opportunities or collaborating with multiple faculty on different types of projects. Both positive and negative experiences can help you identify your preferences. </div> <div> </div> <div> </div> <div> Undergraduates may wish to spend a couple months at a <a href="http://www.nhlbi.nih.gov/research/training/summer-institute-biostatistics-t15" target="_blank">Summer Institute for Training in Biostatistics</a> or <a href="http://www.nsf.gov/crssprgm/reu/" target="_blank">National Science Foundation Research Experience for Undergraduates</a>. There are <a href="https://www.udacity.com/course/st101" target="_blank">also</a> <a href="https://www.coursera.org/course/casebasedbiostat" target="_blank">many</a> <a href="https://www.coursera.org/specialization/jhudatascience/1" target="_blank">MOOC</a> <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x#.VJOhXsAAPe" target="_blank">options</a> <a href="https://www.coursera.org/course/maththink" target="_blank">to</a> <a href="https://www.udacity.com/course/ud120" target="_blank">get</a> <a href="https://www.udacity.com/course/ud359" target="_blank">a</a> <a href="https://www.udacity.com/course/ud651" target="_blank">taste</a> <a href="https://www.edx.org/course/foundations-data-analysis-utaustinx-ut-7-01x#.VNpQRd4bakA" target="_blank">of</a> <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x#.VNpQS94bakA" target="_blank">different</a> <a href="https://www.edx.org/course/scalable-machine-learning-uc-berkeleyx-cs190-1x#.VNpQU94bakA" target="_blank">areas</a> <a href="https://www.edx.org/course/introduction-computational-thinking-data-mitx-6-00-2x-0#.VNpQWd4bakA" target="_blank">of</a><a href="https://www.edx.org/course/fundamentals-clinical-trials-harvardx-hsph-hms214x#.VNpQt94bakA" target="_blank">statistics</a>. Selecting a graduate program for PhD study can be a difficult choice, especially when your interests within statistics have yet to be identified, as is often the case for undergraduates. However, if you know that you have interests in software and programming, it can be easy to sort which statistical science PhD programs have a curricular or research focus in this area by looking at department websites. Similarly, if you know you want to work in epidemiologic methods, genomics, or imaging, specific programs are going to jump right to the top as good fits. Getting advice from faculty in your department will be important. Competition for admissions into statistics and biostatistics PhD programs has continued to increase, and most faculty advise applying to as many relevant programs as is reasonable given the demands on your time and finances. If you end up sitting on multiple (funded) offers come April, talking to current students, student alums, and looking at alumni placement can be helpful. Don't hesitate to contact these people, selectively. Most PhD programs genuinely do want you to end up in the place that is best for you, even if it is not with them. </div> <div> </div> <div> </div> <div> Once you're in a PhD program, internship opportunities for graduate students are listed each year by the <a href="http://www.amstat.org/education/internships.cfm" target="_blank">American Statistical Association</a>. Your home department may also have ties with local research organizations and companies with openings. Internships can help you identify future positions and the types of environments where you will flourish in your career. <a href="https://www.linkedin.com/pub/lauren-kunz/a/aab/293" target="_blank">Lauren Kunz</a>, a recent PhD graduate in biostatistics from Harvard University, is currently a Statistician at the National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health. Dr. Kunz said, "As a previous summer intern at the NHLBI, I was able to get a feel for the day to day life of a biostatistician at the NHLBI. I found the NHLBI Office of Biostatistical Research to be a collegial, welcoming environment, and I soon learned that NHLBI biostatisticians have the opportunity to work on a variety of projects, very often collaborating with scientists and clinicians. Due to the nature of these collaborations, the biostatisticians are frequently presented with scientifically interesting and important statistical problems. This work often motivates methodological research which in turn has immediate, practical applications. These factors matched well with my interest in collaborative research that is both methodological and applied." </div> <div> </div> <div> </div> <div> <span style="font-family: Helvetica;">Industry is also enticing to statistics PhDs, particularly those with an applied or computational focus, like <a href="http://www.stephaniesapp.com/" target="_blank">Stephanie Sapp</a> and</span> <a href="http://alyssafrazee.com/" target="_blank">Alyssa Frazee</a><span style="font-family: Helvetica;">. Dr. Sapp has a PhD in statistics from the University of California, Berkeley, and is currently a Quantitative Analyst at <a href="http://www.google.com/" target="_blank">Google</a>. She also completed an internship there the summer before she graduated. In commenting about her choice to join Google, Dr. Sapp said,  "</span>I really enjoy both academic research and seeing my work used in practice.  Working at Google allows me to continue pursuing new and interesting research topics, as well as see my results drive more immediate impact."  <span style="font-family: Helvetica;">Dr. Frazee just finished her PhD in biostatistics at Johns Hopkins University and previously spent a summer exploring her interests in <a href="https://www.hackerschool.com/" target="_blank">Hacker School</a>.  While she applied to both academic and industry positions, receiving multiple offers, she ultimately chose to go into industry and work for <a href="https://stripe.com/" target="_blank">Stripe</a>: "</span>I accepted a tech company's offer for many reasons, one of them being that I really like programming and writing code. There are tons of opportunities to grow as a programmer/engineer at a tech company, but building an academic career on that foundation would be more of a challenge. I'm also excited about seeing my statistical work have more immediate impact. At smaller companies, much of the work done there has visible/tangible bearing on the product. Academic research in statistics is operating a lot closer to the boundaries of what we know and discovering a lot of cool stuff, which means researchers get to try out original ideas more often, but the impact is less immediately tangible. A new method or estimator has to go through a lengthy peer review/publication process and be integrated into the community's body of knowledge, which could take several years, before its impact can be fully observed."  One of Dr. Frazee, Dr. Sapp, and Dr. Kunz's considerations in choosing a job reflects many of those in the early career statistics community: having an impact. </div> <div> </div> <div> </div> <div> <span style="font-family: Helvetica;">Interest in both developing methods </span><i>and</i> <span style="font-family: Helvetica;">translating statistical advances into practice is a common theme in the big data statistics world, but not one that always leads to an industry or government career. There are also academic opportunities in statistics, biostatistics, and interdisciplinary departments like my own where your work can have an impact on current science.  The <a href="http://www.hcp.med.harvard.edu/" target="_blank">Department of Health Care Policy</a> (HCP) at Harvard Medical School has 5 tenure-track/tenured statistics faculty members, including myself, among a total of about 20 core faculty members. The statistics faculty work on a range of theoretical and methodological problems while collaborating with HCP faculty (health economists, clinician <wbr />researchers, and sociologists) and leading our own substantive projects in health care policy (e.g., <a href="http://www.massdac.org/" target="_blank">Mass-DAC</a>). I find it to be a unique and exciting combination of roles, and love that the science truly informs my statistical research, giving it broader impact. Since joining the department a year and a half ago, I've worked in many new areas, such as plan payment risk adjustment methodology. I have also applied some of my previous work in machine learning to predicting adverse health outcomes in large datasets. Here, I immediately saw a need for new avenues of statistical research to make the optimal approach based on statistical theory align with an optimal approach in practice. My current research portfolio is diverse; example projects include the development of a double robust estimator for the study of chronic disease, leading an evaluation of a new state-wide health plan initiative, and collaborating with department colleagues on statistical issues in all-payer claims databases, physician prescribing intensification behavior, and predicting readmissions. The <a href="http://statistics.fas.harvard.edu/" target="_blank">larger</a> <a href="http://www.hsph.harvard.edu/biostatistics/" target="_blank">statistics</a> <a href="http://www.iq.harvard.edu/" target="_blank">community</a> <a href="http://bcb.dfci.harvard.edu/" target="_blank">at</a> Harvard also affords many opportunities to interact with statistics faculty across the campus, and <a href="http://www.faculty.harvard.edu/" target="_blank">university-wide junior faculty events</a> have connected me with professors in computer science and engineering. I feel an immense sense of research freedom to pursue my interests at HCP, which was a top priority when I was comparing job offers.</span> </div> <div> </div> <div> </div> <div> <a href="http://had.co.nz/" target="_blank">Hadley Wickam</a>, of <a href="http://www.amazon.com/dp/0387981403/" target="_blank">ggplot2</a> and <a href="http://www.amazon.com/dp/1466586966/" target="_blank">Advanced R</a> fame, took on a new role as Chief Scientist at <a href="http://www.rstudio.com/" target="_blank">RStudio</a> in 2013. Freedom was also a key component in his choice to move sectors: "For me, the driving motivation is freedom: I know what I want to work on, I just need the freedom (and support) to work on it. It's pretty unusual to find an industry job that has more freedom than academia, but I've been noticeably more productive at RStudio because I don't have any meetings, and I can spend large chunks of time devoted to thinking about hard problems. It's not possible for everyone to get that sort of job, but everyone should be thinking about how they can negotiate the freedom to do what makes them happy. I really like the thesis of Cal Newport's book <a href="http://www.amazon.com/dp/1455509124/" target="_blank"><i>So </i></a><a href="http://www.amazon.com/dp/1455509124/" target="_blank"><i>Good They Can't Ignore You</i></a> - the better you are at your job, the greater your ability to negotiate for what you want." </div> <div> </div> <div> </div> <div> There continues to be a strong emphasis in the work force on the vaguely defined field of “data science,” which incorporates the collection, storage, analysis, and interpretation of big data.  Statisticians not only work in and lead teams with other scientists (e.g., clinicians, biologists, computer scientists) to attack big data challenges, but with each other. Your time as a statistics trainee is an amazing opportunity to explore your strengths and preferences, and which sectors and jobs appeal to you. Do your due diligence to figure out which employers are interested in and supportive of the type of career you want to create for yourself. Think about how you want to spend your time, and remember that you're the only person who has to live your life once you get that job. Other people's opinions are great, but your values and instincts matter too. Your definition of "best" doesn't have to match someone else's. Ask questions! Try new things! The potential for breakthroughs with novel flexible methods is strong. Statistical science training has progressed to the point where trainees are armed with thorough knowledge in design, methodology, theory, and, increasingly, data collection, applications, and computation.  Statisticians working in data science are poised to continue making important contributions in all sectors for years to come. Now, you just need to decide where you fit. </div> Introduction to Linear Models and Matrix Algebra MOOC starts this Monday Feb 16 2015-02-13T09:00:11+00:00 http://simplystats.github.io/2015/02/13/introduction-to-linear-models-and-matrix-algebra-mooc-starts-this-monday-feb-16 <p>Matrix algebra is the language of modern data analysis. We use it to develop and describe statistical and machine learning methods, and to code efficiently in languages such as R, matlab and python. Concepts such as principal component analysis (PCA) are best described with matrix algebra. It is particularly useful to describe linear models.</p> <p>Linear models are everywhere in data analysis. ANOVA, linear regression, limma, edgeR, DEseq, most smoothing techniques, and batch correction methods such as SVA and Combat are based on linear models. In this two week MOOC we well describe the basics of matrix algebra, demonstrate how linear models are used in the life sciences and show how to implement these efficiently in R.</p> <p>Update: Here is <a href="https://www.edx.org/course/introduction-linear-models-matrix-harvardx-ph525-2x">the link</a> to the class</p> Is Reproducibility as Effective as Disclosure? Let's Hope Not. 2015-02-12T10:21:35+00:00 http://simplystats.github.io/2015/02/12/is-reproducibility-as-effective-as-disclosure-lets-hope-not <p>Jeff and I just this week published a <a href="http://www.pnas.org/content/112/6/1645.full">commentary</a> in the <em>Proceedings of the National Academy of Sciences</em> on our latest thinking on reproducible research and its ability to solve the reproducibility/replication “crisis” in science (there’s a version on <a href="http://arxiv.org/abs/1502.03169">arXiv</a> too). In a nutshell, we believe reproducibility (making data and code available so that others can recompute your results) is an essential part of science, but it is not going to end the crisis of confidence in science. In fact, I don’t think it’ll even make a dent. The problem is that reproducibility, as a tool for preventing poor research, comes in at the wrong stage of the research process (the end). While requiring reproducibility may deter people from committing outright fraud (a small group), it won’t stop people who just don’t know what they’re doing with respect to data analysis (a much larger group).</p> <p>In an eerie coincidence, Jesse Eisinger of the investigative journalism non-profit ProPublica, has just published a piece on the New York Times Dealbook site discussing how <a href="http://dealbook.nytimes.com/2015/02/11/an-excess-of-sunlight-a-paucity-of-rules/">requiring disclosure rules in the financial industry has produced meager results</a>. He writes</p> <blockquote> <p class="story-body-text"> Over the last century, disclosure and transparency have become our regulatory crutch, the answer to every vexing problem. We require corporations and government to release reams of information on food, medicine, household products, consumer financial tools, campaign finance and crime statistics. We have a booming “report card” industry for a range of services, including hospitals, public schools and restaurants. </p> </blockquote> <p class="story-body-text"> The rationale for all this disclosure is that </p> <blockquote> <p class="story-body-text"> someone, somewhere reads the fine print in these contracts and keeps corporations honest. It turns out what we laymen intuit is true: <a href="http://www.law.nyu.edu/news/ideas/Marotta-Wurgler-standard-form-contracts-fine-print">No one reads them</a>, according to research by a New York University law professor, Florencia Marotta-Wurgler. </p> </blockquote> <p class="story-body-text"> But disclosure is nevertheless popular because how could you be against it? </p> <blockquote> <p class="story-body-text"> The disclosure bonanza is easy to explain. Nobody is against it. It’s politically expedient. Companies prefer such rules, especially in lieu of actual regulations that would curtail bad products or behavior. The opacity lobby — the <a href="http://en.wikipedia.org/wiki/Remora">remora fish</a> class of lawyers, lobbyists and consultants in New York and Washington — knows that disclosure requirements are no bar to dodgy practices. You just have to explain what you’re doing in sufficiently incomprehensible language, a task that earns those lawyers a hefty fee. </p> </blockquote> <p class="story-body-text"> In the now infamous <a href="http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/">Duke Saga</a>, Keith Baggerly was able to reproduce the work of Potti et al. after roughly 2,000 hours of work because the data were publicly available (although the code was not). It's not clear how much time would have been saved if the code had been available, but it seems reasonable to assume that it would have taken some amount of time to <em>understand</em> the analysis, if not reproduce it. Once the errors in Potti's work were discovered, it took 5 years for the original Nature Medicine paper to be retracted. </p> <p class="story-body-text"> Although you could argue that the process worked in some sense, it came at tremendous cost of time and money. Wouldn't it have been better if the analysis had been done right in the first place? </p> The trouble with evaluating anything 2015-02-09T19:24:22+00:00 http://simplystats.github.io/2015/02/09/the-trouble-with-evaluating-anything <p>It is very hard to evaluate people’s productivity or work in any meaningful way. This problem is the source of:</p> <ol> <li><a href="http://simplystatistics.org/2013/09/26/how-could-code-review-discourage-code-disclosure-reviewers-with-motivation/">Consternation about peer review</a></li> <li><a href="http://simplystatistics.org/2014/02/21/heres-why-the-scientific-publishing-system-can-never-be-fixed/">The reason why post publication peer review doesn’t work</a></li> <li><a href="http://simplystatistics.org/2012/05/24/how-do-we-evaluate-statisticians-working-in-genomics/">Consternation about faculty evaluation</a></li> <li>Major problems at companies like <a href="http://www.bloomberg.com/bw/articles/2013-11-12/yahoos-latest-hr-disaster-ranking-workers-on-a-curve">Yahoo</a> and <a href="http://www.bloomberg.com/bw/articles/2013-11-13/microsoft-kills-its-hated-stack-rankings-dot-does-anyone-do-employee-reviews-right">Microsoft</a>.</li> </ol> <p>Roger and I were just talking about this problem in the context of evaluating the impact of software as a faculty member and Roger suggested the problem is that:</p> <blockquote> <p>Evaluating people requires real work and so people are always looking for shortcuts</p> </blockquote> <p>To evaluate a person’s work or their productivity requires three things:</p> <ol> <li>To be an expert in what they do</li> <li>To have absolutely no reason to care whether they succeed or not</li> <li>To have time available to evaluate them</li> </ol> <p>These three fundamental things are at the heart of why it is so hard to get good evaluations of people and why peer review and other systems are under such fire. The main source of the problem is the conflict between 1 and 2. The group of people in any organization or on any scale that is truly world class at any given topic from software engineering to history is small. It has to be by definition. This group of people inevitably has some reason to care about the success of the other people in that same group. Either they work with the other world class people and want them to succeed or they  either intentionally or unintentionally are competing with them.</p> <p>The conflict between being and expert and having no say wouldn’t be such a problem if it wasn’t for issue number 3: the time to evaluate people. To truly get good evaluations what you need is for someone who <em>isn’t an expert in a field and so has no stake</em> to take the time to become an expert and then evaluate the person/software. But this requires a huge amount of effort on the part of a reviewer who has to become expert in a new field. Given that reviewing is often considered the least important task in people’s workflow, evidenced by the value we put on people acting as peer reviewers for journals, or the value people get for doing a good job in people’s evaluation for promotion in companies, it is no wonder people don’t take the time to become experts.</p> <p>I actually think that tenure review committees at forward thinking places may be the best at this (<a href="http://simplystatistics.org/2012/12/20/the-nih-peer-review-system-is-still-the-best-at-identifying-innovative-biomedical-investigators/">Rafa said the same thing about NIH study section</a>). They at least attempt to get outside reviews from people who are unbiased about the work that a faculty member is doing before they are promoted. This system, of course, has large and well-document problems, but I think it is better than having a person’s direct supervisor - who clearly has a stake - being the only person evaluating them.It is also better than only using the quantifiable metrics like number of papers and impact factor of the corresponding journals. I also think that most senior faculty who evaluate people take the job very seriously despite the only incentive being good citizenship.</p> <p>Since real evaluation requires hard work and expertise, most of the time people are looking for a short cut. These short cuts typically take the form of quantifiable metrics. In the academic world these shortcuts are things like:</p> <ol> <li>Number of papers</li> <li>Citations to academic papers</li> <li>The impact factor of a journal</li> <li>Downloads to a person’s software</li> </ol> <p>I think all of these things are associated with quality but none define quality. You could try to model the relationship, but it is very hard to come up with a universal definition for the outcome you are trying to model. In academics, some people have suggested that <a href="http://www.michaeleisen.org/blog/?p=694">open review or post-publication review</a> solves the problem. But this is only true for a very small subset of cases that violate rule number 2. The only papers that get serious post-publication review are where people have an incentive for the paper to go one way or the other. This means that papers in Science will be post-pub reviewed much much more often than equally important papers in discipline specific journals - just because people care more about Science. This will leave the vast majority of papers unreviewed - as evidenced by the relatively modest number of papers reviewed by <a href="https://pubpeer.com/">PubPeer</a> or <a href="http://www.ncbi.nlm.nih.gov/pubmedcommons/">Pubmed Commons.</a></p> <p>I’m beginning to think that the only way to do evaluation well is to hire people whose <em>only job is to evaluate something well</em>. In other words, peer reviewers who are paid to review papers full time and are only measured by how often those papers are retracted or proved false. Or tenure reviewers who are paid exclusively to evaluate tenure cases and are measured by how well the post-tenure process goes for the people they evaluate and whether there is any measurable bias in their reviews.</p> <p>The trouble with evaluating anything is that it is hard work and right now we aren’t paying anyone to do it.</p> <p> </p> Johns Hopkins Data Science Specialization Top Performers 2015-02-05T10:40:14+00:00 http://simplystats.github.io/2015/02/05/johns-hopkins-data-science-specialization-top-performers <p><em>Editor’s note: The Johns Hopkins Data Science Specialization is the largest data science program in the world.  <a href="http://www.bcaffo.com/">Brian</a>, <a href="http://www.biostat.jhsph.edu/~rpeng/">Roger</a>, and <a href="http://jtleek.com/">myself </a> conceived the program at the beginning of January 2014 , then built, recorded, and launched the classes starting in April 2014 with the help of <a href="https://twitter.com/iragooding">Ira</a>.  Since April 2014 we have enrolled 1.76 million student and awarded 71,589 Signature Track verified certificates. The first capstone class ran in October - just 7 months after the first classes launched and 4 months after all classes were running. Despite this incredibly short time frame 917 students finished all 9 classes and enrolled in the Capstone Course. 478 successfully completed the course.</em></p> <p>When we first announced the the Data Science Specialization, we said that the top performers would be profiled here on Simply Statistics. Well, that time has come, and we’ve got a very impressive group of participants that we want to highlight. These folks have successfully completed all nine MOOCs in the specialization and earned top marks in our first capstone session with <a href="http://swiftkey.com/en/">SwiftKey</a>. We had the pleasure of meeting some of them last week in a video conference, and we were struck by their insights and expertise. Check them out below.</p> <h2 id="sasa-bogdanovic"><strong>Sasa Bogdanovic</strong></h2> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Sasa-Bogdanovic.jpg"><img class="size-thumbnail wp-image-3874 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Sasa-Bogdanovic-120x90.jpg" alt="Sasa-Bogdanovic" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>Sasa Bogdanovic is passionate about everything data. For the last 6 years, he’s been working in the iGaming industry, providing data products (integrations, data warehouse architectures and models, business intelligence tools, analyst reports and visualizations) for clients, helping them make better, data-driven, business decisions.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>Although I’ve been working with data for many years, I wanted to take a different perspective and learn more about data science concepts and get insights into the whole pipeline from acquiring data to developing final data products. I also wanted to learn more about statistical models and machine learning.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>I am very happy to have discovered the data science field. It is a whole new world that I find fascinating and inspiring to explore. I am looking forward to my new career in data science. This will allow me to combine all my previous knowledge and experience with my new insights and methods. I am very proud of every single quiz, assignment and project. For sure, the capstone project was a culmination, and I am very proud and happy to have succeeded to make a solid data product and to be a one of the top performers in the group. For this I am very grateful to the instructors, community TAs, all other peers for their contributions in the forums, and Coursera for putting it all together and making it possible.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>I have already put the certificate in motion. My company is preparing new projects, and I expect the certificate to add weight to our proposals.</p> <h2 id="alejandro-morales-gallardo">Alejandro Morales Gallardo</h2> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Alejandro.png"><img class="size-thumbnail wp-image-3875 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Alejandro-120x90.png" alt="Alejandro" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>I’m a trained physicist with strong coding skills. I have a passion for dissecting datasets to find the hidden stories in data and produce insights through creative visualizations. A hackathon and open-data aficionado, I have an interest in using data (and science) to improve our lives.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-1"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>I wanted to close a gap in my skills and transition into to becoming a full blown Data Scientist by learning key concepts and practices in the field. Learning R, an industry relevant language, while creating a portfolio to showcase my abilities in the entire data science pipeline seemed very attractive.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-1"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>I’m most proud of the Predictive Text App I developed. With the Capstone Project, it was extremely rewarding to be able to tackle a brand new data type and learn about text mining and natural language processing while building a fun and attractive data product. I was particularly proud that the accuracy of my app was not that far off from SwiftKey smartphone app. I’m also proud of being a top performer!</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-1"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>I want to apply my new set of skills to develop other products, analyze new datasets and keep growing my portfolio. It is also helpful to have Verified Certificates to show prospective employers.</p> <h2 id="nitin-gupta">Nitin Gupta</h2> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/NitinGupta.jpg"><img class="size-thumbnail wp-image-3876 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/NitinGupta-120x90.jpg" alt="NitinGupta" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>Nitin is an independent trader and quant strategist with over 13 years of multi-faceted experience in the investment management industry. In the past he worked for a leading investment management firm where he built automated trading and risk management systems and gained complete life-cycle expertise in creating systematic investment products. He has a background in computer science with a strong interest in machine learning and its applications in quantitative modeling.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-2"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>I was fortunate to have done the first Machine Learning course taught by Prof. Andrew Ng at the launch of Coursera in 2012, which really piqued my interest in the topic. The next course I did on Coursera was Prof. Roger Peng’s Computing For Data Analysis which introduced me to R. I realized that R was ideally suited for the quantitative modeling work I was doing. When I learned about the range of topics that the JHU DSS would cover - from the best practices in tidying and transforming data to modeling, analysis and visualization - I did not hesitate to sign up. Learning how to do all of this in an ecosystem built around R has been a huge plus.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-2"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>I am quite pleased with the web apps I built which utilize the concepts learned during the track. One of my apps visualizes and compares historical stock performance with other stocks and market benchmarks after querying the data directly from web resources. Another one showcases a predictive typing engine that dynamically predicts the next few words to use and append, as the user types a sentence. The process of building these apps provided a fantastic learning experience. Also, for the first time I built something that even my near and dear ones could use and appreciate, which is terrific.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-2"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>The broad skill set developed through this specialization could be applied across multiple domains. My current focus is on building robust quantitative models for systematic trading strategies that could learn and adapt to changing market environments. This would involve the application of machine learning techniques among other skills learned during the specialization. Using R and Shiny to interactively analyze the results would be tremendously useful.</p> <h2 id="marc-kreyer">Marc Kreyer</h2> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Marc-Kreyer.jpeg"><img class="size-thumbnail wp-image-3877 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/Marc-Kreyer-120x90.jpeg" alt="Marc Kreyer" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>Marc Kreyer is an expert business analyst and software engineer with extensive experience in financial services in Austria and Liechtenstein. He successfully finishes complex projects by not only using broad IT knowledge but also outstanding comprehension of business needs. Marc loves combining his programming and database skills with his affinity for mathematics to transform data into insight.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-3"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>There are many data science MOOCs, but usually they are independent 4-6 week courses. The JHU Data Science Specialization was the first offering of a series of courses that build upon each other.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-3"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>Creating a working text prediction app without any prior NLP knowledge and only minimal assistance from instructors.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-3"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>Knowledge and experience are the most valuable things gained from the Data Science Specialization. As they can’t be easily shown to future employers, the certificate can be a good indicator for them. Unfortunately there is neither an issue data nor a verification link on the certificate, therefore it will be interesting to see how valuable it really will be.</p> <h2 id="hsing-liu">Hsing Liu</h2> <p> </p> <p style="text-align: left;"> <a href="http://simplystatistics.org/wp-content/uploads/2015/02/Paul_HsingLiu.jpeg"><img class="size-thumbnail wp-image-3878" src="http://simplystatistics.org/wp-content/uploads/2015/02/Paul_HsingLiu-120x90.jpeg" alt="Paul_HsingLiu" width="120" height="90" /></a> </p> <p>I studied in the U.S. for a number of years, and received my M.S. in mathematics from NYU before returning to my home country, Taiwan. I’m most interested in how people think and learn, and education in general. This year I’m starting a new career as an iOS app engineer.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-4"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>In my brief past job as an instructional designer, I read a lot about the new wave of online education, and was especially intrigued by how Khan Academy’s data science division is using data to help students learn. It occurred to me that to leverage my math background and make a bigger impact in education (or otherwise), data science could be an exciting direction to take.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-4"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>It may sound boring, but I’m proud of having done my best for each course in the track, going beyond the bare requirements when I’m able. The parts of the Specialization fit into a coherent picture of the discipline, and I’m glad to have put in the effort to connect the dots and gained a new perspective.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-4"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>I’m listing the certificate on my resume and LinkedIn, and I expect to be applying what I’ve learned once my company’s e-commence app launch.</p> <h2 id="yichen-liu">Yichen Liu</h2> <p> </p> <p>Yichen Liu is a business analyst at Toyota Western Australia where he is responsible for business intelligence development, data analytics and business improvement. His prior experience includes working as a sessional lecturer and tutor at Curtin University in finance and econometrics units.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-5"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>Recognising the trend that the world is more data-driven than before, I felt it was necessary to gain further understanding in data analysis to tackle both current and future challenges at work.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-5"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>The most proud thing as part of the program is that I have gained some basic knowledge in a totally new area, natural language processing. Though its connection with my current working area is limited, I see the future of data analysis to be more unstructured-data-drive and am willing to develop more knowledge in this area.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-5"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>I see the certificate as a stepping stone into the data science world, and would like to conduct more advanced studies in data science especially for unstructured data analysis.</p> <h2 id="johann-posch">Johann Posch</h2> <p style="text-align: left;"> <a href="http://simplystatistics.org/wp-content/uploads/2015/02/PictureJohannPosch.png"><img class="size-thumbnail wp-image-3879" src="http://simplystatistics.org/wp-content/uploads/2015/02/PictureJohannPosch-120x90.png" alt="PictureJohannPosch" width="120" height="90" /></a> </p> <p>After graduating form Vienna University of Technology with a specialization in Artificial Intelligence I joined Microsoft. There I worked as a developer on various products but the majority of the time as a Windows OS developer. After venturing into start-ups for a few years I joined GE Research to work on the Predix Big Data Platform and recently I joined on the Industrial Data Science team.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-6"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>Ever since I wrote my masters thesis in Neural Networks I have been intrigued with machine learning. I see data science as a field where great advances will happen over the next decade and as an opportunity to positively impact millions of lives. I like how JHU structured the course series.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-6">What are you most proud of doing as part of the JHU Data Science Specialization?</h3> <p>Being able to complete the JHU Data Science Specialization in 6 months and to get an distinction on every one of the courses was a great success. However, the best moment was probably the way my capstone project (next word prediction) turned out. The model could be trained in incremental steps and how it was able to provide meaningful options in real time.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-6"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>The course covered the concepts and tools needed to successfully address data science problems. It gave me the confidence and knowledge to apply for data science position. I am now working in the field at GE Research. I am grateful to all who made this Specialization happen!</p> <h2 id="jason-wilkinson">Jason Wilkinson</h2> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/JasonWilkinson.jpg"><img class="size-thumbnail wp-image-3880 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/JasonWilkinson-120x90.jpg" alt="JasonWilkinson" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>Jason Wilkinson is a trader of commodity futures and other financial securities at a small proprietary trading firm in New York City. He and his wife, Katie, and dog, Charlie, can frequently be seen at the Jersey shore. And no, it’s nothing like the tv show, aside from the fist pumping.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-7"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>The JHU Data Science Specialization helped me to prepare as I begin working on a Masters of Computer Science specializing in Machine Learning at Georgia Tech and also in researching algorithmic trading ideas. I also hope to find ways of using what I’ve learned in philanthropic endeavors, applying data science for social good.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-7"><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></h3> <p>I’m most proud of going from knowing zero R code to being able to apply it in the capstone and other projects in such a short amount of time.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-7"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>The knowledge gained in pursuing the specialization certificate alone was worth the time put into it. A certificate is just a piece of paper. It’s what you can do with the knowledge gained that counts.</p> <h2 id="uli-zellbeck">Uli Zellbeck</h2> <p> </p> <p style="text-align: left;"> <a href="http://simplystatistics.org/wp-content/uploads/2015/02/Uli.jpg"><img class="size-thumbnail wp-image-3881" src="http://simplystatistics.org/wp-content/uploads/2015/02/Uli-120x90.jpg" alt="Uli" width="120" height="90" /></a> </p> <p> </p> <p>I studied economics in Berlin with focus on econometrics and business informatics. I am currently working as a Business Intelligence / Data Warehouse Developer in an e-commerce company. I am interested in recommender systems and machine learning.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-8"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>I wanted to learn about Data Science because it provides a different approach on solving business problems with data. I chose the JHU Data Science Specialization on Coursera because it promised a wide range of topics and I like the idea of online courses. Also, I had experience with R and I wanted to deepen my knowledge with this tool.</p> <h3 id="what-are-you-most-proud-of-doing-as-part-of-the-jhu-data-science-specialization-8">What are you most proud of doing as part of the JHU Data Science Specialization?</h3> <p>There are two things. I successfully took all nine courses in 4 months and the capstone project was really hard work.</p> <h3 id="how-are-you-planning-on-using-your-data-science-specialization-certificate-8"><strong>How are you planning on using your Data Science Specialization Certificate?</strong></h3> <p>I might get the chance to develop a Data Science department at my company. I like to use the certificate as basis to get a deeper knowledge in the many parts of Data Science.</p> <h2 id="fred-zhengzhenhao">Fred Zheng Zhenhao</h2> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/ZHENG-Zhenhao.jpeg"><img class="size-thumbnail wp-image-3882 alignleft" src="http://simplystatistics.org/wp-content/uploads/2015/02/ZHENG-Zhenhao-120x90.jpeg" alt="ZHENG Zhenhao" width="120" height="90" /></a></p> <p> </p> <p> </p> <p> </p> <p> </p> <p>By the time I enrolled in the JHU data science specialization, I was an undergraduate student in The Hong Kong Polytechnic university. Before that, I read some data mining books, feel excited about the content, but I never get to implement any of the algorithms because I barely have any programming skill. After taking this series of courses, now I am able to analyze the web content which is related to my research using R.</p> <h3 id="why-did-you-take-the-jhu-data-science-specialization-9"><strong>Why did you take the JHU Data Science Specialization?</strong></h3> <p>I took this series of courses as a challenge to me. I would like to see whether my interest can support me through 9 courses and 1 capstone project. And I do want to learn more in this field. This specialization is different from other data mining or machine learning class in that it covers the entire process including the Git, R, R-Markdown, shiny etc, and I think these are necessary skills too.</p> <p><strong>What are you most proud of doing as part of the JHU Data Science Specialization?</strong></p> <p>Getting my word prediction app to respond in 0.05 seconds is already exiting, and one of the reviewer says “congratulations your engine came up with the most correct prediction among those I reviewed: 3 out of 5, including one that stumped every one else : “child might stick her finger or a foreign object into an electrical (outlet)”. I guess that’s the part I am most proud of.</p> <p><strong>How are you planning on using your Data Science Specialization Certificate?</strong></p> <p>It definitely goes in my CV for future job hunting.</p> <p> </p> <p> </p> Early data on knowledge units - atoms of statistical education 2015-02-05T09:44:49+00:00 http://simplystats.github.io/2015/02/05/early-data-on-knowledge-units-atoms-of-statistical-education <p>Yesterday I posted <a href="http://simplystatistics.org/2015/02/04/knowledge-units-the-atoms-of-statistical-education/">about atomizing statistical education into knowledge units</a>. You can try out the first knowledge unit here: <a href="https://jtleek.typeform.com/to/jMPZQe">https://jtleek.typeform.com/to/jMPZQe</a>. The early data is in and it is consistent with many of our hypotheses about the future of online education.</p> <p>Namely:</p> <ol> <li>Completion rates are high when segments are shorter</li> <li>You can learn something about statistics in a short amount of time (2 minutes to complete, many people got all questions right)</li> <li>People will consume educational material on tablets/smartphones more and more.</li> </ol> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png"><img class="aligncenter wp-image-3863" src="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png" alt="Screen Shot 2015-02-05 at 9.34.51 AM" width="500" height="402" srcset="http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM-300x241.png 300w, http://simplystatistics.org/wp-content/uploads/2015/02/Screen-Shot-2015-02-05-at-9.34.51-AM.png 1004w" sizes="(max-width: 500px) 100vw, 500px" /></a></p> <p> </p> Knowledge units - the atoms of statistical education 2015-02-04T16:45:21+00:00 http://simplystats.github.io/2015/02/04/knowledge-units-the-atoms-of-statistical-education <p><em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p> <p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p> <p>Data collected from massive online open courses suggest that [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p> <p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p> <p>Data collected from massive online open courses suggest that](https://onlinelearninginsights.wordpress.com/2014/04/28/mooc-design-tips-maximizing-the-value-of-video-lectures/) and the [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p> <p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p> <p>Data collected from massive online open courses suggest that [<em>Editor’s note: This idea is <a href="http://www.bcaffo.com/">Brian’s idea</a> and based on conversations with him and Roger, but I just executed it.</em></p> <p>The length of academic courses has traditionally ranged between a few days for a short course to a few months for a semester-long course.  Lectures are typically either 30 minutes or one hour. Term and lecture lengths have been dictated by tradition and the relative inconvenience of coordinating schedules of the instructors and students for shorter periods of time. As classes have moved online the barrier of inconvenience to varying the length of an academic course has been removed. Despite this flexibilty, most academic online courses adhere to the traditional semester-long format. For example, the first massive online open courses were simply semester-long courses directly recorded and offered online.</p> <p>Data collected from massive online open courses suggest that](https://onlinelearninginsights.wordpress.com/2014/04/28/mooc-design-tips-maximizing-the-value-of-video-lectures/) and the](https://www.coursera.org/specialization/jhudatascience/1?utm_medium=courseDescripTop) leads to higher student retention. These results line up with data on other online activities such as Youtube video watching or form completion, which also show that shorter activities lead to higher completion rates.</p> <p>We have  some of the earliest and most highly subscribed massive online open courses through the Coursera platform: Data Analysis, Computing for Data Analysis, and Mathematical Biostatistics Bootcamp. Our original courses were translated from courses we offered locally and were therefore closer to semester long with longer lectures ranging from 15-30 minutes. Based on feedback from our students and the data we observed about completion rates, we made the decision to break our courses down into smaller, one-month courses with no more than two hours of lecture material per week. Since then, we have enrolled more than a million students in our MOOCs.</p> <p>The data suggest that the shorter you can make an academic unit online, the higher the completion percentage. The question then becomes “How short can you make an online course?” To answer this question requires a definition of a course. For our purposes we will define a course as an educational unit consisting of the following three components:</p> <p><strong>** </strong>**</p> <ul> <li> <p><strong>**Knowledge delivery</strong> -** the distribution of educational material through lectures, audiovisual materials, and course notes<strong>.</strong></p> </li> <li> <p><strong>Knowledge evaluation</strong> - the evaluation of how much of the knowledge delivered to a student is retained.</p> </li> <li> <p><strong>Knowledge certification</strong> - an independent claim or representation that a student has learned some set of knowledge.</p> </li> </ul> <p> </p> <p>A typical university class delivers 36 hours = 12 weeks x 3 hours/week of content knowledge, evaluates that knowledge based on the order of 10 homework assignments and 2 tests, and results in a certification equivalent to 3 university credits.With this definition, what is the smallest possible unit that satisfies all three definitions of a course? We will call this smallest possible unit one knowledge unit. The smallest knowledge unit that satisfies all three definitions is a course that:</p> <ul> <li> <p><strong>**Delivers a single unit of content</strong> -** We will define a single unit of content as a text, image, or video describing a single concept.</p> </li> <li> <p><strong>Evaluates that single unit of content</strong> -  The smallest unit of evaluation possible is a single question to evaluate a student’s knowledge.</p> </li> <li> <p><strong>Certifies knowlege</strong> - Provides the student with a statement of successful evaluation of the knowledge in the knowledge unit.</p> </li> </ul> <p>An example of a knowledge unit appears here: <a href="https://jtleek.typeform.com/to/jMPZQe">https://jtleek.typeform.com/to/jMPZQe</a>. The knowledge unit consists of a short (less than 2 minute) video and 3 quiz questions. When completed, the unit sends the completer an email verifying that the quiz has been completed. Just as an atom is the smallest unit of mass that defines a chemical element, the knowledge unit is the smallest unit of education that defines a course.</p> <p>Shrinking the units down to this scale opens up some ideas about how you can connect them together into courses and credentials. I’ll leave that for a future post.</p> Precision medicine may never be very precise - but it may be good for public health 2015-01-30T14:24:17+00:00 http://simplystats.github.io/2015/01/30/precision-medicine-will-never-be-very-precise-but-it-may-be-good-for-public-health <p><em>Editor’s note: This post was originally titled: <a href="http://simplystatistics.org/2013/06/12/personalized-medicine-is-primarily-a-population-health-intervention/">Personalized medicine is primarily a population health intervention</a>. It has been updated with the graph of odds ratios/betas from GWAS studies.</em></p> <p>There has been a lot of discussion of <a href="http://en.wikipedia.org/wiki/Personalized_medicine">personalized medicine</a>, <a href="http://web.jhu.edu/administration/provost/initiatives/ihi/">individualized health</a>, and <a href="http://www.ucsf.edu/welcome-to-ome">precision medicine</a> in the news and in the medical research community and President Obama just <a href="http://www.whitehouse.gov/the-press-office/2015/01/30/fact-sheet-president-obama-s-precision-medicine-initiative">announced a brand new initiative in precision medicine</a> . Despite this recent attention, it is clear that healthcare has always been personalized to some extent. For example, men are rarely pregnant and heart attacks occur more often among older patients. In these cases, easily collected variables such as sex and age, can be used to predict health outcomes and therefore used to “personalize” healthcare for those individuals.</p> <p>So why the recent excitement around personalized medicine? The reason is that it is increasingly cheap and easy to collect more precise measurements about patients that might be able to predict their health outcomes. An example that <a href="http://www.nytimes.com/2013/05/14/opinion/my-medical-choice.html?_r=0">has recently been in the news</a> is the measurement of mutations in the BRCA genes. Angelina Jolie made the decision to undergo a prophylactic double mastectomy based on her family history of breast cancer and measurements of mutations in her BRCA genes. Based on these measurements, previous studies had suggested she might have a lifetime risk as high as 80% of developing breast cancer.</p> <p>This kind of scenario will become increasingly common as newer and more accurate genomic screening and predictive tests are used in medical practice. When I read these stories there are two points I think of that sometimes get obscured by the obviously fraught emotional, physical, and economic considerations involved with making decisions on the basis of new measurement technologies:</p> <ol> <li><strong>In individualized health/personalized medicine the “treatment” is information about risk</strong>. In <a href="http://en.wikipedia.org/wiki/Gleevec">some cases</a> treatment will be personalized based on assays. But in many other cases, we still do not (and likely will not) have perfect predictors of therapeutic response. In those cases, the healthcare will be “personalized” in the sense that the patient will get more precise estimates of their likelihood of survival, recurrence etc. This means that patients and physicians will increasingly need to think about/make decisions with/act on information about risks. But communicating and acting on risk is a notoriously challenging problem; personalized medicine will dramatically raise the importance of <a href="http://understandinguncertainty.org/">understanding uncertainty</a>.</li> <li><strong>Individualized health/personalized medicine is a population-level treatment.</strong> Assuming that the 80% lifetime risk estimate was correct for Angelina Jolie, it still means there is a 1 in 5 chance she was never going to develop breast cancer. If that had been her case, then the surgery was unnecessary. So while her decision was based on personal information, there is still uncertainty in that decision for her. So the “personal” decision may not always be the “best” decision for any specific individual. It may however, be the best thing to do for everyone in a population with the same characteristics.</li> </ol> <p>The first point bears serious consideration in light of President Obama’s new proposal. We have already collected a massive amount of genetic data about a large number of common diseases. In almost all cases, the amount of predictive information that we can glean from genetic studies is modest. One paper pointed this issue out in a rather snarky way by comparing two approaches to predicting people’s heights: (1) averaging their parents heights - an approach from the Victorian era and (2) combing the latest information on the best genetic markers at the time. It turns out, all the genetic information we gathered isn’t as good as <a href="http://www.nature.com/ejhg/journal/v17/n8/full/ejhg20095a.html">averaging parents heights</a>. Another way to see this is to download data on all genetic variants associated with disease from the <a href="http://www.genome.gov/gwastudies/">GWAS catalog</a> that have a P-value less than 1 x 10e-8. If you do that and look at the distribution of effect sizes, you see that 95% have an odds ratio or beta coefficient less than about 4. Here is a histogram of the effect sizes:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png"><img class="aligncenter size-full wp-image-3852" src="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png" alt="gwas-overall" width="480" height="480" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-overall.png 480w" sizes="(max-width: 480px) 100vw, 480px" /></a></p> <p> </p> <p> </p> <p>This means that nearly all identified genetic effects are small. The ones that are really large (effect size greater than 100) are not for common disease outcomes, they are for <a href="http://en.wikipedia.org/wiki/Birdshot_chorioretinopathy">Birdshot chorioretinopathy</a> and hippocampal volume. You can really see this if you look at the bulk of the distribution of effect sizes, which are mostly less than 2 by zooming the plot on the x-axis:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png"><img class="aligncenter size-full wp-image-3853" src="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png" alt="gwas-zoomed" width="480" height="480" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed-300x300.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed-200x200.png 200w, http://simplystatistics.org/wp-content/uploads/2015/01/gwas-zoomed.png 480w" sizes="(max-width: 480px) 100vw, 480px" /></a></p> <p> </p> <p> </p> <p>These effect sizes translate into very limited predictive capacity for most identified genetic biomarkers.  The implication is that personalized medicine, at least for common diseases, is highly likely to be inaccurate for any individual person. But if we can take advantage of the population-level improvements in health from precision medicine by increasing risk literacy, improving our use of uncertain markers, and understanding that precision medicine isn’t precise for any one person, it could be a really big deal.</p> Reproducible Research Course Companion 2015-01-26T16:22:36+00:00 http://simplystats.github.io/2015/01/26/reproducible-research-course-companion <p><a href="https://itunes.apple.com/us/book/reproducible-research/id961495566?ls=1&amp;mt=13" rel="https://itunes.apple.com/us/book/reproducible-research/id961495566?ls=1&amp;mt=13"><img class="alignright wp-image-3838" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-779x1024.png" alt="Screen Shot 2015-01-26 at 4.14.26 PM" width="331" height="435" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-228x300.png 228w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-779x1024.png 779w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM-152x200.png 152w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-26-at-4.14.26-PM.png 783w" sizes="(max-width: 331px) 100vw, 331px" /></a>I’m happy to announce that you can now get a copy of the <a title="Reproducible Research Course Companion" href="https://itunes.apple.com/us/book/reproducible/id961495566?ls=1&amp;mt=13" target="_blank">Reproducible Research Course Companion</a> from the Apple iBookstore. The purpose of this e-book is pretty simple. The book provides all of the key video lectures from my <a title="JHU/Coursera Reproducible Research Course " href="https://www.coursera.org/course/repdata" target="_blank">Reproducible Research course</a> offered on Coursera, in a simple offline e-book format. The book can be viewed on a Mac, iPad, or iPad mini.</p> <p>If you’re interested in taking my Reproducible Research course on Coursera and would like a flavor of what the course will be like, then you can view the lectures through the book (the free sample contains three lectures). On the other hand, if you already took the course and would like access to the lecture material afterwards, then this might be a useful add-on. If you care currently enrolled in the course, then this could be a handy way for you to take the lectures on the road with you.</p> <p>Please note that all of the lectures are still available for free on YouTube via my <a href="https://www.youtube.com/channel/UCZA0RbbSK1IXeeJysKYRWuQ" target="_blank">YouTube channel</a>. Also, the book provides content only. If you wish to actually complete the course, you must take it through the Coursera web site.</p> Data as an antidote to aggressive overconfidence 2015-01-21T11:58:07+00:00 http://simplystats.github.io/2015/01/21/data-as-an-antidote <p>A recent <a href="http://www.nytimes.com/2014/12/07/opinion/sunday/adam-grant-and-sheryl-sandberg-on-discrimination-at-work.html?_r=0">NY Times op-ed</a> reminded us of the many biases faced by women at work. A [A recent <a href="http://www.nytimes.com/2014/12/07/opinion/sunday/adam-grant-and-sheryl-sandberg-on-discrimination-at-work.html?_r=0">NY Times op-ed</a> reminded us of the many biases faced by women at work. A ](http://time.com/3666135/sheryl-sandberg-talking-while-female-manterruptions/)  gave specific recommendations for how to conduct ourselves in meetings_. <em>In general, I found these very insightful, but don’t necessarily agree with the recommendations that women should “Practice Assertive Body Language”.  Instead, we should make an effort to judge ideas by their content and not be impressed by body language. More generally, it is a problem that many of the characteristics that help advance careers contribute nothing to intellectual output. One of these is what I call _aggressive overconfidence</em>.</p> <p>Here is an example (based on a true story). A data scientist finds a major flaw with the data analysis performed by a prominent data-producing scientist’s lab. Both are part of a large collaborative project. A meeting is held among the project leaders to discuss the disagreement. The data producer is very self-confident in defending his approach. The data scientist, who in not nearly as aggressive, is <a href="http://time.com/3666135/sheryl-sandberg-talking-while-female-manterruptions/">interrupted</a> so much that she barely gets her point across. The project leaders decide that this seems to be simply a difference of opinion and, for all practical purposes, ignore the data scientist. I imagine this story sounds familiar to many. While in many situations this story ends here, when the results are data driven we can actually fact check opinions that are pronounced as fact. In this example, the data is public and anybody with the right expertise can download the data and corroborate the flaw in the analysis. This is typically quite tedious, but it can be done. Because the key flaws are rather complex, the project leaders, lacking expertise in data analysis, can’t make this determination. But eventually, a chorus of fellow data analysts will be too loud to ignore.</p> <p>That aggressive overconfidence is generally rewarded in academia is a problem. And if this trait is <a href="http://scholar.google.com/scholar?hl=en&amp;as_sdt=0,22&amp;q=overconfidence+gender">highly correlated with being male</a>, then a manifestation of this is a worsened gender gap. My experience (including reading internet discussions among scientists on controversial topics) has convinced me that this trait is in fact correlated with gender. But the solution is not to help women become more aggressively overconfident. Instead we should continue to strive to judge work based on content rather than style. I am optimistic that more and more, data, rather than who sounds more sure of themselves, will help us decide who wins a debate.</p> <p> </p> Gorging ourselves on "free" health care: Harvard's dilemma 2015-01-20T09:00:56+00:00 http://simplystats.github.io/2015/01/20/gorging-ourselves-on-free-health-care-harvards-dilemma <p><em>Editor’s note: This is a guest post by <a href="http://www.hcp.med.harvard.edu/faculty/core/laura-hatfield-phd">Laura Hatfield</a>. Laura is an Assistant Professor of Health Care Policy at Harvard Medical School, with a specialty in Biostatistics. Her work focuses on understanding trade-offs and relationships among health outcomes. Dr. Hatfield received her BS in genetics from Iowa State University and her PhD in biostatistics from the University of Minnesota. She tweets <a href="https://twitter.com/bioannie">@bioannie</a></em></p> <p>I didn’t imagine when I joined Harvard’s Department of Health Care Policy that the New York Times would be <a href="http://www.nytimes.com/2015/01/06/us/health-care-fixes-backed-by-harvards-experts-now-roil-its-faculty.html">writing about my benefits package</a>. Then a vocal and aggrieved group of faculty <a href="http://www.thecrimson.com/article/2014/11/12/harvards-health-benefits-unfairness/">rebelled against health benefits changes</a> for 2015, and commentators responded by gleefully <a href="http://www.thefiscaltimes.com/2015/01/07/Harvards-Whiny-Profs-Could-Get-Obamacare-Bonus">skewering</a> entitled-sounding Harvard professors. But I’m a statistician, so I want to talk data.</p> <p>Health care spending is tremendously right-skewed. The figure below shows the annual spending distribution among people with any spending (~80% of the total population) in two data sources on people covered by employer-sponsored insurance, such as the Harvard faculty. Notice that the y axis is on the log scale. More than half of people spend 1000 or less, but a few very unfortunate folks top out near half a million.</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/spending_distribution.jpg"><img class="alignnone size-full wp-image-3814" src="http://simplystatistics.org/wp-content/uploads/2015/01/spending_distribution.jpg" alt="spending_distribution" width="600" height="400" /></a></p> <p>Source: <a href="https://www.bea.gov/papers/working_papers.htm">Measuring health care costs of individuals with employer-sponsored health insurance in the US: A comparison of survey and claims data</a>. A. Aizcorbe, E. Liebman, S. Pack, D.M. Cutler, M.E. Chernew, A.B. Rosen. BEA working paper. WP2010-06. June 2010.</p> <p>If instead of contributing to my premiums, Harvard instead gave me the1000/month premium contribution in the form of wages, I would be on the hook for my own health care expenses. If I stay healthy, I pocket the money, minus income taxes. If I get sick, I have the extra money available to cover the expenses…provided I’m not one of the unlucky 10% of people spending more than $12,000/year. In that case, the additional wages would be insufficient to cover my health care expenses. This “every woman for herself” system lacks the key benefit of insurance: risk pooling. The sickest among us would be bankrupted by health costs. Another good reason for an employer to give me benefits is that I do not pay taxes on this part of my compensation (more on that later).</p> <p>At the opposite end of the spectrum is the Harvard faculty health insurance plan. Last year, the university paid ~$1030/month toward my premium and I put in ~$425 (tax-free). In exchange for this ~$17,000 of premiums, my family got first-dollar insurance coverage with very low co-pays. Faculty contributions to our collective expenses health care were distributed fairly evenly among all of us, with only minimal cost sharing to reflect how much care each person consumed. The sickest among us were in no financial peril. My family didn’t use much care and thus didn’t get our (or Harvard’s) money’s worth for all that coverage, but I’m ok with it. I still prefer risk pooling.</p> <p>Here’s the problem: moral hazard. It’s a word I learned when I started hanging out with health economists. It describes the tendency of people to over-consume goods that feel free, such as health care paid through premiums or desserts at an all-you-can-eat buffet. Just look at this array—how much cake do *you* want to eat for 9.99?</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/buffet.jpg"><img class="alignnone size-large wp-image-3815" src="http://simplystatistics.org/wp-content/uploads/2015/01/buffet-1024x768.jpg" alt="buffet" width="500" height="380" /></a></p> <p>Source: https://www.flickr.com/photos/jimmonk/5687939526/in/photostream/</p> <p>One way to mitigate moral hazard is to expose people to more of their cost of care at the point of service instead of through premiums. You might think twice about that fifth tiny cake if you were paying per morsel. This is what the new Harvard faculty plans do: our premiums actually go down, but now we face a modest deductible,250 per person or $750 max for a family. This is meant to encourage faculty to use their health care more efficiently, but it still affords good protection against catastrophic costs. The out-of-pocket max remains low at$1500 per individual or \$4500 per family, with recent announcements to further protect individuals who pay more than 3% of salary in out-of-pocket health costs through a reimbursement program.</p> <p>The allocation of individuals’ contributions between premiums and point-of-service costs is partly a question of how we cross-subsidize each other. If Harvard’s total contribution remains the same and health care costs do not grow faster than wages (ha!), then increased cost sharing decreases the amount by which people who use less care subsidize those who use more. How you feel about the “right” level of cost sharing may depend on whether you’re paying or receiving a subsidy from your fellow employees. And maybe your political leanings.</p> <p>What about the argument that it is better for an employer to “pay” workers by health insurance premium contributions rather than wages because of the tax benefits? While we might prefer to get our compensation in the form of tax-free health benefits vs taxed wages, the university, like all employers, is looking ahead to the <a href="http://www.forbes.com/sites/sallypipes/2014/12/01/a-cadillac-tax-for-everyone/">Cadillac tax provision of the ACA</a>. So they have to do some re-balancing of our overall compensation. If Harvard reduces its health insurance contributions to avoid the tax, we might reasonably <a href="http://www.washingtonpost.com/blogs/wonkblog/wp/2013/08/30/youre-spending-way-more-on-your-health-benefits-than-you-think/">expect to make up that difference</a> in higher wages. The empirical evidence is <a href="http://www.hks.harvard.edu/fs/achandr/JLE_LaborMktEffectsRisingHealthInsurancePremiums_2006.pdf">complicated</a> and suggests that employers may not immediately return savings on health benefits dollar-for-dollar in the form of wages.</p> <p>As far as I can tell, Harvard is contributing roughly the same amount as last year toward my health benefits, but exact numbers are difficult to find. I switched plan types\footnote{into a high-deductible plan, but that’s a topic for another post!}, so I can’t find and directly compare Harvard’s contributions in the same plan type this year and last. Peter Ubel <a href="http://www.peterubel.com/health_policy/how-behavioral-economics-could-have-prevented-the-harvard-meltdown-over-healthcare-costs/">argues</a> that if the faculty *had* seen these figures, we might not have revolted. The actuarial value of our plans remains very high (91%, just a bit better than the expensive Platinum plans on the exchanges) and Harvard’s spending on health care has grown from 8% to 12% of the university’s budget over the past few years. Would these data have been sufficient to quell the insurrection? Good question.</p> If you were going to write a paper about the false discovery rate you should have done it in 2002 2015-01-16T10:58:04+00:00 http://simplystats.github.io/2015/01/16/if-you-were-going-to-write-a-paper-about-the-false-discovery-rate-you-should-have-done-it-in-2002 <p>People often talk about academic superstars as people who have written highly cited papers. Some of that has to do with people’s genius, or ability, or whatever. But one factor that I think sometimes gets lost is luck and timing. So I wrote a little script to get the first 30 papers that appear when you search Google Scholar for the terms:</p> <ul> <li>empirical processes</li> <li>proportional hazards model</li> <li>generalized linear model</li> <li>semiparametric</li> <li>generalized estimating equation</li> <li>false discovery rate</li> <li>microarray statistics</li> <li>lasso shrinkage</li> <li>rna-seq statistics</li> </ul> <p>Google Scholar sorts by relevance, but that relevance is driven to a large degree by citations. For example, if you look at the first 10 papers you get for searching for false discovery rate you get.</p> <ul> <li>Controlling the false discovery rate: a practical and powerful approach to multiple testing</li> <li>Thresholding of statistical maps in functional neuroimaging using the false discovery rate</li> <li>The control of the false discovery rate in multiple testing under dependency</li> <li>Controlling the false discovery rate in behavior genetics research</li> <li>Identifying differentially expressed genes using false discovery rate controlling procedures</li> <li>The positive false discovery rate: A Bayesian interpretation and the q-value</li> <li>On the adaptive control of the false discovery rate in multiple testing with independent statistics</li> <li>Implementing false discovery rate control: increasing your power</li> <li>Operating characteristics and extensions of the false discovery rate procedure</li> <li>Adaptive linear step-up procedures that control the false discovery rate</li> </ul> <p>People who work in this area will recognize that many of these papers are the most important/most cited in the field.</p> <p>Now we can make a plot that shows for each term when these 30 highest ranked papers appear. There are some missing values, because of the way the data are scraped, but this plot gives you some idea of when the most cited papers on these topics were published:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png"><img class="aligncenter size-full wp-image-3798" src="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png" alt="citations-boxplot" width="600" height="400" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot-300x200.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot-260x173.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/citations-boxplot.png 600w" sizes="(max-width: 600px) 100vw, 600px" /></a></p> <p>You can see from the plot that the median publication year of the top 30 hits for “empirical processes” was 1990 and for “RNA-seq statistics” was 2010. The medians for the other topics were:</p> <ul> <li>Emp. Proc. 1990.241</li> <li>Prop. Haz. 1990.929</li> <li>GLM 1994.433</li> <li>Semi-param. 1994.433</li> <li>GEE 2000.379</li> <li>FDR 2002.760</li> <li>microarray 2003.600</li> <li>lasso 2004.900</li> <li>rna-seq 2010.765</li> </ul> <p>I think this pretty much matches up with the intuition most people have about the relative timing of fields, with a few exceptions (GEE in particular seems a bit late). There are a bunch of reasons this analysis isn’t perfect, but it does suggest that luck and timing in choosing a problem can play a major role in the “success” of academic work as measured by citations.  It also suggests another reason for success in science than individual brilliance. Given the potentially negative consequences the <a href="http://www.sciencemag.org/content/347/6219/262.abstract">expectation of brilliance has on certain subgroups</a>, it is important to recognize the importance of timing and luck. The median most cited “false discovery rate” paper was 2002, but almost none of the 30 top hits were published after about 2008.</p> <p><a href="https://gist.github.com/jtleek/c5158965d77c21ade424">The code for my analysis is here</a>. It is super hacky so have mercy.</p> How to find the science paper behind a headline when the link is missing 2015-01-15T13:35:42+00:00 http://simplystats.github.io/2015/01/15/how-to-find-the-science-paper-behind-a-headline-when-the-link-is-missing <p>I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.</p> <p> </p> <blockquote class="twitter-tweet" width="550"> <p> Amazingly, less than 60% of university news releases link to the papers they're describing <a href="http://t.co/daN11xYvKs">http://t.co/daN11xYvKs</a> <a href="http://t.co/QtneZUAeFD">pic.twitter.com/QtneZUAeFD</a> </p> <p> &mdash; Justin Wolfers (@JustinWolfers) <a href="https://twitter.com/JustinWolfers/status/555782983429677056">January 15, 2015</a> </p> </blockquote> <p>Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.</p> <p><strong>Here is the news article (<a href="http://www.huffingtonpost.com/2015/01/14/online-avatar-personality_n_6463484.html?utm_hp_ref=science">link</a>):</strong></p> <p> </p> <p><img class="aligncenter wp-image-3787" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.11.22-PM.png" alt="Screen Shot 2015-01-15 at 1.11.22 PM" width="300" height="405" /></p> <p> </p> <p> </p> <p><strong>Step 1: Look for a link to the article</strong></p> <p>Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. <em>This is not the original research article</em>. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.</p> <p> </p> <p><strong>Step 2: Look for names of the authors, scientific key words and journal name if available</strong></p> <p>You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png"><img class="aligncenter size-full wp-image-3791" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png" alt="Untitled presentation (2)" width="949" height="334" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-300x105.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-260x91.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png 949w" sizes="(max-width: 949px) 100vw, 949px" /></a></p> <p> </p> <p>And some key words:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png"><img class="aligncenter size-full wp-image-3792" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png" alt="Untitled presentation (3)" width="933" height="343" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3-300x110.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png 933w" sizes="(max-width: 933px) 100vw, 933px" /></a></p> <p> </p> <p><strong>Step 3 Use Google Scholar</strong></p> <p>You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to <a href="https://scholar.google.com/">Google Scholar </a>then click on the little triangle next to the search box.</p> <p> </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png"><img class="aligncenter size-full wp-image-3793" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png" alt="Untitled presentation (4)" width="960" height="540" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4-260x146.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png 960w" sizes="(max-width: 960px) 100vw, 960px" /></a></p> <p>Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png"><img class="aligncenter size-full wp-image-3794" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png" alt="Screen Shot 2015-01-15 at 1.31.38 PM" width="509" height="368" /></a></p> <p> </p> <p><strong>Step 4 Victory</strong></p> <p>Often this will come up with the article you are looking for:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png"><img class="aligncenter size-full wp-image-3795" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png" alt="Screen Shot 2015-01-15 at 1.32.45 PM" width="813" height="658" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM-247x200.png 247w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png 813w" sizes="(max-width: 813px) 100vw, 813px" /></a></p> <p> </p> <p>Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag [I just saw a pretty wild statistic on Twitter that less than 60% of university news releases link to the papers they are describing.</p> <p> </p> <blockquote class="twitter-tweet" width="550"> <p> Amazingly, less than 60% of university news releases link to the papers they're describing <a href="http://t.co/daN11xYvKs">http://t.co/daN11xYvKs</a> <a href="http://t.co/QtneZUAeFD">pic.twitter.com/QtneZUAeFD</a> </p> <p> &mdash; Justin Wolfers (@JustinWolfers) <a href="https://twitter.com/JustinWolfers/status/555782983429677056">January 15, 2015</a> </p> </blockquote> <p>Before you believe anything your read about science in the news, you need to go and find the original article.  When the article isn’t linked in the press release, sometimes you need to do a bit of sleuthing.  Here is an example of how I do it for a news article. In general the press-release approach is very similar, but you skip the first step I describe below.</p> <p><strong>Here is the news article (<a href="http://www.huffingtonpost.com/2015/01/14/online-avatar-personality_n_6463484.html?utm_hp_ref=science">link</a>):</strong></p> <p> </p> <p><img class="aligncenter wp-image-3787" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.11.22-PM.png" alt="Screen Shot 2015-01-15 at 1.11.22 PM" width="300" height="405" /></p> <p> </p> <p> </p> <p><strong>Step 1: Look for a link to the article</strong></p> <p>Usually it will be linked near the top or the bottom of the article. In this case, the article links to the press release about the paper. <em>This is not the original research article</em>. If you don’t get to a scientific journal you aren’t finished. In this case, the press release actually gives the full title of the article, but that will happen less than 60% of the time according to the statistic above.</p> <p> </p> <p><strong>Step 2: Look for names of the authors, scientific key words and journal name if available</strong></p> <p>You are going to use these terms to search in a minute. In this case the only two things we have are the journal name:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png"><img class="aligncenter size-full wp-image-3791" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png" alt="Untitled presentation (2)" width="949" height="334" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-300x105.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2-260x91.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-2.png 949w" sizes="(max-width: 949px) 100vw, 949px" /></a></p> <p> </p> <p>And some key words:</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png"><img class="aligncenter size-full wp-image-3792" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png" alt="Untitled presentation (3)" width="933" height="343" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3-300x110.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-3.png 933w" sizes="(max-width: 933px) 100vw, 933px" /></a></p> <p> </p> <p><strong>Step 3 Use Google Scholar</strong></p> <p>You could just google those words and sometimes you get the real paper, but often you just end up back at the press release/news article. So instead the best way to find the article is to go to <a href="https://scholar.google.com/">Google Scholar </a>then click on the little triangle next to the search box.</p> <p> </p> <p> </p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png"><img class="aligncenter size-full wp-image-3793" src="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png" alt="Untitled presentation (4)" width="960" height="540" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4-260x146.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/Untitled-presentation-4.png 960w" sizes="(max-width: 960px) 100vw, 960px" /></a></p> <p>Fill in information while you can. Fill in the same year as the press release, information about the journal, university and key words.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png"><img class="aligncenter size-full wp-image-3794" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.31.38-PM.png" alt="Screen Shot 2015-01-15 at 1.31.38 PM" width="509" height="368" /></a></p> <p> </p> <p><strong>Step 4 Victory</strong></p> <p>Often this will come up with the article you are looking for:</p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png"><img class="aligncenter size-full wp-image-3795" src="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png" alt="Screen Shot 2015-01-15 at 1.32.45 PM" width="813" height="658" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM-247x200.png 247w, http://simplystatistics.org/wp-content/uploads/2015/01/Screen-Shot-2015-01-15-at-1.32.45-PM.png 813w" sizes="(max-width: 813px) 100vw, 813px" /></a></p> <p> </p> <p>Unfortunately, the article may be paywalled, so if you don’t work at a university or institute with a subscription, you can always tweet the article name with the hashtag](https://twitter.com/hashtag/icanhazpdf) and your contact info. Then you just have to hope that someone will send it to you (they often do).</p> <p> </p> <p> </p> Statistics and R for the Life Sciences: New HarvardX course starts January 19 2015-01-12T10:30:08+00:00 http://simplystats.github.io/2015/01/12/statistics-and-r-for-the-life-sciences-new-harvardx-course-starts-january-19 <p>The first course of our Biomedical Data Science online curriculum</p> <p>starts next week. You can sign up <a href="https://www.edx.org/course/statistics-r-life-sciences-harvardx-ph525-1x">here</a>. Instead of relying on</p> <p>mathematical formulas to teach statistical concepts, students can</p> <p>program along as we show computer code for simulations that illustrate</p> <p>the main ideas of exploratory data analysis and statistical inference</p> <p>(p-values, confidence intervals and power calculations for example).</p> <p>By doing this, students will learn Statistics and R simultaneously and</p> <p>will not be bogged down by having to memorize formulas. We have three types of learning modules: lectures (see picture below), screencasts and assessments. After each</p> <p>video students will have the opportunity to assess their understanding</p> <p>through homeworks involving coding in R. A big improvement over the</p> <p>first version is that we have added dozens of assessment.</p> <p>Note that this course is the first in an <a href="http://simplystatistics.org/2014/03/31/data-analysis-for-genomic-edx-course/">eight part series</a> on Data Analysis for Genomics. Updates will be provided via twitter <a href="https://twitter.com/rafalab">@rafalab</a>.</p> <p> </p> <p><a href="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2.png"><img class="alignnone size-large wp-image-3773" src="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-1024x603.png" alt="edx_screenshot_v2" width="495" height="291" srcset="http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-300x176.png 300w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-1024x603.png 1024w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2-260x153.png 260w, http://simplystatistics.org/wp-content/uploads/2015/01/edx_screenshot_v2.png 1298w" sizes="(max-width: 495px) 100vw, 495px" /></a></p> Beast mode parenting as shown by my Fitbit data