Simply Statistics

03
Jul

The Massive Future of Statistics Education

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

NOTE: This post was written as a chapter for the not-yet-released Handbook on Statistics Education. 

Data are eating the world, but our collective ability to analyze data is going on a starvation diet.

Everywhere you turn, data are being generated somehow. By the time you read this piece, you’ll probably have collected some data. (For example this piece has 2,072 words). You can’t avoid data—it’s coming from all directions.

So what do we do with it? For the most part, nothing. There’s just too much data being spewed about. But for the data that we are interested in, we need to know the appropriate methods for thinking about and analyzing them. And by “we”, I mean pretty much everyone.

In the future, everyone will need some data analysis skills. People are constantly confronted with data and the need to make choices and decisions from the raw data they receive. Phones deliver information about traffic, we have ratings about restaurants or books, and even rankings of hospitals. High school students can obtain complex and rich information about the colleges to which they’re applying while admissions committees can get real-time data on applicants’ interest in the college.

Many people already have heuristic algorithms to deal with the data influx—and these algorithms may serve them well—but real statistical thinking will be needed for situations beyond choosing which restaurant to try for dinner tonight.

Limited Capacity

The McKinsey Global Institute, in a highly cited report, predicted that there would be a shortage of “data geeks” and that by 2018 there would be between 140,000 and 190,000 unfilled positions in data science. In addition, there will be an estimated 1.5 million people in managerial positions who will need to be trained to manage data scientists and to understand the output of data analysis. If history is any guide, it’s likely that these positions will get filled by people, regardless of whether they are properly trained. The potential consequences are disastrous as untrained analysts interpret complex big data coming from myriad sources of varying quality.

Who will provide the necessary training for all these unfilled positions? The field of statistics’ current system of training people and providing them with master’s degrees and PhDs is woefully inadequate to the task. In 2013, the top 10 largest statistics master’s degree programs in the U.S. graduated a total of 730 people. At this rate we will never train the people needed. While statisticians have greatly benefited from the sudden and rapid increase in the amount of data flowing around the world, our capacity for scaling up the needed training for analyzing those data is essentially nonexistent.

On top of all this, I believe that the McKinsey report is a gross underestimation of how many people will need to be trained in some data analysis skills in the future. Given how much data is being generated every day, and how critical it is for everyone to be able to intelligently interpret these data, I would argue that it’s necessary for everyone to have some data analysis skills. Needless to say, it’s foolish to suggest that everyone go get a master’s or even bachelor’s degrees in statistics. We need an alternate approach that is both high-quality and scalable to a large population over a short period of time.

Enter the MOOCs

In April of 2014, Jeff Leek, Brian Caffo, and I launched the Johns Hopkins Data Science Specialization on the Coursera platform. This is a sequence of nine courses that intends to provide a “soup-to-nuts” training in data science for people who are highly motivated and have some basic mathematical and computing background. The sequence of the nine courses follow what we believe is the essential “data science process”, which is

  1. Formulating a question that can be answered with data
  2. Assembling, cleaning, tidying data relevant to a question
  3. Exploring data, checking, eliminating hypotheses
  4. Developing a statistical model
  5. Making statistical inference
  6. Communicating findings
  7. Making the work reproducible

We took these basic steps and designed courses around each one of them.

Each course is provided in a massive open online format, which means that many thousands of people typically enroll in each course every time it is offered. The learners in the courses do homework assignments, take quizzes, and peer assess the work of others in the class. All grading and assessment is handled automatically so that the process can scale to arbitrarily large enrollments. As an example, the April 2015 session of the R Programming course had nearly 45,000 learners enrolled. Each class is exactly 4 weeks long and every class runs every month.

We developed this sequence of courses in part to address the growing demand for data science training and education across the globe. Our background as biostatisticians was very closely aligned with the training needs of people interested in data science because, essentially, data science is what we do every single day. Indeed, one curriculum rule that we had was that we couldn’t include something if we didn’t in fact use it in our own work.

The sequence has a substantial amount of standard statistics content, such as probability and inference, linear models, and machine learning. It also has non-standard content, such as git, GitHub, R programming, Shiny, and Markdown. Together, the sequence covers the full spectrum of tools that we believe will be needed by the practicing data scientist.

For those who complete the nine courses, there is a capstone project at the end, that involves taking all of the skills in the course and developing a data product. For our first capstone project we partnered with SwiftKey, a predictive text analytics company, to develop a project where learners had to build a statistical model for predicting words in a sentence. This project involves taking unstructured, messy data, processing it into an analyzable form, developing a statistical model while making tradeoffs for efficiency and accuracy, and creating a Shiny app to show off their model to the public.

Degree Alternatives

The Data Science Specialization is not a formal degree program offered by Johns Hopkins University—learners who complete the sequence do not get any Johns Hopkins University credit—and so one might wonder what the learners get out of the program (besides, of course, the knowledge itself). To begin with, the sequence is completely portfolio based, so learners complete projects that are immediately viewable by others. This allows others to evaluate a learner’s ability on the spot with real code or data analysis.

All of the lecture content is openly available and hosted on GitHub, so outsiders can view the content and see for themselves what is being taught. This give outsiders an opportunity to evaluate the program directly rather than have to rely on the sterling reputation of the institution teaching the courses.

Each learner who completes a course using Coursera’s “Signature Track” (which currently costs $49 per course) can get a badge on their LinkedIn profile, which shows that they completed the course. This can often be as valuable as a degree or other certification as recruiters scouring LinkedIn for data scientist positions will be able to see our completers’ certifications in various data science courses.

Finally, the scale and reach of our specialization immediately creates a large alumni social network that learners can take advantage of. As of March 2015, there were approximately 700,000 people who had taken at least one course in the specialization. These 700,000 people have a shared experience that, while not quite at the level of a college education, still is useful for forging connections between people, especially when people are searching around for jobs.

Early Numbers

So far, the sequence has been wildly successful. It averaged 182,507 enrollees a month for the first year in existence. The overall course completion rate was about 6% and the completion rate amongst those in the “Signature Track” (i.e. paid enrollees) was 67%. In October of 2014, barely 7 months since the start of the specialization, we had 663 learners enroll in the capstone project.

Some Early Lessons

From running the Data Science Specialization for over a year now, we have learned a number of lessons, some of which were unexpected. Here, I summarize the highlights of what we’ve learned.

Data Science as Art and Science. Ironically, although the word “Science” appears in the name “Data Science”, there’s actually quite a bit about the practice of data science that doesn’t really resemble science at all. Much of what statisticians do in the act of data analysis is intuitive and ad hoc, with each data analysis being viewed as a unique flower.

When attempting to design data analysis assignments that could be graded at scale with tens of thousands of people, we discovered that designing the rubrics for grading these assignments was not trivial. The reason is because our understanding of what makes a “good” analysis different from a bad one is not well-articulated. We could not identify any community-wide understanding of what are the components of a good analysis. What are the “correct” methods to use in a given data analysis situation? What is definitely the “wrong” approach?

Although each one of us had been doing data analysis for the better part of a decade, none of us could succinctly write down what the process was and how to recognize when it was being done wrong. To paraphrase Daryl Pregibon from his 1991 talk at the National Academies of Science, we had a process that we regularly espoused but barely understood.

Content vs. Curation. Much of the content that we put online is available elsewhere. With YouTube, you can find high-quality videos on almost any topic, and our videos are not really that much better. Furthermore, the subject matter that we were teaching was in now way proprietary. The linear models that we teach are the same linear models taught everywhere else. So what exactly was the value we were providing?

Searching on YouTube requires that you know what you are looking for. This is a problem for people who are just getting into an area. Effectively, what we provided was a curation of all the knowledge that’s out there on the topic of data science (we also added our own quirky spin). Curation is hard, because you need to make definitive choices between what is and is not a core element of a field. But curation is essential for learning a field for the uninitiated.

Skill sets vs. Certification. Because we knew that we were not developing a true degree program, we knew we had to develop the program in a manner so that the learners could quickly see for themselves the value they were getting out of it. This lead us to taking a portfolio approach where learners produced things that could be viewed publicly.

In part because of the self-selection of the population seeking to learn data science skills, our learners were more interested in being able to demonstrate the skills taught in the course rather than an abstract (but official) certification as might be gotten in a degree program. This is not unlike going to a music conservatory, where the output is your ability to play an instrument rather than the piece of paper you receive upon graduation. We feel that giving people the ability to demonstrate skills and skill sets is perhaps more important than official degrees in some instances because it gives employers a concrete sense of what a person is capable of doing.

Conclusions

As of April 2015, we had a total of 1,158 learners complete the entire specialization, including the capstone project. Given these numbers and our rate of completion for the specialization as a whole, we believe we are on our way to achieving our goal of creating a highly scalable program for training people in data science skills. Of course, this program alone will not be sufficient for all of the data science training needs of society. But we believe that the approach that we’ve taken, using non-standard MOOC channels, focusing on skill sets instead of certification, and emphasizing our role in curation, is a rich opportunity for the field of statistics to explore in order to educate the masses about our important work.

02
Jul

Looks like this R thing might be for real

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Not sure how I missed this, but the Linux Foundation just announced the R Consortium for supporting the "world’s most popular language for analytics and data science and support the rapid growth of the R user community". From the Linux Foundation:

The R language is used by statisticians, analysts and data scientists to unlock value from data. It is a free and open source programming language for statistical computing and provides an interactive environment for data analysis, modeling and visualization. The R Consortium will complement the work of the R Foundation, a nonprofit organization based in Austria that maintains the language. The R Consortium will focus on user outreach and other projects designed to assist the R user and developer communities.

Founding companies and organizations of the R Consortium include The R Foundation, Platinum members Microsoft and RStudio; Gold member TIBCO Software Inc.; and Silver members Alteryx, Google, HP, Mango Solutions, Ketchum Trading and Oracle.

01
Jul

How Airbnb built a data science team

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

From Venturebeat:

Back then we knew so little about the business that any insight was groundbreaking; data infrastructure was fast, stable, and real-time (I was querying our production MySQL database); the company was so small that everyone was in the loop about every decision; and the data team (me) was aligned around a singular set of metrics and methodologies.

But five years and 43,000 percent growth later, things have gotten a bit more complicated. I’m happy to say that we’re also more sophisticated in the way we leverage data, and there’s now a lot more of it. The trick has been to manage scale in a way that brings together the magic of those early days with the growing needs of the present — a challenge that I know we aren’t alone in facing.

24
Jun

How public relations and the media are distorting science

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Throughout history, engineers, medical doctors and other applied scientists have helped convert  basic science discoveries into products, public goods and policy that have greatly improved our quality of life. With rare exceptions, it has taken years if not decades to establish these discoveries. And even the exceptions stand on the shoulders of incremental contributions. The researchers that produce this knowledge go through a slow and painstaking process to reach these achievements.

In contrast, most science related media reports that grab the public's attention fall into three categories:

  1. The exaggerated big discovery: Recent examples include the discovery of the bubonic plague in the NYC subway, liquid water in mars, and the infidelity gene.
  2. Over-promising:  These try to explain a complicated basic science finding and, in the case of biomedical research, then speculate without much explanation that the finding will "lead to a deeper understanding of diseases and new ways to treat or cure them."
  3. Science is broken:  These tend to report an anecdote about an allegedly corrupt scientist, maybe cite the "Why Most Published Research Findings are False" paper, and then extrapolate speculatively.

In my estimation, despite the attention grabbing headlines, the great majority of the subject matter included in these reports will not have an impact on our lives and will not even make it into scientific textbooks. So does science still have anything to offer? Reports of the third category have even scientists particularly worried. I, however, remain optimistic. First, I do not see any empirical evidence showing that the negative effects of the lack of reproducibility are worse now than 50 years ago. Furthermore, although not widely reported in the lay press, I continue to see bodies of work built by several scientists over several years or decades with much promise of leading to tangible improvements to our quality of life.  Recent advances that I am excited about include topological insulators, PD-1 pathway inhibitors, clustered regularly interspaced short palindromic repeats, advances in solar energy technology, and prosthetic robotics.

However, there is one general aspect of science that I do believe has become worse.  Specifically, it's a shift in how much scientists jockey for media attention, even if it's short-lived. Instead of striving for having a sustained impact on our field, which may take decades to achieve, an increasing number of scientists seem to be placing more value on appearing in the New York Times, giving a Ted Talk or having a blog or tweet go viral. As a consequence, too many of us end up working on superficial short term challenges that, with the help of a professionally crafted press release, may result in an attention grabbing media report. NB: I fully support science communication efforts, but not when the primary purpose is garnering attention, rather than educating.

My concern spills over to funding agencies and philanthropic organizations as well. Consider the following two options. Option 1: be the funding agency representative tasked with organizing a big science project with a well-oiled PR machine. Option 2: be the funding agency representative in charge of several small projects, one of which may with low, but non-negligible, probability result in a Nobel Prize 30 years down the road. In the current environment, I see a preference for option 1.

I am also concerned about how this atmosphere may negatively affect societal improvements within science. Publicly shaming transgressors on Twitter or expressing one's outrage on a blog post can garner many social media clicks. However, these may have a smaller positive impact than mundane activities such as serving on a committee that, after several months of meetings, implements incremental, yet positive, changes. Time and energy spent on trying to increase internet clicks is time and energy we don't spend on the tedious administrative activities that are needed to actually affect change.

Because so many of the scientists that thrive in this atmosphere of short-lived media reports are disproportionately rewarded, I imagine investigators starting their careers feel some pressure to garner some media attention of their own. Furthermore, their view of how they are evaluated may be highly biased because evaluators that ignore media reports and focus more on the specifics of the scientific content, tend to be less visible. So if you want to spend your academic career slowly building a body of work with the hopes of being appreciated decades from now, you should not think that it is hopeless based on what is perhaps, a distorted view of how we are currently being evaluated.

Update: changed topological insulators links to these two. Here is one more. Via David S.

16
Jun

Interview at Leanpub

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

A few weeks ago I sat down with Len Epp over at Leanpub to talk about my recently published book R Programming for Data Science. So far, I've only published one book through Leanpub but I'm a huge fan. They've developed a system that is, in my opinion, perfect for academic publishing. The book's written in Markdown and they compile it into PDF, ePub, and mobi formats automatically.

The full interview transcript is over at the Leanpub blog. If you want to listen to the audio of the interview, you can subscribe to the Leanpub podcast on iTunes.

R Programming for Data Science is available at Leanpub for a suggested price of $15 (but you can get it for free if you want). R code files, datasets, and video lectures are available through the various add-on packages. Thanks to all of you who've already bought a copy!

10
Jun

Johns Hopkins Data Science Specialization Captsone 2 Top Performers

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

The second capstone session of the Johns Hopkins Data Science Specialization concluded recently. This time, we had 1,040 learners sign up to participate in the session, which again featured a project developed in collaboration with the amazingly innovative folks at SwiftKey

We've identified the learners listed below as the top performers in this capstone session. This is an incredibly talented group of people who have worked very hard throughout the entire nine-course specialization.  Please take some time to read their stories and look at their work. 

Ben Apple

Ben_Apple

Ben Apple is a Data Scientist and Enterprise Architect with the Department of Defense.  Mr. Apple holds a MS in Information Assurance and is a PhD candidate in Information Sciences.

Why did you take the JHU Data Science Specialization?

As a self trained data scientist I was looking for a program that would formalize my established skills while expanding my data science knowledge and tool box.

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone project was the most demanding aspect of the program.  As such I most proud of the finale project.  The project stretched each of us beyond the standard course work of the program and was quite satisfying.

How are you planning on using your Data Science Specialization Certificate?

To open doors so that I may further my research into the operational value of applying data science thought and practice to analytics of my domain.

Final Project: https://bengapple.shinyapps.io/coursera_nlp_capstone

Project Slide Deck: http://rpubs.com/bengapple/71376

 

Ivan Corneillet

Ivan.Corneillet

A technologist, thinker, and tinkerer, Ivan facilitates the establishment of start-up companies by advising these companies about the hiring process, product development, and technology development, including big data, cloud computing, and cybersecurity. In his 17-year career, Ivan has held a wide range of engineering and management positions at various Silicon Valley companies. Ivan is a recent Wharton MBA graduate, and he previously earned his master’s degree in computer science from the Ensimag, and his master’s degree in electrical engineering from Université Joseph Fourier, both located in France.

Why did you take the JHU Data Science Specialization?

There are three reasons why I decided to enroll in the JHU Data Science Specialization. First, fresh from college, my formal education was best suited for scaling up the Internet’s infrastructure. However, because every firm in every industry now creates products and services from analyses of data, I challenged myself to learn about Internet-scale datasets. Second, I am a big supporter of MOOCs. I do not believe that MOOCs should replace traditional education; however, I do believe that MOOCs and traditional education will eventually coexist in the same way that open-source and closed-source software does (read my blog post for more information on this topic: http://ivantur.es/16PHild). Third, the Johns Hopkins University brand certainly motivated me to choose their program. With a great name comes a great curriculum and fantastic professors, right?
Once I had completed the program, I was not disappointed at all. I had read a blog post that explained that the JHU Data Science Specialization was only a start to learning about data science. I certainly agree, but I would add that this program is a great start, because the curriculum emphasizes information that is crucial, while providing additional resources to those who wish to deepen their understanding of data science. My thanks to Professors Caffo, Leek, and Peng; the TAs, and Coursera for building and delivering this track!

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone project made for a very rich and exhilarating learning experience, and was my favorite course in the specialization. Because I did not have prior knowledge in natural language processing (NLP), I had to conduct a fair amount of research. However, the program’s minimal-guidance approach mimicked a real-world environment, and gave me the opportunity to leverage my experience with developing code and designing products to get the most out of the skillset taught in the track. The result was that I created a data product that implemented state-of-the-art NLP algorithms using what I think are the best technologies (i.e., C++, JavaScript, R, Ruby, and SQL), given the choices that I had made. Bringing everything together is what made me the most proud. Additionally, my product capabilities are a far cry from IBM’s Watson, but while I am well versed in supercomputer hardware, this track helped me to gain a much deeper appreciation of Watson’s AI.

How are you planning on using your Data Science Specialization Certificate?

Thanks to the broad skillset that the specialization covered, I feel confident wearing a data science hat. The concepts and tools covered in this program helped me to better understand the concerns that data scientists have and the challenges they face. From a business standpoint, I am also better equipped to identify the opportunities that lie ahead.

Final Project: https://paspeur.shinyapps.io/wordmaster-io/

Project Slide Deck: http://rpubs.com/paspeur/wordmaster-io

Oscar de León

Oscar_De_Leon

Oscar is an assistant researcher at a research institute in a developing country, graduated as a licentiate in biochemistry and microbiology in 2010 from the same university which hosts the institute. He has always loved technology, programming and statistics and has engaged in self learning of these subjects from an early age, finally using his abilities in the health-related research in which he has been involved since 2008. He is now working on the design, execution and analysis of various research projects, consulting for other researchers and students, and is looking forward to develop his academic career in biostatistics.

Why did you take the JHU Data Science Specialization?

I wanted to integrate my R experience into a more comprehensive data analysis workflow, which is exactly what this specialization offers. This was in line with the objectives of my position at the research institute in which I work, so I presented a study plan to my supervisor and she approved it. I also wanted to engage in an activity which enabled me to document my abilities in a verifiable way, and a Coursera Specialization seemed like a good option.

Additionally, I've followed the JHSPH group's courses since the first offering of Mathematical Biostatistics Bootcamp in November 2012. They have proved the standards and quality of education at their institution, and it was not something to let go by.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm not one to usually interact with other students, and certainly didn't do it during most of the specialization courses, but I decided to try out the fora on the Capstone project. It was wonderful; sharing ideas with, and receiving criticism form, my peers provided a very complete learning experience. After all, my contributions ended being appreciated by the community and a few posts stating it were very rewarding. This re-kindled my passion for teaching, and I'll try to engage in it more from now on.

How are you planning on using your Data Science Specialization Certificate?

First, I'll file it with HR at my workplace, since our research projects payed for the specialization :)

I plan to use the certificate as a credential for data analysis with R when it is relevant. For example, I've been interested in offering an R workshop for life sciences students and researchers at my University, and this certificate (and the projects I prepared during the specialization) could help me show I have a working knowledge on the subject.

Final Project: https://odeleon.shinyapps.io/ngram/

Project Slide Deck: http://rpubs.com/chemman/n-gram

Jeff Hedberg

Jeff_Hedberg

I am passionate about turning raw data into actionable insights that solve relevant business problems. I also greatly enjoy leading large, multi-functional projects with impact in areas pertaining to machine and/or sensor data.  I have a Mechanical Engineering Degree and an MBA, in addition to a wide range of Data Science (IT/Coding) skills.

Why did you take the JHU Data Science Specialization?

I was looking to gain additional exposure into Data Science as a current practitioner, and thought this would be a great program.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am most proud of completing all courses with distinction (top of peers).  Also, I'm proud to have achieved full points on my Capstone project having no prior experience in Natural Language Processing.

How are you planning on using your Data Science Specialization Certificate?

I am going to add this to my Resume and LinkedIn Profile.  I will use it to solidify my credibility as a data science practitioner of value.

Final Project: https://hedbergjeffm.shinyapps.io/Next_Word_Prediction/

Project Slide Deck: https://rpubs.com/jhedbergfd3s/74960

Hernán Martínez-Foffani

Hernán_Martínez-Foffani

I was born in Argentina but now I'm settled in Spain. I've been working in computer technology since the eighties, in digital networks, programming, consulting, project management. Now, as CTO in a software company, I lead a small team of programmers developing a supply chain management app.

Why did you take the JHU Data Science Specialization?

In my opinion the curriculum is carefully designed with a nice balance between theory and practice. The JHU authoring and the teachers' widely known prestige ensure the content quality. The ability to choose the learning pace, one per month in my case, fits everyone's schedule.

What are you most proud of doing as part of the JHU Data Science Specialization?

The capstone definitely. It resulted in a fresh and interesting challenge. I sweat a lot, learned much more and in the end had a lot of fun.

How are you planning on using your Data Science Specialization Certificate?

While for the time being I don't have any specific plan for the certificate, it's a beautiful reward for the effort done.

Final Project: https://herchu.shinyapps.io/shinytextpredict/

Project Slide Deck: http://rpubs.com/herchu1/shinytextprediction

Francois Schonken

 

Francois Schonken

I'm a 36 year old South African male born and raised. I recently (4 years now) immigrated to lovely Melbourne, Australia. I wrapped up a BSc (Hons) Computer Science with specialization in Computer Systems back in 2001. Next I co-found a small boutique Software Development house operating from South Africa. I wrapped my MBA, from Melbourne Business School, in 2013 and now I consult for my small boutique Software Development house and 2 (very) small internet start-ups.

Why did you take the JHU Data Science Specialization?

One of the core subjects in my MBA was Data Analysis, basically an MBA take on undergrad Statistics with focus on application over theory (not that there was any shortage of theory). Waiting in a lobby room some 6 months later I was paging through the financial section of business focused weekly. I came across an article explaining how a Melbourne local applied a language called R to solve a grammatically and statistically challenging issue. The rest, as they say, is history.

What are you most proud of doing as part of the JHU Data Science Specialization?

I'm quite proud of both my Developing Data Products and Capstone projects, but for me these tangible outputs merely served as a vehicle to better understand a different way of thinking about data. I've spend most of my Software Development life dealing with one form or the other form of RDBS (Relational Database Management System). This, in my experience, leads to a very set oriented way of thinking about data.

I'm most proud of developing a new tool in my "Skills Toolbox" that I consider highly complementary to both my Software and Business outlook on projects.

How are you planning on using your Data Science Specialization Certificate?

Honestly, I had not planned on using my Certificate in and of itself. The skills I've acquired has already helped shape my thinking in designing an in-house web based consulting collaboration platform.

I do not foresee this being the last time I'll be applying Data Science thinking moving forward on my journey.

Final Project: https://schonken.shinyapps.io/WordPredictor

Project Slide Deck: http://rpubs.com/schonken/sentence-builder

David J. Tagler

 

David J. Tagler

David is passionate about solving the world’s most important and challenging problems. His expertise spans chemical/biomedical engineering, regenerative medicine, healthcare technology management, information technology/security, and data science/analysis. David earned his Ph.D. in Chemical Engineering from Northwestern University and B.S. in Chemical Engineering from the University of Notre Dame.

Why did you take the JHU Data Science Specialization?

I enrolled in this specialization in order to advance my statistics, programming, and data analysis skills. I was interested in taking a series of courses that covered the entire data science pipeline. I believe that these skills will be critical for success in the future.

What are you most proud of doing as part of the JHU Data Science Specialization?

I am most proud of the R programming and modeling skills that I developed throughout this specialization. Previously, I had no experience with R. Now, I can effectively manage complex data sets, perform statistical analyses, build prediction models, create publication-quality figures, and deploy web applications.

How are you planning on using your Data Science Specialization Certificate?

I look forward to utilizing these skills in future research projects. Furthermore, I plan to take additional courses in data science, machine learning, and bioinformatics.

Final Project: http://dt444.shinyapps.io/next-word-predict

Project Slide Deck: http://rpubs.com/dt444/next-word-predict

Melissa Tan

 

MelissaTan

I'm a financial journalist from Singapore. I did philosophy and computer science at the University of Chicago, and I'm keen on picking up more machine learning and data viz skills.

Why did you take the JHU Data Science Specialization?

I wanted to keep up with coding, while learning new tools and techniques for wrangling and analyzing data that I could potentially apply to my job. Plus, it sounded fun. :)

What are you most proud of doing as part of the JHU Data Science Specialization?

Building a word prediction app pretty much from scratch (with a truckload of forum reading). The capstone project seemed insurmountable initially and ate up all my weekends, but getting the app to work passably was worth it.

How are you planning on using your Data Science Specialization Certificate?

It'll go on my CV, but I think it's more important to be able to actually do useful things. I'm keeping an eye out for more practical opportunities to apply and sharpen what I've learnt.

Final Project: https://melissatan.shinyapps.io/word_psychic/

Project Slide Deck: https://rpubs.com/melissatan/capstone

Felicia Yii

FeliciaYii

Felicia likes to dream, think and do. With over 20 years in the IT industry, her current fascination is at the intersection of people, information and decision-making.  Ever inquisitive, she has acquired an expertise in subjects as diverse as coding to cookery to costume making to cosmetics chemistry. It’s not apparent that there is anything she can’t learn to do, apart from housework.  Felicia lives in Wellington, New Zealand with her husband, two children and two cats.

Why did you take the JHU Data Science Specialization?

Well, I love learning and the JHU Data Science Specialization appealed to my thirst for a new challenge. I'm really interested in how we can use data to help people make better decisions.  There's so much data out there these days that it is easy to be overwhelmed by it all. Data visualisation was at the heart of my motivation when starting out. As I got into the nitty gritty of the course, I really began to see the power of making data accessible and appealing to the data-agnostic world. There's so much potential for data science thinking in my professional work.

What are you most proud of doing as part of the JHU Data Science Specialization?

Getting through it for starters while also working and looking after two children. Seriously though, being able to say I know what 'practical machine learning' is all about.  Before I started the course, I had limited knowledge of statistics, let alone knowing how to apply them in a machine learning context.  I was thrilled to be able to use what I learned to test a cool game concept in my final project.

How are you planning on using your Data Science Specialization Certificate?

I want to use what I have learned in as many ways possible. Firstly, I see opportunities to apply my skills to my day-to-day work in information technology. Secondly, I would like to help organisations that don't have the skills or expertise in-house to apply data science thinking to help their decision making and communication. Thirdly, it would be cool one day to have my own company consulting on data science. I've more work to do to get there though!

Final Project: https://micasagroup.shinyapps.io/nwpgame/

Project Slide Deck: https://rpubs.com/MicasaGroup/74788

 

09
Jun

Batch effects are everywhere! Deflategate edition

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

In my opinion, batch effects are the biggest challenge faced by genomics research, especially in precision medicine. As we point out in this review, they are everywhere among high-throughput experiments. But batch effects are not specific to genomics technology. In fact, in this 1972 paper (paywalled), WJ Youden describes batch effects in the context of measurements made by physicists. Check out this plot of speed of light estimates with confidence intervals (red and green are same lab).

Slide1

Sometimes you find batch effects where you least expect them. For example, in the deflategate debate. Here is quote from the New England patriot's deflategate rebuttal (written with help from Nobel Prize winner Roderick MacKinnon)

in other words, the Colts balls were measured after the Patriots balls and had warmed up more. For the above reasons, the Wells Report conclusion that physical law cannot explain the pressures is incorrect.

Here is another one:

In the pressure measurements physical conditions were not very well-defined and major uncertainties, such as which gauge was used in pre-game measurements, affect conclusions.

So NFL, please read our paper before you accuse a player of cheating.

Disclaimer: I live in New England but I am Ravens fan.

08
Jun

I'm a data scientist - mind if I do surgery on your heart?

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

There has been a lot of recent interest from scientific journals and from other folks in creating checklists for data science and data analysis. The idea is that the checklist will help prevent results that won't reproduce or replicate from the literature. One analogy that I'm frequently hearing is the analogy with checklists for surgeons that can help reduce patient mortality.

The one major difference between checklists for surgeons and checklists I'm seeing for research purposes is the difference in credentialing between people allowed to perform surgery and people allowed to perform complex data analysis. You would never let me do surgery on you. I have no medical training at all. But I'm frequently asked to review papers that include complicated and technical data analyses, but have no trained data analysts or statisticians. The most common approach is that a postdoc or graduate student in the group is assigned to do the analysis, even if they don't have much formal training. Whenever this happens red flags are up all over the place. Just like I wouldn't trust someone without years of training and a medical license to do surgery on me, I wouldn't let someone without years of training and credentials in data analysis make major conclusions from complex data analysis.

You might argue that the consequences for surgery and for complex data analysis are on completely different scales. I'd agree with you, but not in the direction that you might think. I would argue that high pressure and complex data analysis can have much larger consequences than surgery. In surgery there is usually only one person that can be hurt. But if you do a bad data analysis, say claiming say that vaccines cause autism, that can have massive consequences for hundreds or even thousands of people. So complex data analysis, especially for important results, should be treated with at least as much care as surgery.

The reason why I don't think checklists alone will solve the problem is that they are likely to be used by people without formal training. One obvious (and recent) example that I think makes this really clear is the HealthKit data we are about to start seeing. A ton of people signed up for studies on their iPhones and it has been all over the news. The checklist will (almost certainly) say to have a big sample size. HealthKit studies will certainly pass the checklist, but they are going to get Truman/Deweyed big time if they aren't careful about biased sampling.

If I walked into an operating room and said I'm going to start dabbling in surgery I would be immediately thrown out. But people do that with statistics and data analysis all the time. What they really need is to require careful training and expertise in data analysis on each paper that analyzes data. Until we treat it as a first class component of the scientific process we'll continue to see retractions, falsifications, and irreproducible results flourish.
04
Jun

Interview with Class Central

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Recently I sat down with Class Central to do an interview about the Johns Hopkins Data Science Specialization. We talked about the motivation for designing the sequence and and the capstone project. With the demand for data science skills greater than ever, the importance of the specialization is only increasing.

See the full interview at the Class Central site. Below is short excerpt.

01
Jun

Interview with Chris Wiggins, chief data scientist at the New York Times

Tweet about this on TwitterShare on FacebookShare on Google+Share on LinkedInEmail this to someone

Editor's note: We are trying something a little new here and doing an interview with Google Hangouts on Air. The interview will be live at 11:30am EST. I have some questions lined up for Chris, but if you have others you'd like to ask, you can tweet them @simplystats and I'll see if I can work them in. After the livestream we'll leave the video on Youtube so you can check out the interview if you can't watch the live stream. I'm embedding the Youtube video here but if you can't see the live stream when it is running go check out the event page: https://plus.google.com/events/c7chrkg0ene47mikqrvevrg3a4o.