Tag: coursera


My Online Course Development Workflow

One of the nice things about developing 9 new courses for the JHU Data Science Specialization in a short period of time is that you get to learn all kinds of cool and interesting tools. One of the ways that we were able to push out so much content in just a few months was that we did most of the work ourselves, rather than outsourcing things like video production and editing. You could argue that this results in a poorer quality final product but (a) I disagree; and (b) even if that were true, I think the content is still valuable.

The advantage of learning all the tools was that it allowed for a quick turn-around from the creation of the lecture to the final exporting of the video (often within a single day). For a hectic schedule, it's nice to be able to write slides in the morning, record some video in between two meetings in the afternoon, and the combine/edit all the video in the evening. Then if you realize something doesn't work, you can start over the next day and have another version done in less than 24 hours.

I thought it might be helpful to someone out there to detail the workflow and tools that I use to develop the content for my online courses.

  • I use Camtasia for Mac to do all my screencasting/recording. This is a nice tool and I think has more features than your average screen recorder. That said, if you just want to record your screen on your Mac, you can actually use the built-in Quicktime software. I used to do all of my video editing in Camtasia but now it's pretty much glorified screencasting software for me.
  • For talking head type videos I use my iPhone 5S mounted on a tripod. The iPhone produces surprisingly good 1080p HD 30 fps video and is definitely sufficient for my purposes (see here for a much better example of what can be done). I attach the phone to an Apogee microphone to pick up better sound. For some of the interviews that we do on Simply Statistics I use two iPhones (A 5S and a 4S, my older phone).
  • To record my primary sound (i.e. me talking), I use the Zoom H4N portable recorder. This thing is not cheap but it records very high-quality stereo sound. I can connect it to my computer via USB or it can record to a SD card.
  • For simple sound recording (no video or screen) I use Audacity.
  • All of my lecture videos are run through Final Cut Pro X on my 15-inch MacBook Pro with Retina Display. Videos from Camtasia are exported in Apple ProRes format and then imported into Final Cut. Learning FCPX is not for the faint-of-heart if you're not used to a nonlinear editor (as I was not). I bought this excellent book to help me learn it, but I still probably only use 1% of the features. In the end using a real editor was worth it because it makes merging multiple videos much easier (i.e. multicam shots for screencasts + talking head) and editing out mistakes (e.g. typos on slides) much simpler. The editor in Camtasia is pretty good but if you have more then one camera/microphone it becomes infeasible.
  • I have an 8TB Western Digital Thunderbolt drive to store the raw video for all my classes (and some backups). I also use two 1TB Thunderbolt drives to store video for individual classes (each 4-week class borders on 1TB of raw video). These smaller drives are nice because I can just throw them in my bag and edit video at home or on the weekend if I need to.
  • Finished videos are shared with a Dropbox for Business account so that Jeff, Brian, and I can all look at each other's stuff. Videos are exported to H.264/AAC and uploaded to Coursera.
  • For developing slides, Jeff, Brian, and I have standardized around using Slidify. The beauty of using slidify is that it lets you write everything in Markdown, a super simple text format. It also make it simpler to manage all the course material in Git/GitHub because you don't have to lug around huge PowerPoint files. Everything is  a light-weight text file. And thanks to Ramnath's incredible grit and moxie, we have handy tools to easily export everything to PDF and HTML slides (HTML slides hosted via GitHub Pages).

The first courses for the Data Science Specialization start on April 7th. Don't forget to sign up!


Podcast #6: Data Analysis MOOC Post-mortem

Jeff and I talk about Jeff's recently completed MOOC on Data Analysis.


The landscape of data analysis

I have been getting some questions via email, LinkedIn, and Twitter about the content of the Data Analysis class I will be teaching for Coursera. Data Analysis and Data Science mean different things to different people. So I made a video describing how Data Analysis fits into the landscape of other quantitative classes here:

Here is the corresponding presentation. I also made a tentative list of topics we will cover, subject to change at the instructor's whim. Here it is:

  • The structure of a data analysis  (steps in the process, knowing when to quit, etc.)
  • Types of data (census, designed studies, randomized trials)
  • Types of data analysis questions (exploratory, inferential, predictive, etc.)
  • How to write up a data analysis (compositional style, reproducibility, etc.)
  • Obtaining data from the web (through downloads mostly)
  • Loading data into R from different file types
  • Plotting data for exploratory purposes (boxplots, scatterplots, etc.)
  • Exploratory statistical models (clustering)
  • Statistical models for inference (linear models, basic confidence intervals/hypothesis testing)
  • Basic model checking (primarily visually)
  • The prediction process
  • Study design for prediction
  • Cross-validation
  • A couple of simple prediction models
  • Basics of simulation for evaluating models
  • Ways you can fool yourself and how to avoid them (confounding, multiple testing, etc.)

Of course that is a ton of material for 8 weeks and so obviously we will be covering just the very basics. I think it is really important to remember that being a good Data Analyst is like being a good surgeon or writer. There is no such thing as a prodigy in surgery or writing, because it requires long experience, trying lots of things out, and learning from mistakes. I hope to give people the basic information they need to get started and point to resources where they can learn more. I also hope to give them a chance to practice a couple of times some basics and to learn that in data analysis the first goal is to "do no harm".


Podcast #5: Coursera Debrief

Jeff and I talk with Brian Caffo about teaching MOOCs on Coursera.


Statistics project ideas for students (part 2)

A little while ago I wrote a post on statistics projects ideas for students. In honor of the first Simply Statistics Coursera offering, Computing for Data Analysis, here is a new list of student projects for folks excited about trying out those new R programming skills. Again we have rated each project with my best guess difficulty and effort required. Happy computing!

Data Analysis

  1. Use city data to predict areas with the highest risk for parking tickets. Here is the data for Baltimore. (Difficulty: Moderate, Effort: Low/Moderate)
  2. If you have a Fitbit with a premium account, download the data into a spreadsheet (or get Chris’s data)  Then build various predictors using the data: (a) are you running or walking, (b) are you having a good day or not, (c) did you eat well that day or not, (d) etc. For special bonus points create a blog with your new discoveries and share your data with the world. (Difficulty: Depends on what you are trying to predict, Effort: Moderate with Fitbit/Jawbone/etc.)

Data Collection/Synthesis

  1. Make a list of skills associated with each component of the Data Scientist Venn Diagram. Then update the data scientist R function described in this post to ask a set of questions, then plot people on the diagram. Hint, check out the readline() function. (Difficulty: Moderately low, Effort:Moderate)
  2. HealthData.gov has a ton of data from various sources about public health, medicines, etc. Some of this data is super useful for projects/analysis and some of it is just data dumps. Create an R package that downloads data from healthdata.gov and gives some measures of how useful/interesting it is for projects (e.g. number of samples in the study, number of variables measured, is it summary data or raw data, etc.) (Difficulty: Moderately hard, Effort: High)
  3. Build an up-to-date aggregator of R tutorials/how-to videos, summarize/rate each one so that people know which ones to look at for learning which tasks. (Difficulty: Low, Effort: Medium)

Tool building

  1. Build software that creates a 2-d author list and averages people’s 2-d author lists. (Difficulty: Medium, Effort: Low)
  2. Create an R package that interacts with and downloads data from government websites and processes it in a way that is easy to analyze. (Difficulty: Medium, Effort: High)


Sunday Data/Statistics Link Roundup (9/9/12)

  1. Not necessarily statistics related, but pretty appropriate now that the school year is starting. Here is a little introduction to “how to google” (via Andrew J.). Being able to “just google it” and find answers for oneself without having to resort to asking folks is maybe the #1 most useful skill as a statistician. 
  2. A really nice presentation on interactive graphics with the googleVis package. I think one of the most interesting things about the presentation is that it was built with markdown/knitr/slidy (see slide 53). I am seeing more and more of these web-based presentations. I like them for a lot of reasons (ability to incorporate interactive graphics, easy sharing, etc.), although it is still harder than building a Powerpoint. I also wonder, what happens when you are trying to present somewhere that doesn’t have a good internet connection?
  3. We talked a lot about the ENCODE project this week. We had an interview with Steven Salzberg, then Rafa followed it up with a discussion of top-down vs. bottom-up science. Tons of data from the ENCODE project is now available, there is even a virtual machine with all the software used in the main analysis of the data that was just published. But my favorite quote/tweet/comment this week came from Leonid K. about a flawed/over the top piece trying to make a little too much of the ENCODE discoveries: “that’s a clown post, bro”.
  4. Another breathless post from the Chronicle about how there are “dozens of plagiarism cases being reported on Coursera”. Given that tens of thousands of people are taking the course, it would be shocking if there wasn’t plagiarism, but my guess is it is about the same rate you see in in-person classes. I will be using peer grading in my course, hopefully plagiarism software will be in place by then. 
  5. A New York Times article about a new book on visualizing data for scientists/engineers. I love all the attention data visualization is getting. I’ll take a look at the book for sure. I bet it says a lot of the same things Tufte said and a lot of the things Nathan Yau says in his book. This one may just be targeted at scientists/engineers. (link via Dan S.)
  6. Edo and co. are putting together a workshop on the analysis of social network data for NIPS in December. If you do this kind of stuff, it should be a pretty awesome crowd, so get your paper in by the Oct. 15th deadline!

Why we are teaching massive open online courses (MOOCs) in R/statistics for Coursera

Editor’s Note: This post written by Roger Peng and Jeff Leek. 

A couple of weeks ago, we announced that we would be teaching free courses in Computing for Data Analysis and Data Analysis on the Coursera platform. At the same time, a number of other universities also announced partnerships with Coursera leading to a large number of new offerings. That, coupled with a new round of funding for Coursera, led to press coverage in the New York Times, the Atlantic, and other media outlets.

There was an ensuing explosion of blog posts and commentaries from academics. The opinions ranged from dramatic, to negative, to critical, to um…hilariously angry. Rafa posted a few days ago that many of the folks freaking out are missing the point - the opportunity to reach a much broader audience of folks with our course content. 

[Before continuing, we’d like to make clear that at this point no money has been exchanged between Coursera and Johns Hopkins. Coursera has not given us anything and Johns Hopkins hasn’t given them anything. For now, it’s just a mutually beneficial partnership — we get their platform and they get to use our content. In the future, Coursera will need to figure out a way to make money, and they are currently considering a number of options.] 

Now that the initial wave of hype has died down, we thought we’d outline why we are excited about participating in Coursera. We think it is only fair to start by saying this is definitely an experiment. Coursera is a newish startup and as such is still figuring out its plan/business model. Similarly, our involvement so far has been a little whirlwind and we haven’t actually taught courses yet, and we are happy to collect data and see how things turn out. So ask us again in 6 months when we are both done teaching.

But for now, this is why we are excited.

  1. Open Access. As Rafa alluded to in his post, this is an opportunity to reach a broad and diverse audience. As academics devoted to open science, we also think that opening up our courses to the biggest possible audience is, in principle, a good thing. That is why we are both basing our courses on free software and teaching the courses for free to anyone with an internet connection. 
  2. Excitement about statistics. The data revolution means that there is a really intense interest in statistics right now. It’s so exciting that Joe Blitzstein’s stat class on iTunes U has been one of the top courses on that platform. Our local superstar John McGready has also put his statistical reasoning course up on iTunes U to a similar explosion of interest. Rafa recently put his statistics for genomics lectures up on Youtube and they have already been viewed thousands of times. As people who are super pumped about the power and importance of statistics, we want to get in on the game. 
  3. We work hard to develop good materials. We put effort into building materials that our students will find useful. We want to maximize the impact of these efforts. We have over 30,000 students enrolled in our two courses so far. 
  4. It is an exciting experiment. Online teaching, including very very good online teaching, has been around for a long time. But the model of free courses at incredibly large scale is actually really new. Whether you think it is a gimmick or something here to stay, it is exciting to be part of the first experimental efforts to build courses at scale. Of course, this could flame out. We don’t know, but that is the fun of any new experiment. 
  5. Good advertising. Every professor at a research school is a start-up of one. This idea deserves it’s own blog post. But if you accept that premise, to keep the operation going you need good advertising. One way to do that is writing good research papers, another is having awesome students, a third is giving talks at statistical and scientific conferences. This is an amazing new opportunity to showcase the cool things that we are doing. 
  6. Coursera built some cool toys. As statisticians, we love new types of data. It’s like candy. Coursera has all sorts of cool toys for collecting data about drop out rates, participation, discussion board answers, peer review of assignments, etc. We are pretty psyched to take these out for a spin and see how we can use them to improve our teaching.
  7. Innovation is going to happen in education. The music industry spent years fighting a losing battle over music sharing. Mostly, this damaged their reputation and stopped them from developing new technology like iTunes/Spotify that became hugely influential/profitable. Education has been done the same way for hundreds (or thousands) of years. As new educational technologies develop, we’d rather be on the front lines figuring out the best new model than fighting to hold on to the old model. 

Finally, we’d like to say a word about why we think in-person education isn’t really threatened by MOOCs, at least for our courses. If you take one of our courses through Coursera you will get to see the lectures and do a few assignments. We will interact with students through message boards, videos, and tutorials. But there are only 2 of us and 30,000 people registered. So you won’t get much one on one interaction. On the other hand, if you come to the top Ph.D. program in biostatistics and take Data Analysis, you will now get 16 weeks of one-on-one interaction with Jeff in a classroom, working on tons of problems together. In other words, putting our lectures online now means at Johns Hopkins you get the most qualified TA you have ever had. Your professor. 


Online education: many academics are missing the point

Many academics are complaining about online education and warning us about how it can lead to a lower quality product. For example, the New York Times recently published this op-ed piece wondering if “online education [will] ever be education of the very best sort?”. Although pretty much every controlled experiment comparing online and in-class education finds that students learn just about the same under both approaches, I do agree that in-person lectures are more enjoyable to both faculty and students. But who cares? My enjoyment and the enjoyment of the 30 privileged students that physically sit in my classes seems negligible compared to the potential of reaching and educating thousands of students all over the world.  Also, using recorded lectures will free up time that I can spend on one-on-one interactions with tuition paying students.  But what most excites me about online education is the possibility of being part of the movement that redefines existing disciplines as the number of people learning grows by orders of magnitude. How many Ramanujans are out there eager to learn Statistics? I would love it if they learned it from me. 


Sunday Data/Statistics Link Roundup (7/22/12)

  1. This paper is the paper describing how Uri Simonsohn identified academic misconduct using statistical analyses. This approach has received a huge amount of press in the scientific literature. The basic approach is that he calculates the standard deviations of mean/standard deviation estimates across groups being compared. Then he simulates from a Normal distribution and shows that under the Normal model, it is unlikely that the means/standard deviations are so similar. I think the idea is clever, but I wonder if the Normal model is the best choice here…could the estimates be similar because it was the same experimenter, etc.? I suppose the proof is in the pudding though, several of the papers he identifies have been retracted. 
  2. This is an amazing rant by a history professor at Swarthmore over the development of massive online courses, like the ones Roger, Brian and I are teaching. I think he makes some important points (especially about how we could do the same thing with open access in a heart beat if universities/academics through serious muscle behind it), but I have to say, I’m personally very psyched to be involved in teaching one of these big classes. I think that statistics is a field that a lot of people would like to learn something about and I’d like to make it easier for them to do that because I love statistics. I also see the strong advantage of in-person education. The folks who enroll at Hopkins and take our courses will obviously get way more one-on-one interaction, which is clearly valuable. I don’t see why it has to be one or the other…
  3. An interesting discussion with Facebook’s former head of big data. I think the first point is key. A lot of the “big data” hype has just had to do with the infrastructure needed to deal with all the data we are collecting. The bigger issue (and where statisticians will lead) is figuring out what to do with the data. 
  4. This is a great post about data smuggling. The two key points that I think are raised are: (1) how when the data get big enough, they have their own mass and aren’t going to be moved, and (2) how physically mailing harddrives is still the fastest way of transferring big data sets. That is certainly true in genomics where it is called “sneaker net” when a collaborator walks a hard drive over to our office. Hopefully putting data in physical terms will drive home the point that the new scientists are folks that deal with/manipulate/analyze data. 
  5. Not statistics related, but here is a high-bar to hold your work to: the bus-crash test. If you died in a bus-crash tomorrow, would your discipline notice? Yikes. Via C.T. Brown.