Youtube should check its checksums

I am in the process of uploading the video lectures for Data Analysis. I am getting ready to send out the course wrap-up email and I wanted to include the link to the Youtube playlist as well.

Unfortunately, Youtube keeps reporting that a pair of the videos in week 2 are duplicates. This is true despite them being different lengths (12:15 vs. 16:58), having different titles, and having dramatically different content. I found this on the forums:

YouTube uses a checksum to determine duplicates. The chances of having two different files containing different content but have the same checksum would be astronomical.

That isn't on the official Google documentation page, which is pretty sparse, but is the only description I can find of how Youtube checks for duplicate content. A checksum is a function you apply to the data from a video that (ideally) with high probability will yield different values when different videos are uploaded and the same value when the same video is uploaded. One possible checksum function could be the length of the video. Obviously that won't work in general because many videos might be 2 minutes exactly.

Regardless, it looks like Youtube can't distinguish my lecture videos. I'm thinking Vimeo or something else if I can't get this figured out. Of course, if someone has a suggestion (short of re-exporting the videos from Camtasia) that would allow me to circumvent this problem I'd love to hear it!

Update: I ended up fiddling with the videos and got them to upload. Thanks to the helpful comments!


  • Andy Mitchell

    That's got to be frustrating! You've had some bad luck with technology a few times during this course. Here's the best set of suggestions I could find: Any chance you clicked upload twice and one got stuck as a private video?

  • nuada

    Try changing some video metadata, even single bit should change final checksum.

  • Fran

    If they were using a standard checksum it would be (nearly) impossible so chances are they are using their own (or that they are mishandling the ones they have).

    But, if they are using their own, we can guess why they would do so. Perhaps many people found out that, in order to upload for a gazillion time the same Justin Bieber video they only needed to change one byte in their videos to change the checksum and give humanity some Bieber joy...

    So guys at Google probably did an "intelligent hash" that tries to detect stuff beyond bytes differences but they won't tell you what so that you cannot easily go around.

    I check a video of length16:59 you have about data analysis so I imagine this is the one you are talking about. This video has 4 seconds of silence with a intro screen very similar to the other videos you have. So perhaps the "intelligent" hash checks the initial frames and detects this silence and similar intro screen in you other video and complains.

    Anyhow, their "intelligent" hash can do so much so if you keep changing the video it will eventually pass the filter so you could try several things:

    1 - remove the 4 seconds of silence.

    2 - do not start all the videos with "This is a video a about" but instead read the title first sot that every video has a "unique" beginning....

    Anyhow, just guessing.... good luck Jeff!

  • Jim Carson

    Thanks for posting your course post-mortem video - it's an interesting perspective.

    Chuck Severance's post-mortem on his course ( advocates being a benevolent dictator in dealing with contradictory feedback and recommends aggressive pruning of the forums.

  • Raja Doake

    I took both Computing for Data Analysis and Data Analysis. I was one of the subset that finished everything, and while it was a LOT of work, I loved it. I'm going to continue to take more classes in statistics, analysis, and machine learning as time permits (my day job as an engineer has some very heavy stretches).

    Some specific thoughts on Data Analysis:

    - The addition of a 0/3/5 "was this assignment any good" to the second rubric was helpful. It allowed me to give credit for good assignments that didn't quite fit the rubric in some areas, or penalize assignments that fit the rubric in non-substantive ways. Even if it didn't have a huge effect on overall grades, it made me feel better as a grader.

    - The 2,000 word limit for the assignments was great. It made things tough for me -- both of my assignments were exactly 2,000 words -- but having that limit made me review my assignments repeatedly and make sure that everything I was saying was a necessary part of the story I was trying to tell.

    - I was a little disappointed that the last several quizzes were all only 5 questions. I found the 10-question quizzes did a great job of forcing me to think things through and make sure I actually understood the concepts in order to get the right answers. Having many questions that build on previous questions worked really well here, and there was obviously less of that with the 5 question quizzes. Shorter quizzes during assignment periods is good, but outside of that I preferred the longer quizzes, even though it meant more work for me!

    - On that note, I put a ton of time into both classes, and that's as someone who met the suggested prereqs. I know about half a dozen other people who enrolled and started the class, and I'm the only one who finished it. I spent about 10 hours a week for 10-question quiz weeks, plus 20-25 hours for each assignment. That includes watching the lectures, which generally took me about 1.5x the listed time, since I'd pause to take notes, repeat sections, etc.

    - I would love it if there was a way to turn this class (or this type of class) into an academic credit. I took both classes as a way of finding out if I like doing pure data analysis work as much as I like thinking about doing it, as opposed to fitting it into my job when I can find the time. Now that I've confirmed my suspicion that I love it, I'm exploring my options for what to do next. I prefer the Coursera "view lectures at your own pace" model to something like (for example) the University of Washington's Data Science certification program, where you stream a 3-hour lecture live once a week. Those classes are on Coursera now too, but if you take them on Coursera rather than through the university directly, you don't get credit. This is starting to change for some Coursera classes, and I'm hoping that "build your own degree" will eventually become viable...

    Finally, thanks to you and Roger for running the classes!