Tag: Computing


Sunday data/statistics link roundup (12/9/12)

  1. Some interesting data/data visualizations about working conditions in the apparel industry. Here is the full report. Whenever I see reports like this, I wish the raw data were more clearly linked. I want to be able to get in, play with the data, and see if I notice something that doesn't appear in the infographics. 
  2. This is an awesome plain-language discussion of how a bunch of methods (CS and Stats) with fancy names relate to each other. It shows that CS/Machine Learning/Stats are converging in many ways and there isn't much new under the sun. On the other hand, I think the really exciting thing here is to use these methods on new questions, once people drop the stick
  3. If you are a reader of this blog and somehow do not read anything else on the internet, you will have missed Hadley Wickham's Rcpp tutorial. In my mind, this pretty much seals it, Julia isn't going to overtake R anytime soon. In other news, Hadley is coming to visit JHSPH Biostats this week! I'm psyched to meet him. 
  4. For those of us that live in Baltimore, this interesting set of data visualizations lets you in on the crime hotspots. This is a much fancier/more thorough analysis than Rafa and I did way back when. 
  5. Check out the new easy stats tool from the Census (via Hilary M.) and read our interview with Tom Louis who is heading over there to the Census to do cool things. 
  6. Watch out, some Tedx talks may be pseudoscience! More later this week on the politicization/glamourization of science, so stay tuned. 

Cleveland's (?) 2001 plan for redefining statistics as "data science"

This plan has been making the rounds on Twitter and is being attributed to William Cleveland in 2001 (thanks to Kasper for the link). I’m not sure of the provenance of the document but it has some really interesting ideas and is worth reading in its entirety. I actually think that many Biostatistics departments follow the proposed distribution of effort pretty closely. 

One of the most interesting sections is the discussion of computing (emphasis mine): 

Data analysis projects today rely on databases, computer and network hardware, and computer and network software. A collection of models and methods for data analysis will be used only if the collection is implemented in a computing environment that makes the models and methods sufficiently efficient to use. In choosing competing models and methods, analysts will trade effectiveness for efficiency of use.


This suggests that statisticians should look to computing for knowledge today, just as data science looked to mathematics in the past.

I also found the theory section worth a read and figure it will definitely lead to some discussion: 

Mathematics is an important knowledge base for theory. It is far too important to take for granted by requiring the same body of mathematics for all. Students should study mathematics on an as-needed basis.


Not all theory is mathematical. In fact, the most fundamental theories of data science are distinctly nonmathematical. For example, the fundamentals of the Bayesian theory of inductive inference involve nonmathematical ideas about combining information from the data and information external to the data. Basic ideas are conveniently expressed by simple mathematical expressions, but mathematics is surely not at issue. 


Do you own or rent?

When it comes to computing, history has gone back and forth between what I would call the “owner model” and the “renter model”. The question is what’s the best approach and how do you determine that?

Back in the day when people like John von Neumann were busy inventing the computer to work out H-bomb calculations, there was more or less a renter model in place. Computers were obviously quite expensive and so not everyone could have one. If you wanted to do your calculation, you’d walk down to the computer room, give them your punch cards with your program written out, and they’d run it for you. Sometime later you’d get some print out with the results of your program. 

A little later, with time-sharing types of machines, you could have dumb terminals login to a central server and run your calculations that way. I guess that saved you the walk to the computer room (and all the punch cards). I still remember some of these green-screen dumb terminals from my grad school days (yes, UCLA still had these monstrosities in 1999). 

With personal computers in the 80s, you could own your own computer, so there was no need to depend on some central computer (and a connection to it) to do the work for you. As computing components got cheaper, these personal computers got more and more powerful and rivaled the servers of yore. It was difficult for me to imagine ever needing things like mainframes again except for some esoteric applications. Especially, with the development of Linux, you could have all the power of a Unix mainframe on your desk or lap (or now your palm). 

But here we are, with Jeff buying a Chromebook. Have we just taken a step back in time? Is cloud computing and the renter model the way to go? I have to say that I was a big fan of “cloud computing” back in the day. But once Linux came around, I really didn’t think there was a need for the thin client/fat server model.

But it seems we are going back that way and the reason seems to be because of mobile devices. Mobile devices are now just small computers, so many people own at least two computers (a “real” computer and a phone). With multiple computers, it’s a pain to have to synchronize both the data and the applications on them. If they’re made by different manufacturers then you can’t even have the same operating system/applications on the devices. Also, no one cares about the operating system anymore, so why should it have to be managed? The cloud helps solve some of these problems, as does owning devices from the same company (as I do, Apple fanboy that I am).

I think the all-renter model of the Chromebook is attractive, but I don’t think it’s ready for prime time just yet. Two reasons I can think of are (1) Microsoft Office and (2) slow network connections. If you want to make Jeff very unhappy, you can either (1) send him a Word document that needs to be edited in Track Changes; or (2) invite him to an international conference on some remote island. The need for a strong network connection is problematic because I’ve yet to encounter a hotel that had a fast enough connection for me to work remotely over on our computing cluster. For that reason I’m sticking with my current laptop.


Apple this is ridiculous - you gotta upgrade to upgrade!?

So along with a few folks here around Hopkins we have been kicking around the idea of developing an app for the iPhone/Android. I’ll leave the details out for now (other than to say stay tuned!). 

But to start developing an app for the iPhone, you need a version of Xcode, Apple’s development environment. The latest version of Xcode is version 4, which can only be installed with the latest version of Mac OS X Lion (10.7, I think) and above. So I dutifully went off to download Lion. Except, whoops! You can only download Lion from the Mac App store. 

Now this wouldn’t be a problem, if you didn’t need OS X Snow Leopard (10.6 and above) to access the App store. Turns out I only have version 10.5 (must be OS X Housecat or something). I did a little searching and it looks like the only way I can get Lion is if I buy Snow Leopard first and upgrade to upgrade!

It isn’t the money so much (although it does suck to pay $60 for $30 worth of software), but the time and inconvenience this causes. Apple has done this a couple of times to me in the past with operating systems needing to be upgraded so I can buy things from iTunes. But this is getting out of hand….maybe I need to consider the alternatives


Data Scientist vs. Statistician

There’s in interesting discussion over at reddit on the difference between a data scientist and a statistician. My crude summary of the discussion seems to be that by and large they are the same but the phrase “data scientist” is just the hip new name for statistician that will probably sound stupid 5 years from now.

My question is why isn’t “statistician” hip? The comments don’t seem to address that much (although a few go in that direction).  There a few interesting comments about computing. For example from ByteMining:

Statisticians typically don’t care about performance or coding style as long as it gets a result. A loop within a loop within a loop is all the same as an O(1) lookup.

Another more down-to-earth comment comes from marshallp:

There is a real distinction between data scientist and statistician

  • the statistician spent years banging his/her head against blackboards full of math notation to get a modestly paid job

  • the data scientist gets s—loads of cash after having learnt a scripting language and an api

More people should be encouraged into data science and not pointless years of stats classes

 Not sure I fully agree but I see where he’s coming from!

[Note: See also our post on how determine whether you are a data scientist.]