Simply Statistics

26
Jul
26
Jul

Online education: many academics are missing the point

Many academics are complaining about online education and warning us about how it can lead to a lower quality product. For example, the New York Times recently published this op-ed piece wondering if “online education [will] ever be education of the very best sort?”. Although pretty much every controlled experiment comparing online and in-class education finds that students learn just about the same under both approaches, I do agree that in-person lectures are more enjoyable to both faculty and students. But who cares? My enjoyment and the enjoyment of the 30 privileged students that physically sit in my classes seems negligible compared to the potential of reaching and educating thousands of students all over the world.  Also, using recorded lectures will free up time that I can spend on one-on-one interactions with tuition paying students.  But what most excites me about online education is the possibility of being part of the movement that redefines existing disciplines as the number of people learning grows by orders of magnitude. How many Ramanujans are out there eager to learn Statistics? I would love it if they learned it from me. 

26
Jul
25
Jul
25
Jul

Really Big Objects Coming to R

I noticed in the development version of R the following note in the NEWS file:

There is a subtle change in behaviour for numeric index values 2^31 and larger.  These used never to be legitimate and so were treated as NA, sometimes with a warning.  They are now legal for long vectors so there is no longer a warning, and x[2^31] <- y will now extend the vector on a 64-bit platform and give an error on a 32-bit one.

This is significant news indeed!

Some background: In the old days, when most us worked on 32-bit machines, objects in R were limited to be about 4GB in size (and practically a lot less) because memory addresses were indexed using 32 bit numbers. When 64-bit machines became more common in the early 2000s, that limit was removed. Objects could theoretically take up more memory because of the dramatically larger address space. For the most part, this turned out to be true, although there were some growing pains as R was transitioned to be runnable on 64-bit systems (I remember many of those pains).

However, even with the 64-bit systems, there was a key limitation, which is that vectors, one of the fundamental objects in R, could only have a maximum of 2^31-1 elements, or roughly 2.1 billion elements. This was because array indices in R were stored internally as signed integers (specifically as ‘R_len_t’), which are 32 bits on most modern systems (take a look at .Machine$integer.max in R).

You might think that 2.1 billion elements is a lot, and for a single vector it still is. But you have to consider the fact that internally R stores all arrays, no matter how many dimensions there are, as just long vectors. So that would limit you, for example, to a square a matrix that was no bigger than roughly 46,000 by 46,000. That might have seemed like a large matrix back in 2000 but it seems downright quaint now. And if you had a 3-way array, the limit gets even smaller. 

Now it appears that change is a comin’. The details can be found in the R source starting at revision 59005 if you follow on subversion. 

A new type called ‘R_xlen_t’ has been introduced with a maximum value of 4,503,599,627,370,496, which is 2^52. As they say where I grew up, that’s a lot of McNuggets. So if your computer has enough physical memory, you will soon be able to index vectors (and matrices) that are significantly longer than before.

24
Jul
24
Jul

Proof by example and letters of recommendation

In math or statistics, proof by example does not work. One example of a phenomenon does not prove anything. For example, because 2 is prime doesn’t mean that all even numbers are prime. In fact, no even numbers other than 2 are prime. 

But in other areas proof by example is the best way to demonstrate something. One example is writing letters of recommendation. It is way more convincing when I get one example of something a person has achieved:

Kyle created the first R package that can be used to analyze terabytes of sequencing data in under an hour.

Than something much more general but with no details:

Bryan is an excellent programmer with a mastery of six different programming languages. 

In mathematics it makes sense why proof by example does not work. There is a concrete result and even one example violating that result means it isn’t true. On the other hand, if most of the time Kyle crushes his work, but every once in a while he has an off day and doesn’t get it done, I can live with that. That’s true of a lot of applied statistical methods too. If it works 99% of the time and 1% of the time fails but you can discover how it failed, that is still a pretty good statistical method…

24
Jul
23
Jul

Facebook's Real Big Data Problem

Facebook’s first quarterly earnings report as a public company is coming out this Thursday and everyone’s wondering what will be in it. One question is whether advertisers are going to Facebook over other sites like Google.

“Advertisers need more proof that actual advertising on Facebook offers a return on investment,” said Debra Aho Williamson, an analyst with the market research firm eMarketer. “There is such disagreement over whether Facebook is the next big thing on the Internet or whether it’s going to fail miserably.”

Facebook’s unique asset is the pile of personal data it collects from 900 million users. But using that data to serve up effective, profitable advertisements is a daunting task. Google has been in the advertising game longer and has roughly $40 billion in annual revenue from advertising — 10 times that of Facebook. Since the public offering, Wall Street has tempered its expectations for Facebook’s advertising revenue, and shares closed Friday at $28.76, down from their initial price of $38.

There’s a pretty fundamental question here: Does it work?

With all the data Facebook has at its fingertips, it would be a shame if they couldn’t answer that question.

23
Jul