Tag: big data


The Big in Big Data relates to importance not size

In the past couple of years several non-statisticians have asked me "what is Big Data exactly?" or "How big is Big Data?". My answer has been "I think Big Data is much more about "data" than "big". I explain below.

Screen Shot 2014-05-28 at 10.14.53 AM Screen Shot 2014-05-28 at 10.15.04 AM

Since 2011 Big Data has been all over the news. The New York Times, The Economist, Science, Nature, etc.. have told us that the Big Data Revolution is upon us (see google trends figure above). But was this really a revolution? What happened to the Massive Data Revolution (see figure above)? For this to be called a revolution, there must be some a drastic change, a discontinuity, or a quantum leap of some kind.  So has there been such a discontinuity in the rate of growth of data? Although this may be true for some fields (for example in genomics, next generation sequencing did introduce a discontinuity around 2007), overall, data size seems to have been growing at a steady rate for decades. For example, in the  graph below (see this paper for source) note the trend in internet traffic data (which btw dwarfs genomics data). There does seem to be a change of rate but during the 1990s which brings me to my main point.

internet data traffic

Although several fields (including Statistics) are having to innovate to keep up with growing data size, I don't see this as something that new. But I do think that we are in the midst of a Big Data revolution.  Although the media only noticed it recently,  it started about 30 years ago. The discontinuity is not in the size of data, but in the percent of fields (across academia, industry and government) that use data. At some point in the 1980s with the advent of cheap computers, data were moved from the file cabinet to the disk drive. Then in the 1990s, with the democratization of the internet, these data started to become easy to share. All of the sudden, people could use data to answer questions that were previously answered only by experts, theory or intuition.

In this blog we like to point out examples but let me review a few. Credit card companies started using purchase data to detect fraud. Baseball teams started scraping data and evaluating players without ever seeing them. Financial companies started analyzing  stock market data to develop investment strategies. Environmental scientists started to gather and analyze data from air pollution monitors. Molecular biologists started quantifying outcomes of interest into matrices of numbers (as opposed to looking at stains on nylon membranes) to discover new tumor types and develop diagnostics tools. Cities started using crime data to guide policing strategies. Netflix started using costumer ratings to recommend movies. Retail stores started mining bonus card data to deliver targeted advertisements. Note that all the data sets mentioned were tiny in comparison to, for example, sky survey data collected by astronomers. But, I still call this phenomenon Big Data because the percent of people using data was in fact Big.


I borrowed the title of this talk from a very nice presentation by Diego Kuonen


Sunday Data/Statistics Link Roundup (10/21/12)

  1. This is scientific variant on the #whatshouldwecallme meme isn’t exclusive to statistics, but it is hilarious. 
  2. This is a really interesting post that is a follow-up to the XKCD password security comic. The thing I find most interesting about this is that researchers realized the key problem with passwords was that we were looking at them purely from a computer science perspective. But people use passwords, so we need a person-focused approach to maximize security. This is a very similar idea to our previous post on an experimental foundation for statistics. Looks like Di Cook and others are already way ahead of us on this idea. It would be interesting to redefine optimality incorporating the knowledge that most of the time it is a person running the statistics. 
  3. This is another fascinating article about the math education wars. It starts off as the typical dueling schools issue in academia - two different schools of thought who routinely go after the other side. But the interesting thing here is it sounds like one side of this math debate is being waged by a person collecting data and the other is being waged by a side that isn’t. It is interesting how many areas are being touched by data - including what kind of math we should teach. 
  4. I’m going to visit Minnesota in a couple of weeks. I was so pumped up to be an outlaw. Looks like I’m just a regular law abiding citizen though….
  5. Here are outstanding summaries of what went on at the Carl Morris Big Data conference this last week. Tons of interesting stuff there. Parts one, two, and three

Sunday Data/Statistics Link Roundup (9/2/2012)

  1. Just got back from IBC 2012 in Kobe Japan. I was in an awesome session (organized by the inimitable Lieven Clement) with great talks by Matt McCall, Djork-Arne Clevert, Adetayo Kasim, and Willem Talloen. Willem’s talk nicely tied in our work and how it plays into the pharmaceutical development process and the bigger theme of big data. On the way home through SFO I saw this hanging in the airport. A fitting welcome back to the states. Although, as we talked about in our first podcast, I wonder how long the Big Data hype will last…
  2. Simina B. sent this link along for a masters program in analytics at NC State. Interesting because it looks a lot like a masters in statistics program, but with a heavier emphasis on data collection/data management. I wonder what role the stat department down there is playing in this program and if we will see more like it pop up? Or if programs like this with more data management will be run by stats departments other places. Maybe our friends down in Raleigh have some thoughts for us. 
  3. If one set of weekly links isn’t enough to fill your procrastination quota, go check out NextGenSeek’s weekly stories. A bit genomics focused, but lots of cool data/statistics links in there too. Love the “extreme Venn diagrams”. 
  4. This seems almost like the fast statistics journal I proposed earlier. Can’t seem to access the first issue/editorial board either. Doesn’t look like it is open access, so it’s still not perfect. But I love the sentiment of fast/single round review. We can do better though. I think Yihue X. has some really interesting ideas on how. 
  5. My wife taught for a year at Grinnell in Iowa and loved it there. They just released this cool data set with a bunch of information about the college. If all colleges did this, we could really dig in and learn a lot about the American secondary education system (link via Hilary M.). 
  6. From the way-back machine, a rant from Rafa about meetings. Stayed tuned this week for some Simply Statistics data about our first year on the series of tubes

Statistics/statisticians need better marketing

Statisticians have not always been great self-promoters. I think in part this comes from our tendency to be arbiters rather than being involved in the scientific process. In some ways, I think this is a good thing. Self-promotion can quickly become really annoying. On the other hand, I think our advertising shortcomings are hurting our field in a number of different ways. 

Here are a few:

  1. As Rafa points out even though statisticians are ridiculously employable right now it seems like statistics M.S. and Ph.D. programs are flying under the radar in all the hype about data/data science (here is an awesome one if you are looking). Computer Science and Engineering, even the social sciences, are cornering the market on “big data”. This potentially huge and influential source of students may pass us by if we don’t advertise better. 
  2. A corollary to this is lack of funding. When the Big Data event happened at the White House with all the major funders in attendance to announce $200 million in new funding for big data, none of the invited panelists were statisticians. 
  3. Our top awards don’t get the press they do in other fields. The Nobel Prize announcements are an international event. There is always speculation/intense interest in who will win. There is similar interest around the Fields medal in mathematics. But the top award in statistics, the COPSS award doesn’t get nearly the attention it should. Part of the reason is lack of funding (the Fields is $15k, the COPSS is $1k). But part of the reason is that we, as statisticians, don’t announce it, share it, speculate about it, tell our friends about it, etc. The prestige of these awards can have a big impact on the visibility of a field. 
  4.  A major component of visibility of a scientific discipline, for better or worse, is the popular press. The most recent article in a long list of articles at the New York Times about the data revolution does not mention statistics/statisticians. Neither do the other articles. We need to cultivate relationships with the media. 

We are all busy solving real/hard scientific and statistical problems, so we don’t have a lot of time to devote to publicity. But here are a couple of easy ways we could rapidly increase the visibility of our field, ordered roughly by the degree of time commitment. 

  1. All statisticians should have Twitter accounts and we should share/discuss our work and ideas online. The more we help each other share, the more visibility our ideas will get. 
  2. We should make sure we let the ASA know about cool things that are happening with data/statistics in our organizations and they should spread the word through their Twitter account and other social media. 
  3. We should start a conversation about who we think will win the next COPSS award in advance of the next JSM and try to get local media outlets to pick up our ideas and talk about the award. 
  4. We should be more “big tent” about statistics. ASA President Robert Rodriguez nailed this in his speech at JSM. Whenever someone does something with data, we should claim them as a statistician. Sometimes this will lead to claiming people we don’t necessarily agree with. But the big tent approach is what is allowing CS and other disciplines to overtake us in the data era. 
  5. We should consider setting up a place for statisticians to donate money to build up the award fund for the COPSS/other statistics prizes. 
  6. We should try to forge relationships with start-up companies and encourage our students to pursue industry/start-up opportunities if they have interest. The less we are insular within the academic community, the more high-profile we will be. 
  7. It would be awesome if we started a statistical literacy outreach program in communities around the U.S. We could offer free courses in community centers to teach people how to understand polling data/the census/weather reports/anything touching data. 

Those are just a few of my ideas, but I have a ton more. I’m sure other people do too and I’d love to hear them. Let’s raise the tide and lift all of our boats!


Motivating statistical projects

It seems like half of the battle in statistics is identifying an important/unsolved problem. In math, this is easy, they have a list. So why is it harder for statistics? Since I have to think up projects to work on for my research group, for classes I teach, and for exams we give, I have spent some time thinking about ways that research problems in statistics arise.

I borrowed a page out of Roger’s book and made a little diagram to illustrate my ideas (actually I can’t even claim credit, it was Roger’s idea to make the diagram). The diagram shows the rough relationship of science, data, applied statistics, and theoretical statistics. Science produces data (although there are other sources), the data are analyzed using applied statistical methods, and theoretical statistics concerns the math behind statistical methods. The dotted line indicates that theoretical statistics ostensibly generalizes applied statistical methods so they can be applied in other disciplines. I do think that this type of generalization is becoming harder and harder as theoretical statistics becomes farther and farther removed from the underlying science.

Based on this diagram I see three major sources for statistical problems: 

  1. Theoretical statistical problems One component of statistics is developing the mathematical and foundational theory that proves we are doing sensible things. This type of problem often seems to be inspired by popular methods that exists/are developed but lack mathematical detail. Not surprisingly, much of the work in this area is motivated by what is mathematically possible or convenient, rather than by concrete questions that are of concern to the scientific community. This work is important, but the current distance between theoretical statistics and science suggests that the impact will be limited primarily to the theoretical statistics community. 
  2. Applied statistics motivated by convenient sources of data. The best example of this type of problem are the analyses in Freakonomics.  Since both big data and small big data are now abundant, anyone with a laptop and an internet connection can download the Google n-gram data, a microarray from GEO data about your city, or really data about anything and perform an applied analysis. These analyses may not be straightforward for computational/statistical reasons and may even require the development of new methods. These problems are often very interesting/clever and so are often the types of analyses you hear about in newspaper articles about “Big Data”. But they may often be misleading or incorrect, since the underlying questions are not necessarily well founded in scientific questions. 
  3. Applied statistics problems motivated by scientific problems. The final category of statistics problems are those that are motivated by concrete scientific questions. The new sources of big data don’t necessarily make these problems any easier. They still start with a specific question for which the data may not be convenient and the math is often intractable. But the potential impact of solving a concrete scientific problem is huge, especially if many people who are generating data have a similar problem. Some examples of problems like this are: can we tell if one batch of beer is better than another, how are quantitative characteristics inherited from parent to child, which treatment is better when some people are censored, how do we estimate variance when we don’t know the distribution of the data, or how do we know which variable is important when we have millions

So this leads back to the question, what are the biggest open problems in statistics? I would define these problems as the “high potential impact” problems from category 3. To answer this question, I think we need to ask ourselves, what are the most common problems people are trying to solve with data but can’t with what is available right now? Roger nailed this when he talked about the role of statisticians in the science club

Here are a few ideas that could potentially turn into high-impact statistical problems, maybe our readers can think of better ones?

  1. How do we credential students taking online courses at a huge scale?
  2. How do we communicate risk about personalized medicine (or anything else) to a general population without statistical training? 
  3. Can you use social media as a preventative health tool?
  4. Can we perform randomized trials to improve public policy?
Image Credits: The Science Logo is the old logo for the USU College of Science, the R is the logo for the R statistical programming language, the data image is a screenshot of Gapminder, and the theoretical statistics image comes from the Wikipedia page on the law of large numbers.

Edit: I just noticed this paper, which seems to support some of the discussion above. On the other hand, I think just saying lots of equations = less citations falls into category 2 and doesn’t get at the heart of the problem. 

The problem with small big data

There’s lots of talk about “big data” these days and I think that’s great. I think it’s bringing statistics out into the mainstream (even if they don’t call it statistics) and it creating lots of opportunities for people with statistics training. It’s one of the reasons we created this blog.

One thing that I think gets missed in much of the mainstream reporting is that, in my opinion, the biggest problems aren’t with the truly massive datasets out there that need to be mined for important information. Sure, those types of problems pose interesting challenges with respect to hardware infrastructure and algorithm design.

I think a bigger problem is what I call “small big data”. Small big data is the dataset that is collected by an individual whose data collection skills are far superior to his/her data analysis skills. You can think of the size of the problem as being measured by the ratio of the dataset size to the investigator’s statistical skill level. For someone with no statistical skills, any dataset represents “big data”.

These days, any individual can create a massive dataset with relatively few resources. In some of the work I do, we send people out with portable air pollution monitors that record pollution levels every 5 minutes over a 1-week period. People with fitbits can get highly time-resolved data about their daily movements. A single MRI can produce millions of voxels of data.

One challenge here is that these examples all represent datasets that are large “on paper”. That is, there are a lot of bits to store, but that doesn’t mean there’s a lot of useful information there. For example, I find people are often impressed by data that are collected with very high temporal or spatial resolution. But often, you don’t need that level of detail and can get away with coarser resolution over a wider range of scenarios. For example, if you’re interested in changes in air pollution exposure across seasons but you only measure people in the summer, then it doesn’t matter if you measure levels down to the microsecond and produce terabytes of data. Another example might be the idea the sequencing technology doesn’t in fact remove biological variability, no matter how large a dataset it produces.

Another challenge is that the person who collected the data is often not qualified/prepared to analyze it. If the data collector didn’t arrange beforehand to have someone analyze the data, then they’re often stuck. Furthermore, usually the grant that paid for the data collection didn’t budget (enough) for the analysis of the data. The result is that there’s a lot of “small big data” that just sits around unanalyzed. This is an unfortunate circumstance, but in my experience quite common.

One conclusion we can draw is that we need to get more statisticians out into the field both helping to analyze the data; and perhaps more importantly, designing good studies so that useful data are collected in the first place (as opposed to merely “big” data). But the sad truth is that there aren’t enough of us on the planet to fill the demand. So we need to come up with more creative ways to get the skills out there without requiring our physical presence.


Computational biologist blogger saves computer science department

People who read the news should be aware by now that we are in the midst of a big data era. The New York Times, for example, has been writing about this frequently. One of their most recent articles describes how UC Berkeley is getting $60 million dollars for a new computer science center. Meanwhile, at University of Florida the administration seems to be oblivious to all this and about a month ago announced it was dropping its computer science department to save $. Blogger Steven Salzberg, a computational biologists known for his work in genomics, wrote a post titled “University of Florida eliminates Computer Science Department. At least they still have football” ridiculing UF for their decisions. Here are my favorite quotes:

 in the midst of a technology revolution, with a shortage of engineers and computer scientists, UF decides to cut computer science completely? 

Computer scientist Carl de Boor, a member of the National Academy of Sciences and winner of the 2003 National Medal of Science, asked the UF president “What were you thinking?”

Well, his post went viral and days later UF reversed it’s decision! So my point is this: statistics departments, be nice to bloggers that work in genomics… one of them might save your butt some day.

Disclaimer: Steven Salzberg has a joint appointment in my department and we have joint lab meetings.


OracleWorld Claims and Sensations

Larry Ellison, the CEO of Oracle, like most technology CEOs, has a tendency for the over-the-top sales pitch. But it’s fun to keep track of what these companies are up to just to see what they think the trends are. It seems clear that companies like IBM, Oracle, and HP, which focus substantially on the enterprise (or try to), think the future is data data data. One piece of evidence is the list of companies that they’ve acquired recently.

Ellison claims that they’ve developed a new computer that integrates hardware with software to produce an overall faster machine. Why do we need this kind of integration? Well, for data analysis, of course!

I was intrigued by this line from the article:

On Sunday Mr. Ellison mentioned a machine that he claimed would do data analysis 18 to 23 times faster than could be done on existing machines using Oracle databases. The machine would be able to compute both standard Oracle structured data as well as unstructured data like e-mails, he said.

It’s always a bit hard in these types of articles to figure out what they mean by “data analysis”, but even still, there’s an important idea here.

Alex Szalay talks about the need to “bring the computation to the data”. This comes from his experience working with ridiculous amounts of data from the Sloan Digital Sky Survey. There, the traditional model of pulling the data on to your computer, running some analyses, and then producing results just does not work. But the opposite is often reasonable. If the data are sitting in an Oracle/Microsoft/etc. database, you bring the analysis to the database and operate on the data there. Presumably, the analysis program is smaller than the dataset, or this doesn’t quite work.

So if Oracle’s magic computer is real, it and others like it could be important as we start bringing more computations to the data.