Tidy data sets have one observation per row and one variable per column. Using this definition, big data sets can be either:
- Wide - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications.
- Tall - a tall data set has a large number of observations, but fewer measurements. This is the typical setting in a large clinical trial or in a basic social network analysis.
The curse of dimensionality tells us that estimating some quantities gets harder as the number of dimensions of a data set increases - as the data gets taller or wider. An example of this was nicely illustrated by my student Prasad (although it looks like his quota may be up on Rstudio).
For wide data sets there is also a blessing of dimensionality. The basic reason for the blessing of dimensionality is that:
No matter how many new measurements you take on a small set of observations, the number of observations and all of their characteristics are fixed.
As an example, suppose that we make measurements on 10 people. We start out by making one measurement (blood pressure), then another (height), then another (hair color) and we keep going and going until we have one million measurements on those same 10 people. The blessing occurs because the measurements on those 10 people will all be related to each other. If 5 of the people are women and 5 or men, then any measurement that has a relationship with sex will be highly correlated with any other measurement that has a relationship with sex. So by knowing one small bit of information, you can learn a lot about many of the different measurements.
This blessing of dimensionality is the key idea behind many of the statistical approaches to wide data sets whether it is stated explicitly or not. I thought I'd make a very short list of some of these ideas:
How the blessing plays a role: The measurements for each observation are assumed to be a mixture of values measured from different observation types. The proportion of each observation type is assumed to be fixed across measurements, so you can take advantage of the multiple measurements to estimate the mixing percentage and perform the deconvolution. (Wenyi Wang came and gave an excellent seminar on this idea at JHU a couple of days ago, which inspired this post).
How the blessing plays a role: The models assume that a hypothesis test is performed for each observation and that the probability any observation is drawn from the null, the null distribution, and the alternative distributions are common across observations. If the null is assumed known, then it is possible to use the known null distribution to estimate the common probability that an observation is drawn from the null.
How the blessing plays a role: A linear model is fit for each observation and the means and variances of the log ratios calculated from the model are assumed to follow a common distribution across observations. The method estimates the hyper-parameters of these common distributions and uses them to adjust any individual measurement's estimates.
4. Idea: Surrogate variable analysis
How the blessing plays a role: Each observation is assumed to be influenced by a single variable of interest (a primary variable) and multiple unmeasured confounders. Since the observations are fixed, the values of the unmeasured confounders are the same for each measurement and a supervised PCA can be used to estimate surrogates for the confounders. (see my JHU job talk for more on the blessing)
The blessing of dimensionality I'm describing here is related to the idea that Andrew Gelman refers to in this 2004 post. Basically, since increasingly large number of measurements are made on the same observations there is an inherent structure to those observations. If you take advantage of that structure, then as the dimensionality of your problem increases you actually get better estimates of the structure in your high-dimensional data - a nice blessing!