In this post I am going to talk about a classical mathematical dimension-reduction tool (from the 80’s) that I really like, the Johnson-Lindenstrauss Lemma, and a faster version of it, more recently developed.
Suppose one has points, , in (with very large). If , since the points have to lie in a subspace of dimension it is clear that one can consider the projection of the points to that subspace and keep the whole geometry of . In particular, for every and , , meaning that is an isometry in .
In 1984, Johnson and Lindenstrauss showed a remarkable Lemma that answers this question positively.
Theorem 1 (Johnson-Lindenstrauss Lemma) For any and any integer , let such that
Then, for any set of points in , there is a linear map that is an isometry for . Furthermore, this map can be found in randomized polynomial time.
I going to show below an elementary proof (borrowed from this paper).
Before the proof, we need a few concentration of measure bounds, I will omit the proof of those but they are available in the same paper and their proof relies on essentially the same ideas as used to show Hoefding bounds.
Lemma 2 Let iid standard Gaussian random variables and . Let be the projection into the first coordinates and and , it is clear that . In fact, is very concentrated around its mean
- If ,
- If ,
Proof of the Johnson-Lindenstrauss Lemma:
We’ll start by showing that, given a pair , a projection onto a random subspace of dimension will satisfy (after appropriate scaling) property (1) with high probability. WLOG, we can assume that is unit norm. Understanding what is the norm of the projection of on a random subspace of dimension is the same as understanding the norm of the projection of a (uniformly) random point on (the unit sphere in ) on a specific dimensional subspace, let’s say the one generated by the first canonical basis vectors. This means that we are interested in the distribution of the norm of the first entries of a random vector drawn from the uniform distribution over — this distribution is the same as taking a standard gaussian vector in and normalizing it to the unit sphere.
Let be the projection on a random dimensional subspace and let defined as . Then (by the above discussion), given a pair of distinct and , has the same distribution as , as defined in Lemma~2. Using Lemma~2, we have, given a pair ,
since, for , we have
On the other hand,
since, for , we have
This means that (by union bound)
Since there exist such pairs, again, a simple union bound gives
This means that chosen as a properly scaled projection onto a random dimensional subspace is an isometry on (see (1)) with probability at least . This means we can achieve any desirable constant probability of success by trying such random projections, meaning we can find an isometry in randomized polynomial time.
In fact, by considering slightly larger, we can get a good projection on the first random attempt with very good confidence. It’s trivial to adapt the proof above to obtain:
Lemma 3 For any , , and for any integer , let such that
Then, for any set of points in , take to a suitably scaled projection on a random subspace of dimension , then is an isometry for (see (1)) with probability at least .
Lemma 3 is quite remarkable. Think about the situation where we are given a high-dimensional data set in a streaming fashion — meaning that we get each data point at a time, consecutively. To run a dimension-reduction technique like Principal component analysis or Diffusion maps we would need to wait until we received the last data point and then compute the dimension reduction map (both PCA and Diffusion Maps are, in some sense, data adaptive). Using Lemma 3 you can just choose a projection at random in the beginning of the process (all we need to have is an estimate of the of the size of the data set) and just map each point using this projection matrix which can be done online — we don’t need to see the next point to compute the projection of the current data point. Lemma 3 ensures that this (seemingly naive) procedure will, with high probably, not distort the data by more than .
In what follows I will very briefly and very intuitively explain an interesting idea behind a faster version of Johnson-Lindenstrauss proposed in this very nice paper.
Let’s continue thinking about the high-dimensional streaming data. After we draw the random projection matrix, say , for each data point , we still have to compute which, since has entries, has a computational cost of . In some applications this might be too expensive, can one do better?
It turns out that there is no hope of (significantly) reducing the number of rows (see this paper of Noga Alon). The only hope seems to be to speed up the matrix-vector multiplication. If we were able to construct a sparse matrix then we would definitely speed up the computation of . Unfortunately, one can show that a sparse matrix will distort sparse vectors, so if your data set contains sparse vectors will fail. Another option would be to exploit the Fast Fourier Transform and compute the Fourier Transform of (which takes time) and then multiply the of by a sparse matrix.. this will again not work because the vector might have sparse Fourier Transform.
The solution comes from leveraging a type of uncertainty principle — not both and the Fourier Transform of can be sparse. The idea is that if, before one takes the Fourier Transform of , one flips (randomly) the signs of , then the probably of obtaining a sparse vector is very small so a sparse matrix can be used for projection. In a nutshell the algorithm has be a matrix of the form , where is a diagonal matrix that flips the signs of the vector randomly, is a Fourier Transform and a sparse matrix. This method was proposed and analysed here and, roughly speaking, achieves a complexity of , instead of the classical .