The Johnson-Lindenstrauss Lemma

In this post I am going to talk about a classical mathematical dimension-reduction tool (from the 80’s) that I really like, the Johnson-Lindenstrauss Lemma, and a faster version of it, more recently developed.

Suppose one has {n} points, {X = \{x_1,\dots,x_n\}}, in {\mathbb{R}^d} (with {d} very large). If {d>n}, since the points have to lie in a subspace of dimension {n} it is clear that one can consider the projection {f:\mathbb{R}^d\rightarrow \mathbb{R}^n} of the points to that subspace and keep the whole geometry of {X}. In particular, for every {x_i} and {x_j}, {\|f(x_i)-f(x_j)\|^2 = \|x_i-x_j\|^2}, meaning that {f} is an isometry in {X}.

Let’s say that we allow a bit of distortion, we are looking for {f:\mathbb{R}^d\rightarrow \mathbb{R}^k} that is an {\epsilon-}isometry, meaning that

\displaystyle (1-\epsilon) \|x_i-x_j\|^2 \leq \|f(x_i)-f(x_j)\|^2 \leq (1+\epsilon) \|x_i-x_j\|^2. \ \ \ \ \ (1)

Can we do better than {k=n}?

In 1984, Johnson and Lindenstrauss showed a remarkable Lemma that answers this question positively.

Theorem 1 (Johnson-Lindenstrauss Lemma) For any {0<\epsilon<1} and any integer {n}, let {k} such that

\displaystyle k \geq 4\frac1{\epsilon^2/2 - \epsilon^3/3}\log n.

Then, for any set {X} of {n} points in {\mathbb{R}^d}, there is a linear map {f:\mathbb{R}^d\rightarrow\mathbb{R}^k} that is an {\epsilon-}isometry for {X}. Furthermore, this map can be found in randomized polynomial time.

I going to show below an elementary proof (borrowed from this paper).

Before the proof, we need a few concentration of measure bounds, I will omit the proof of those but they are available in the same paper and their proof relies on essentially the same ideas as used to show Hoefding bounds.

Lemma 2 Let {y_1,...,y_d} iid standard Gaussian random variables and {Y = (y_1,\dots,y_d)}. Let {g:\mathbb{R}^d\rightarrow\mathbb{R}^k} be the projection into the first {k} coordinates and {Z = g\left(\frac{Y}{\|Y\|}\right) = \frac1{\|Y\|}(y_1,\dots,y_k)} and {L = \|Z\|^2}, it is clear that {\mathbb{E} L = \frac{k}d}. In fact, {L} is very concentrated around its mean

  • If {\beta<1},

    \displaystyle \mathrm{Prob}\left[ L \leq \beta \frac{k}d\right] \leq \exp\left( \frac{k}2(1-\beta +\log\beta) \right).

  • If {\beta>1},

    \displaystyle \mathrm{Prob}\left[ L \geq \beta \frac{k}d\right] \leq \exp\left( \frac{k}2(1-\beta +\log\beta) \right).

Proof of the Johnson-Lindenstrauss Lemma:

We’ll start by showing that, given a pair {x_i,x_j}, a projection onto a random subspace of dimension {k} will satisfy (after appropriate scaling) property (1) with high probability. WLOG, we can assume that {u = x_i - x_j} is unit norm. Understanding what is the norm of the projection of {u} on a random subspace of dimension {k} is the same as understanding the norm of the projection of a (uniformly) random point on {S^{d-1}} (the unit sphere in {\mathbb{R}^d}) on a specific {k-}dimensional subspace, let’s say the one generated by the first {k} canonical basis vectors. This means that we are interested in the distribution of the norm of the first {k} entries of a random vector drawn from the uniform distribution over {S^{d-1}} — this distribution is the same as taking a standard gaussian vector in {\mathbb{R}^d} and normalizing it to the unit sphere.

Let {g:\mathbb{R}^d\rightarrow\mathbb{R}^k} be the projection on a random {k-}dimensional subspace and let {f:\mathbb{R}^d\rightarrow\mathbb{R}^k} defined as {f = \frac{d}kg}. Then (by the above discussion), given a pair of distinct {x_i} and {x_j}, {\frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2}} has the same distribution as {\frac{d}kL}, as defined in Lemma~2. Using Lemma~2, we have, given a pair {x_i,x_j},

\displaystyle \mathrm{Prob}\left[ \frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \leq (1-\epsilon) \right] \leq \exp\left( \frac{k}2(1-(1-\epsilon) +\log(1-\epsilon)) \right),

since, for {\epsilon\geq 0}, {\log(1-\epsilon) \leq -\epsilon -\epsilon^2/2} we have

\displaystyle \begin{array}{rcl} \mathrm{Prob}\left[ \frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \leq (1-\epsilon) \right] & \leq & \exp\left( -\frac{k\epsilon^2}4 \right) \\ & \leq & \exp\left( -2\log n \right) = \frac1{n^2}. \end{array}

On the other hand,

\displaystyle \mathrm{Prob}\left[ \frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \geq (1+\epsilon) \right] \leq \exp\left( \frac{k}2(1-(1+\epsilon) +\log(1+\epsilon)) \right).

since, for {\epsilon\geq 0}, {\log(1+\epsilon) \leq \epsilon -\epsilon^2/2 + \epsilon^3/3} we have

\displaystyle \begin{array}{rcl} \mathrm{Prob}\left[ \frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \leq (1-\epsilon) \right] & \leq & \exp\left( -\frac{k\left(\epsilon^2 - 2\epsilon^3/3\right)}4 \right) \\ & \leq & \exp\left( -2\log n \right) = \frac1{n^2}. \end{array}

This means that (by union bound)

\displaystyle \mathrm{Prob}\left[ \frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \notin [1-\epsilon,1+\epsilon] \right] \leq \frac2{n^2}.

Since there exist {{n \choose 2}} such pairs, again, a simple union bound gives

\displaystyle \mathrm{Prob}\left[ \exists_{i,j}:\frac{\|f(x_i)-f(x_j)\|^2}{\|x_i-x_j\|^2} \notin [1-\epsilon,1+\epsilon] \right] \leq \frac2{n^2} = 1-\frac1n.

This means that {f} chosen as a properly scaled projection onto a random {k-}dimensional subspace is an {\epsilon-} isometry on {X} (see (1)) with probability at least {\frac1n}. This means we can achieve any desirable constant probability of success by trying {\mathcal{O}(n)} such random projections, meaning we can find an {\epsilon-}isometry in randomized polynomial time.

In fact, by considering {k} slightly larger, we can get a good projection on the first random attempt with very good confidence. It’s trivial to adapt the proof above to obtain:

Lemma 3 For any {0<\epsilon<1}, {\tau>0}, and for any integer {n}, let {k} such that

\displaystyle k \geq (2+\tau)\frac2{\epsilon^2/2 - \epsilon^3/3}\log n.

Then, for any set {X} of {n} points in {\mathbb{R}^d}, take {f:\mathbb{R}^d\rightarrow\mathbb{R}^k} to a suitably scaled projection on a random subspace of dimension {k}, then {f} is an {\epsilon-}isometry for {X} (see (1)) with probability at least {1-\frac1{n^{\tau}}}.

Lemma 3 is quite remarkable. Think about the situation where we are given a high-dimensional data set in a streaming fashion — meaning that we get each data point at a time, consecutively. To run a dimension-reduction technique like Principal component analysis or Diffusion maps we would need to wait until we received the last data point and then compute the dimension reduction map (both PCA and Diffusion Maps are, in some sense, data adaptive). Using Lemma 3 you can just choose a projection at random in the beginning of the process (all we need to have is an estimate of the {\log} of the size of the data set) and just map each point using this projection matrix which can be done online — we don’t need to see the next point to compute the projection of the current data point. Lemma 3 ensures that this (seemingly naive) procedure will, with high probably, not distort the data by more than {\epsilon}.

Fast Johnson-Lindenstrauss

In what follows I will very briefly and very intuitively explain an interesting idea behind a faster version of Johnson-Lindenstrauss proposed in this very nice paper.

Let’s continue thinking about the high-dimensional streaming data. After we draw the random projection matrix, say {M}, for each data point {x}, we still have to compute {Mx} which, since {M} has {\mathcal{O}(\epsilon^{-2}\log(n)d)} entries, has a computational cost of {\mathcal{O}(\epsilon^{-2}\log(n)d)}. In some applications this might be too expensive, can one do better?

It turns out that there is no hope of (significantly) reducing the number of rows (see this paper of Noga Alon). The only hope seems to be to speed up the matrix-vector multiplication. If we were able to construct a sparse matrix {M} then we would definitely speed up the computation of {Mx}. Unfortunately, one can show that a sparse matrix will distort sparse vectors, so if your data set contains sparse vectors {M} will fail. Another option would be to exploit the Fast Fourier Transform and compute the Fourier Transform of {x} (which takes {\mathcal{O}(d\log d)} time) and then multiply the {FT} of {x} by a sparse matrix.. this will again not work because the vector {x} might have sparse Fourier Transform.

The solution comes from leveraging a type of uncertainty principle — not both {x} and the Fourier Transform of {x} can be sparse. The idea is that if, before one takes the Fourier Transform of {x}, one flips (randomly) the signs of {x}, then the probably of obtaining a sparse vector is very small so a sparse matrix can be used for projection. In a nutshell the algorithm has {M} be a matrix of the form {PFD}, where {D} is a diagonal matrix that flips the signs of the vector randomly, {F} is a Fourier Transform and {P} a sparse matrix. This method was proposed and analysed here and, roughly speaking, achieves a complexity of {\mathcal{O}(d\log d)}, instead of the classical {\mathcal{O}(\epsilon^{-2}d\log n)}.

Advertisements

3 thoughts on “The Johnson-Lindenstrauss Lemma

    1. Hi,

      In JL lemma proof, I was wodering what exactly is the value of n? For the two inequalities, different expressions were substituted for n. How is this possible?

      1. Hi Sri Hari,

        Thanks for you question.
        n is the number of points that one is trying to project. I am not sure I understand what do you mean by different expressions, can you point them out for me on the text?

        Thanks!

        Best,
        Afonso

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s