Diffusion Maps and Clustering on Graphs

It has been a while since I posted here (not counting the announcement I did a few days ago). I have decided to revive the blog from the ashes and start posting with compliance to the Boris law of Blogging (the time elapsed between any two consecutive posts should be never shorter than 1 week nor longer than 2 weeks).

Now the math, this is based on part of a talk I gave last Tuesday on the PACM Graduate Student Seminar.

Suppose we are given a data set we want to understand (I’ll be more precise later). Let’s say we have a notion of affinity between two data points. One popular way of describing this type of data set is through a weighted graph ${G = ( V , E , W)}$ where each vertex ${v\in V}$ represents a data point and, for each edge ${(i,j)\in E}$, the weight ${w_{i,j} = W[i,j] }$ represents the affinity between node i and node j. The matrix W is often called the weighted adjacency matrix of the graph. We make the reasonable assumption of symmetry in the weights, ${w_{ij} = w_{ji}}$.

We further assume that: ${d_i = \sum_{j=1}^nw_{ij} = 1}$ and ${w_{ii}\geq \frac12}$. These assumptions are not needed but they clean up the math below.

As you might have already guessed by now, the reason why I want the degree of each vertex to be 1 is because I want to define a random walk on it. Indeed, a random walk is a very natural process to define on a graph and its properties usually give meaningful understanding about the structure of the graph itself, this post is one such example. A natural random walk to define in our graph G is simply one that, given the vertex i as its position, walks to vertex j with probability ${w_{ij}}$. This means that ${ P(x_1 = j| x_0 = i) = w_{ij} }$. The matrix W gives us a easy way of computing the probability distribution after t steps of the walk: ${ P(x_t = j| x_0 = i) = W^t[i,j] }$.

We are now interested in defining a meaningful distance between nodes. One could use the inverse of the affinity or something of that sort but it might fail in two ways: It wouldn’t be robust to errors on the measurement of the affinity or lack of such measurements, very often we only have a notion of affinity of nearby data points. Another issue is that this distance might not be meaningful, in fact if you think of a graph composed of two star graphs sharing all the leaf nodes then both centers cannot be very distant (although having no edge between them, which would assign weight zero and infinite distance). Another option is what is called Diffusion Distance: Given a time t we look at the probability cloud of a random walk starting at i and one starting at j and we see how different the probability clouds are (the assumption that ${w_{ii}\geq\frac12}$ makes the random walk “lazy” enough so that parity and things like it don’t play so much of a role and the random walk is mostly governed by the geometry the graph, think of what would happen in a bipartite graph if ${w_{ii} = 0}$).

We define, given t, the Diffusion distance between i and j as

$\displaystyle \mathrm{DD}(i,j)=\|W^t[i,.]-W^t[j,.]\|_2^2=\sum_{k=1}^n\left(W^t[i,k]-W^t[j,k]\right)^2$

In order to compute this we are going to define the Laplacian of G, which is given by ${L = I - W}$ (if we were not assuming that the graph is regular with degree 1 the definition would be ${L=D-W}$ where ${D}$ is the degree matrix). ${L}$ is a semi-definite positive matrix with eigenvalues ${0=\lambda_1\leq \lambda_2 \leq \cdots \lambda_n}$, due to our assumption of laziness of the random walk ${\lambda_n\leq 1}$. The fact that this matrix is called the Laplacian is not a coincidence but I won’t discuss that here, very briefly under certain conditions this matrix approximates the Laplacian we all know and love (or more general the Laplace–Beltrami operator on manifolds). With respect to these eigenvalues the spectral decomposition of W is given by ${\Phi \Lambda \Phi^T}$ where the columns of ${\Phi}$ are the eigenvectors ${\{\phi_1,\phi_2, \dots, \phi_n \}}$ of L and ${\Lambda}$ is a diagonal matrix with ${1-\lambda_l}$ on its l’th diagonal.

We thus have:

$\displaystyle \begin{array}{rcl} \mathrm{DD}(i,j) & = & \sum_{k=1}^n\left( \sum_{l=1}^n (\phi_l(i) - \phi_l(j)) (1 - \lambda_l)^t \phi_l(k) \right)^2 \\ & = &\sum_{k=1}^n\sum_{l,r} (\phi_l(i) - \phi_l(j)) (\phi_r(i) - \phi_r(j)) (1 - \lambda_l)^t(1 - \lambda_r)^t \phi_l(k)\phi_r(k) \\ & = &\sum_{l,r} (\phi_l(i) - \phi_l(j)) (\phi_r(i) - \phi_r(j)) (1 - \lambda_l)^t(1 - \lambda_r)^t \left(\sum_{k=1}^n\phi_l(k)\phi_r(k)\right) \\ & = & \sum_{r=1}^n (1 - \lambda_r)^{2t} (\phi_r(i) - \phi_r(j))^2\\ & = & \sum_{r=2}^n (1 - \lambda_r)^{2t} (\phi_r(i) - \phi_r(j))^2, \end{array}$

where the next to last equality comes from the orthogonality of the eigenvectors and the last one comes from the fact that ${\phi_1}$ is the all ones vector.

Note that, if we consider the embedding ${\mathrm{DM}_t:V \rightarrow \mathbb{R}^{n-1}}$ given by

$\displaystyle \mathrm{DM}_t(i) = \left[\begin{array}{c}(1 - \lambda_2)^t \phi_2(i) \\ \vdots \\ (1 - \lambda_1)^t \phi_n(i)\end{array}\right].$

we have that the Diffusion Distance between node ${i}$ and node ${j}$ is simply the ${\ell^2}$ distance on this embedding. This is a strong argument towards the meaningfulness of this representation of the graph. The caveat is that one needs ${n-1}$ dimensions to represent the graph.

However, as ${(1-\lambda_r)}$ is decreasing from ${1}$ to ${0}$, when ${r}$ is sufficiently large that dimension on the embedding starts to be negligible. This suggests that truncating the map is good way to obtain a embedding on a lower dimensional space. Given ${k}$ The following embedding is known as truncated Diffusion Maps and was introduced and analyzed in Coifman and Lafon (2006)

$\displaystyle \mathrm{DM}^k_t(i) = \left[\begin{array}{c}(1 - \lambda_2)^t \phi_2(i) \\ \vdots \\ (1 - \lambda_{k+1})^t \phi_{k+1}(i) \end{array}\right] \in \mathbb{R}^k.$

Since the dependency on ${t}$ is a bit inconvenient and it only corresponds to stretching some of the dimensions, we consider the following embedding, known as Laplacian Eigenmaps see Belkin and Niyogi 2003, ${g:V\rightarrow\mathbb{R}^k}$ given by,

$\displaystyle g_k(i) = \left[\begin{array}{c}\phi_2(i) \\ \vdots \\ \phi_{k+1}(i) \end{array}\right] \in \mathbb{R}^k.$

Let’s say we are interested in clustering the graph. More specifically we want to split the graph in two clusters in such a way such that each cluster has roughly the same volume and there are as least as possible edges across the clusters. If the embedding ${g_1}$ is (sufficiently) faithful to the graph structure it would make sense to embed the graph in ${\mathbb{R}^1}$ and then cluster in ${\mathbb{R}^1}$, since that is significantly easier. This motivates an algorithm for clustering:

Algorithm for Clustering:

Do the embedding ${g_1}$, “choose” ${t\in\mathbb{R}}$ and then output two clusters as

$\displaystyle V_1 = \{i\in V : g_1(i) \leq t \} \text{ and } V_2 = \{i\in V : g_1(i) > t \}.$

In fact, one can show that there exists at least one ${t\in\mathbb{R}}$ that gives a “good” clustering (note that it is computationally tracktable to test all values of $t$ that give different clusters). This non-trivial fact is a consequence of the Cheeger’s inequality that was first shown by Noga Alon in 1986. My next post will be devoted to this inequality and to Spectral Clustering through another viewpoint.