Today’s post will be concerned with the clustering problem. Suppose one has a data set and wants to split the set into two “meaningful” clusters. We will show how the eigenvectors of the graph Laplacian can provide such a clustering, giving an efficient method for clustering. The result that guarantees the performance of this method is inspired in an isoperimetric inequality proved in the late 60’s. This post is tightly related to my previous post about Spectral Clustering and Diffusion Maps.
The setting is the same, consider a weighted graph , with
. Let us assume as we did before that the weights are symmetric and that
. Again these assumptions are not needed but they slightly clean up the math below.
Recall that we are interested in splitting the graph into two subsets
and
such that there are very few connections between the two sets. We make this precise by asking for a cut, that separates the graph into two parts,
and
, such that the sum of the weights of the edges between
and
(we call this
) is small. This is however not yet a good notion of a good cut, the reason is that we could simply make
the whole graph or the empty set and the cut would be zero. To remedy for this we define the Cheeger cut as
and the Cheeger constant as .
Instead of directly analyzing the Cheeger cut we are going to analyze the normalized cut as
where is the number of vertices in
. It is clear to check that
A clever trick allows us to pose this problem as quadratic form minimization. Let and define the vector in
given as
Now, the quadratic form
We can therefore pose the problem of minimizing the as finding the minimum of the quadratic form over all vectors
that are of the form
for some
. We note that
, for all
.
In the previous post we defined the graph Laplacian as where
was the adjacency matrix (containing the weights
). It turns out that the Rayleigh quotient of this matrix gives the quadratic form used above (this can be seen as discrete analogue of a integration by parts formula for the continuous Laplacian):
We are thus interested in minimizing this Rayleigh quotient, which is obviously minimized by the first eigenvector of . It is easy to see that
has an eigenvector in its null space, the all-ones vector which is very different from any
, given that we just showed that any
is orthogonal to it. Let us then look at minimizing
orthogonal to this vector, this space of vectors contains all vector of the form
which implies that this problem is a relaxation of the Normalized cut minimization problem.
It is a basic fact in linear algebra that the minimizer of subject to
is the second eigenvector of
,
, and the minimum is its second eigenvalue
. The observation that this problem is a relaxation of the minimizing the
problem already gives:
which is one side of the Cheeger’s inequality. The question now is how tight this relaxation is. It turns out that, if one minimizes the Rayleigh quotient and then performs a clever rounding procedure to the eigenvector (to make it two-valued so that it defines two clusters) one gets a partitioning of the graph whose Cheeger cut is less than , this automatically implies the hard part of Cheeger’s inequality and shows that
Theorem 1 (Cheeger’s inequality)
This result serves as a guarantee of performance of the Clustering Algorithm proposed in the previous post.
I will not prove this inequality in this post (the post is already a bit long and I rather keep my posts on the shorter side). The result was originally shown here. I would however recommend the reader to take a look at a very nice proof that can be found in Luca Trevisan’s blog: in theory. Very roughly, the idea is to perform random rounding (with a very clever probability distribution) that will in expectation perform better than implying that there must exist at least one instance for which such performance is achieve (this kind of argument is often refereed to as the probabilistic method).
It is interesting to note that before this result was known for graphs it was shown in the context of Riemannian Geometry by Cheeger here.
That choice for
is magic!
Hi Dustin,
Thanks for the question!
It looks like magic but it is not quite, you know you want the vector
to be orthogonal to the all-ones vector and, since it is suppose to be a proxy for a partitioning it should be bi-valued. These two requirements already give you the vector up to scaling and then you just pick the scaling that makes it look nicer =)
Best,
Afonso