Exact Recovery in the Stochastic Block Model

Emmanuel Abbe, Georgina Hall, and myself just uploaded a paper on exact recovery in the stochastic block model (SBM), this post will be about this result.

The stochastic block model (SBM) with two communities (or equivalently the planted partition model) is perhaps the simplest model of random graph exhibiting a cluster behavior. Here we’ll look at the very simple form of the model where the graph has two equally sized clusters and vertices connect with probability ${p}$ within clusters and ${q across clusters.

Recently, there has been a lot of interest in understanding what happens for the regime when ${a=pn}$ and ${b=qn}$ are constant. A fascinating phase transition phenomena was conjectured by Decelle et al.: recovering slightly more than 50% of the vertices (i.e., beating a random guess) is possible with high probability (w.h.p.) if and only if

$\displaystyle (a-b)^2 > 2(a+b).$

This was proven, roughly at the same time, by Massoulié and Mossel et al., the impossibility result was first shown by Mossel et al. here.

Motivated by this line work, we asked ourselves the natural question of determining whether the problem of recovering all of the vertices (not just over 50%) also admits a sharp threshold phenomenon, we call this problem exact recovery.

Notice that exact recovery is impossible at the regime above as an Erd\H{o}s-Rényi random graph needs at least a logarithmic expected degree in order to be connected (with high probability). In fact, for the Erd\H{o}s-Rényi ensemble ${G(n,p)}$, the giant component (the existence of a connected component of linear size) has a sharp threshold at ${p=\frac{1}{n}}$, while connectivity has a sharp threshold at ${p=\frac{\log(n)}{n}}$. To hope for exact recovery, we hence consider the regime of probabilities scaling like ${p=\frac{\alpha \log(n)}{n}}$ and ${q=\frac{\beta \log(n)}{n}}$, for constants ${\alpha,\beta>0}$. In fact, one expects at least ${\alpha + \beta > 2}$ in order to have a connected graph with high probability.

We showed that indeed this problem has a sharp threshold phenomenon. More, precisely if ${\alpha=pn/\log(n)}$ and ${\beta=qn/\log(n)}$ are constant (and ${\alpha>\beta}$) we showed that, if

$\displaystyle \frac{\alpha+\beta}{2} - \sqrt{\alpha \beta}<1,$

then it is impossible to perform exact recovery (with high probability). On the other hand, if

$\displaystyle \frac{\alpha+\beta}{2} - \sqrt{\alpha \beta}>1,$

then exact recovery is possible (with high probability), with a non-efficient algorithm. Interestingly ${\frac{\alpha + \beta}{2} - \sqrt{\alpha \beta }>1}$ is equivalent to

$\displaystyle (\alpha - \beta)^2 > 4(\alpha + \beta) - 4,$

and ${\alpha+\beta \geq 2}$.

Both results are obtained by analyzing the maximum likelihood estimator (similarly to a previous paper that I discussed in this blog post). Interestingly these results require sharp tail bounds for differences of binomial coefficients and the “right tail bound” is quite different from the one that the central limit theorem would suggest, I will write another blog post soon about these bounds, it might be that someone else also finds them useful.

Using an adaptation of the dual certificate techniques described in this post we were also able to show that a certain natural semidefinite programming based algorithm achieves exact recovery (with high probability) as long as

$\displaystyle (\alpha - \beta)^2 > 8(\alpha + \beta)+ \frac83 (\alpha - \beta).$

It remains to propose an algorithm that is able to perform exact recovery all the way down to the threshold. Ideas?