# Boolean inverse problems on graphs and community detection

Last week I gave a talk at the CISS 2014 conference here in Princeton about a new paper with Emmanuel Abbe, Annina Bracher, and Amit Singer. I thought it would be an excellent opportunity to talk about this problem in the blog.

The problem is very simple, given a graph ${G=(V,E)}$ and unknown Boolean ${\{0,1\}}$ variables ${x_i}$ on the vertices we observe edge-variables given by, for ${(i,j)\in E}$, $\displaystyle y_{ij} = x_i\oplus x_j \oplus Z_{ij},$

where ${Z_{ij}}$ are iid random noise terms satisfying $\displaystyle Z_{ij} = \left\{ \begin{array}{ll} 1 & \text{ with probability } 1-\varepsilon \\ -1 & \text{ with probability } \varepsilon. \end{array} \right.$

The objective is to recover the value of the vertex variables ${x_i}$, up to a global flip. This problem is also known as ${\mathbb{Z}_2}$ Synchronization (in fact, I talked about it before in this blog), and it can also be seen as particular instance of community detection. We are interesting in exact recovery of ${x}$, with high probability.

For simplicity we’ll start considering the case on which ${G}$ is an s R\’ is showing that, as long as, $\displaystyle \frac{np}{\log n} \geq 2\frac1{(1-2\varepsilon)^2} + o\left(\frac1{(1-2\varepsilon)^2}\right), \ \ \ \ \ (1)$

exact recovery is possible (not necessarily efficiently), with high probability. Essentially it is shown that the maximum likelihood solution matches the ground truth with high probability.

Remarkably, the condition (1) is essentially tight as we can show that, as long as ${(1-2\varepsilon)\geq n^{-\tau/2}}$ then, for $\displaystyle \frac{np}{\log n} < 2\frac{1-3\tau/2}{(1-2\varepsilon)^2} + o\left(\frac1{(1-2\varepsilon)^2}\right),$

exact recovery is hopeless.

Unfortunately, the maximum likelihood estimator is, in general, NP-hard to compute. As a tractable alternative we propose an approach based in Semidefinite Programming. I have explained how this algorithm works in a past post. Using tools similar to the ones explained previously here we were able to show that, while condition (1) is not sufficient for the SDP approach to give exact recovery, as long as $\displaystyle \frac{np}{\log n} \geq 4\frac1{(1-2\varepsilon)^2} + o\left(\frac1{(1-2\varepsilon)^2}\right), \ \ \ \ \ (2)$

the algorithm based in Semidefinite Programming will, with high probability, achieve exact recovery.

This raises, in my opinion, an extremely interesting question: Is there any polynomial time algorithm that can match the threshold of the maximum likelihood estimator or is it the fact that there is a complexity gap and below a certain threshold (different than (1)) there is no poly-time algorithm that can achieve exact recovery?