Computing sparse quadratic models in DFO

paper of mine (together with Katya Scheinberg and Luis Nunes Vicente) was recently awarded the 2013 INFORMS Optimization Society student paper prize and so I decided to use the occasion to restart my blog by writing a post about this work.

This paper, entitled “Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization”, was the basis of my Master Thesis back in 2010. In this post I will try to briefly explain the intuition and ideas behind the results in the paper, the actual paper is available here.

The framework is unconstrained optimization: one wants to minimize a (sufficiently smooth) function ${f:\mathbb{R}^n\rightarrow\mathbb{R}}$ over ${\mathbb{R}^n}$. In many applications, function evaluations are particularly expensive and one has no access to function derivatives (an example of this is if the goal is to optimize a set of parameters and each evaluation requires an expensive simulation). Motivated by these applications we are interested in optimizing ${f}$ without using its derivatives and using as few function evaluations as possible, this is known as Derivative-Free Optimization.

One class of algorithms used in such instances are model-based trust-region methods. Essentially they operate by iteratively picking a small region, known as the trust-region, ${B\subset\mathbb{R}^n}$ (say a ball) and build (through interpolation by sampling ${f}$) a model ${m:B\rightarrow\mathbb{R}}$ of ${f}$ in ${B}$, the idea being that ${m}$ is easier to optimize in ${B}$ and that ${m}$ is a reliable model for ${f}$ whose minimizer is an estimate for the minimizer of ${f}$ in ${B}$. Depending of the location of the minimizer of ${m}$ and its value on ${f}$ the trust-region is updated and the procedure repeated.

A very popular class of models used are the quadratic polynomials, as these are simple and capture curvature of the function (unlike linear models). For the sake of simplicity, let us assume that ${f}$ itself is a quadratic and we want to find a model ${m=f}$ (to be more rigorous one attempts to build a model with a Taylor-like approximation quality but I want to keep this post as untechnical as possible). Constructing ${m}$ from ${p}$ function evaluations of ${f}$ corresponds to solving a linear system. Let ${\phi_1,\dots,\phi_N}$ be a basis for the quadratic polynomials of ${n}$ variables (meaning ${N = \frac{(n+3)n+2}{2}}$). We can write ${m = \sum_{i=1}^N\alpha_i\phi_i}$ and focus on estimating ${\{\alpha_i\}_{i=1}^N}$. In fact, a function evaluation of ${f}$ at ${y_j}$ gives a linear constraint on ${\alpha}$,

$\displaystyle \sum_{i=1}^N\alpha_i\phi_i(y_j) = f(y_j).$

A sample set ${Y}$ of ${p}$ points corresponds to ${p}$ linear constraints on ${\alpha}$ which we represent by an interpolation matrix ${M(\phi,Y)}$

$\displaystyle M(\phi,Y)\alpha = f(Y). \ \ \ \ \ (1)$

In order for (1) to be a determined system one needs ${p\geq N= \frac{(n+3)n+2}{2}}$  function evaluations in each iteration, which is often too expensive.

Most functions that arise from applications have special structures. For example, in the parameter estimation, it is unlikely that, for every pair of parameters, the two parameters are interacting with each other. Pairs of parameters not interacting should correspond to zero joint derivatives. Motivated by this we assume that ${f}$ as a sparse Hessian. This means that, provided we pick a basis ${\{\phi\}_{i=1}^N}$ such that Hessian sparsity translates into sparsity in ${\alpha}$, we know that the solution of (1) is sparse.

Exciting research in the last decade in the area of sparse recovery (also known as Compressed Sensing) gives very good understanding of when one is able to recover a sparse vector ${\alpha}$ from an underdetermined linear system. The main contribution of this paper is leveraging these results to estimate ${\alpha}$ (which gives a model ${m}$) by using significantly less than ${N=\mathcal{O}(n^2)}$ measurements.

In a nutshell, the theory of Compressed Sensing tells us that, if ${M(\phi,Y)}$ satisfies a certain property (known as the Restricted Isometry Property), then the sparse vector ${\alpha}$ can be recovered by ${\ell_1}$ minimization (essentially, it is the vector with minimum ${\ell_1}$ norm while still satisfying the constraints). Matrices satisfying the Restricted Isometry Property are notably difficult to build (I have spent some time thinking about this myself…) but random constructions are known to yield RIP matrices for matrices as “short and fat” as ${p}$ rows and ${N}$ columns where ${p=\mathcal{O}(k\log N)}$ where ${k}$ is the sparsity of the vector to be recovered.

In fact, provided that the basis ${\{\phi\}_{i=1}^N}$ satisfies certain properties, a sufficiently large random sample set ${Y}$ is known to give an interpolation matrix ${M(\phi,Y)}$ that, with high probability, satisfies the RIP. In the paper, we manage to build a basis ${\{\phi\}_{i=1}^N}$ that both inherits sparsity from the Hessian and satisfies the properties needed to yield RIP interpolation matrices. This, together with a particular choice of trust-region and probability measure to sample the random sample set ${Y}$, allows us to show that, with high probability, ${\ell_1}$ minimization finds ${\alpha}$ from the linear measurements (1) with as few as

$\displaystyle p = \mathcal{O}(n\log^4(n))$

samples, provided that the Hessian of ${f}$ has ${\mathcal{O}(n)}$ non-zero entries. Notice that this is considerably less that the ${\mathcal{O}(n^2)}$ samples that would be needed in the classical case.

In general, when ${f}$ is not a quadratic (but still sufficiently smooth) the procedure sketched above gives, with the same number of samples, a model ${m}$ that approximates ${f}$ in ${B}$ essentially as well as the second-order Taylor approximation of ${f}$.