A paper of mine (together with Katya Scheinberg and Luis Nunes Vicente) was recently awarded the 2013 INFORMS Optimization Society student paper prize and so I decided to use the occasion to restart my blog by writing a post about this work.
This paper, entitled “Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization”, was the basis of my Master Thesis back in 2010. In this post I will try to briefly explain the intuition and ideas behind the results in the paper, the actual paper is available here.
The framework is unconstrained optimization: one wants to minimize a (sufficiently smooth) function over
. In many applications, function evaluations are particularly expensive and one has no access to function derivatives (an example of this is if the goal is to optimize a set of parameters and each evaluation requires an expensive simulation). Motivated by these applications we are interested in optimizing
without using its derivatives and using as few function evaluations as possible, this is known as Derivative-Free Optimization.
One class of algorithms used in such instances are model-based trust-region methods. Essentially they operate by iteratively picking a small region, known as the trust-region, (say a ball) and build (through interpolation by sampling
) a model
of
in
, the idea being that
is easier to optimize in
and that
is a reliable model for
whose minimizer is an estimate for the minimizer of
in
. Depending of the location of the minimizer of
and its value on
the trust-region is updated and the procedure repeated.
A very popular class of models used are the quadratic polynomials, as these are simple and capture curvature of the function (unlike linear models). For the sake of simplicity, let us assume that itself is a quadratic and we want to find a model
(to be more rigorous one attempts to build a model with a Taylor-like approximation quality but I want to keep this post as untechnical as possible). Constructing
from
function evaluations of
corresponds to solving a linear system. Let
be a basis for the quadratic polynomials of
variables (meaning
). We can write
and focus on estimating
. In fact, a function evaluation of
at
gives a linear constraint on
,
A sample set of
points corresponds to
linear constraints on
which we represent by an interpolation matrix
In order for (1) to be a determined system one needs function evaluations in each iteration, which is often too expensive.
Most functions that arise from applications have special structures. For example, in the parameter estimation, it is unlikely that, for every pair of parameters, the two parameters are interacting with each other. Pairs of parameters not interacting should correspond to zero joint derivatives. Motivated by this we assume that as a sparse Hessian. This means that, provided we pick a basis
such that Hessian sparsity translates into sparsity in
, we know that the solution of (1) is sparse.
Exciting research in the last decade in the area of sparse recovery (also known as Compressed Sensing) gives very good understanding of when one is able to recover a sparse vector from an underdetermined linear system. The main contribution of this paper is leveraging these results to estimate
(which gives a model
) by using significantly less than
measurements.
In a nutshell, the theory of Compressed Sensing tells us that, if satisfies a certain property (known as the Restricted Isometry Property), then the sparse vector
can be recovered by
minimization (essentially, it is the vector with minimum
norm while still satisfying the constraints). Matrices satisfying the Restricted Isometry Property are notably difficult to build (I have spent some time thinking about this myself…) but random constructions are known to yield RIP matrices for matrices as “short and fat” as
rows and
columns where
where
is the sparsity of the vector to be recovered.
In fact, provided that the basis satisfies certain properties, a sufficiently large random sample set
is known to give an interpolation matrix
that, with high probability, satisfies the RIP. In the paper, we manage to build a basis
that both inherits sparsity from the Hessian and satisfies the properties needed to yield RIP interpolation matrices. This, together with a particular choice of trust-region and probability measure to sample the random sample set
, allows us to show that, with high probability,
minimization finds
from the linear measurements (1) with as few as
samples, provided that the Hessian of has
non-zero entries. Notice that this is considerably less that the
samples that would be needed in the classical case.
In general, when is not a quadratic (but still sufficiently smooth) the procedure sketched above gives, with the same number of samples, a model
that approximates
in
essentially as well as the second-order Taylor approximation of
.