A paper of mine (together with Katya Scheinberg and Luis Nunes Vicente) was recently awarded the 2013 INFORMS Optimization Society student paper prize and so I decided to use the occasion to restart my blog by writing a post about this work.
This paper, entitled “Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization”, was the basis of my Master Thesis back in 2010. In this post I will try to briefly explain the intuition and ideas behind the results in the paper, the actual paper is available here.
The framework is unconstrained optimization: one wants to minimize a (sufficiently smooth) function over . In many applications, function evaluations are particularly expensive and one has no access to function derivatives (an example of this is if the goal is to optimize a set of parameters and each evaluation requires an expensive simulation). Motivated by these applications we are interested in optimizing without using its derivatives and using as few function evaluations as possible, this is known as Derivative-Free Optimization.
One class of algorithms used in such instances are model-based trust-region methods. Essentially they operate by iteratively picking a small region, known as the trust-region, (say a ball) and build (through interpolation by sampling ) a model of in , the idea being that is easier to optimize in and that is a reliable model for whose minimizer is an estimate for the minimizer of in . Depending of the location of the minimizer of and its value on the trust-region is updated and the procedure repeated.
A very popular class of models used are the quadratic polynomials, as these are simple and capture curvature of the function (unlike linear models). For the sake of simplicity, let us assume that itself is a quadratic and we want to find a model (to be more rigorous one attempts to build a model with a Taylor-like approximation quality but I want to keep this post as untechnical as possible). Constructing from function evaluations of corresponds to solving a linear system. Let be a basis for the quadratic polynomials of variables (meaning ). We can write and focus on estimating . In fact, a function evaluation of at gives a linear constraint on ,
A sample set of points corresponds to linear constraints on which we represent by an interpolation matrix
In order for (1) to be a determined system one needs function evaluations in each iteration, which is often too expensive.
Most functions that arise from applications have special structures. For example, in the parameter estimation, it is unlikely that, for every pair of parameters, the two parameters are interacting with each other. Pairs of parameters not interacting should correspond to zero joint derivatives. Motivated by this we assume that as a sparse Hessian. This means that, provided we pick a basis such that Hessian sparsity translates into sparsity in , we know that the solution of (1) is sparse.
Exciting research in the last decade in the area of sparse recovery (also known as Compressed Sensing) gives very good understanding of when one is able to recover a sparse vector from an underdetermined linear system. The main contribution of this paper is leveraging these results to estimate (which gives a model ) by using significantly less than measurements.
In a nutshell, the theory of Compressed Sensing tells us that, if satisfies a certain property (known as the Restricted Isometry Property), then the sparse vector can be recovered by minimization (essentially, it is the vector with minimum norm while still satisfying the constraints). Matrices satisfying the Restricted Isometry Property are notably difficult to build (I have spent some time thinking about this myself…) but random constructions are known to yield RIP matrices for matrices as “short and fat” as rows and columns where where is the sparsity of the vector to be recovered.
In fact, provided that the basis satisfies certain properties, a sufficiently large random sample set is known to give an interpolation matrix that, with high probability, satisfies the RIP. In the paper, we manage to build a basis that both inherits sparsity from the Hessian and satisfies the properties needed to yield RIP interpolation matrices. This, together with a particular choice of trust-region and probability measure to sample the random sample set , allows us to show that, with high probability, minimization finds from the linear measurements (1) with as few as
samples, provided that the Hessian of has non-zero entries. Notice that this is considerably less that the samples that would be needed in the classical case.
In general, when is not a quadratic (but still sufficiently smooth) the procedure sketched above gives, with the same number of samples, a model that approximates in essentially as well as the second-order Taylor approximation of .