10 Lectures and 42 Open Problems in Mathematics of Data Science

This upcoming fall, I am teaching a special topics course at the Math Department in MIT, called Topics in Mathematics of Data Science. This will be a mostly self-contained research-oriented course focusing on the theoretical aspects of algorithms that aim to extract information from data.

I have divided the content of the class in ten topics (or “lectures”), I’ll describe them below. The biggest novelty perhaps is that I have decided to present a number of open problems on each of these lectures. Given that this list of problems (and their description) may be of interest to the readers of this blog, I plan to include short versions of the lecture notes as blog posts (linking to the proper lecture notes) and include a description of a total of forty open problems over the course of ten future posts. I am hoping interesting discussions about some of these problems arise from comments on these posts!

This “post zero” serves as an announcement for the class (if you are a student at MIT, think about taking the class!) and a warm-up for the open problems, I am including two below. But first, the content of the class:

 

  1. Principal Component Analysis (PCA) and some random matrix theory that will be used to understand the performance of PCA in high dimensions, through spike models.
  2. Manifold Learning and Diffusion Maps: a nonlinear dimension reduction tool, alternative to PCA. Semisupervised Learning and its relations to Sobolev Embedding Theorem.
  3. Spectral Clustering and a guarantee for its performance: Cheeger’s inequality.
  4. Concentration of Measure and tail bounds in probability, both for scalar variables and matrix variables.
  5. Dimension reduction through Johnson-Lindenstrauss Lemma and Gordon’s Escape Through a Mesh Theorem.
  6. Compressed Sensing/Sparse Recovery, Matrix Completion, etc. If time permits, I will present Number Theory inspired constructions of measurement matrices.
  7. Group Testing. Here we will use combinatorial tools to establish lower bounds on testing procedures and, if there is time, I might give a crash course on Error-correcting codes and show a use of them in group testing.
  8. Approximation algorithms in Theoretical Computer Science and the Max-Cut problem.
  9. Clustering on random graphs: Stochastic Block Model. Basics of duality in optimization.
  10. Synchronization, inverse problems on graphs, and estimation of unknown variables from pairwise ratios on compact groups.

 

Now to the open problems!

 

Open Problem 0.1.

Komlós Conjecture (see here for description on lecture notes)

We start with a fascinating problem in Discrepancy Theory.

Given {n}, let {K(n)} denote the infimum over all real numbers such that: for all set of {n} vectors {u_1,\dots,u_n\in\mathbb{R}^n} satisfying {\|u_i\|_2\leq 1}, there exist signs {\epsilon_i = \pm1} such that

\displaystyle \| \epsilon_1u_1+\epsilon_2u_2+\cdots + \epsilon_nu_n \|_\infty \leq K(n).

There exists a universal constant {K} such that {K(n)\leq K} for all {n}.

An early reference for this conjecture is a book by Joel Spencer. This conjecture is tightly connected to Spencer’s famous Six Standard Deviations Suffice Theorem. Later in the course we will study semidefinite programming relaxations, recently it was shown that a certain semidefinite relaxation of this conjecture holds, the same paper also has a good accounting of partial progress on the conjecture.

Open Problem 0.2.

Matrix AM-GM inequality (see here for description on lecture notes)

We move now to an interesting generalization of arithmetic-geometric means inequality, which has applications on understanding the difference in performance of with- versus without-replacement sampling in certain randomized algorithms (see this paper by Ben Recht and Christopher Re).

For any collection of {d\times d} positive semidefinite matrices {A_1,\cdots,A_n}, the following is true:

(a)

\displaystyle \left\| \frac1{n!}\sum_{\sigma\in \mathrm{Sym}(n)} \prod_{j=1}^n A_{\sigma(j)} \right\| \leq \left\| \frac1{n^n}\sum_{k_1,\dots,k_n = 1}^n \prod_{j=1}^n A_{k_j} \right\|,

and

(b)

\displaystyle \frac1{n!}\sum_{\sigma\in \mathrm{Sym}(n)} \left\|\prod_{j=1}^n A_{\sigma(j)} \right\| \leq \frac1{n^n}\sum_{k_1,\dots,k_n = 1}^n \left\|\prod_{j=1}^n A_{k_j} \right\|,

where {\mathrm{Sym}(n)} denotes the group of permutations of {n} elements, and {\|\cdot\|} the spectral norm.

Morally, these conjectures state that products of matrices with repetitions are larger than without. For more details on the motivations of these conjecture (and their formulations) see this paper for conjecture (a) and this commentary by John Duchi for conjecture (b).

Recently these conjectures have been solved for the particular case of {n=3}:(a) by Teng Zhang and (b) by  Arie Israel, Felix Krahmer, and Rachel Ward.

1 thought on “10 Lectures and 42 Open Problems in Mathematics of Data Science

  1. There is a 2020 preprint that says the matrix AM-GM inequality is false for general n with the first counterexample for n=5, one of the authors was interviewed on quanta magazine and mentions the paper on passing, I don’t work on this stuff but recalled reading this blog post from a mathoverflow answer, couldn’t find if you mentioned that preprint on some another blog post later so I’m leaving this message just in case, best regards.
    * https://arxiv.org/abs/2006.01510
    * https://www.quantamagazine.org/an-applied-mathematician-strengthens-ai-with-pure-math-20230301/
    * https://mathoverflow.net/questions/342966/is-data-science-mathematically-interesting

Leave a comment