# MATH 581: Mathematics of Data Science

## Instructor Information

Instructor: Dmitriy Drusvyatskiy
Lecture: MW 1:00-2:20 PM
Office hours: F 12:00-1:00 PM and by appointment
Office: PDL C-434
Email: ddrusv at uw.edu

## Meeting Times and Location

lecture time: Monday and Wednesday 1:00-2:20 PM
lecture location: Mary Gates Hall (MGH) 238

## Course Description

Description: This is a three-term sequence covering the fundamentals of mathematics of data science. The basic question we will address is how does one “optimally” extract information from high dimensional data. The course will cover material from a few different fields, most notably high dimensional probability, statistical and machine learning, and optimization.

## Textbook

We will use a number of different texts:

• Biweakly problem sets with solutions written in LaTex (100% of the grade). Since, there is no TA for this course, the grading will be on the scale:

• 0 (did not much work, if any),

• 1 (did at least 50% of the homework),

• 2 (did at least 80% of the homework).

## Collaboration

You may work together on problem sets, but you must write up your own solutions. You must also cite any resources which helped you obtain your solutions.

## Weakly Schedule

• Chapter 1: Concentration Inequalities

• Chapter 2: Random Vectors in High Dimensions (Lecture 12/13)

• Concentration of the norm

• Isotropic Distributions

• Similarity of Normal and Spherical Distributions

• Sub-Gaussian and Sub-Exponential Random Vectors

• Chapter 3: Introduction to Statistical Inference (Lecture 14/15)

• Mean square error and the Bias-Variance decomposition

• Maximum Likelyhood Estimation and examples

• Chapter 4: Linear Regression (Lecture 16/17)

• Ordinary least squares

• Ridge regression

• Chapter 5: Uniform Laws and Introduction to Statistical Learning (Lecture 18-20)

• Uniform law via Rademacher complexity

• Glivenko-Cantelli Theorem

• Upper bounds on Rademacher complexity

• Sample complexity of binary classification with VC dimension

• Learning linear classes

• Stable learning rules generalize

• Learning without uniform convergence with convex losses

• Chapter 6: Metric entropy and Matrix Concentration (Lecture 21-31)

• Nets, Covering numbers, and the metric entropy

• Eigenvalaues and Singular values of sub-Gaussian random matrices

• Matrix concentration

• Controlling the operator norm of a random matrix by row/column norms

• Applications: community detection, (sparse) covariance estimation, matrix completion; See also the paper "Clustering mixtures with unknown covariances"

• Chapter 7: Learning with kernels (Lecture 31-35)

• Motivation: the Kernel trick and dimension-free generalization

• Introduction to Hilbert spaces

• Representer Theorem

• Positive definite kernels and the reproducing kernel Hilbert space (RKHS)

• Moore-Aronszajn theorem and continuity of point evaluations

• Translation-invariant kernels

• Generalization properties

• Chapter 8: Optimization for learning (Lecture 36-45)

• Two paradigms: empirical risk minimization vs stochastic approximation

• Illustration: gradient descent for least squares

• Convexity: an interlude

• Minimax lower bound for deterministic convex optimization

• Stochastic gradient method and Polyak-Juditsky averaging

• Stochastic variance reduced gradient (SVRG)

• Chapter 9: Suprema of sub-Gaussian processes (Notes)

• Motivation

• Examples of sub-Gaussian processes

• Dudley's entropy integral and chaining

• From metric entropy to Gaussian width

• Sudakov-Fernique inequality

• Gordon's theorem and the matrix deviation inequality

• Johnson Lindenstrauss lemma for infinite sets

• -bound and escape through the mesh

• Chapter 11: Sparse recovery (Notes)

• High dimensional signal recovery

• signal recovery based on M* bound

• exact recovery based on the escape theorem

• deterministic conditions for recovery

• recovery with noise (LASSO/BPDN)

• Chapter 10: Minimax lower bounds for estimation (Notes)

• Minimax risk

• Reduction to hypothesis testing

• Le Cam's method based on binary testing

• Interlude: Deviations between probability distributions

• Applications: lower bounds for mean estimation and linear regression

• Chapter 12: Overparametrized models and the Polyak-Łojasiewicz (PL) inequality

• Polyak's inequality and linear convergence of gradient descent (Notes)

• Lojasiewicz inequality for analytic functions and finite length of gradient orbits

• Kurdyka and desingularization of tame functions

• Some extentions to nonsmooth functions

• Introduction to neural networks

• Transition to linearity and the PL condition for wide neural networks (Paper 1, Paper 2)

## Homework Problems

• HW 1 (due Friday October 20th (slide it under my door)):

• Reading: Vershynin (Chapter 2, Sections 3.1-3.4)

• Optional Reading: Wainwright (Chapter 1, Chapter 2) and Vershynin (Sections 3.5-3.8)

• Exercises: Vershynin (1.3.3, 2.2.10, 2.5.10, 2.5.11, 3.3.5, 3.4.4)

• HW 2 (due Friday November 10th (slide it under my door)):

• Exercises: Vershynin (2.4.4, 2.5.7, 2.5.9, 2.6.5, 2.6.6, 3.3.3 )

• HW 3 (due Friday December 1st (slide it under my door)):

• Exercises: (HW3)

• HW 4 (due Friday February 2nd (slide it under my door)):