# Distinguished Seminar in Optimization & Data

This is an interdepartmental talks series at the University of Washington,
focused on all aspects of optimization and data science.
It is a new incarnation of the CORE seminar
that ran from 2013-2020.

The series is funded by the Institute for the Foundations of
Data Science , the Pacific Institute of Mathematical Sciences (PIMS), and the home departments of the organizers.

### Organizers:

Sasha Aravkin, Applied Mathematics
Dima Drusvyatskiy, Mathematics
Maryam Fazel, Electrical Engineering & Computer Engineering * (current co-chair) *
Zaid Harchaoui, Statistics
Kevin Jamieson, Computer Science & Engineering
Thomas Rothvoss, Mathematics & Computer Sience & Engineering
Rekha Thomas, Mathematics * (current co-chair) *
Cynthia Vinzant, Mathematics

## Upcoming Talks:

* Monday February 5, 2024 *

## Amitabh Basu, Johns Hopkins University

* Monday March 18, 2024 *

* Monday April 22, 2024 *

* Monday May 6, 2024 *

## Past Talks:

## Lin Xiao, Fundamental AI Research - Meta

* Monday November 6, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### Non-negative Gauss-Newton Methods for Empirical Risk Minimization
( recording)

** Abstract: **
Training large-scale machine learning models, especially modern deep learning models, relies exclusively on the stochastic gradient method and its many variants. A major challenge in applying these methods is how to automatically schedule or adapt the step size. We exploit the fact that the loss functions in machine learning are mostly non-negative and thus can be written as the composition of the square and their real-valued square roots. With such a simple reformulation, we can apply the classical Gauss-Newton method, or the Levenberg-Marquardt method with an extra quadratic regularization. We show that the resulting algorithms are highly adaptive and can automatically warm up and decay the effective step size while tracking the loss landscape. We provide convergence analysis of these Non-negative Gauss-Newton (NGN) methods in convex, non-convex, and stochastic settings. Both the convergence rates and empirical evaluations compare favorably to the classical (stochastic) gradient method. This is joint work with Antonio Orvieto.

** Bio: **
Lin Xiao is a Research Scientist Manager on the Fundamental AI Research (FAIR) team at Meta. He received his Ph.D. from Stanford University in 2004, spent two years as a postdoctoral fellow at California Institute of Technology, and then worked at Microsoft Research until 2020 before joining Meta. His current research focuses on optimization theory and algorithms for deep learning and reinforcement learning.

* Monday October 2, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### New directions in (quantum) distribution learning and testing (recording)

** Abstract: **
I will talk about:
- New efficient algorithms for quantum state tomography (the quantum analogue of estimating a probability distribution).
- Why you should care about the difference between total variation distance and Hellinger distance and KL divergence and chi-squared divergence.
- Quantum-inspired improvements to the classical problem of independence testing.

Includes joint work with Steven T. Flammia (Amazon)

** Bio: ** Ryan O'Donnell is a professor of computer science at Carnegie Mellon University. He received his Ph.D. in 2003 from the Department of Applied Mathematics at MIT. His research interests include quantum computation and information theory, approximability of optimization problems, spectral graph theory, analysis of Boolean functions, probability, and complexity theory.
## Ting-Kei Pong, Hong Kong Polytechnic University

* Monday June 5, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### Error bounds for conic feasibility problems: case studies on the exponential cone ( recording )

** Abstract: **
A linear conic programming problem aims at minimizing a linear objective over the intersection of an affine set and a closed convex cone. The conic feasibility problems we study in this talk naturally arise from the optimality conditions of linear conic programming problems, which boil down to finding a point in the intersection of an affine set and a closed convex cone. While the distance to the intersection gives a precise measure of proximity of a point to being feasible, this distance can be much harder to compute than the distances to the affine set and the cone respectively. Thus, establishing bounds on the distance to the intersection based on the latter two distances (a.k.a. the error bound problem) is instrumental in the design of termination criteria for conic solvers and the study of convergence rate of algorithms. In this talk, we present a general framework for deriving error bounds for conic feasibility problems. Our framework is based on the classical concept of facial reduction, which is a fundamental tool for handling degeneracy in conic programs, and a new object called one-step facial residual function. We develop tools to compute these facial residual functions, which are applicable even when the projections onto the cones under study are not easy to analyze. We illustrate how our framework can be applied to obtain error bounds for the exponential cone. Exponential cone is a recent addition to the MOSEK commercial conic solver and allows the modeling of optimization problems involving power, exponential, logarithmic and entropy functions in the objective or constraint set. If time permits, we will also use our new error bound results to derive interesting new results in the study of Kurdyka-Lojasiewicz property.
This is joint work with Scott B. Lindstrom and Bruno F. Lourenco.

** Bio: ** Ting Kei Pong is an associate professor in Department of
Applied Mathematics at the Hong Kong Polytechnic University. He
received a bachelor's degree in 2004 and an MPhil degree in 2006 from
the Chinese University of Hong Kong. Pong received a Ph.D. from the
Department of Mathematics at UW in 2011 followed by visiting an
postdoctoral positions at Simon Fraser University, University of
Waterloo and UBC Vancouver. He joined the Hong Kong Polytechnic
University in 2014.
* Monday May 22, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### Statistical applications of Wasserstein gradient flows (recording)

** Abstract: **
Otto calculus is a fundamental toolbox in mathematical optimal transport, imparting the Wasserstein space of probability measures with a Riemmanian structure. In particular, one can compute the Riemannian gradient of a functional over this space and, in turn, optimize it using Wasserstein gradient flows. The necessary background to define and compute Wasserstein gradient flows will be presented in the first part of the talk before moving to statistical applications such as variational inference and maximum likelihood estimation in Gaussian mixture models.

** Bio: **
Philippe Rigollet is a Professor of Mathematics at MIT. He received his Ph.D. in mathematics from the University of Paris VI. in 2006. His work is at the intersection of statistics, machine learning, and optimization, focusing primarily on the design and analysis of statistical methods for high-dimensional problems. Rigollet's recent research focuses on statistical optimal transport and its applications to geometric data analysis and sampling.
Please advertise it through your department's mailing list, and send it to anyone who may be interested.
## Misha Belkin , University of California, San Diego

* Monday May 1, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### The Challenges of Training Infinitely Large Neural Networks (recording)

** Abstract: **
Remarkable recent successes of deep learning have heavily relied on very large neural networks. Time and time again, larger networks have shown better performance than smaller networks trained on comparably sized data sets. Since large networks generally work better, why not train infinitely large networks directly to get best performance? This is not a rhetorical question. Recent work on the Neural Tangent Kernel showed that in certain regimes infinitely wide neural networks are equivalent to kernel machines with an explicitly computable kernel function. These machines can be trained directly by solving linear systems. However, there are two primary challenges in training infinitely large networks. First, such networks are not able to take advantage of feature learning, a key ingredient in the success of neural networks. Second, scaling kernels machines to large data sizes has been a serious computational challenge.
In this talk, I will first show how feature learning can be incorporated into kernel machines without backpropagation, resulting in Recursive Feature Machines (RFM) that outperform Multilayer Perceptrons. Just as standard kernel machines, RFMs can be trained by solving linear systems of equations. These machines demonstrate state-of-the art performance on tabular data and are considerably more efficient than neural networks for small and medium data sizes. Second, I will discuss some of our recent work on scaling kernel machines to much larger data. While the data sizes used to train modern LLMs remain beyond our reach, given sufficient resources, scaling to these data sizes is not out of the realm of possibility.

** Bio: **
Mikhail Belkin is a Professor at Halicioglu Data Science Institute and Computer Science and Engineering Department at UCSD and an Amazon
Scholar. He received his Ph.D. in 2003 from the Department of Mathematics at the University of Chicago. His research interests are broadly in theory and applications of machine learning and data analysis. Some of his well-known work includes widely used Laplacian Eigenmaps, Graph Regularization and Manifold Regularization algorithms, which brought ideas from classical differential geometry and spectral graph theory to data science. His recent work has been concerned with understanding remarkable mathematical and statistical phenomena observed in the context of deep learning. This empirical evidence necessitated revisiting some of the basic concepts in statistics and optimization. One of his key recent findings has been the "double descent" risk curve that extends the textbook U-shaped bias-variance trade-off curve beyond the point of interpolation.
* Monday April 3, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### Leveraging "partial" smoothness for faster convergence in nonsmooth optimization

** Abstract: **
Nonsmoothness and nonconvexity are significant challenges in large-scale optimization problems, such as training neural networks and solving inverse problems in computational imaging. Although first-order methods are commonly used to address these problems, they are often considered slow. In this presentation, we introduce a (locally) accelerated first-order method that violates this perception by solving "generic" nonsmooth equations at a superlinear rate. The central insight is that nonsmooth functions often exhibit "partial" smoothness, which can be leveraged through a novel generalization of Newton's method.

** Bio: **
Damek Davis is an Associate Professor of Operations Research at Cornell University. He received his Ph.D. in mathematics from the University of California, Los Angeles in 2015. His research focuses on the interplay of optimization, signal processing, statistics, and machine learning. He has received several awards for his work, including a Sloan Research Fellowship in Mathematics (2020), the INFORMS Optimization Society Young Researchers Prize (2019), an NSF CAREER Award (2021), and the SIAM Activity Group on Optimization Best Paper Prize (2023).

* Monday January 9, 2023, 3:30 pm *

* Gates Commons (CSE 691) Allen Center *

### Stability and Learning in Strategic Queueing Systems (recording)

** Abstract: **
Over the last two decades we have developed good
understanding of how to quantify the impact of strategic user behavior on
outcomes in many games (including traffic routing and online auctions)
and showed that the resulting bounds extend to repeated games assuming
players use a form of no-regret learning to adapt to the
environment. Unfortunately, these results do not apply when outcomes
in one round affect the game in the future, as is the case in many
applications. In this talk, we study this phenomenon in the context of
a game modeling queuing systems: routers compete for servers, where
packets that do not get served need to be resent, resulting in a
system where the number of packets at each round depends on the
success of the routers in the previous rounds. In joint work with
Jason Gaitonde, we analyze the resulting highly dependent random
process. We find that if the capacity of the servers is high enough to
allow a centralized and knowledgeable scheduler to get all packets
served even with double the packet arrival rate, then despite selfish
behavior of the queues, the expected number of packets in the queues
will remain bounded throughout time, assuming older packets have
priority. Further, if queues are more patient in evaluating their
outcomes , maximizing their long-run success rate, stability can be
ensured with just 1.58 times extra capacity, strictly better than what
is possible assuming the no-regret property.

** Bio: ** Éva Tardos is a Jacob Gould Schurman Professor of
Computer Science, currently chair of the Department of Computer
Science for a second term after being chair 2006-2010. She was Interim
Dean for Computing and Information Sciences 2012-2013 and more
recently was Associate Dean for Diversity & Inclusion at Cornell
University. She received her BA and PhD from Eötvös University in
Budapest. She joined the faculty at Cornell in 1989. Tardosâ€™s research
interest is algorithms and interface of algorithms and incentives.
She is most known for her work on network-flow algorithms and
quantifying the efficiency of selfish routing. She has been elected to
the National Academy of Engineering, the National Academy of Sciences,
the American Philosophical Society, the American Academy of Arts and
Sciences, and to the Hungarian Academy of Sciences. She is the
recipient of a number of fellowships and awards including the Packard
Fellowship, the Gödel Prize, Dantzig Prize, Fulkerson Prize, ETACS
prize, and the IEEE von Neumann Medal. She co-wrote the widely used
textbook Algorithms Design. She has been editor-in-Chief of the
Journal of the ACM and of the SIAM Journal of Computing, and was
editor of several other journals, and was program committee member and
chair for several ACM and IEEE conferences in her area.