# Marco Cuturi

Regularization for Optimal Transport and Dynamic Time Warping Distances

Machine learning deals with mathematical objects that have structure. Two common structures arising in applications are point clouds / histograms, as well as time series. Early progress in optimization (linear and dynamic programming) have provided powerful families of distances between these structures, namely Wasserstein distances and dynamic time warping scores. Because they rely both on the minimization of a linear functional over a (discrete) space of alignments and a continuous set of couplings respectively, both result in non-differentiable quantities. We show how two distinct smoothing strategies result in quantities that are better behaved and more suitable for machine learning applications, with applications to the computation of Fréchet means.

_________________________________________________________________________

* Marco Cuturi is professor of statistics at CREST/ENSAE, Université Paris Saclay. His research is currently focused on the application of optimal transport theory to machine learning and more generally data sciences. He received his Ph.D. in 2005 from the Ecole des Mines de Paris, worked as a post-doctoral researcher at the Institute of Statistical Mathematics, Tokyo, between 2005 and 2007, in the financial industry until 2008, and in the ORFE department of Princeton University until 2010 as a lecturer. He was an associate professor at the Graduate School of Informatics of Kyoto University between 2010 and 2016. His research is supported by a « Chaire d’Excellence de l’IDEX Paris Saclay » (2017-2020).
*_________________________________________________________________________

# Grégoire Montavon

**Machine Learning Models: Explaining their Decisions, and Validating the Explanations**

Machine learning models such as deep neural networks have been successful at solving complex tasks in image recognition, text understanding, or physics. There is also a high demand to use these models for assisting humans in taking decisions, e.g. medical diagnosis, or autonomous driving. For this, one needs to be able to trust the learned model, and it is therefore necessary to thoroughly validate it. In particular, we should ensure that its decisions are based on the correct input features.

In this talk, the deep Taylor decomposition framework for explaining decisions in terms of input features will be presented. The framework is applicable to a wide range of neural network architectures, including highly complex ones such as GoogleNet. It works by propagating the model’s decision backwards in the network until the input variables are reached. The propagation mechanism at each layer is based on a Taylor expansion principle.

Explanation techniques can be used to validate a trained model. But we also need to validate the explanation technique itself. Ground-truth explanations are usually not available. However one can still test the explanation technique for a number properties considered as desirable. We will show how free parameters of the Taylor expansion allow to induce these desirable properties.

_________________________________________________________________________

**Grégoire Montavon** received a Masters degree in Communication Systems from École Polytechnique Fédérale de Lausanne in 2009 and a Ph.D. degree in Machine Learning from the Technische Universität Berlin in 2013. He is currently a Research Associate in the Machine Learning Group at TU Berlin. His current research focuses on methods for interpreting machine learning models, in particular, deep neural networks.

_________________________________________________________________________

**Patrice Simard**

Machine Learning: What’s next?

For many Machine Learning (ML) problems, labeled data is readily available. When this is the case, algorithms and training time are the performance bottleneck. This is the ML researcher’s paradise! Vision and Speech are good examples of such problems because they have a stable distribution and additional human labels can be collected each year. Problems that extract their labels from history, such as click prediction, data analytics, and forecasting are also blessed with large numbers of labels. Unfortunately, there are only a few problems for which we can rely on such an endless supply of free labels. They receive a disproportionally large amount of attention from the media.

We are interested in tackling the much larger class of ML problems where labeled data is sparse. For example, consider a dialog system for a specific app to recognize specific commands such as: “lights on first floor off”, “increase spacing between 2^{nd} and 3^{rd} paragraph”, “make doctor appointment after Hawaii vacation”. Anyone who has attempted building such a system has soon discovered that generalizing to new instances from a small custom set of labeled instances is far more difficult than they originally thought. Each domain has its own generalization challenges, data exploration and discovery, custom features, and decomposition structure. Creating labeled data to communicate custom knowledge is inefficient. It also leads to embarrassing errors resulting from over-training on small sets. ML algorithms and processing power are not a bottleneck when labeled data is scarce. The bottleneck is the teacher and the teaching language.

To address this problem, we change our focus from the learning algorithm to teachers. We define “Machine Teaching” as improving the human productivity *given* a learning algorithm. If ML is the science and engineering of extracting knowledge from data, Machine Teaching is the science and engineering of extracting knowledge from teachers. A similar shift of focus has happened in computer science. While computing is revolutionizing our lives, systems sciences (e.g., programming languages, operating systems, networking) have shifted their foci to human productivity. We expect a similar trend will shift science from Machine Learning to Machine Teaching.

The aim of this talk is to convince the audience that we are asking the right questions. We provide some answers and some spectacular results. The most exciting part, however, is the research opportunities that come with the emergence of a new field.

_________________________________________________________________________

**Patrice Simard** is a Distinguished Engineer in the Microsoft Research AI Lab in Redmond. He is passionate about finding new ways to combine engineering and science in the field of machine learning. Simard’s research is currently focused on human teachers. His goal is to extend the teaching language, science, and engineering, beyond the traditional (input, label) pairs.

*Simard completed his PhD thesis in Computer Science at the University of Rochester in 1991. He then spent 8 years at AT&T Bell Laboratories working on neural networks. He joined Microsoft Research in 1998. In 2002, he started MSR’s Document Processing and Understanding research group. In 2006, he left MSR to become the Chief Scientist and General Manager of Microsoft’s Live Labs Research. In 2009, he became the Chief Scientist of Microsoft’s AdCenter (the organization that monetizes Bing search). In 2012, he returned to Microsoft Research to work on his passion, Machine Learning research. Specifically, he founded the Computer-Human Interactive Learning (CHIL) group to study Machine Teaching and to make machine learning accessible to everyone.
*_________________________________________________________________________