Blog

NeurIPS 2018 - Part 2/4 Visualization and ML

The Thirty-second Annual Conference on Neural Information Processing Systems
Highlights of what we saw at this year’s conference  - Part 2/4

Visualization and ML (S. Michel)

Sofia was at the Monday morning tutorial on ‘Visualization for Machine Learning’ by F. Viégas and M. Wattenber from Google Brain where, given that the human visual system is really good at a few special tasks, the goal of this tutorial was to understand how data can be transformed into visual encodings that people naturally decode. The key questions was thus 'How can we guide people's attention to let their brain naturally interpret the visualization?'

Some of the take home messages she shared were: 

  - If the visualization happens on a computer then it’s crucial to make it interactive, following three principles: give an overview first, then filter and zoom and only then give details on demand.

  - Colours play a key role in conveying a message. A useful tool to create colour palettes according to your purpose is ‘Colorbrewer  

  - For tables, the "remove to improve" principle can do magic. You might want to look at this not-so-recent but useful video: https://www.darkhorseanalytics.com/blog/clear-off-the-table

 More specifically for machine learning: 

   - On the importance of visualizing training data there’s Google's facet tool

   - Visualizations can help understand the behaviour of neural networks with examples for images that include https://arxiv.org/abs/1311.2901, http://yosinski.com/deepvis, and  http://brainmodels.csail.mit.edu/dnn/drawCNN/ and, for sequence-to-sequence models  http://seq2seq-vis.io/

   - Saliency maps (and their many variants) are interesting but often difficult to interpret

   - Visualizing high dimensional data: beyond the linear PCA, the t-sne method is now very popular. One should however be careful with the importance of the perplexity hyper-parameter which can completely transform the results. In particular it determines or governs both the size of the clusters and the distance between them, which are therefore meaningless to the interpretation of the data. A better alternative to t-sne is UMAP which is faster and supposed to better capture the global structure. See UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction or https://www.math.upenn.edu/~jhansen/2018/05/04/UMAP/) --

Viégas and Wattenber insisted on how counterintuitive high-dimensional data can be. A rule of thumb is that, if you spot something interesting, then you should make sure the same phenomenon does not occur with a random baseline. 

Why Deep Nets generalize and does back propagation work in the brain? (C. Dance)

The first session of Track 1 on Tuesday was a positive surprise as it had three papers that I found interesting and important.

  1. Baldi and Vershynin from Irvine (http://papers.nips.cc/paper/7999-on-neuronal-capacity.pdf) considered the capacity of a learning architecture, defined as the log of the number of functions it can implement. The capacity of linear threshold gates has been known since the work of Zuev (1989), but Baldi and Vershynin prove simple formulae for the capacity of polynomial threshold gates and deep ReLU nets. The results show that shallower architectures have more capacity than deeper architectures with a similar number of weights. This is perhaps intriguing in the light of recent results (e.g. https://arxiv.org/abs/1608.03287) showing that some functions computed by deep networks can only be poorly approximated by shallow networks unless those shallow networks have exponentially-many more hidden units.
  2. Liang and Li (http://papers.nips.cc/paper/8038-learning-overparameterized-neural-networks-via-stochastic-gradient-descent-on-structured-data) explore two well-known mysteries of deep learning: (i) the fact that over-parameterisation helps optimisation; and (ii) the fact that deep nets often generalise well even when they are so over-parameterised that they can fit random labels on the same dataset. The authors explain this “black magic” in a simple k-class classification setting, where each class corresponds to a mixture distribution having l components. Each of these components has a finite support and the closest pair of points from any pair of distinct classes is bounded from below. The authors explore classification with a ReLU network having a single hidden layer and a carefully-selected random initialisation, but pointed out that they would also be presenting results on networks having two or three hidden layers in the Integration of Deep Learning Theories workshop on the Saturday (see Saturday section below). Their key insight was that, the more we over-parameterise a network, the less likely the activation pattern for one neuron on one data point will change after a fixed number of SGD iterations. Liang and Li’s proofs can therefore work with a “pseudo network” which is just like the original ReLU network but it has fixed signs for its activations and is much simpler to analyse.  
  3. Finally, backpropagation has long been thought to be implausible in the brain and this fundamental issue in computational neuroscience is known as the synaptic credit assignment problem. In particular, backpropagation as used in deep learning relies on precisely clocked forward and backwards messages, yet the brain has continuous-time neuronal dynamics. Furthermore, backpropagation appears to require “weight transport” since the weights of one layer explicitly appear as a multiplicative factor in the weight update for the previous layer. In spite of these issues, Sacramento et al presented both mathematical models and observational data to argue that backpropagation is in fact plausible, at least in dendritic cortical microcircuits. The key to the model is to describe pyramidal neurons in terms of three compartments. Bottom-up and top-down connections converge on dendrites in different compartments corresponding to normal feedforward signals and to error signals.

Figure. Results for a ReLU network with a single hidden layer on synthetic data. (a) The test accuracy hardly changes as the number of hidden nodes varies over a wide range. (b) Consider the fraction of hidden nodes and samples for which the sign of the output after a given number of gradient iterations differs from the sign at initialisation. This fraction varies linearly with the number of gradient iterations, to a good approximation. (c) The distance of the trained weights from the initialisation is a decreasing function of the number of hidden nodes, for a given number of gradient iterations. (d) The accumulated updates to the weights have low rank. (By courtesy of Li and Liang, NeurIPS, 2018.) 

Figure. Model of learning in a network of pyramidal cells and lateral inhibitory neurons. When a teaching signal is presented to the output layer, a prediction error is generated in layer 1. This error propagates to the soma, as shown by the purple arrow. This modulates the somatic firing rate, which leads to plasticity at the bottom-up synapses (green). (By courtesy of Sacramento et al., NeurIPS, 2018)

Advising policy makers, interpretability, parameterizing, RNNs, NMT and inference (M. Galle)

Edward W. Felten, from Princeton  and  former Deputy U.S. Chief Technology Officer at the Federal Trade Commission under the Obama administration, gave an invited talk on Tuesday morning on "Machine Learning Meets Public Policy: What to Expect and How to Cope". The goal was to give scientists insights on how policy makers operate and some tips on how to interact with them. He insisted several times on the difference between science and politics, a world "where truth is not the ultimate trump card", and starting from the observation that "democracy is not a search for truth, it is an algorithm to solve disagreement" modeling democracy as a toy mathematical model. This allowed to convey some messages under the disguise of mathematical results, including inconsistency and NP-hard results, concluding that locally this system looks irrational, but that scientists have to consider the broader picture in which legislators operate. His tips for interacting with them, included avoiding providing them only facts (too much information) and dictating the solution (nobody elected the scientists). He advocated for a multi-turn interaction, where the knowledge of the scientists should be combined with the knowledge and preferences of the politicians. Concretely, his tips were to (i) get their knowledge and preferences and (ii) structure the decision space.

After the keynote, Lage et al. explained “Human-in-the-Loop Interpretability Prior” where they tried to formalize an abstract definition of interpretability. The idea of combining automatic decision-making with human preferences is under-studied in the literature, partly because of costly evaluations and comparisons. In this paper, the authors find first a set of models that have high likelihood (P(X | M)), and then use humans to select a model with high interpretability prior (p(M)). Different proxies for interpretability (number of non-zero features, length of decision path, etc) lead to different results.

"Sparse Attentive Backtracking: Long-Range Credit Assignment in Recurrent Networks" presented by Ke et al., propose an alternative training method to learn different dialects of RNNs.  Existing methods have problems with extremely long-term dependencies on the past. One of the reasons is that, in order to model the relationship to state n-k, the standard training procedure - backpropagation through time - has to go through all k intermediate past states. The alternative of Ke et al., includes an attention mechanism and only updating the top-k past states.   

An attention mechanism was also central to "Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation" by He et al., which modifies the Transformer model so that each layer of the decoder is coordinated with the corresponding layer of the encoder. The intuition is that similar levels of semantic understanding in the decoder should attend to the same level in the encoder. The results on machine translation are impressive, showing an improvement of 2-3 BLEU points on standard benchmarks, with respect to baseline models.

The main idea behind "e-SNLI: Natural Language Inference with Natural Language Explanations" was explained as augmenting a natural inference dataset (given two pairs of sentence determine if they have a relationship of entails, contradicts or neutral) with explanations. The authors set up a crowdsource task asking annotators to provide a natural-language explanation for the relationship governing each pair. A proposed pipeline that first generates an explanation and then uses that explanation predicts a label obtains only slightly smaller overall accuracy than a direct prediction pair -> labels. The advantage of the proposed approach is that it allows to provide the explanation together with the label, a desired feature in many tasks where humans have to be convinced of the prediction.

Highlights of what we saw at this year’s conference  -  Part 1/4 Expo Day

Highlights of what we saw at this year’s conference  -  Part 3/4 Robotics and Optimization

Highlights of what we saw at this year’s conference  -  Part 4/4 Machine Learning for Creativity and Design