The Thirty-second Annual Conference on Neural Information Processing Systems
Highlights of what we saw at this year’s conference - Part 2/4
Sofia was at the Monday morning tutorial on ‘Visualization for Machine Learning’ by F. Viégas and M. Wattenber from Google Brain where, given that the human visual system is really good at a few special tasks, the goal of this tutorial was to understand how data can be transformed into visual encodings that people naturally decode. The key questions was thus 'How can we guide people's attention to let their brain naturally interpret the visualization?'
Some of the take home messages she shared were:
- If the visualization happens on a computer then it’s crucial to make it interactive, following three principles: give an overview first, then filter and zoom and only then give details on demand.
- Colours play a key role in conveying a message. A useful tool to create colour palettes according to your purpose is ‘Colorbrewer
- For tables, the "remove to improve" principle can do magic. You might want to look at this not-so-recent but useful video: https://www.darkhorseanalytics.com/blog/clear-off-the-table
More specifically for machine learning:
- On the importance of visualizing training data there’s Google's facet tool.
- Visualizations can help understand the behaviour of neural networks with examples for images that include https://arxiv.org/abs/1311.2901, http://yosinski.com/deepvis, and http://brainmodels.csail.mit.edu/dnn/drawCNN/ and, for sequence-to-sequence models http://seq2seq-vis.io/
- Saliency maps (and their many variants) are interesting but often difficult to interpret
- Visualizing high dimensional data: beyond the linear PCA, the t-sne method is now very popular. One should however be careful with the importance of the perplexity hyper-parameter which can completely transform the results. In particular it determines or governs both the size of the clusters and the distance between them, which are therefore meaningless to the interpretation of the data. A better alternative to t-sne is UMAP which is faster and supposed to better capture the global structure. See UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction or https://www.math.upenn.edu/~jhansen/2018/05/04/UMAP/) --
Viégas and Wattenber insisted on how counterintuitive high-dimensional data can be. A rule of thumb is that, if you spot something interesting, then you should make sure the same phenomenon does not occur with a random baseline.
The first session of Track 1 on Tuesday was a positive surprise as it had three papers that I found interesting and important.
Edward W. Felten, from Princeton and former Deputy U.S. Chief Technology Officer at the Federal Trade Commission under the Obama administration, gave an invited talk on Tuesday morning on "Machine Learning Meets Public Policy: What to Expect and How to Cope". The goal was to give scientists insights on how policy makers operate and some tips on how to interact with them. He insisted several times on the difference between science and politics, a world "where truth is not the ultimate trump card", and starting from the observation that "democracy is not a search for truth, it is an algorithm to solve disagreement" modeling democracy as a toy mathematical model. This allowed to convey some messages under the disguise of mathematical results, including inconsistency and NP-hard results, concluding that locally this system looks irrational, but that scientists have to consider the broader picture in which legislators operate. His tips for interacting with them, included avoiding providing them only facts (too much information) and dictating the solution (nobody elected the scientists). He advocated for a multi-turn interaction, where the knowledge of the scientists should be combined with the knowledge and preferences of the politicians. Concretely, his tips were to (i) get their knowledge and preferences and (ii) structure the decision space.
After the keynote, Lage et al. explained “Human-in-the-Loop Interpretability Prior” where they tried to formalize an abstract definition of interpretability. The idea of combining automatic decision-making with human preferences is under-studied in the literature, partly because of costly evaluations and comparisons. In this paper, the authors find first a set of models that have high likelihood (P(X | M)), and then use humans to select a model with high interpretability prior (p(M)). Different proxies for interpretability (number of non-zero features, length of decision path, etc) lead to different results.
"Sparse Attentive Backtracking: Long-Range Credit Assignment in Recurrent Networks" presented by Ke et al., propose an alternative training method to learn different dialects of RNNs. Existing methods have problems with extremely long-term dependencies on the past. One of the reasons is that, in order to model the relationship to state n-k, the standard training procedure - backpropagation through time - has to go through all k intermediate past states. The alternative of Ke et al., includes an attention mechanism and only updating the top-k past states.
An attention mechanism was also central to "Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation" by He et al., which modifies the Transformer model so that each layer of the decoder is coordinated with the corresponding layer of the encoder. The intuition is that similar levels of semantic understanding in the decoder should attend to the same level in the encoder. The results on machine translation are impressive, showing an improvement of 2-3 BLEU points on standard benchmarks, with respect to baseline models.
The main idea behind "e-SNLI: Natural Language Inference with Natural Language Explanations" was explained as augmenting a natural inference dataset (given two pairs of sentence determine if they have a relationship of entails, contradicts or neutral) with explanations. The authors set up a crowdsource task asking annotators to provide a natural-language explanation for the relationship governing each pair. A proposed pipeline that first generates an explanation and then uses that explanation predicts a label obtains only slightly smaller overall accuracy than a direct prediction pair -> labels. The advantage of the proposed approach is that it allows to provide the explanation together with the label, a desired feature in many tasks where humans have to be convinced of the prediction.
Highlights of what we saw at this year’s conference - Part 1/4 Expo Day
Highlights of what we saw at this year’s conference - Part 3/4 Robotics and Optimization
Highlights of what we saw at this year’s conference - Part 4/4 Machine Learning for Creativity and Design