NIPS 2017 content, Random Kitchen Sink edition

 

PS : if you’re wondering about that random kitchen sink title.

My DL papers of the year

First as always, standard disclaimer applies, since there’s been more than 1660 papers this year on GANs alone. I can’t possibly claim to be comprehensive, and surely I’ve missed out on a ton of stuff (feel free to DM me on Twitter @kloudstrife to let me know of any criminal omissions).  I’ve tried to narrow it down to about a paper every two weeks. A lot of that stuff we covered at the weekly Imperial Deep Learning Reading Group . Anyway, here we go.

Architectures / models :

This year has been a lot less about convnet architectures and everything has more or less stabilised. Some papers have definitely been pushing the envelope though. First among these is Andrew Brock’s cracking SMASH, which, in spite of its ICLR reviews, has given neural architecture search on 1000 GPUs a run for its money.

SMASH : one shot model architecture search through Hypernetworks

DenseNets (updated 2017 revision) is a memory-hungry but very neat idea. The TLDR is ‘in computer vision, eyes + fur = cat, so connect all the things (and the layers)’.

Densely connected convolutional networks

A crucially underrated idea in CNNs is that of the scattering transform, effectively taking moduli of a wavelet filterbank (bridging wavelet theory with conv + maxpool, and ReLU). Somehow surprisingly, this sheds light on why the first few layers of a convnet look like Gabor filters, and why you might not need to train them at all. In the words of Stephane Mallat, ‘I’m surprised it works !’. See paper below.

Scaling the Scattering Transform

Tensorized LSTMs are new SotA on Wikipedia @ 1.2 bits-per-character – some people believe the coding limit for English to be 1.0, 1.1 BPC (for reference, LayerNorm LSTMs are circa 1.3 bpc). I prefer this paper to ‘Recurrent Highway HyperNetworks’ for originality reasons.

Tensorized LSTMs for sequence learning

Finally, without much further comment needed :

Dynamic Routing Between Capsules / Matrix capsules with EM routing

Generative models :

I’ve deliberately left out most of the GAN papers given NVidia’s thunderstorm with the Progressive Growing paper.

First with the autoregressive family – Aaron Van den Oord’s latest masterpiece, VQ-VAE, is one of those papers that look obvious in hindsight, but coming up with that double stop-gradient’ed loss function is no small feat. I’m sure a ton of iterations – including sandwiching with ELBO’ed Bayesian layers ala PixelVAE – will come out of that work.

Neural Discrete Representation Learning

Another surprise came from the form of Parallel Wavenet. When everyone was expecting a fast dyadic scheme in line with Tom LePaine’s work, the DeepMind guys gave us teacher-student distillation, and noise shaping by interpreting the high-dimensional isotropic Gaussian / Logistic latent space as a time process that can be made autoregressive via Inverse Autoregressive Flows. Very, very neat.

Parallel Wavenet

The number one paper that nobody expected – NVidia lays down the law. Elegantly enough, GAN theory goes full circle – instead of simply moar Wassersteinizing (in the immortal words of Justin Solomon), just keep any loss including KL or least squares, but get rid of the disjoint support problem by doing multiresolution approximation of the data distribution. This requires still quite a few tricks to stabilize the gradients, but empirical results speak for themselves. Also as has been commented by many – including the seminal Wasserstein GAN paper, the one that started a movement, no, a revolution !

Progressive growing of GANs / Wasserstein GAN

While the French school led by Peyre and Genevay did define Minimum Kantorovich Estimators earlier this year, it is the Google team of Bousquet which wrote down the definitive framework for VAE-GANs in a push-forward / optimal transport setup. The W-AAE paper will probably be one of the top hits at ICLR2018.

The VeGAN cookbook / Wasserstein Autoencoders

On the variational inference front, who better than Dustin Tran to borrow ideas from off-policy reinforcement learning & from GANs and once again push the envelope of what modern VI can do :

Hierarchical Implicit Models

Reinforcement Learning :

A year dominated by soft / max-entropy Q-learning. We’ve been doing it wrong ! All these years !

Schulman proves the equivalence between the main two families of RL algorithms. A landmark paper. ‘Nuff said :

Equivalence between Policy Gradients and Soft Q-learning

Would he have done it without the below that, by taking old math and re-doing the partition function calcs very carefully, proves the path-by-path equivalence ? Nobody knows but Ofir :

Bridging the gap between value and policy RL

Another criminally underrated paper – Gergely quietly one-ups everyone by also working out the analogy between the above RL algos and convex optimization methods. A strong contender for RL paper of the year IMHO and yet few people’ve heard of it.

A unified view of entropy-regularized MDPs

If David Silver’s Predictron somehow fell off the radar due to being refused (!) at ICLR 2017, Theo’s paper is like a dual view of it, with beautiful and intuitive Sokoban experimental results to boot :

Imagination-Augmented Agents

Marc Bellemare gets another transformational paper out – do away with all the DQN stabilization plugins, and simply learn the distribution ( and beat SotA in the process). Beautiful. Many extensions possible, including the link with Wasserstein distances.

A distributional perspective on RL / Distributional RL with Quantile Regression

A simple, yet extremely efficient, double-whammy of an idea.

Noisy Networks for Exploration

Of course, the list wouldn’t be complete without AlphaGo Zero. The idea of aligning policy network pre- and post-MCTS, ie MCTS as a policy improvement algorithm (and a means to smooth NN approx error rather than propagating it), is the stuff of legends.

Mastering the game of Go without human knowledge

SGD & Optimization :

2017 has definitely been a year ripe with theoretical insights as to why SGD works as well as it does (and is so hard to beat from a generalization error perspective !) in a non-convex setting.

The ‘most technical’ paper of the year award goes to Chaudhari. Pretty much connects everything from SGD and gradient flows to PDEs. A masterpiece that follows and completes ‘Entropy-SGD‘ :

Deep Relaxation : PDEs for optimizing deep networks

The Bayesian view on this is the SGD-VI connection from Mandt & Hoffman. As you know, I’ve been a frequentist for years, sic.

SGD as approximate Bayesian inference

The previous paper hinged on the continuous relaxation of SGD as a stochastic differential equation (gradient noise being treated as Gaussian because of the CLT). This explains the implications for batch size & gives a very nice chi-square formula in the process.

Batch size matters, a diffusion approximation framework

Another paper with similar Ornstein-Uhlenbeck inspired results, from the lab of Yoshua Bengio :

Three factors influencing minima in SGD

Finally, another, very recent contribution to the SGD-SDE-VI trinity, courtesy of Chaudhari once again :

SGD performs VI, converges to limit cycles

Theory :

I’m a firm believer in an intuition that many insights as to why exactly deep learning works will come from the intersection of harmonic / L2 analysis (as seen earlier with the scattering ideas) and information theory with entropy-based measures. Naftali Tishby’s ideas, while still controversial and under fire from a recent ICLR2018 submissions, certainly bring us a step closer to that understanding.

Opening the black box of deep networks via information / On the information bottleneck theory of deep learning

Similarly, a beautiful paper from ICLR2017 takes a variational approach to the information bottleneck theory. My pick with a tiny edge over Beta-VAE in the ‘disentanglement’ category.

Deep variational information bottleneck

There have been umpteen billion generative models with twelve zillion ways of factorizing the log-likelihood this year, and they can mostly be understood under the light of convex duality. A necessary paper IMHO.

A Lagrangian perspective on latent variable modelling

Finally, in a stunning display of technical mastery, and a sign that the mathematical arms race in deep learning is alive and well, Jeff Pennington combines complex analysis, random matrix theory, free probability, and graph morphisms (!) to derive an exact law for the eigenvalues of the Hessian of neural network loss functions, whereas the graph shape was only known empirically before, for instance in papers by Sagun et al. Must read.

Geometry of NN loss surfaces via RMT / Nonlinear RMT for deep learning

That’s all from me, happy holidays to all !

Bonus list : The most important articles on GANs

2016 was a vintage year for GAN litterature, with the technique gaining traction and many theoretical breakthroughs on top of much increased quality ( see the recent PPGN paper for impressive results on ImageNet).

Generative Adversarial Networks, I. Goodfellow, 2014 ( + NIPS Tutorial on GANs, I. Goodfellow, 2016)

Deep Convolutional GANs, A. Radford, 2015

Auxiliary Classifier GANs, A. Odena, 2016

Improved Techniques for training GANs, T. Salimans, 2016

InfoGANs : Interpretable Representation learning by Information Maximizing Generative Adversarial Networks, X. Chen, 2016

Energy-Based GANs, J. Zhao, 2016

Adversarially Learned Inference, V. Dumoulin, 2016

Stacked GANs, X. Huang, 2016 ( + StackGAN, H. Zhang, 2016)

Plug and Play Generative Networks : Conditional iterative generation of images in latent space, A. Nguyen, 2016

Wasserstein GANs, M. Arjovsky, 2017 (+ Towards Principled Methods for training GANs, M. Arjovsky, 2016 / Improved Training of Wasserstein GANs, I. Gulrajani,  2017)

TFProf, a model analyzer for TensorFlow

I was surprised to find out about TFProf, a useful debugging and analysis tool for TensorFlow models that should help you find build your model and find a reasonable training batch size for it. It is very useful if only to get information about your network in a way complementary to TensorBoard. There are three main operations :

1 (Compactly) Fetch a summary of the shapes and size of all trainable variables, using property .TRAINABLE_VARS_PARAMS_OPTIONS of a tfprof.model_analyzer object. I will be illustrating on a simple MNIST convnet with conv1 shape 5x5x32, conv2 shape 5x5x64, FC1 512 and FC2 10 ( the standard convolutional.py from the Tensorflow MNIST examples). Results look like the following.

==================Model Analysis Report======================
_TFProfRoot (–/1.66m params)
Variable (5x5x1x32, 800/800 params)
Variable_1 (32, 32/32 params)
Variable_2 (5x5x32x64, 51.20k/51.20k params)
Variable_3 (64, 64/64 params)
Variable_4 (3136×512, 1.61m/1.61m params)
Variable_5 (512, 512/512 params)
Variable_6 (512×10, 5.12k/5.12k params)
Variable_7 (10, 10/10 params)
Variable_8 (0/0 params)

2. Get the number of FLOPS ( floating point operations ) per minibatch required for the training pass of your model, sorted descending. This is achieved through .FLOAT_OPS_OPTIONS . A very interesting statistic to see what you can optimize and where. Fixing batch size at 128 – see total FLOPS usage on the topline with, unsurprisingly, convolutions taking up the lion’s share :

==================Model Analysis Report======================
_TFProfRoot (0/5.54b flops)
Conv2D_1 (2.57b/2.57b flops)
Conv2D_3 (1.28b/1.28b flops)
MatMul (411.04m/411.04m flops)
gradients/MatMul_grad/MatMul (411.04m/411.04m flops)
gradients/MatMul_grad/MatMul_1 (411.04m/411.04m flops)
MatMul_2 (205.52m/205.52m flops)
Conv2D (160.56m/160.56m flops)
Conv2D_2 (80.28m/80.28m flops)
BiasAdd (3.21m/3.21m flops)
BiasAdd_1 (1.61m/1.61m flops)
BiasAdd_2 (1.61m/1.61m flops)
MatMul_1 (1.31m/1.31m flops)
gradients/MatMul_1_grad/MatMul (1.31m/1.31m flops)
gradients/MatMul_1_grad/MatMul_1 (1.31m/1.31m flops)
BiasAdd_3 (802.82k/802.82k flops)
MatMul_3 (655.36k/655.36k flops)

3. Finally, you can get a summary of memory usage – incredibly useful if you have enough regularization built inside your model already to try and maximize batch size for GPU memory – and timings, per minibatch, with granularity. This is if you summon .PRINT_ALL_TIMING_MEMORY and hack your session.run() call in order to force inclusion of metadata, and results in the below. Note that the topline memory usage should be less than your GPU’s for obvious reasons. The 4.67ms figure doesn’t scale, and as such, seems to be training pass time per sample.

==================Model Analysis Report======================
_TFProfRoot (0B/145.88MB, 0us/4.67ms)
Adam (4B/6.65MB, 31us/729us)
Adam/Assign (4B/4B, 46us/46us)
Adam/Assign_1 (4B/4B, 29us/29us)
Adam/beta1 (4B/4B, 3us/3us)
Adam/beta2 (4B/4B, 3us/3us)
Adam/epsilon (4B/4B, 3us/3us)
Adam/mul (4B/4B, 32us/32us)
Adam/mul_1 (4B/4B, 31us/31us)
Adam/update_Variable/ApplyAdam (3.20KB/3.20KB, 64us/64us)
Adam/update_Variable_1/ApplyAdam (128B/128B, 73us/73us)
Adam/update_Variable_2/ApplyAdam (204.80KB/204.80KB, 68us/68us)
Adam/update_Variable_3/ApplyAdam (256B/256B, 72us/72us)
Adam/update_Variable_4/ApplyAdam (6.42MB/6.42MB, 70us/70us)
Adam/update_Variable_5/ApplyAdam (2.05KB/2.05KB, 65us/65us)
Adam/update_Variable_6/ApplyAdam (20.48KB/20.48KB, 65us/65us)
Adam/update_Variable_7/ApplyAdam (40B/40B, 74us/74us)
BiasAdd (12.85MB/12.85MB, 31us/31us)
BiasAdd_1 (6.42MB/6.42MB, 28us/28us)
Conv2D (12.85MB/12.85MB, 379us/379us)
Conv2D_1 (6.42MB/6.42MB, 172us/172us)
ExponentialDecay (4B/28B, 28us/217us)
ExponentialDecay/Cast_1 (4B/4B, 91us/91us)
ExponentialDecay/Cast_2/x (4B/4B, 1us/1us)
ExponentialDecay/Floor (4B/4B, 31us/31us)
ExponentialDecay/Pow (4B/4B, 33us/33us)
ExponentialDecay/learning_rate (4B/4B, 1us/1us)
ExponentialDecay/truediv (4B/4B, 32us/32us)
L2Loss (4B/4B, 68us/68us)
L2Loss_1 (4B/4B, 80us/80us)
L2Loss_2 (4B/4B, 62us/62us)
L2Loss_3 (4B/4B, 71us/71us)
MatMul (262.14KB/262.14KB, 105us/105us)
MatMul_1 (5.12KB/5.12KB, 46us/46us)
MaxPool (3.21MB/3.21MB, 48us/48us)
MaxPool_1 (1.61MB/1.61MB, 42us/42us)
Relu (12.85MB/12.85MB, 31us/31us)
Reshape/shape (8B/8B, 2us/2us)
SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (5.63KB/5.63KB, 154us/154us)

While there is individual run variance on timings, it will be extremely interesting to try and compare graphs generated by different layers of syntactic sugar for efficiency, or to benchmark expected perf gains from newly released compiler XLA.

More info at TFProf Github.

 

The most important papers in deep learning, pt 6 : Bayesian & Variational Deep Learning

Subject to much change, as I’m a frequentist (hah, hah).

An introductional to variational methods for graphical models, M. Jordan, 1999 ( + Variational Inference : Foundations and Modern Methods, D. Blei, 2016 )

The Neural Autoregressive Distribution Estimator, H. Larochelle, 2011

Stochastic Variational Inference, M. Hoffman, 2012

Black Box Variational Inference, R. Ranganath, 2013

Build, compute, critique, repeat : data analysis with latent variable models, D. Blei, 2014 ( +( Variational Inference : A review for statisticians, D. Blei, 2016)

Auto-encoding Variational Bayes, D. Kingma, 2014 (+Stochastic backpropagation and approximate inference in deep generative models, D. Rezende, 2014)

Semi-supervised learning with Deep Generative Models, D. Kingma, 2014

Deep Exponential Families, R. Ranganath, 2014  (+ Deep Autoregressive Networks, K. Gregor, 2013)

A Recurrent Latent Variable Model for sequential data, J. Chung, 2015 (+ Learning Stochastic Recurrent Networks, J. Bayer, 2015 + Variational Recurrent Auto-Encoders, O. Fabius, 2015)

Importance Weighted Autoencoders, Y. Burda, 2015

Uncertainty in Deep Learning, Y. Gal, 2016

The Variational Gaussian Process, D. Tran, 2016 (+ Hierarchical Variational Models, R. Ranganath, 2015)

Improved Variational Inference with Inverse Autoregressive Flow, D. Kingma, 2016

The Variational Lossy Autoencoder, X. Chen, 2017

The most important papers in deep learning, pt 5 : Theory ( yeah, really !)

Here I deliberately omitted Boltzmann machines and deep belief nets, since they’re state of the art on exactly nothing. There is no complete & operational unified theory of deep learning in neural networks as of yet – hence, most insights come from taking involved connex sub-fields of science such as statistical mechanics, multiscale analysis or Riemannian geometry, and pushing as much as possible to get closed-form expressions. As a result, these papers are technically fairly hardcore.

Spin-glass models of neural networks, D. Amit, 1985

Flat minima, S. Hochreiter, 1997

How transferable are features in deep neural networks?, J. Yosinski, 2014

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Y. Dauphin, 2014

A Mathematical Motivation for Complex-Valued Convolutional Networks, J. Bruna, 2015 (+ Invariant Scattering Convolution Networks, J. Bruna, 2012)

The loss surfaces of multilayer networks, A. Choromanska, 2015

Transition to chaos in random neuronal networks, J. Kadmon, 2015

Maximally informative hierarchical representations of high-dimensional data, G. Ver Steeg, 2015 ( + Variational Information Maximization for feature selection, S. Gao, 2016)

On the expressive power of deep neural networks, M. Raghu ( + Deep Information Propagation , S. Schoenholz, 2017)

Deep learning without poor minima, K. Kawaguchi, 2016

Deep neural networks with random Gaussian weights : a universal classification strategy ? , R. Giryes, 2016 ( + Robust Large Margin Deep Neural Networks , J. Sokolic, 2016)

Topology and geometry of half-rectified network optimization, C. Freeman, 2016

The most important papers in deep learning, pt 4 : Initialization, Regularization, Activations and Optimization

Understanding the difficulty of training deep feedforward neural networks, X. Glorot, 2010

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, A. Saxe, 2014

Delving deep into Rectifiers : surpassing human-level performance on ImageNet classification, K. He, 2015

Dropout : A simple way to prevent neural networks from overfitting, N. Srivastava, 2014

Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, 2015 ( + Recurrent Batch Normalization, T. Cooijmans, 2016)

Deep Sparse Rectifier Neural Networks, X. Glorot, 2011 ( + Fast and accurate deep network learning by exponential linear units, DA Clevert, 2015)

Bridging nonlinearities and stochastic regularizers with Gaussian error linear units, D. Hendrycks, 2016

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation, Y. Bengio, 2013

The concrete distribution : a continuous relaxation of discrete random variables, C. Maddison, 2016 ( + Categorical reparameterization with Gumbel-Softmax, E. Jang, 2016)

On the importance of initialization and momentum in deep learning, I. Sutskever, 2013

Adaptive subgradient methods for online learning and stochastic optimization, J. Duchi, 2011 ( + Adadelta : an adaptive learning rate method, M. Zeiler, 2012)

Adam : A method for stochastic optimization, D. Kingma, 2014 ( + Incorporating Nesterov Momentum into Adam, T. Dozat, 2015)

Stochastic Gradient Descent with Warm Restarts, I. Loshchilov, 2016

 

The most important papers in deep learning, pt 3 : Deep Reinforcement Learning

Learning to predict by the methods of temporal differences, R. Sutton, 1988

Algorithms for reinforcement learning, C. Szepesvari, 2009

Human-level control through deep reinforcement learning, V. Mnih, 2013/2015 ( + Playing ATARI with deep reinforcement learning, V. Mnih, 2013)

Deterministic Policy Gradient Algorithms, D. Silver,  2014

Prioritized Experience Replay, T. Schaul, 2015

Trust region policy optimization, J. Schulman, 2015

Mastering the game of Go with deep neural networks and tree search, D. Silver, 2016

Asynchronous methods for deep reinforcement learning, V. Mnih, 2016 ( + Reinforcement learning with unsupervised auxiliary tasks, M. Jaderberg, 2016)

Value Iteration Networks, A. Tamar, 2016

Neural Architecture Search with Reinforcement Learning, B. Zoph, 2016

Faster deep reinforcement learning by optimality tightening, F. He, 2016

Learning to reinforcement learn, J. Wang, 2016

The Predictron : end-to-end learning and planning, D. Silver, 2016

Sample-efficient Actor-Critic with Experience Replay, Z. Wang, 2016 ( + Bridging the gap between value and policy-based reinforcement learning, O. Nachum, 2016)