It’s 2017 and TensorFlow wins the framework war


TensorFlow is now on its v1.0 alpha release ; I don’t think anyone anymore can deny this is the framework of choice for deep learning. In just about a single year of public existence, it’s already as evolved and complete, if not more, than Theano or Torch ever were. Understandably so given it’s backed by the full faith and credit of Google, rather than a few academics, but still. I used to prefer Theano because of its availability on Windows ( admittedly, after a painful install phase that took a couple days out of your life ) and its superior single GPU performance due to compile-time optimizations. Many people used to feel that way. But the tables have turned, and now I can’t really see a scenario where you don’t favor TensorFlow over, well, basically anything else.

If you’re a beginner, you’ll want a simple install procedure via pip. Used deep-learning-made-easy Keras before ? Its author Francois Chollet tweeted today that it will be fully integrated into TensorFlow soon. If you need even more syntactic sugar, you can also create Keras-like sequential models with PrettyTensor, and/or enjoy three-line deep model specifications on an sklearn interface with tf-learn. The authors have made sure popular models are properly democratized and accessible in the github repo.

If you’re intermediate, you will enjoy the amazing visualization and graph debugging possibilities provided by Tensorboard, out of the box and as soon as you start training, rather than waiting for Theano to compile. Since you already have to remember eight or nine packages’ syntax by heart, you will also appreciate its very numpyesque API.

If you’re advanced, you will require proper DL at scale via either integrated multi-GPU support or Hadoop cluster deployment.

With Deepmind and Brain’s endorsement, and the sheer amount of papers these powerhouses get out, TensorFlow has nowhere to go but to become the de facto standard in DL research. I know which library I’m learning this year.

NIPS 2017 content, Random Kitchen Sink edition


PS : if you’re wondering about that random kitchen sink title.

My DL papers of the year

First as always, standard disclaimer applies, since there’s been more than 1660 papers this year on GANs alone. I can’t possibly claim to be comprehensive, and surely I’ve missed out on a ton of stuff (feel free to DM me on Twitter @kloudstrife to let me know of any criminal omissions).  I’ve tried to narrow it down to about a paper every two weeks. A lot of that stuff we covered at the weekly Imperial Deep Learning Reading Group . Anyway, here we go.

Architectures / models :

This year has been a lot less about convnet architectures and everything has more or less stabilised. Some papers have definitely been pushing the envelope though. First among these is Andrew Brock’s cracking SMASH, which, in spite of its ICLR reviews, has given neural architecture search on 1000 GPUs a run for its money.

SMASH : one shot model architecture search through Hypernetworks

DenseNets (updated 2017 revision) is a memory-hungry but very neat idea. The TLDR is ‘in computer vision, eyes + fur = cat, so connect all the things (and the layers)’.

Densely connected convolutional networks

A crucially underrated idea in CNNs is that of the scattering transform, effectively taking moduli of a wavelet filterbank (bridging wavelet theory with conv + maxpool, and ReLU). Somehow surprisingly, this sheds light on why the first few layers of a convnet look like Gabor filters, and why you might not need to train them at all. In the words of Stephane Mallat, ‘I’m surprised it works !’. See paper below.

Scaling the Scattering Transform

Tensorized LSTMs are new SotA on Wikipedia @ 1.2 bits-per-character – some people believe the coding limit for English to be 1.0, 1.1 BPC (for reference, LayerNorm LSTMs are circa 1.3 bpc). I prefer this paper to ‘Recurrent Highway HyperNetworks’ for originality reasons.

Tensorized LSTMs for sequence learning

Finally, without much further comment needed :

Dynamic Routing Between Capsules / Matrix capsules with EM routing

Generative models :

I’ve deliberately left out most of the GAN papers given NVidia’s thunderstorm with the Progressive Growing paper.

First with the autoregressive family – Aaron Van den Oord’s latest masterpiece, VQ-VAE, is one of those papers that look obvious in hindsight, but coming up with that double stop-gradient’ed loss function is no small feat. I’m sure a ton of iterations – including sandwiching with ELBO’ed Bayesian layers ala PixelVAE – will come out of that work.

Neural Discrete Representation Learning

Another surprise came from the form of Parallel Wavenet. When everyone was expecting a fast dyadic scheme in line with Tom LePaine’s work, the DeepMind guys gave us teacher-student distillation, and noise shaping by interpreting the high-dimensional isotropic Gaussian / Logistic latent space as a time process that can be made autoregressive via Inverse Autoregressive Flows. Very, very neat.

Parallel Wavenet

The number one paper that nobody expected – NVidia lays down the law. Elegantly enough, GAN theory goes full circle – instead of simply moar Wassersteinizing (in the immortal words of Justin Solomon), just keep any loss including KL or least squares, but get rid of the disjoint support problem by doing multiresolution approximation of the data distribution. This requires still quite a few tricks to stabilize the gradients, but empirical results speak for themselves. Also as has been commented by many – including the seminal Wasserstein GAN paper, the one that started a movement, no, a revolution !

Progressive growing of GANs / Wasserstein GAN

While the French school led by Peyre and Genevay did define Minimum Kantorovich Estimators earlier this year, it is the Google team of Bousquet which wrote down the definitive framework for VAE-GANs in a push-forward / optimal transport setup. The W-AAE paper will probably be one of the top hits at ICLR2018.

The VeGAN cookbook / Wasserstein Autoencoders

On the variational inference front, who better than Dustin Tran to borrow ideas from off-policy reinforcement learning & from GANs and once again push the envelope of what modern VI can do :

Hierarchical Implicit Models

Reinforcement Learning :

A year dominated by soft / max-entropy Q-learning. We’ve been doing it wrong ! All these years !

Schulman proves the equivalence between the main two families of RL algorithms. A landmark paper. ‘Nuff said :

Equivalence between Policy Gradients and Soft Q-learning

Would he have done it without the below that, by taking old math and re-doing the partition function calcs very carefully, proves the path-by-path equivalence ? Nobody knows but Ofir :

Bridging the gap between value and policy RL

Another criminally underrated paper – Gergely quietly one-ups everyone by also working out the analogy between the above RL algos and convex optimization methods. A strong contender for RL paper of the year IMHO and yet few people’ve heard of it.

A unified view of entropy-regularized MDPs

If David Silver’s Predictron somehow fell off the radar due to being refused (!) at ICLR 2017, Theo’s paper is like a dual view of it, with beautiful and intuitive Sokoban experimental results to boot :

Imagination-Augmented Agents

Marc Bellemare gets another transformational paper out – do away with all the DQN stabilization plugins, and simply learn the distribution ( and beat SotA in the process). Beautiful. Many extensions possible, including the link with Wasserstein distances.

A distributional perspective on RL / Distributional RL with Quantile Regression

A simple, yet extremely efficient, double-whammy of an idea.

Noisy Networks for Exploration

Of course, the list wouldn’t be complete without AlphaGo Zero. The idea of aligning policy network pre- and post-MCTS, ie MCTS as a policy improvement algorithm (and a means to smooth NN approx error rather than propagating it), is the stuff of legends.

Mastering the game of Go without human knowledge

SGD & Optimization :

2017 has definitely been a year ripe with theoretical insights as to why SGD works as well as it does (and is so hard to beat from a generalization error perspective !) in a non-convex setting.

The ‘most technical’ paper of the year award goes to Chaudhari. Pretty much connects everything from SGD and gradient flows to PDEs. A masterpiece that follows and completes ‘Entropy-SGD‘ :

Deep Relaxation : PDEs for optimizing deep networks

The Bayesian view on this is the SGD-VI connection from Mandt & Hoffman. As you know, I’ve been a frequentist for years, sic.

SGD as approximate Bayesian inference

The previous paper hinged on the continuous relaxation of SGD as a stochastic differential equation (gradient noise being treated as Gaussian because of the CLT). This explains the implications for batch size & gives a very nice chi-square formula in the process.

Batch size matters, a diffusion approximation framework

Another paper with similar Ornstein-Uhlenbeck inspired results, from the lab of Yoshua Bengio :

Three factors influencing minima in SGD

Finally, another, very recent contribution to the SGD-SDE-VI trinity, courtesy of Chaudhari once again :

SGD performs VI, converges to limit cycles

Theory :

I’m a firm believer in an intuition that many insights as to why exactly deep learning works will come from the intersection of harmonic / L2 analysis (as seen earlier with the scattering ideas) and information theory with entropy-based measures. Naftali Tishby’s ideas, while still controversial and under fire from a recent ICLR2018 submissions, certainly bring us a step closer to that understanding.

Opening the black box of deep networks via information / On the information bottleneck theory of deep learning

Similarly, a beautiful paper from ICLR2017 takes a variational approach to the information bottleneck theory. My pick with a tiny edge over Beta-VAE in the ‘disentanglement’ category.

Deep variational information bottleneck

There have been umpteen billion generative models with twelve zillion ways of factorizing the log-likelihood this year, and they can mostly be understood under the light of convex duality. A necessary paper IMHO.

A Lagrangian perspective on latent variable modelling

Finally, in a stunning display of technical mastery, and a sign that the mathematical arms race in deep learning is alive and well, Jeff Pennington combines complex analysis, random matrix theory, free probability, and graph morphisms (!) to derive an exact law for the eigenvalues of the Hessian of neural network loss functions, whereas the graph shape was only known empirically before, for instance in papers by Sagun et al. Must read.

Geometry of NN loss surfaces via RMT / Nonlinear RMT for deep learning

That’s all from me, happy holidays to all !

Bonus list : The most important articles on GANs

2016 was a vintage year for GAN litterature, with the technique gaining traction and many theoretical breakthroughs on top of much increased quality ( see the recent PPGN paper for impressive results on ImageNet).

Generative Adversarial Networks, I. Goodfellow, 2014 ( + NIPS Tutorial on GANs, I. Goodfellow, 2016)

Deep Convolutional GANs, A. Radford, 2015

Auxiliary Classifier GANs, A. Odena, 2016

Improved Techniques for training GANs, T. Salimans, 2016

InfoGANs : Interpretable Representation learning by Information Maximizing Generative Adversarial Networks, X. Chen, 2016

Energy-Based GANs, J. Zhao, 2016

Adversarially Learned Inference, V. Dumoulin, 2016

Stacked GANs, X. Huang, 2016 ( + StackGAN, H. Zhang, 2016)

Plug and Play Generative Networks : Conditional iterative generation of images in latent space, A. Nguyen, 2016

Wasserstein GANs, M. Arjovsky, 2017 (+ Towards Principled Methods for training GANs, M. Arjovsky, 2016 / Improved Training of Wasserstein GANs, I. Gulrajani,  2017)

TFProf, a model analyzer for TensorFlow

I was surprised to find out about TFProf, a useful debugging and analysis tool for TensorFlow models that should help you find build your model and find a reasonable training batch size for it. It is very useful if only to get information about your network in a way complementary to TensorBoard. There are three main operations :

1 (Compactly) Fetch a summary of the shapes and size of all trainable variables, using property .TRAINABLE_VARS_PARAMS_OPTIONS of a tfprof.model_analyzer object. I will be illustrating on a simple MNIST convnet with conv1 shape 5x5x32, conv2 shape 5x5x64, FC1 512 and FC2 10 ( the standard convolutional.py from the Tensorflow MNIST examples). Results look like the following.

==================Model Analysis Report======================
_TFProfRoot (–/1.66m params)
Variable (5x5x1x32, 800/800 params)
Variable_1 (32, 32/32 params)
Variable_2 (5x5x32x64, 51.20k/51.20k params)
Variable_3 (64, 64/64 params)
Variable_4 (3136×512, 1.61m/1.61m params)
Variable_5 (512, 512/512 params)
Variable_6 (512×10, 5.12k/5.12k params)
Variable_7 (10, 10/10 params)
Variable_8 (0/0 params)

2. Get the number of FLOPS ( floating point operations ) per minibatch required for the training pass of your model, sorted descending. This is achieved through .FLOAT_OPS_OPTIONS . A very interesting statistic to see what you can optimize and where. Fixing batch size at 128 – see total FLOPS usage on the topline with, unsurprisingly, convolutions taking up the lion’s share :

==================Model Analysis Report======================
_TFProfRoot (0/5.54b flops)
Conv2D_1 (2.57b/2.57b flops)
Conv2D_3 (1.28b/1.28b flops)
MatMul (411.04m/411.04m flops)
gradients/MatMul_grad/MatMul (411.04m/411.04m flops)
gradients/MatMul_grad/MatMul_1 (411.04m/411.04m flops)
MatMul_2 (205.52m/205.52m flops)
Conv2D (160.56m/160.56m flops)
Conv2D_2 (80.28m/80.28m flops)
BiasAdd (3.21m/3.21m flops)
BiasAdd_1 (1.61m/1.61m flops)
BiasAdd_2 (1.61m/1.61m flops)
MatMul_1 (1.31m/1.31m flops)
gradients/MatMul_1_grad/MatMul (1.31m/1.31m flops)
gradients/MatMul_1_grad/MatMul_1 (1.31m/1.31m flops)
BiasAdd_3 (802.82k/802.82k flops)
MatMul_3 (655.36k/655.36k flops)

3. Finally, you can get a summary of memory usage – incredibly useful if you have enough regularization built inside your model already to try and maximize batch size for GPU memory – and timings, per minibatch, with granularity. This is if you summon .PRINT_ALL_TIMING_MEMORY and hack your session.run() call in order to force inclusion of metadata, and results in the below. Note that the topline memory usage should be less than your GPU’s for obvious reasons. The 4.67ms figure doesn’t scale, and as such, seems to be training pass time per sample.

==================Model Analysis Report======================
_TFProfRoot (0B/145.88MB, 0us/4.67ms)
Adam (4B/6.65MB, 31us/729us)
Adam/Assign (4B/4B, 46us/46us)
Adam/Assign_1 (4B/4B, 29us/29us)
Adam/beta1 (4B/4B, 3us/3us)
Adam/beta2 (4B/4B, 3us/3us)
Adam/epsilon (4B/4B, 3us/3us)
Adam/mul (4B/4B, 32us/32us)
Adam/mul_1 (4B/4B, 31us/31us)
Adam/update_Variable/ApplyAdam (3.20KB/3.20KB, 64us/64us)
Adam/update_Variable_1/ApplyAdam (128B/128B, 73us/73us)
Adam/update_Variable_2/ApplyAdam (204.80KB/204.80KB, 68us/68us)
Adam/update_Variable_3/ApplyAdam (256B/256B, 72us/72us)
Adam/update_Variable_4/ApplyAdam (6.42MB/6.42MB, 70us/70us)
Adam/update_Variable_5/ApplyAdam (2.05KB/2.05KB, 65us/65us)
Adam/update_Variable_6/ApplyAdam (20.48KB/20.48KB, 65us/65us)
Adam/update_Variable_7/ApplyAdam (40B/40B, 74us/74us)
BiasAdd (12.85MB/12.85MB, 31us/31us)
BiasAdd_1 (6.42MB/6.42MB, 28us/28us)
Conv2D (12.85MB/12.85MB, 379us/379us)
Conv2D_1 (6.42MB/6.42MB, 172us/172us)
ExponentialDecay (4B/28B, 28us/217us)
ExponentialDecay/Cast_1 (4B/4B, 91us/91us)
ExponentialDecay/Cast_2/x (4B/4B, 1us/1us)
ExponentialDecay/Floor (4B/4B, 31us/31us)
ExponentialDecay/Pow (4B/4B, 33us/33us)
ExponentialDecay/learning_rate (4B/4B, 1us/1us)
ExponentialDecay/truediv (4B/4B, 32us/32us)
L2Loss (4B/4B, 68us/68us)
L2Loss_1 (4B/4B, 80us/80us)
L2Loss_2 (4B/4B, 62us/62us)
L2Loss_3 (4B/4B, 71us/71us)
MatMul (262.14KB/262.14KB, 105us/105us)
MatMul_1 (5.12KB/5.12KB, 46us/46us)
MaxPool (3.21MB/3.21MB, 48us/48us)
MaxPool_1 (1.61MB/1.61MB, 42us/42us)
Relu (12.85MB/12.85MB, 31us/31us)
Reshape/shape (8B/8B, 2us/2us)
SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (5.63KB/5.63KB, 154us/154us)

While there is individual run variance on timings, it will be extremely interesting to try and compare graphs generated by different layers of syntactic sugar for efficiency, or to benchmark expected perf gains from newly released compiler XLA.

More info at TFProf Github.


The most important papers in deep learning, pt 6 : Bayesian & Variational Deep Learning

Subject to much change, as I’m a frequentist (hah, hah).

An introductional to variational methods for graphical models, M. Jordan, 1999 ( + Variational Inference : Foundations and Modern Methods, D. Blei, 2016 )

The Neural Autoregressive Distribution Estimator, H. Larochelle, 2011

Stochastic Variational Inference, M. Hoffman, 2012

Black Box Variational Inference, R. Ranganath, 2013

Build, compute, critique, repeat : data analysis with latent variable models, D. Blei, 2014 ( +( Variational Inference : A review for statisticians, D. Blei, 2016)

Auto-encoding Variational Bayes, D. Kingma, 2014 (+Stochastic backpropagation and approximate inference in deep generative models, D. Rezende, 2014)

Semi-supervised learning with Deep Generative Models, D. Kingma, 2014

Deep Exponential Families, R. Ranganath, 2014  (+ Deep Autoregressive Networks, K. Gregor, 2013)

A Recurrent Latent Variable Model for sequential data, J. Chung, 2015 (+ Learning Stochastic Recurrent Networks, J. Bayer, 2015 + Variational Recurrent Auto-Encoders, O. Fabius, 2015)

Importance Weighted Autoencoders, Y. Burda, 2015

Uncertainty in Deep Learning, Y. Gal, 2016

The Variational Gaussian Process, D. Tran, 2016 (+ Hierarchical Variational Models, R. Ranganath, 2015)

Improved Variational Inference with Inverse Autoregressive Flow, D. Kingma, 2016

The Variational Lossy Autoencoder, X. Chen, 2017