First as always, standard disclaimer applies, since there’s been more than 1660 papers this year on GANs alone. I can’t possibly claim to be comprehensive, and surely I’ve missed out on a ton of stuff (feel free to DM me on Twitter @kloudstrife to let me know of any criminal omissions). I’ve tried to narrow it down to about a paper every two weeks. A lot of that stuff we covered at the weekly Imperial Deep Learning Reading Group . Anyway, here we go.

**Architectures / models** :

This year has been a lot less about convnet architectures and everything has more or less stabilised. Some papers have definitely been pushing the envelope though. First among these is Andrew Brock’s cracking SMASH, which, in spite of its ICLR reviews, has given neural architecture search on 1000 GPUs a run for its money.

SMASH : one shot model architecture search through Hypernetworks

DenseNets (updated 2017 revision) is a memory-hungry but very neat idea. The TLDR is ‘in computer vision, eyes + fur = cat, so connect all the things (and the layers)’.

Densely connected convolutional networks

A crucially underrated idea in CNNs is that of the *scattering transform*, effectively taking moduli of a wavelet filterbank (bridging wavelet theory with conv + maxpool, and ReLU). Somehow surprisingly, this sheds light on why the first few layers of a convnet look like Gabor filters, and why you might not need to train them at all. In the words of Stephane Mallat, ‘I’m surprised it works !’. See paper below.

Scaling the Scattering Transform

Tensorized LSTMs are new SotA on Wikipedia @ 1.2 bits-per-character – some people believe the coding limit for English to be 1.0, 1.1 BPC (for reference, LayerNorm LSTMs are circa 1.3 bpc). I prefer this paper to ‘Recurrent Highway HyperNetworks’ for originality reasons.

Tensorized LSTMs for sequence learning

Finally, without much further comment needed :

Dynamic Routing Between Capsules / Matrix capsules with EM routing

**Generative models** :

I’ve deliberately left out most of the GAN papers given NVidia’s thunderstorm with the Progressive Growing paper.

First with the autoregressive family – Aaron Van den Oord’s latest masterpiece, VQ-VAE, is one of those papers that look obvious in hindsight, but coming up with that double stop-gradient’ed loss function is no small feat. I’m sure a ton of iterations – including sandwiching with ELBO’ed Bayesian layers *ala* PixelVAE – will come out of that work.

Neural Discrete Representation Learning

Another surprise came from the form of Parallel Wavenet. When everyone was expecting a fast dyadic scheme in line with Tom LePaine’s work, the DeepMind guys gave us teacher-student distillation, and noise shaping by interpreting the high-dimensional isotropic Gaussian / Logistic latent space as a time process that can be made autoregressive via Inverse Autoregressive Flows. Very, very neat.

The number one paper that nobody expected – NVidia lays down the law. Elegantly enough, GAN theory goes full circle – instead of simply moar *Wassersteinizing* (in the immortal words of Justin Solomon), just keep any loss including KL or least squares, but get rid of the disjoint support problem by doing multiresolution approximation of the data distribution. This requires still quite a few tricks to stabilize the gradients, but empirical results speak for themselves. Also as has been commented by many – including the seminal *Wasserstein GAN* paper, the one that started a movement, no, a revolution !

Progressive growing of GANs / Wasserstein GAN

While the French school led by Peyre and Genevay did define Minimum Kantorovich Estimators earlier this year, it is the Google team of Bousquet which wrote down the definitive framework for VAE-GANs in a push-forward / optimal transport setup. The W-AAE paper will probably be one of the top hits at ICLR2018.

The VeGAN cookbook / Wasserstein Autoencoders

On the variational inference front, who better than Dustin Tran to borrow ideas from off-policy reinforcement learning & from GANs and once again push the envelope of what modern VI can do :

**Reinforcement Learning** :

A year dominated by soft / max-entropy Q-learning. We’ve been doing it wrong ! All these years !

Schulman proves the equivalence between the main two families of RL algorithms. A landmark paper. ‘Nuff said :

Equivalence between Policy Gradients and Soft Q-learning

Would he have done it without the below that, by taking old math and re-doing the partition function calcs very carefully, proves the path-by-path equivalence ? Nobody knows but Ofir :

Bridging the gap between value and policy RL

Another criminally underrated paper – Gergely quietly one-ups everyone by also working out the analogy between the above RL algos and convex optimization methods. A strong contender for RL paper of the year IMHO and yet few people’ve heard of it.

A unified view of entropy-regularized MDPs

If David Silver’s Predictron somehow fell off the radar due to being refused (!) at ICLR 2017, Theo’s paper is like a dual view of it, with beautiful and intuitive Sokoban experimental results to boot :

Marc Bellemare gets another transformational paper out – do away with all the DQN stabilization plugins, and simply learn the distribution ( and beat SotA in the process). Beautiful. Many extensions possible, including the link with Wasserstein distances.

A distributional perspective on RL / Distributional RL with Quantile Regression

A simple, yet extremely efficient, double-whammy of an idea.

Noisy Networks for Exploration

Of course, the list wouldn’t be complete without AlphaGo Zero. The idea of aligning policy network pre- and post-MCTS, *ie* MCTS as a policy improvement algorithm (and a means to smooth NN approx error rather than propagating it), is the stuff of legends.

Mastering the game of Go without human knowledge

**SGD & Optimization** :

2017 has definitely been a year ripe with theoretical insights as to why SGD works as well as it does (and is so hard to beat from a generalization error perspective !) in a non-convex setting.

The ‘most technical’ paper of the year award goes to Chaudhari. Pretty much connects everything from SGD and gradient flows to PDEs. A masterpiece that follows and completes ‘Entropy-SGD‘ :

Deep Relaxation : PDEs for optimizing deep networks

The Bayesian view on this is the SGD-VI connection from Mandt & Hoffman. As you know, I’ve been a frequentist for years, sic.

SGD as approximate Bayesian inference

The previous paper hinged on the continuous relaxation of SGD as a stochastic differential equation (gradient noise being treated as Gaussian because of the CLT). This explains the implications for batch size & gives a very nice chi-square formula in the process.

Batch size matters, a diffusion approximation framework

Another paper with similar Ornstein-Uhlenbeck inspired results, from the lab of Yoshua Bengio :

Three factors influencing minima in SGD

Finally, another, very recent contribution to the SGD-SDE-VI trinity, courtesy of Chaudhari once again :

SGD performs VI, converges to limit cycles

**Theory** :

I’m a firm believer in an intuition that many insights as to why exactly deep learning works will come from the intersection of harmonic / L2 analysis (as seen earlier with the scattering ideas) and information theory with entropy-based measures. Naftali Tishby’s ideas, while still controversial and under fire from a recent ICLR2018 submissions, certainly bring us a step closer to that understanding.

Opening the black box of deep networks via information / On the information bottleneck theory of deep learning

Similarly, a beautiful paper from ICLR2017 takes a variational approach to the information bottleneck theory. My pick with a tiny edge over Beta-VAE in the ‘disentanglement’ category.

Deep variational information bottleneck

There have been umpteen billion generative models with twelve zillion ways of factorizing the log-likelihood this year, and they can mostly be understood under the light of convex duality. A necessary paper IMHO.

A Lagrangian perspective on latent variable modelling

Finally, in a stunning display of technical mastery, and a sign that the mathematical arms race in deep learning is alive and well, Jeff Pennington combines complex analysis, random matrix theory, free probability, and graph morphisms (!) to derive an exact law for the eigenvalues of the Hessian of neural network loss functions, whereas the graph shape was only known empirically before, for instance in papers by Sagun et al. Must read.

Geometry of NN loss surfaces via RMT / Nonlinear RMT for deep learning

That’s all from me, happy holidays to all !