First as always, standard disclaimer applies, since there’s been more than 1660 papers this year on GANs alone. I can’t possibly claim to be comprehensive, and surely I’ve missed out on a ton of stuff (feel free to DM me on Twitter @kloudstrife to let me know of any criminal omissions). I’ve tried to narrow it down to about a paper every two weeks. A lot of that stuff we covered at the weekly Imperial Deep Learning Reading Group . Anyway, here we go.
Architectures / models :
This year has been a lot less about convnet architectures and everything has more or less stabilised. Some papers have definitely been pushing the envelope though. First among these is Andrew Brock’s cracking SMASH, which, in spite of its ICLR reviews, has given neural architecture search on 1000 GPUs a run for its money.
DenseNets (updated 2017 revision) is a memory-hungry but very neat idea. The TLDR is ‘in computer vision, eyes + fur = cat, so connect all the things (and the layers)’.
A crucially underrated idea in CNNs is that of the scattering transform, effectively taking moduli of a wavelet filterbank (bridging wavelet theory with conv + maxpool, and ReLU). Somehow surprisingly, this sheds light on why the first few layers of a convnet look like Gabor filters, and why you might not need to train them at all. In the words of Stephane Mallat, ‘I’m surprised it works !’. See paper below.
Tensorized LSTMs are new SotA on Wikipedia @ 1.2 bits-per-character – some people believe the coding limit for English to be 1.0, 1.1 BPC (for reference, LayerNorm LSTMs are circa 1.3 bpc). I prefer this paper to ‘Recurrent Highway HyperNetworks’ for originality reasons.
Finally, without much further comment needed :
Generative models :
I’ve deliberately left out most of the GAN papers given NVidia’s thunderstorm with the Progressive Growing paper.
First with the autoregressive family – Aaron Van den Oord’s latest masterpiece, VQ-VAE, is one of those papers that look obvious in hindsight, but coming up with that double stop-gradient’ed loss function is no small feat. I’m sure a ton of iterations – including sandwiching with ELBO’ed Bayesian layers ala PixelVAE – will come out of that work.
Another surprise came from the form of Parallel Wavenet. When everyone was expecting a fast dyadic scheme in line with Tom LePaine’s work, the DeepMind guys gave us teacher-student distillation, and noise shaping by interpreting the high-dimensional isotropic Gaussian / Logistic latent space as a time process that can be made autoregressive via Inverse Autoregressive Flows. Very, very neat.
The number one paper that nobody expected – NVidia lays down the law. Elegantly enough, GAN theory goes full circle – instead of simply moar Wassersteinizing (in the immortal words of Justin Solomon), just keep any loss including KL or least squares, but get rid of the disjoint support problem by doing multiresolution approximation of the data distribution. This requires still quite a few tricks to stabilize the gradients, but empirical results speak for themselves. Also as has been commented by many – including the seminal Wasserstein GAN paper, the one that started a movement, no, a revolution !
While the French school led by Peyre and Genevay did define Minimum Kantorovich Estimators earlier this year, it is the Google team of Bousquet which wrote down the definitive framework for VAE-GANs in a push-forward / optimal transport setup. The W-AAE paper will probably be one of the top hits at ICLR2018.
On the variational inference front, who better than Dustin Tran to borrow ideas from off-policy reinforcement learning & from GANs and once again push the envelope of what modern VI can do :
Reinforcement Learning :
A year dominated by soft / max-entropy Q-learning. We’ve been doing it wrong ! All these years !
Schulman proves the equivalence between the main two families of RL algorithms. A landmark paper. ‘Nuff said :
Would he have done it without the below that, by taking old math and re-doing the partition function calcs very carefully, proves the path-by-path equivalence ? Nobody knows but Ofir :
Another criminally underrated paper – Gergely quietly one-ups everyone by also working out the analogy between the above RL algos and convex optimization methods. A strong contender for RL paper of the year IMHO and yet few people’ve heard of it.
If David Silver’s Predictron somehow fell off the radar due to being refused (!) at ICLR 2017, Theo’s paper is like a dual view of it, with beautiful and intuitive Sokoban experimental results to boot :
Marc Bellemare gets another transformational paper out – do away with all the DQN stabilization plugins, and simply learn the distribution ( and beat SotA in the process). Beautiful. Many extensions possible, including the link with Wasserstein distances.
A simple, yet extremely efficient, double-whammy of an idea.
Of course, the list wouldn’t be complete without AlphaGo Zero. The idea of aligning policy network pre- and post-MCTS, ie MCTS as a policy improvement algorithm (and a means to smooth NN approx error rather than propagating it), is the stuff of legends.
SGD & Optimization :
2017 has definitely been a year ripe with theoretical insights as to why SGD works as well as it does (and is so hard to beat from a generalization error perspective !) in a non-convex setting.
The ‘most technical’ paper of the year award goes to Chaudhari. Pretty much connects everything from SGD and gradient flows to PDEs. A masterpiece that follows and completes ‘Entropy-SGD‘ :
The Bayesian view on this is the SGD-VI connection from Mandt & Hoffman. As you know, I’ve been a frequentist for years, sic.
The previous paper hinged on the continuous relaxation of SGD as a stochastic differential equation (gradient noise being treated as Gaussian because of the CLT). This explains the implications for batch size & gives a very nice chi-square formula in the process.
Another paper with similar Ornstein-Uhlenbeck inspired results, from the lab of Yoshua Bengio :
Finally, another, very recent contribution to the SGD-SDE-VI trinity, courtesy of Chaudhari once again :
I’m a firm believer in an intuition that many insights as to why exactly deep learning works will come from the intersection of harmonic / L2 analysis (as seen earlier with the scattering ideas) and information theory with entropy-based measures. Naftali Tishby’s ideas, while still controversial and under fire from a recent ICLR2018 submissions, certainly bring us a step closer to that understanding.
Similarly, a beautiful paper from ICLR2017 takes a variational approach to the information bottleneck theory. My pick with a tiny edge over Beta-VAE in the ‘disentanglement’ category.
There have been umpteen billion generative models with twelve zillion ways of factorizing the log-likelihood this year, and they can mostly be understood under the light of convex duality. A necessary paper IMHO.
Finally, in a stunning display of technical mastery, and a sign that the mathematical arms race in deep learning is alive and well, Jeff Pennington combines complex analysis, random matrix theory, free probability, and graph morphisms (!) to derive an exact law for the eigenvalues of the Hessian of neural network loss functions, whereas the graph shape was only known empirically before, for instance in papers by Sagun et al. Must read.
That’s all from me, happy holidays to all !