Understanding the difficulty of training deep feedforward neural networks, X. Glorot, 2010
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, A. Saxe, 2014
Delving deep into Rectifiers : surpassing human-level performance on ImageNet classification, K. He, 2015
Dropout : A simple way to prevent neural networks from overfitting, N. Srivastava, 2014
Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, 2015 ( + Recurrent Batch Normalization, T. Cooijmans, 2016)
Deep Sparse Rectifier Neural Networks, X. Glorot, 2011 ( + Fast and accurate deep network learning by exponential linear units, DA Clevert, 2015)
Bridging nonlinearities and stochastic regularizers with Gaussian error linear units, D. Hendrycks, 2016
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation, Y. Bengio, 2013
The concrete distribution : a continuous relaxation of discrete random variables, C. Maddison, 2016 ( + Categorical reparameterization with Gumbel-Softmax, E. Jang, 2016)
On the importance of initialization and momentum in deep learning, I. Sutskever, 2013
Adaptive subgradient methods for online learning and stochastic optimization, J. Duchi, 2011 ( + Adadelta : an adaptive learning rate method, M. Zeiler, 2012)
Adam : A method for stochastic optimization, D. Kingma, 2014 ( + Incorporating Nesterov Momentum into Adam, T. Dozat, 2015)
Stochastic Gradient Descent with Warm Restarts, I. Loshchilov, 2016