Here I deliberately omitted Boltzmann machines and deep belief nets, since they’re state of the art on exactly nothing. There is no complete & operational unified theory of deep learning in neural networks as of yet – hence, most insights come from taking involved connex sub-fields of science such as statistical mechanics, multiscale analysis or Riemannian geometry, and pushing as much as possible to get closed-form expressions. As a result, these papers are technically fairly hardcore.
Spin-glass models of neural networks, D. Amit, 1985
Flat minima, S. Hochreiter, 1997
How transferable are features in deep neural networks?, J. Yosinski, 2014
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Y. Dauphin, 2014
A Mathematical Motivation for Complex-Valued Convolutional Networks, J. Bruna, 2015 (+ Invariant Scattering Convolution Networks, J. Bruna, 2012)
The loss surfaces of multilayer networks, A. Choromanska, 2015
Transition to chaos in random neuronal networks, J. Kadmon, 2015
Maximally informative hierarchical representations of high-dimensional data, G. Ver Steeg, 2015 ( + Variational Information Maximization for feature selection, S. Gao, 2016)
On the expressive power of deep neural networks, M. Raghu ( + Deep Information Propagation , S. Schoenholz, 2017)
Deep learning without poor minima, K. Kawaguchi, 2016
Deep neural networks with random Gaussian weights : a universal classification strategy ? , R. Giryes, 2016 ( + Robust Large Margin Deep Neural Networks , J. Sokolic, 2016)
Topology and geometry of half-rectified network optimization, C. Freeman, 2016