References

[1]    Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.

[2]    Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR, 2020.

[3]    Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.

[4]    Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Spin-glass models of neural networks. Physical Review A, 1985.

[5]    Greg W Anderson, Alice Guionnet, and Ofer Zeitouni. An introduction to random matrices. Cambridge university press, 2010.

[6]    Z. Bai and J. Silverstein. https://link.springer.com/book/10.1007%2F978-1-4419-0661-8Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.

[7]    Z. D. Bai and Jack W. Silverstein. CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann. Probab., 32(1A):553–605, 2004.

[8]    J Baik, G Ben Arous, and S Péché. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. The Annals of Probability, 33(5), sep 2005.

[9]    Jinho Baik and Jack W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal., 97(6):1382–1408, 2006.

[10]    Marco Baity-Jesi, Levent Sagun, Mario Geiger, Stefano Spigler, Gérard Ben Arous, Chiara Cammarota, Yann LeCun, Matthieu Wyart, and Giulio Biroli. Comparing dynamics: Deep neural networks versus glassy systems. In International Conference on Machine Learning. PMLR, 2018.

[11]    Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.

[12]    Nicholas P Baskerville, Diego Granziol, and Jonathan P Keating. Applicability of random matrix theory in deep learning. arXiv preprint arXiv:2102.06740, 2021.

[13]    Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.

[14]    Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020.

[15]    Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pages 541–549. PMLR, 2018.

[16]    A Bloemendal and B Virág. Limits of spiked random matrices I. Probability Theory and Related Fields, 156(3-4):795–825, aug 2013.

[17]    K H Borgwardt. The simplex method: A probabilistic analysis. Springer–Verlag, Berlin, Heidelberg, 1987.

[18]    Jean-Philippe Bouchaud and Marc Potters. Financial applications of random matrix theory: a short review. arXiv preprint arXiv:0910.1205, 2009.

[19]    P Bourgade, L Erdős, and H-T Yau. Edge Universality of Beta Ensembles. Communications in Mathematical Physics, 332(1):261–353, nov 2014.

[20]    Joël Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation matrices: tools from random matrix theory. Phys. Rep., 666:1–109, 2017.

[21]    Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.

[22]    Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics. PMLR, 2015.

[23]     D Dadush and S Huiberts. A Friendly Smoothed Analysis of the Simplex Method. SIAM Journal on Computing, 49(5):STOC18–449–STOC18–499, jan 2020.

[24]    G B Dantzig. Origins of the simplex method. In A history of scientific computing, pages 141–151. ACM, New York, NY, USA, jun 1990.

[25]    Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. arXiv preprint arXiv:1406.2572, 2014.

[26]    P Deift. Orthogonal Polynomials and Random Matrices: a Riemann-Hilbert Approach. Amer. Math. Soc., Providence, RI, 2000.

[27]    P Deift and T Trogdon. Universality for Eigenvalue Algorithms on Sample Covariance Matrices. SIAM Journal on Numerical Analysis, 2017.

[28]    Percy A Deift, Govind Menon, Sheehan Olver, and Thomas Trogdon. Universality in numerical computations with random data. Proceedings of the National Academy of Sciences, 2014.

[29]    A Deshpande and D A Spielman. Improved Smoothed Analysis of the Shadow Vertex Simplex Method. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 349–356. IEEE, 2005.

[30]    X Ding and T Trogdon. The conjugate gradient algorithm on a general class of spiked covariance matrices. arXiv preprint arXiv:2106.13902, jun 2021.

[31]    X Ding and F Yang. Spiked separable covariance matrices and principal components. arXiv preprint arXiv:1905.13060, may 2019.

[32]    Carles Domingo-Enrich, Fabian Pedregosa, and Damien Scieur. Average-case acceleration for bilinear games and normal matrices. arXiv preprint arXiv:2010.02076, 2020.

[33]    Viktor Dotsenko. An introduction to the theory of spin glasses and neural networks. World Scientific, 1995.

[34]    Petros Drineas and Michael W Mahoney. Lectures on randomized numerical linear algebra. The Mathematics of Data, 2018.

[35]    Petros Drineas, Michael W Mahoney, and Nello Cristianini. On the nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 2005.

[36]    P. Erdős. A selection of problems and results in combinatorics. In Recent trends in combinatorics (Matrahaza, 1995), pages 1–6. Cambridge Univ. Press, Cambridge, 1995.

[37]    Paul Erdos and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 1960.

[38]    L Erdős, A Knowles, H-T Yau, and J Yin. Spectral statistics of Erdős–Rényi graphs I: Local semicircle law. The Annals of Probability, 41(3B), may 2013.

[39]    Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 1988.

[40]    Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning. PMLR, 2019.

[41]    Herman H Goldstine and John Von Neumann. Numerical inverting of matrices of high order. ii. Proceedings of the American Mathematical Society, 1951.

[42]    Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 2015.

[43]    R.L. Graham, D.E. Knuth, and O. Patashnik. Concrete mathematics. Addison-Wesley, Reading, MA, 1989.

[44]    George D. Greenwade. The Comprehensive Tex Archive Network (CTAN). TUGBoat, 14(3):342–351, 1993.

[45]    Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 2011.

[46]    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.

[47]    M Hestenes and E Steifel. Method of Conjugate Gradients for Solving Linear Systems. J. Research Nat. Bur. Standards, 20:409–436, 1952.

[48]    Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, and Franck Gabriel. Kernel alignment risk estimator: risk prediction from training data. arXiv preprint arXiv:2006.09796, 2020.

[49]    S Kaczmarz. Angenäherte auflösung von systemen linearer gleichungen (english translation by jason stockmann): Bulletin international de l’académie polonaise des sciences et des lettres. 1937.

[50]    J Keating. The Riemann zeta-function and quantum chaology. In Quantum chaos. Elsevier, 1993.

[51]    A Knowles and J Yin. Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1-2):257–352, oct 2017.

[52]    D.E. Knuth. Two notes on notation. Amer. Math. Monthly, 99:403–422, 1992.

[53]    E Kostlan. Complexity theory of numerical linear algebra. Journal of Computational and Applied Mathematics, 22(2-3):219–230, jun 1988.

[54]    J Kuczyński and H Woźniakowski. Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, oct 1992.

[55]    Jonathan Lacotte and Mert Pilanci. Optimal randomized first-order methods for least-squares problems. In International Conference on Machine Learning. PMLR, 2020.

[56]    Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics, 48(3):1329–1347, 2020.

[57]    Z Liao and M W Mahoney. Hessian Eigenspectra of More Realistic Nonlinear Models. mar 2021.

[58]    Zhenyu Liao. A random matrix framework for large dimensional machine learning and neural networks. PhD thesis, Université Paris-Saclay, 2019.

[59]    Zhenyu Liao, Romain Couillet, and Michael W Mahoney. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. arXiv preprint arXiv:2006.05013, 2020.

[60]    Vladimir A Marčenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1967.

[61]    Madan Lal Mehta. Random matrices. Elsevier, 2004.

[62]    Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.

[63]    James A Mingo and Roland Speicher. Free probability and random matrices, volume 35. Springer, 2017.

[64]    Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for l_2 and l_1 penalized interpolation. arXiv preprint arXiv:1906.03667, 2019.

[65]    Hugh L Montgomery. The pair correlation of zeros of the zeta function. In Proc. Symp. Pure Math, 1973.

[66]    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.

[67]    Andrew M Odlyzko. On the distribution of spacings between zeros of the zeta function. Mathematics of Computation, 1987.

[68]    Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 2020.

[69]    C Paquette, K Lee, F Pedregosa, and E Paquette. SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality. arXiv preprint 2102.04396, feb 2021.

[70]    C Paquette, B van Merriënboer, and F Pedregosa. Halting Time is Predictable for Large Models: A Universality Property and Average-case Analysis. arXiv preprint arXiv:2006.04299, jun 2020.

[71]    Elliot Paquette and Thomas Trogdon. Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:2007.00640, 2020.

[72]    S Péché et al. A note on the pennington-worah distribution. Electronic Communications in Probability, 2019.

[73]    Fabian Pedregosa and Damien Scieur. Acceleration through spectral density estimation. In International Conference on Machine Learning. PMLR, 2020.

[74]    Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2798–2806. PMLR, 2017.

[75]    Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. 2017.

[76]    C W Pfrang, P Deift, and G Menon. How long does it take to compute the eigenvalues of a random symmetric matrix? Random matrix theory, interacting particle systems, and integrable systems, MSRI Publications, 65:411–442, 2014.

[77]    Alexander Rakhlin and Xiyu Zhai. Consistency of interpolation with laplace kernels is a high-dimensional phenomenon. In Conference on Learning Theory, pages 2595–2623. PMLR, 2019.

[78]    Norbert Rosenzweig and Charles E. Porter. ”repulsion of energy levels” in complex atomic spectra. Phys. Rev., 1960.

[79]    Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, 2015.

[80]    Levent Sagun, V Ugur Guney, Gerard Ben Arous, and Yann LeCun. Explorations on high dimensional landscapes. arXiv preprint arXiv:1412.6615, 2014.

[81]    Levent Sagun, Thomas Trogdon, and Yann LeCun. Universal halting times in optimization and machine learning. Quarterly of Applied Mathematics, 2017.

[82]    H. Simpson. Proof of the Riemann Hypothesis. preprint (2003), available at http://www.math.drofnats.edu/riemann.ps, 2003.

[83]    S Smale. On the average number of steps of the simplex method of linear programming. Mathematical Programming, 27(3):241–262, oct 1983.

[84]    D A Spielman and S-H Teng. Smoothed analysis of algorithms. Journal of the ACM, 51(3):385–463, may 2004.

[85]    Daniel A Spielman and Shang-Hua Teng. Spectral sparsification of graphs. SIAM Journal on Computing, 2011.

[86]    T Strohmer and R Vershynin. A Randomized Kaczmarz Algorithm with Exponential Convergence. Journal of Fourier Analysis and Applications, 15(2):262–278, apr 2009.

[87]    Antonia M Tulino, Sergio Verdú, and Sergio Verdu. Random matrix theory and wireless communications. Now Publishers Inc, 2004.

[88]    R Vershynin. Beyond Hirsch Conjecture: Walks on Random Polytopes and Smoothed Complexity of the Simplex Method. SIAM Journal on Computing, 39, 2009.

[89]    Eugene Wigner. Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, 1955.

[90]    Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems, 2001.

[91]    John Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, 1928.

[92]    Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020.

[93]    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.