Paper of the Month: September 2022

Once a month during the academic year, the statistics faculty select a paper for our students to read and discuss. Papers are selected based on their impact or historical value, or because they contain useful techniques or results.

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854.

Notes Preparer: Neil Spencer, Assistant Professor of Statistics

Conventional wisdom dictates that statistical models should be flexible enough to capture the data’s underlying structure, but not so flexible that they overfit to irrelevant noise. Balancing these two demands amounts to the bias-variance tradeoff, a central theme of traditional statistical modeling.

Interestingly, the recent success of deep learning has called some of this wisdom into question. Deep learning models are routinely over-parameterized, involving more model parameters than there are data points. Standard practice is to interpolate (perfectly fit) the sample data, a clear recipe for overfitting. Nevertheless, deep learning can perform remarkably well on held-out data, seemingly breaking the rules of the bias variance tradeoff. What is going on?

This paper addresses this question by describing the “Double descent” phenomenon, which I view as one of the most intriguing statistical findings of recent years. Empirically, once a model is over-parameterized to the point of interpolation, adding even more parameters can cause out-of-sample performance to begin improving again—sometimes past the point of the best low-dimensional model. This explains why overparametrized deep learning models have the potential to perform so well.

My plan for the discussion is to focus on the general principles and statistical implications of the “double descent” phenomenon. What does this discovery tell us about traditional statistics? Why did it go unnoticed for so long? How could one distinguish “good” interpolating model from the sea of bad ones? Could more traditional statistical ideas like regularization, Bayesian inference, or nonparametric modeling be helpful?

Other relevant papers that cover various aspects of this problem in greater detail are:

  • Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48), 30063-30070.
  • Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30, 203-248.
  • Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2022). Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2), 949-986.
  • Bartlett, P. L., Montanari, A., & Rakhlin, A. (2021). Deep learning: a statistical viewpoint. Acta numerica, 30, 87-201.