Loss Landscapes and the Hessian Matrix

February 2025

Much can be said about the optimizability of any loss function based on its associated Hessian matrix. The convexity of an objective function $\mathbf{L}$ is handy for evaluating (or estimating) gradients. We can exactly compute them for nice convex functions (by "nice", I mean smooth, continuous, differentiable), but not so much for non-convex ones. In either setup, finding the right weight updates using gradients or their Jacobians (the Hessian matrix) becomes pivotal. Understanding the Hessian's structure helps understand the loss function's complexity and how manipulating data categories can optimize training. Simply put, the Hessian matrix is the Jacobian of the gradient for a multivariable vector function spitting out some scalars which is up for grabs for us to analyze...

Deep Learning Loss Landscapes

A loss landscape is simply the surface you get when plotting the loss function $L(\theta)$ over all possible parameter configurations $\theta \in \mathbb{R}^D$ of your model. In classical convex optimization, this surface is "nice". The Hessian matrix of all second derivatives, $\nabla^2 L(\theta)$, is positive semidefinite everywhere, guaranteeing a unique global minimum. We done? Nope.

Alas, deep neural networks absolutely shatter this simplicity. With millions of parameters interacting through nonlinear activations, the loss landscape becomes a complex, highly nonconvex terrain filled with a large number of critical points. This is the Hessian can tell a thing or two.

Formally, it's the $D \times D$ matrix:

$$H(\theta)_{ij} = \frac{\partial^2 L(\theta)}{\partial \theta_i \partial \theta_j}$$

which stems from the second-order Taylor expansion:

$$L(\theta + \delta) \approx L(\theta) + \nabla L(\theta)^T \delta + \frac{1}{2} \delta^T H(\theta) \delta \quad \text{(ignore higher order terms)}$$

The Hessian tells us exactly how the loss changes as we move in any direction $\delta$. Directions with large positive e-values are steep valleys and negative e-values signal saddle points with downhill escapes. The e-values encode the landscape's shape around any point of interest!

A Universal Pattern

Computing the full Hessian for deep networks is usually infeasible as it requires $O(D^2)$ memory, which explodes for modern architectures. But researchers have developed clever approximations by examining the Hessian through spectral thoery. Upon EVD of the Hessian of trained deep networks, we consistently observe a universal pattern. A "bulk" of the e-values clustered near zero, and just a handful of "outliers" have much larger magnitudes1. This in fact is a direct result of rank deficiency/singularity viz a strong signature of overparameterization and superposition. This is substantiated by Sagun et al.2 by showing that increasing network width adds zero e-values without creating new large ones.

So, in practice, only a few directions (corresponding to the larger e-values) exhibit significant curvature. The vast majority of directions are nearly flat. Even more intriguingly, the number of large e-values often equals the number of output classes, indicating these directions capture the curvature needed to distinguish between classes. Even if we change the architecture, the task, the model, or the modality, the "bulk" stays roughly the same as long as we use SGD for optimization. Only the "outliers" shift in scale and size.

But what does singularity really mean?

This extreme singularity has profound implications that kinda changed how I look at training the NNs. Since most Hessian e-values are nearly zero, the loss landscape has many flat directions at minima with wide valleys rather than narrow pits.

In such flat, overparameterized regimes, the classical notion of isolated basins of attraction are a bit misleading. Different minima found by different training runs or batch sizes can be connected through these flat dimensions, effectively lying in the same broad basin. The landscape isn't a collection of isolated wells; it's more like a vast plateau with gentle undulations.

Over the course of training, the Hessian typically flattens further. Empirical results consistently show that both the largest e-value and the trace (sum of all e-values) decrease as gradient descent converges. In other words, SGD actively steers models into flatter regions of the loss surface, and this tendency toward flatness is the mathematical reasoning behind generalization.

The Geometry of Saddles and Flatness

We can borrow some neat stuff from the Random Matrix Theory. It particularly suggests that saddle points (critical points with negative e-values) vastly outnumber true local extrema3. As loss increases, a random Hessian's spectrum (following Wigner's semicircle law4) shifts leftward: many e-values become negative, with large density accumulating near zero. This explains why high-dimensional loss surfaces are studded with saddle plateaus that can slow training. Dauphin et al. made an even stronger argument: in very high dimensions, almost all non-global critical points are saddles5. Why? Because it's exponentially unlikely for all curvature directions to be strictly positive simultaneously.

One of the most debated connections in deep learning is between the "flatness" of minima and generalization performance. A flat minimum has many small Hessian e-values (a broad valley), while a sharp minimum has at least one large e-value (a narrow valley). We are taught, based on many empirical validations, that small-batch SGD tends to find flat minima with better test accuracy, while larger batches can converge to sharper minima with poorer generalization6. This observation sparked methods like Entropy-SGD7 and Sharpness-Aware Minimization (SAM)8, which explicitly bias training toward minimizing Hessian spectral norms. However, while these techniques do reduce the maximum e-value $\lambda_{\max}$, the generalization benefit can sometimes vanish under certain conditions. The relationship between Hessian-based flatness and generalization is more subtle than a simple "flatter is always better" story.

Equivalent Formulations


For a neural network, the Hessian can be written as:

$$H(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla^2_\theta \ell(f(x_i;\theta), y_i) \quad \text{where } \ell(f(x_i;\theta), y_i) \text{ is the loss value for the $x_i$}$$ viz the average of per-example Hessians. In supervised settings, $H(\theta)$ actually decomposes into a positive-semidefinite term closely related to the empirical Fisher / Generalized Gauss-Newton (which captures gradient-covariance, $G$) plus a residual ($H'$) that involves second derivatives of the model outputs. This stems from the Gauss-Newton decomposition of $H$ where $G$ captures gradient covariance which tends to dominate at minima9.

More specifically for wide neural networks, Jacot et al. show that the Hessian converges to the covariance of the neural tangent kernel (NTK)10. Under this regime, the Hessian has many zero e-values due to symmetries and the large parameter-to-data ratio, i.e., rank deficiency. This connects to classical results from Kawaguchi et al. showing that, for certain architectures, every local minimum can be global11 because ~all negative curvature directions lead to a global valley owing to overparameterization.

Practical Implications

Understanding the Hessian's structure has real consequences for how we train and design neural networks.

Second-Order Optimization: In principle, we could actually use the Hessian information to accelerate training through Newton's method12. In practice, the exact Newton is prohibitively expensive for large networks. So, Hessian-free and quasi-Newton (pushed towards saddle points) methods approximate curvature implicitly13. One could also use a saddle-free Newton by just flipping the sign of negative e-values, effectively using $|H|^{-1}\nabla L$ to escape saddles quickly.

Regularization: We can explicitly penalize Hessian norms to encourage flatness. Adding a Hessian trace regularizer $\gamma$ to the loss pushes SGD into flatter minima and has improved generalization in experiments. During training too, we could monitor the Hessian-based metrics include $\lambda_{max}$, the condition number ($\kappa = \lambda_{\max}/\lambda_{\min}$), and the trace ($\mathrm{Tr}(H) = \sum_i \lambda_i$) to provide valuable diagnostics for optimization health. The Generalized Gauss-Newton formulation also allows for clever approximations and targeted regularization strategies14.

Layer Structure: Sankar et al. show that each layer's Hessian has a similar eigenspectrum to the full network, with middle layers often dominating curvature15.

Looking Forward

The spectral analysis of the Hessian matrix has mostly become a tool for debugging, not something I do always, but something I definitely check when the code seems correct. The eigenspectrum is directly interpretable and provides some nice theoretical explanations to why simple first-order methods work surprisingly well despite the nonconvexity, and why optimization can stall near saddles.

Understanding Hessian geometry has already inspired new algorithms and sharpened our intuition about OOD generalization. Researchers continue to develop more scalable Hessian estimators and forge sharper theoretical links between curvature and generalization.

Ultimately, the Hessian's role in deep learning is to quantify the shape of loss basins. As we continue to scale neural networks to ever-larger sizes, understanding this geometry is practically essential. So, do try and look at the Hessian when you train a model!


Thanks for Reading :)

- Veeraraju


References

  1. Sagun, L., Evci, U., Guney, V. U., Dauphin, Y., & Bottou, L. (2017). Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. arXiv:1706.04454.
  2. Sagun, L., Bottou, L., & LeCun, Y. (2016). e-values of the Hessian in Deep Learning: Singularity and Beyond. arXiv:1611.01838 / NIPS 2016 Workshop.
  3. Baskerville, N. P., Keating, J. P., Mezzadri, F., Najnudel, J., & Poplavska, V. (2021). Appearance of Random Matrix Theory in deep learning. Physica A, arXiv:2107.11003.
  4. Wigner, E. P. (1955). Characteristic vectors of bordered matrices with infinite dimensions. Annals of Mathematics, 62(3), 548-564.
  5. Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NIPS 2014, arXiv:1406.2572.
  6. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ICLR 2017, arXiv:1609.04836.
  7. Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., & Zecchina, R. (2017). Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. ICLR 2017, arXiv:1611.01838.
  8. Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021, arXiv:2010.01412.
  9. Martens, J. (2020). New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146), 1-76.
  10. Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks. NeurIPS 2018, arXiv:1806.07572.
  11. Kawaguchi, K. (2016). Deep Learning without Poor Local Minima. NeurIPS 2016.
  12. Pearlmutter, B. A. (1994). Fast exact multiplication by the Hessian. Neural Computation, 6(1), 147-160.
  13. Martens, J. (2010). Deep learning via Hessian-free optimization. ICML 2010.
  14. Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7), 1723-1738.
  15. Sankar, A. R., Khasbage, Y., Vigneswaran, R., & Balasubramanian, V. N. (2021). A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization. AAAI 2021.