Grokking as a Phase Transition in Neural Networks

A Survey through Statistical Physics and Mean Field Theory

Surveys
Published

February 11, 2026

Abstract. Grokking—the phenomenon whereby a neural network achieves perfect training accuracy long before exhibiting any generalization—poses a striking challenge to conventional statistical learning theory (Power et al. 2022). This survey examines grokking through the lens of statistical physics and mean field theory. We argue that grokking is not an optimization anomaly but a bona fide phase transition: the system transitions from a lazy regime, governed by the frozen Neural Tangent Kernel (NTK), to a rich regime characterized by active feature learning (Kumar et al. 2024; Rubin, Seroussi, and Ringel 2024). We review the effective theory of Liu et al. (Liu et al. 2022), the first-order phase transition framework of Rubin et al. (Rubin, Seroussi, and Ringel 2024), physical order parameters that diagnose the transition (representation quality index, effective rank, kernel alignment), the dynamical mean field theory (DMFT) description, finite-width perturbative corrections via Feynman diagrams (Guillen, Misof, and Gerken 2025), and the role of modern optimizers such as Muon in accelerating the transition (Jordan et al. 2024).


Neural Dynamics Foundations

Understanding grokking requires first establishing the two fundamental learning regimes of neural networks: the lazy regime and the rich (feature learning) regime. These are not merely descriptive labels but mathematically distinct universality classes of gradient descent dynamics, distinguished by how parameters scale with network width and how they evolve during training.

The Neural Tangent Kernel

NoteDefinition (Neural Tangent Kernel)

For a neural network \(f(x;\theta)\) with parameters \(\theta \in \mathbb{R}^P\), the Neural Tangent Kernel (NTK) is the \(n \times n\) kernel matrix defined on the training set \(\{x_i\}_{i=1}^n\) by \[ K_t(x, x') = \left\langle \nabla_\theta f(x;\theta_t),\; \nabla_\theta f(x';\theta_t) \right\rangle = \sum_{p=1}^{P} \frac{\partial f(x;\theta_t)}{\partial \theta_p} \frac{\partial f(x';\theta_t)}{\partial \theta_p}. \tag{1}\] Under gradient flow \(\dot\theta = -\nabla_\theta \mathcal{L}\), the network output evolves as \[ \frac{d}{dt} f(x;\theta_t) = -\sum_{x'} K_t(x,x') \nabla_f \mathcal{L}(f(x';\theta_t), y'). \tag{2}\] In the infinite-width limit with appropriate parameterization, \(K_t\) converges to a deterministic kernel \(K_\infty\) that remains constant throughout training (Jacot, Gabriel, and Hongler 2018).

Lazy Regime

NoteDefinition (Lazy Regime)

A network operates in the lazy regime when the parameters remain within an infinitesimal neighborhood of their initialization throughout training: \[ \|\theta_t - \theta_0\| = O(N^{-1/2}), \tag{3}\] where \(N\) is the network width. In this regime, the network function is well-approximated by its first-order Taylor expansion around \(\theta_0\): \[ f(x;\theta_t) \approx f(x;\theta_0) + \nabla_\theta f(x;\theta_0)^\top (\theta_t - \theta_0). \tag{4}\] The NTK is frozen at its initial value \(K_0\), and training reduces to kernel regression in the reproducing kernel Hilbert space (RKHS) of \(K_0\) (Jacot, Gabriel, and Hongler 2018). No representation learning occurs; the network uses its random initialization features as a fixed basis.

In the lazy regime, the over-parameterized network can memorize any training set by virtue of the kernel’s positive-definiteness, but it cannot discover the structural regularities of the target function. This is the regime of “rote memorization.”

Rich (Feature Learning) Regime

NoteDefinition (Rich Regime)

A network operates in the rich regime (or feature learning regime) when its parameters undergo \(O(1)\) displacement from initialization: \[ \|\theta_t - \theta_0\| = O(1). \tag{5}\] In this regime, the NTK \(K_t\) evolves during training and adapts to the data, aligning its principal eigenvectors with the structure of the target function. The network learns task-specific internal representations rather than relying on random features (Yang et al. 2022).

The rich regime is accessed through specific parameterizations. The most principled is the maximal update parameterization (\(\mu\)P) of Yang and Hu (Yang et al. 2022), which ensures that hidden-layer updates remain \(O(1)\) regardless of width. In the rich regime, the kernel is not a fixed object but a dynamical variable co-evolving with the data representation.

The Initialization Scale Parameter \(\alpha\)

NoteDefinition (Initialization Scale \(\alpha\))

Consider a two-layer network with output \[ f(x;\theta) = \frac{1}{N^\alpha} \sum_{j=1}^{N} a_j \, \sigma(w_j \cdot x), \tag{6}\] where \(\alpha \in [0, 1/2]\) is the initialization scale parameter, \(a_j\) are readout weights, \(w_j\) are feature weights, and \(\sigma\) is a nonlinear activation. The extreme values correspond to:

  • \(\alpha = 1/2\): NTK parameterization (lazy regime). The large output scale suppresses the need for weight movement; the kernel freezes.
  • \(\alpha = 0\): Mean-field parameterization (rich regime). The small output scale forces large weight excursions, enabling feature learning.

Intermediate values of \(\alpha\) interpolate continuously between these extremes (Kumar et al. 2024).

The parameter \(\alpha\) serves as a control knob that tunes the system between the two universality classes. As we shall see, grokking occurs precisely at intermediate values of \(\alpha\) where the network begins in a lazy-like state but is eventually driven into the rich regime by regularization.

NoteLemma (NTK–Lazy Equivalence)

In the limit \(N \to \infty\) with \(\alpha = 1/2\), gradient flow on the squared loss is equivalent to kernel regression with the deterministic NTK at initialization: \[ f_t(x) = f_0(x) - K_\infty(x, X)(K_\infty(X,X))^{-1}\bigl(e^{-K_\infty(X,X) \eta t} - I\bigr) \bigl(f_0(X) - Y\bigr), \] where \(X\) denotes the training inputs and \(Y\) the targets. In particular, the kernel \(K_t = K_\infty\) for all \(t \geq 0\), and training converges to the minimum-RKHS-norm interpolant (Jacot, Gabriel, and Hongler 2018).


Grokking Phenomenology

NoteDefinition (Grokking)

Grokking is a training phenomenon in which:

  1. The training loss reaches near-zero (memorization) at an early time \(t_{\mathrm{mem}}\).
  2. The test loss remains at chance level for an extended period \(t_{\mathrm{mem}} \ll t \ll t_{\mathrm{grok}}\).
  3. The test loss drops sharply at time \(t_{\mathrm{grok}}\), achieving near-perfect generalization.

The delay ratio \(t_{\mathrm{grok}} / t_{\mathrm{mem}}\) can span several orders of magnitude (Power et al. 2022).

The term “grokking” was introduced by Power et al. (Power et al. 2022), who observed this phenomenon when training small transformers on modular arithmetic tasks such as \(a + b \pmod{p}\) and \(a \times b \pmod{p}\). The training curve exhibits a characteristic four-phase structure:

NoteDefinition (Training Phases of Grokking)
  • Phase I—Memorization. Training accuracy rises to 100%. The network fits the training data using its high-dimensional random feature capacity (NTK behavior). Test accuracy remains at the random baseline (e.g., \(1/p\) for modular arithmetic mod \(p\)).
  • Phase II—Plateau. Training loss is near zero; test loss is stationary. Externally, learning appears stalled. Internally, weight decay and the implicit bias of the optimizer apply a slow, persistent pressure on the weight configuration, driving it along the zero-training-loss manifold toward lower-norm solutions (Liu et al. 2022).
  • Phase III—Circuit formation. Internal representations begin to restructure. Nanda et al. (Nanda et al. 2023) demonstrated that, for modular addition, the network learns discrete Fourier transform representations and implements the identity \(\cos(\omega a)\cos(\omega b) - \sin(\omega a)\sin(\omega b) = \cos(\omega(a+b))\) using dedicated neuron circuits.
  • Phase IV—Generalization. Test accuracy jumps to near 100%. The network has transitioned from a memorization solution (lookup table) to a generalizing solution (algorithmic circuit).

Algorithmic Tasks and Data Sparsity

Grokking is most readily observed on tasks with a strict low-dimensional algebraic structure that is entirely opaque when viewed as a generic input–output mapping. Canonical examples include:

Data sparsity amplifies the grokking delay. When the training fraction \(|S_{\mathrm{train}}| / |S_{\mathrm{total}}|\) is small, the memorization solution is easily accessible (a lookup table requires complexity proportional to \(|S_{\mathrm{train}}|\)), while the generalizing solution—which must be valid everywhere—requires discovering the underlying rule. As the training fraction increases, grokking disappears because the generalizing solution becomes the easier path from the outset (Power et al. 2022; Liu et al. 2022).

ImportantOpen Question: Universality of Grokking

Is grokking specific to discrete algebraic tasks, or does it arise generically in continuous regression and classification problems? The effective theory of Liu et al. (Liu et al. 2022) suggests that grokking is universal whenever the loss landscape contains both a high-norm memorizing basin and a low-norm generalizing basin separated by an energy barrier. Empirical evidence from polynomial regression and image classification under heavy regularization supports this broader view.


Grokking as a Phase Transition

The sharp, sudden transition from memorization to generalization is mathematically analogous to a phase transition in statistical mechanics. In this section, we formalize this correspondence.

Energy Landscape and Entropy

In statistical physics, a system evolves to minimize its free energy \(F = E - TS\), where \(E\) is the internal energy, \(T\) the temperature, and \(S\) the entropy. In the neural network setting:

  • Energy \(\leftrightarrow\) Loss function plus regularization: \(\mathcal{L}(\theta) + \lambda \|\theta\|^2\).
  • Entropy \(\leftrightarrow\) Volume of weight space corresponding to a given solution type.
NoteDefinition (Effective Free Energy)

The effective free energy for a neural network trained with weight decay \(\lambda\) is \[ F(\theta) = \mathcal{L}_{\mathrm{train}}(\theta) + \lambda \|\theta\|^2, \tag{7}\] where \(\mathcal{L}_{\mathrm{train}}\) is the training loss. Gradient flow with weight decay is equivalent to gradient descent on \(F\). The regularization term \(\lambda\|\theta\|^2\) penalizes high-norm solutions, biasing the dynamics toward simpler (lower-complexity) configurations (Liu et al. 2022).

The memorizing and generalizing solutions correspond to qualitatively different basins in this landscape:

  • Memorization basin (disordered / “glassy” state): Training loss is zero, but the weight norm is large. These configurations are numerous—there are exponentially many ways to memorize \(n\) data points with a sufficiently over-parameterized network. High entropy, high energy (due to the \(\lambda\|\theta\|^2\) term).
  • Generalization basin (ordered / “crystalline” state): Training and test loss are zero. The weights are aligned with the target function’s structure (e.g., Fourier modes). These configurations are rare and highly organized. Low entropy, low energy.

The Effective Theory of Liu et al.

Liu et al. (Liu et al. 2022) decomposed the learning dynamics into competing timescales for “signal” (structured representation) and “noise” (memorization). They identified a Goldilocks zone in hyperparameter space:

NoteDefinition (Goldilocks Zone)

The Goldilocks zone is the region of hyperparameter space \((\lambda, \alpha, |S_{\mathrm{train}}|)\) in which:

  1. Regularization is strong enough that the memorization solution is metastable (not the global minimum of \(F\)).
  2. Regularization is not so strong that the network cannot fit the training data at all.
  3. The signal learning rate is positive but slower than the noise learning rate.

Within this zone, the network first memorizes (fast noise dynamics), then gradually transitions to generalization as weight decay destabilizes the memorization solution (Liu et al. 2022).

The signal strength \(S(t)\) and noise strength \(N(t)\) obey effective dynamics of the form: \[ \dot{S} = \eta_S \, g_S(S, N) - \lambda S, \qquad \dot{N} = \eta_N \, g_N(S, N) - \lambda N, \tag{8}\] where \(\eta_S \ll \eta_N\) reflects the slower signal learning rate. After the noise component saturates and begins to decay under weight decay, the signal eventually crosses a threshold where it dominates the output, triggering generalization (Liu et al. 2022).

First-Order Phase Transition

NoteTheorem (Grokking as First-Order Phase Transition (Rubin, Seroussi, and Ringel 2024))

Consider a two-layer neural network trained on a structured task (e.g., modular arithmetic) with weight decay \(\lambda > 0\). Define the order parameter \(m = \langle w, w^* \rangle / \|w\|\|w^*\|\) as the overlap between the learned features and the target features. Then:

  1. Two metastable phases exist. The effective free energy \(F(m)\) has two local minima: a memorizing phase \(\mathcal{M}\) at \(m \approx 0\) (no feature alignment) and a generalizing phase \(\mathcal{G}\) at \(m \approx 1\) (full alignment).
  2. The transition is first-order. As the effective control parameter (training time, sample size, or \(\lambda\)) crosses a critical value, the global minimum of \(F\) switches discontinuously from \(\mathcal{M}\) to \(\mathcal{G}\). The order parameter \(m\) jumps.
  3. Metastability produces delay. The grokking delay \(\tau_{\mathrm{grok}}\) is governed by the free energy barrier \(\Delta F\) between the phases: \[ \tau_{\mathrm{grok}} \sim \exp\!\left(\frac{\Delta F}{T_{\mathrm{eff}}}\right), \tag{9}\] where \(T_{\mathrm{eff}}\) is an effective temperature set by the learning rate and stochastic gradient noise. This is a Kramers-type escape rate formula (Kramers 1940). A full derivation with quantitative bounds is given in the Analytical Proof.
NoteLemma (Structure of the Loss Landscape (Rubin, Seroussi, and Ringel 2024; Nanda et al. 2023))

For networks trained on modular addition mod \(p\):

  1. The memorizing basin contains \(O(p^2)\) distinct solutions related by permutation symmetry of the input tokens.
  2. The generalizing basin contains \(O(p)\) solutions corresponding to different choices of Fourier frequency \(\omega\).
  3. The free energy barrier between basins is extensive in the number of parameters \(P\), explaining why the grokking delay grows with model size at fixed learning rate.
TipPhysical Analogy: Nucleation

Grokking is analogous to nucleation in a supercooled liquid. The memorizing phase is the metastable “liquid”; the generalizing phase is the thermodynamically stable “crystal.” Weight decay acts as supercooling—it lowers the free energy of the crystal relative to the liquid. The system remains trapped in the liquid state until a thermal fluctuation (stochastic gradient noise) nucleates a “droplet” of the crystalline phase that then grows to fill the system. The grokking delay is the nucleation time.


Physical Order Parameters

In statistical mechanics, a phase transition is diagnosed by an order parameter: a macroscopic observable that is zero in one phase and nonzero in the other. Several such quantities have been identified for the grokking transition.

Representation Quality Index

NoteDefinition (Representation Quality Index (Liu et al. 2022))

For a task with known algebraic structure (e.g., modular arithmetic mod \(p\)), the Representation Quality Index (RQI) measures the geometric regularity of the learned embedding. Let \(\{e_i\}_{i=0}^{p-1}\) denote the learned embeddings of the input tokens. Define: \[ \mathrm{RQI} = \frac{|\{(a,b,c,d) : e_a + e_b \approx e_c + e_d,\; a+b \equiv c+d \pmod{p}\}|}{|\{(a,b,c,d) : a+b \equiv c+d \pmod{p}\}|}. \tag{10}\] RQI \(\approx 0\) in the memorizing phase (random embeddings) and RQI \(\approx 1\) in the generalizing phase (structured lattice on the circle/torus). This is analogous to the crystalline order parameter in condensed matter.

Effective Rank

NoteDefinition (Effective Rank (Roy and Vetterli 2007))

Given a matrix \(W\) with singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\), define the normalized singular value distribution \(p_i = \sigma_i^2 / \sum_j \sigma_j^2\). The effective rank is \[ \mathrm{erank}(W) = \exp\!\left(-\sum_{i=1}^{r} p_i \log p_i\right) = \exp(H(\mathbf{p})), \tag{11}\] where \(H(\mathbf{p})\) is the Shannon entropy of the distribution \(\mathbf{p}\).

  • If all singular values are equal: \(\mathrm{erank} = r\) (maximal, fully delocalized).
  • If one singular value dominates: \(\mathrm{erank} \to 1\) (minimal, fully localized).

During the memorizing phase, the effective rank of the weight matrices (or activation covariance) is high. At the grokking transition, it collapses sharply, indicating the network has found a low-rank generalizing solution.

Participation Ratio

NoteDefinition (Participation Ratio)

For a covariance matrix \(C\) with eigenvalues \(\{\lambda_i\}\), the participation ratio is \[ \mathrm{PR} = \frac{\left(\sum_i \lambda_i\right)^2}{\sum_i \lambda_i^2} = \frac{\mathrm{tr}(C)^2}{\mathrm{tr}(C^2)}. \tag{12}\] The participation ratio counts the effective number of “active” modes. In the memorizing phase, variance is spread across many modes (\(\mathrm{PR} = O(N)\)). In the generalizing phase, it concentrates on the few task-relevant modes (\(\mathrm{PR} = O(1)\)). The sharp drop in PR at the transition is a robust diagnostic of grokking (Kumar et al. 2024).

Kernel Alignment

NoteDefinition (Kernel Alignment)

The kernel alignment between the empirical NTK \(K_t\) at training time \(t\) and an ideal target kernel \(K^*\) (encoding the task structure) is \[ A(K_t, K^*) = \frac{\langle K_t, K^* \rangle_F}{\|K_t\|_F \, \|K^*\|_F}, \tag{13}\] where \(\langle \cdot, \cdot \rangle_F\) is the Frobenius inner product. In the lazy regime, \(A \approx A_0\) is small and constant (random features are not aligned with the task). In the rich regime, \(A\) increases as the kernel adapts its eigenvectors to the target function. The grokking transition is marked by a sigmoidal rise in kernel alignment (Kumar et al. 2024).

Signal and Noise Subspace Projections

NoteDefinition (Signal–Noise Decomposition)

Let \(\mathcal{V}_S\) be the subspace spanned by the ground-truth features of the target function (e.g., the relevant Fourier modes for modular arithmetic). Let \(\mathcal{V}_N = \mathcal{V}_S^\perp\) be the orthogonal complement. For a weight matrix \(W\), define \[ E_S(t) = \|\Pi_S W_t\|_F^2, \qquad E_N(t) = \|\Pi_N W_t\|_F^2, \tag{14}\] where \(\Pi_S\) and \(\Pi_N\) are the projectors onto \(\mathcal{V}_S\) and \(\mathcal{V}_N\), respectively.

  • Phase I: \(E_N\) grows rapidly (memorization of random patterns); \(E_S\) grows slowly (signal learning is harder).
  • Phase II: \(E_N\) decays under weight decay; \(E_S\) continues growing from a small base.
  • Phase III–IV: \(E_S\) overtakes \(E_N\). The signal-to-noise ratio \(\mathrm{SNR}(t) = E_S(t)/E_N(t)\) crosses a critical threshold, and the softmax nonlinearity amplifies the signal, producing sudden generalization (Nanda et al. 2023; Liu et al. 2022).
NoteLemma (Order Parameter Discontinuity at Transition (Rubin, Seroussi, and Ringel 2024; Kumar et al. 2024))

At the grokking transition:

  1. RQI jumps from \(O(\varepsilon)\) to \(1 - O(\varepsilon)\).
  2. Effective rank drops from \(O(\sqrt{P})\) to \(O(1)\).
  3. Participation ratio drops from \(O(N)\) to \(O(1)\).
  4. Kernel alignment jumps from \(O(1/P)\) to \(O(1)\).
  5. SNR crosses unity: \(\mathrm{SNR}(t_{\mathrm{grok}}^-) < 1 < \mathrm{SNR}(t_{\mathrm{grok}}^+)\).

These discontinuities are the hallmarks of a first-order phase transition.

Table 1: Summary of order parameters and their behavior across the grokking transition.
Order Parameter Memorizing Phase Generalizing Phase Physical Analogy
RQI \(\approx 0\) \(\approx 1\) Magnetization
Effective rank \(O(\sqrt{P})\) \(O(1)\) Entropy
Participation ratio \(O(N)\) \(O(1)\) Localization length
Kernel alignment \(O(1/P)\) \(O(1)\) Field alignment
SNR \(< 1\) \(> 1\) Signal-to-noise ratio

Mean Field Theory and DMFT

Dynamical mean field theory (DMFT) provides a rigorous framework for analyzing the grokking transition in the infinite-width limit. Unlike the NTK theory (which also operates at infinite width but fixes features), DMFT allows features to evolve and thus captures the lazy-to-rich transition.

DMFT for Neural Networks

NoteDefinition (Dynamical Mean Field Theory for Neural Networks)

In the infinite-width limit of a two-layer network, the joint dynamics of any finite collection of neurons become independent, with each neuron \(i\) governed by a stochastic process \[ \dot{h}_i(t) = -\frac{\partial \mathcal{L}}{\partial h_i} + \xi_i(t), \] where the interaction with all other neurons is replaced by a self-consistent Gaussian “bath” characterized by the kernel \[ C(t, t') = \frac{1}{N}\sum_{i=1}^{N} \mathbb{E}\bigl[h_i(t) h_i(t')\bigr] \tag{15}\] and the response function \[ R(t, t') = \frac{1}{N}\sum_{i=1}^{N} \frac{\delta \mathbb{E}[h_i(t)]}{\delta \xi_i(t')}. \tag{16}\] Self-consistency requires that the statistics of \(h_i\) generated by the single-site dynamics match the kernel \(C\) and response \(R\) that define the bath (Kumar et al. 2024; Mei, Montanari, and Nguyen 2018).

The Action Landscape

The DMFT self-consistency conditions can be reformulated as the stationarity conditions of an action (or effective potential) \(\mathcal{S}[C, R]\), a functional of the kernel and response functions. Training trajectories correspond to saddle points of this action.

The grokking transition is encoded in the structure of \(\mathcal{S}\):

  1. Featureless minimum (\(m = 0\)): The kernel remains close to its initialization value. No feature learning occurs. This corresponds to the lazy/memorizing phase.
  2. Feature-learning minimum (\(m \neq 0\)): The kernel has evolved to align with the target. This corresponds to the rich/generalizing phase.

As the sample size \(n\) or training time increases, the global minimum of \(\mathcal{S}\) shifts from the featureless to the feature-learning saddle point. The grokking transition is the dynamical relaxation from the metastable featureless minimum to the stable feature-learning minimum (Kumar et al. 2024).

Mixed Phase and Gaussian Mixture Feature Learning

NoteDefinition (Gaussian Mixture Feature Learning Phase)

The Gaussian Mixture Feature Learning (GMFL) phase is an intermediate state in which the neural population splits into two subpopulations:

  • A majority of neurons remain in the “lazy” Gaussian cloud (pre-activations are approximately Gaussian with the initial covariance).
  • A minority of neurons have “specialized” and locked onto the target features, appearing as outliers in the spectral distribution of the weight matrix.

The pre-activation distribution in this phase is a Gaussian mixture rather than a pure Gaussian (Rubin, Seroussi, and Ringel 2024; Kumar et al. 2024).

This mixed phase is closely related to the spiked random matrix model from random matrix theory.

Spiked Random Matrix Model

NoteDefinition (Spiked Random Matrix Model)

Model the weight matrix as \[ W = \sum_{k=1}^{r} \beta_k \, u_k v_k^\top + \frac{\sigma}{\sqrt{N}} Z, \tag{17}\] where \(\beta_k\) are the spike strengths, \(u_k, v_k\) are unit vectors encoding the signal directions, and \(Z\) has i.i.d. entries. The \(\beta_k\) terms represent the learned signal; the noise term \(\sigma Z/\sqrt{N}\) represents the unlearned random component (Baik, Ben Arous, and Péché 2005).

NoteLemma (Baik–Ben Arous–Péché Transition (Baik, Ben Arous, and Péché 2005))

In the spiked random matrix model (Equation 17) with a single spike \(\beta\) and noise level \(\sigma\):

  • If \(\beta \leq \sigma\): the top singular value of \(W\) remains within the bulk of the Marchenko–Pastur distribution. The spike is undetectable.
  • If \(\beta > \sigma\): the top singular value separates from the bulk: \(\sigma_1(W) \to \beta + \sigma^2/\beta\). The spike emerges.

In the grokking context, training dynamics must push the signal strength \(\beta\) past the BBP threshold \(\sigma\) for the learned features to emerge from the noise floor. The grokking instant corresponds to the BBP transition.

NoteTheorem (DMFT Description of Grokking (Kumar et al. 2024))

In the infinite-width limit of a two-layer network trained with gradient flow and weight decay \(\lambda\) on a structured task:

  1. The DMFT equations admit two fixed-point solutions: a memorizing fixed point \(C_{\mathcal{M}}\) with feature overlap \(m \approx 0\) and a generalizing fixed point \(C_{\mathcal{G}}\) with \(m \approx 1\).
  2. For \(\alpha < \alpha_c\) (rich regime), the system dynamically transitions from \(C_{\mathcal{M}}\) to \(C_{\mathcal{G}}\) after a delay time \(\tau \sim \exp(\Delta \mathcal{S} / T_{\mathrm{eff}})\), where \(\Delta \mathcal{S}\) is the action barrier.
  3. For \(\alpha \geq \alpha_c\) (lazy regime), the system remains at \(C_{\mathcal{M}}\) for all time.

The critical value \(\alpha_c\) depends on the task structure, sample size, and regularization strength.

NoteLemma (Lazy-to-Rich Transition (Kumar et al. 2024; Yang et al. 2022))

The transition between the lazy (\(\alpha = 1/2\)) and rich (\(\alpha = 0\)) regimes is itself a phase transition in the learning dynamics:

  • For \(\alpha > \alpha_c\): the NTK is approximately constant; no feature learning occurs.
  • For \(\alpha < \alpha_c\): the NTK evolves; features adapt to the task.

The critical \(\alpha_c\) satisfies \(\alpha_c = \alpha_c(n, \lambda, \mathrm{SNR}_0)\), where \(n\) is the sample size, \(\lambda\) the regularization, and \(\mathrm{SNR}_0\) the initial signal-to-noise ratio of the task in the random feature basis.


Initialization and the Scaling Parameter \(\alpha\)

The initialization scale \(\alpha\) is the primary control parameter governing whether grokking occurs. This section examines its role in more detail.

For a network with output \(f(x) = N^{-\alpha} \sum_j a_j \sigma(w_j \cdot x)\):

  • Large \(\alpha\) (lazy): To produce \(O(1)\) outputs, the readout weights \(a_j\) must be large, but the feature weights \(w_j\) need only make small adjustments. The system stays near initialization.
  • Small \(\alpha\) (rich): The output scaling is small, forcing \(w_j\) to undergo large displacements. The neurons are pushed into the nonlinear regime of the activation function, enabling feature adaptation.

The grokking phenomenon arises when \(\alpha\) is large enough that the network can initially memorize via the lazy mechanism, but not so large that the rich mechanism is entirely suppressed.

Maximal Update Parameterization

NoteDefinition (Maximal Update Parameterization, \(\mu\)P (Yang et al. 2022))

The maximal update parameterization is an initialization and learning rate scheme defined by:

  • Hidden weights: \(W^{(l)} \sim \mathcal{N}(0, 1/N_l)\) with learning rate \(\eta^{(l)} \propto 1/N_l\).
  • Output weights: \(a \sim \mathcal{N}(0, 1/N)\) with learning rate \(\eta_a \propto 1\).

Under \(\mu\)P, the hidden-layer weight updates are \(O(1)\) regardless of width, ensuring maximal feature learning. Hyperparameters optimized at one width transfer to other widths without retuning.

Phase Diagram

NoteLemma (Phase Diagram in \((\alpha, \lambda)\) Space (Kumar et al. 2024; Liu et al. 2022))

For a fixed task and architecture, the \((\alpha, \lambda)\) parameter space partitions into three regions:

  1. No learning (\(\lambda > \lambda_{\max}(\alpha)\)): Regularization is too strong; neither memorization nor generalization occurs.
  2. Memorization only (\(\lambda < \lambda_{\min}(\alpha)\) or \(\alpha > \alpha_c\)): The network memorizes but never generalizes. Either regularization is too weak to destabilize the memorization solution, or the dynamics are too lazy to permit feature learning.
  3. Grokking (\(\lambda_{\min}(\alpha) < \lambda < \lambda_{\max}(\alpha)\) and \(\alpha < \alpha_c\)): The network first memorizes, then transitions to generalization after a delay.

The grokking delay diverges as \(\alpha \to \alpha_c^-\) and as \(\lambda \to \lambda_{\min}^+\), consistent with the approach to a phase boundary.

Timescale separation. The fundamental mechanism is a separation of timescales between the readout (linear, fast) and feature (nonlinear, slow) dynamics (see the Analytical Proof for a rigorous treatment): \[ \tau_{\mathrm{readout}} \sim \frac{1}{\eta \|K_0\|}, \qquad \tau_{\mathrm{features}} \sim \frac{N^{2\alpha}}{\eta \|\nabla_w K\|}. \tag{18}\] The network memorizes on the fast timescale \(\tau_{\mathrm{readout}}\). Feature learning occurs on the slow timescale \(\tau_{\mathrm{features}}\). Grokking happens when \(\tau_{\mathrm{features}} \gg \tau_{\mathrm{readout}}\), i.e., \(\alpha > 0\).


Finite-Width Corrections and Feynman Diagrams

The DMFT framework is exact in the infinite-width limit, but real grokking occurs in finite-width networks. Perturbative techniques from quantum field theory—specifically Feynman diagram expansions—provide a systematic way to compute corrections in powers of \(1/N\) (Guillen, Misof, and Gerken 2025; Roberts, Yaida, and Hanin 2022).

Perturbative Expansion of the NTK

NoteDefinition (Finite-Width NTK Expansion)

At finite width \(N\), the NTK admits an expansion \[ K_N(x, x') = K_\infty(x, x') + \frac{1}{N} K^{(1)}(x, x') + \frac{1}{N^2} K^{(2)}(x, x') + \cdots, \tag{19}\] where each correction \(K^{(k)}\) is computed as a sum over Feynman diagrams with \(k\) loops. At infinite width, all loop diagrams vanish and the kernel reduces to the deterministic \(K_\infty\). At finite width, the loop corrections introduce kernel fluctuations and, crucially, kernel drift—the mechanism by which features evolve (Guillen, Misof, and Gerken 2025).

Feynman Rules

NoteDefinition (Feynman Rules for Neural Network Perturbation Theory)

The perturbative expansion of network correlators is organized by Feynman diagrams with the following building blocks (Guillen, Misof, and Gerken 2025; Roberts, Yaida, and Hanin 2022):

  • Propagator (solid line): Represents the infinite-width covariance \(C_\infty(x, x') = \mathbb{E}[h_i(x) h_i(x')]\).
  • Vertex (\(k\)-point): Arises from the \(k\)-th Hermite coefficient of the activation function \(\sigma\). The vertex factor is \(V_k = \mathbb{E}_{z \sim \mathcal{N}(0,1)}[\mathrm{He}_k(z) \, \sigma(z)]\).
  • Loop: Each closed loop contributes a factor of \(1/N\). Diagrams are organized by loop number.

The one-loop correction to the NTK (the leading \(1/N\) term) involves the “four-point vertex,” which quantifies the interaction strength between different neurons. If this vertex vanishes (as in a Gaussian process), no feature learning occurs at any finite order.

NoteLemma (Finite-Width Correction to Grokking (Guillen, Misof, and Gerken 2025))

The grokking delay time at finite width \(N\) receives a perturbative correction: \[ \tau_N = \tau_\infty \left(1 + \frac{c}{N} + O(1/N^2)\right), \tag{20}\] where \(c > 0\) depends on the task and architecture. Finite width generally increases the grokking delay because:

  1. The free energy barrier \(\Delta F\) receives \(O(1/N)\) corrections from loop diagrams.
  2. Fluctuations broaden the transition, effectively increasing the barrier the system must cross.

This explains why grokking is difficult to observe in very wide networks under standard parameterization (\(\alpha = 1/2\)): the \(1/N\) feature-learning correction is suppressed, and the system remains locked in the lazy phase (Guillen, Misof, and Gerken 2025).

The Feynman diagram perspective also clarifies why \(\mu\)P is special: it resums a specific class of diagrams (the “cactus” diagrams) to all orders in \(1/N\), effectively promoting the one-loop feature-learning effect to a leading-order contribution.


Muon Optimizer and Spectral Dynamics

Understanding the grokking transition through the spectral dynamics of weight matrices suggests that optimizers operating in the spectral domain can accelerate or eliminate the grokking delay.

The Muon Optimizer

NoteDefinition (Muon Optimizer (Jordan et al. 2024))

The Muon (Momentum + Orthogonalization) optimizer performs gradient updates via the polar decomposition of the gradient matrix. For a weight matrix \(W\) with gradient \(G = \nabla_W \mathcal{L}\):

  1. Compute the momentum-corrected gradient \(\tilde{G}\).
  2. Compute the polar decomposition \(\tilde{G} = U \Sigma V^\top\).
  3. Update: \(W \leftarrow W - \eta \, U V^\top\).

The update \(UV^\top\) is the nearest orthogonal matrix to \(\tilde{G}\) in Frobenius norm. This “flattens” the gradient in the spectral domain: all singular values of the update are exactly 1, preventing any single spectral mode from dominating the update (Jordan et al. 2024; Bernstein and Newhouse 2024).

NoteDefinition (Spectral Dynamics)

The spectral dynamics of a weight matrix \(W_t\) during training is the evolution of its singular value decomposition: \[ W_t = \sum_i \sigma_i(t) \, u_i(t) \, v_i(t)^\top. \] Under standard SGD/Adam, the gradient is applied element-wise, and the largest singular values (typically associated with noise in the grokking context) receive disproportionately large updates. Under Muon, the orthogonalized update treats all singular directions equally, allowing signal modes to grow at the same rate as noise modes.

NoteLemma (Muon Accelerates Grokking (Jordan et al. 2024; Bernstein and Newhouse 2024))

The Muon optimizer reduces the grokking delay compared to Adam or SGD because:

  1. Implicit norm control. The orthogonal update \(UV^\top\) has unit operator norm, preventing the weight norm from growing during the memorization phase. This effectively provides stronger implicit regularization than weight decay alone.
  2. Spectral democracy. By equalizing the learning rate across all singular value directions, Muon prevents the noise modes (which typically have larger singular values) from suppressing signal modes. The BBP threshold is crossed earlier.
  3. Barrier reduction. The implicit regularization shifts the effective free energy landscape, reducing the barrier \(\Delta F\) between the memorizing and generalizing phases.

Empirically, Muon eliminates or dramatically shortens the plateau phase (Phase II) of grokking, often achieving generalization within the same order of magnitude of training steps as memorization.

The connection between Muon and the phase transition framework is that Muon modifies the effective free energy landscape by constraining the weight matrices to lie near the Stiefel manifold. This constraint removes the high-norm memorization basin (which relies on large, unstructured weights), forcing the optimizer into the low-norm generalizing basin from the start. From the spectral perspective, Muon acts as a “spectral thermostat” that prevents the system from getting trapped in the high-entropy, disordered memorization state.


Conclusion and Open Questions

This survey has presented a unified view of grokking as a phase transition in the space of neural network training dynamics. The key insights are:

  1. Grokking is a phase transition. The network transitions from a high-entropy memorizing state (lazy/NTK regime) to a low-entropy generalizing state (rich/feature learning regime). The order parameters—RQI, effective rank, participation ratio, kernel alignment, and SNR—all exhibit discontinuous jumps characteristic of a first-order transition.

  2. The transition is controlled by \(\alpha\) and \(\lambda\). The initialization scale \(\alpha\) determines whether feature learning is dynamically accessible, while weight decay \(\lambda\) controls the relative stability of the memorizing and generalizing phases. Grokking occurs in a “Goldilocks zone” of these parameters.

  3. DMFT provides a quantitative description. In the infinite-width limit, dynamical mean field theory yields self-consistent equations whose fixed points correspond to the memorizing and generalizing phases. The action barrier between these fixed points determines the grokking delay.

  4. Finite-width effects are perturbatively computable. Feynman diagram techniques extend the DMFT predictions to finite width, showing that the feature-learning correction to the NTK is a \(1/N\) effect in standard parameterization.

  5. Modern optimizers reshape the landscape. The Muon optimizer, by operating in the spectral domain, provides implicit regularization that reduces the barrier between phases, accelerating or eliminating grokking.

ImportantOpen Question: Universality Class

What is the universality class of the grokking transition? Do different architectures (transformers, MLPs, CNNs) and tasks (modular arithmetic, sparse parities, polynomial regression) share the same critical exponents? If so, the transition belongs to a universal class that depends only on symmetry and dimensionality, not microscopic details.

ImportantOpen Question: Order of the Transition

Can grokking exhibit a second-order (continuous) phase transition for certain task families? Preliminary evidence suggests that when the signal subspace is higher-dimensional, the transition may become continuous, with the order parameter growing smoothly rather than jumping.

ImportantOpen Question: Depth

Most theoretical results apply to two-layer (shallow) networks. How does depth change the phase diagram? Does depth introduce additional “thermodynamic” parameters, and can deeper networks exhibit qualitatively new phases (e.g., hierarchical grokking at different layers)?

ImportantOpen Question: Finite-Size Scaling

Can a systematic finite-size scaling analysis—treating width \(N\) as the system size—extract critical exponents from numerical experiments? Such an analysis would place grokking on the same rigorous footing as phase transitions in statistical mechanics.


Appendix: Summary Tables

Table 2: Comparison of Lazy and Rich training regimes.
Feature Lazy Regime (NTK) Rich Regime (Mean Field)
Scaling limit Infinite width, \(\alpha = 1/2\) Infinite width, \(\alpha = 0\) (\(\mu\)P)
Weight displacement \(\|\Delta W\| = O(N^{-1/2})\) \(\|\Delta W\| = O(1)\)
Kernel behavior Static (frozen at initialization) Dynamic (evolves with data)
Feature learning None (fixed random basis) Active (basis adapts to task)
Effective rank High (random, delocalized) Low (collapses to task-relevant modes)
Grokking Not observed Intrinsic property
Table 3: Theoretical frameworks and their contributions to the grokking story.
Framework Regime Key Result Limitation
NTK (Jacot, Gabriel, and Hongler 2018) \(\alpha = 1/2\) Kernel regression equivalence No feature learning
\(\mu\)P (Yang et al. 2022) \(\alpha = 0\) Maximal feature learning, HP transfer Infinite width only
DMFT (Kumar et al. 2024) Infinite width Quantitative grokking dynamics No finite-width effects
Feynman diagrams (Guillen, Misof, and Gerken 2025) Large \(N\) Perturbative \(1/N\) corrections Breaks down at strong coupling
Effective theory (Liu et al. 2022) General Phase diagram, Goldilocks zone Phenomenological
First-order transition (Rubin, Seroussi, and Ringel 2024) Two-layer Kramers-type delay formula Specific to structured tasks

References

Baik, Jinho, Gérard Ben Arous, and Sandrine Péché. 2005. “Phase Transition of the Largest Eigenvalue for Nonnull Complex Sample Covariance Matrices.” The Annals of Probability 33 (5): 1643–97. https://doi.org/10.1214/009117905000000233.
Bernstein, Jeremy, and Laker Newhouse. 2024. “Old Optimizer, New Norm: An Anthology.” https://arxiv.org/abs/2409.20325.
Guillen, Max, Philipp Misof, and Jan E. Gerken. 2025. “Finite-Width Neural Tangent Kernels from Feynman Diagrams.” https://arxiv.org/abs/2508.11522.
Jacot, Arthur, Franck Gabriel, and Clément Hongler. 2018. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks.” In Advances in Neural Information Processing Systems. Vol. 31. https://arxiv.org/abs/1806.07572.
Jordan, Keller, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. 2024. “Muon: An Optimizer for Hidden Layers in Neural Networks.” https://kellerjordan.github.io/posts/muon/.
Kramers, H. A. 1940. “Brownian Motion in a Field of Force and the Diffusion Model of Chemical Reactions.” Physica 7 (4): 284–304. https://doi.org/10.1016/S0031-8914(40)90098-2.
Kumar, Tanishq, Blake Bordelon, Samuel J. Gershman, and Cengiz Pehlevan. 2024. “Grokking as the Transition from Lazy to Rich Training Dynamics.” In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=vt5mnLVIVo.
Liu, Ziming, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, and Mike Williams. 2022. “Towards Understanding Grokking: An Effective Theory of Representation Learning.” In Advances in Neural Information Processing Systems. Vol. 35. https://arxiv.org/abs/2205.10343.
Mei, Song, Andrea Montanari, and Phan-Minh Nguyen. 2018. “A Mean Field View of the Landscape of Two-Layer Neural Networks.” Proceedings of the National Academy of Sciences 115 (33): E7665–71. https://doi.org/10.1073/pnas.1806579115.
Merrill, William, Nikolaos Tsilivis, and Aman Shukla. 2023. “A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks.” https://arxiv.org/abs/2303.11873.
Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=9XFSbDPmdW.
Power, Alethea, Yuri Burda, Harrison Edwards, Igor Babuschkin, and Vedant Misra. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” https://arxiv.org/abs/2201.02177.
Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2022. The Principles of Deep Learning Theory. Cambridge University Press. https://arxiv.org/abs/2106.10165.
Roy, Olivier, and Martin Vetterli. 2007. “The Effective Rank: A Measure of Effective Dimensionality.” In Proceedings of the 15th European Signal Processing Conference (EUSIPCO), 606–10.
Rubin, Noa, Inbar Seroussi, and Zohar Ringel. 2024. “Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks.” In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=3ROGsTX3IR.
Yang, Greg, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. 2022. “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.” https://arxiv.org/abs/2203.03466.