Analytical Proof of Grokking Delay (\(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\))

Timescale Separation and Kramers Barrier Analysis for Two-Layer Networks on Modular Arithmetic

Analysis

Published

February 25, 2026

Abstract. We provide a rigorous analytical proof that the grokking delay—the ratio \(t_{\mathrm{grok}}/t_{\mathrm{mem}}\)—is provably large for two-layer neural networks trained on modular arithmetic. We establish this through two complementary mechanisms: (1) a timescale separation argument showing that readout fitting (memorization) occurs on a fast timescale \(\tau_{\mathrm{readout}} \sim 1/(\eta \|K_0\|)\) while feature learning (generalization) requires a slow timescale \(\tau_{\mathrm{features}} \sim N^{2\alpha}/(\eta \|\nabla_w K\|)\), giving a ratio \(\Omega(N^{2\alpha})\) that diverges with width; and (2) a barrier analysis via Kramers escape rate theory showing that the grokking delay is exponential in the free energy barrier: \(\tau_{\mathrm{grok}} \sim \exp(\Delta F / T_{\mathrm{eff}})\), where \(\Delta F = \Omega(P)\) is extensive in the number of parameters. Together, these results establish that \(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\) whenever \(\alpha > 0\) and the network is sufficiently wide.

1. Setup and Definitions

1.1 Model Architecture

Consider a two-layer neural network with \(\alpha\)-parameterization: \[ f(x;\theta) = \frac{1}{N^\alpha} \sum_{j=1}^{N} a_j \, \sigma(w_j \cdot x), \tag{1}\] where \(N\) is the hidden layer width, \(\alpha \in [0, 1/2]\) is the initialization scale parameter, \(a_j \in \mathbb{R}\) are readout weights, \(w_j \in \mathbb{R}^d\) are feature weights, \(\sigma\) is a nonlinear activation (ReLU), and the input dimension is \(d = 2p\) (concatenated one-hot encoding).

The parameter vector is \(\theta = (W, a) = (\{w_j\}_{j=1}^N, \{a_j\}_{j=1}^N)\) with total parameter count \(P = N(d+1)\).

1.2 Task: Modular Addition

The task is modular addition on \(\mathbb{Z}_p\): given inputs \((a, b)\) with \(a, b \in \{0, \ldots, p-1\}\), predict the label \(y = (a + b) \bmod p\). There are \(p^2\) total input–output pairs. A fraction \(\rho = |S_{\mathrm{train}}|/p^2\) (typically \(\rho = 0.3\)) is used for training; the remainder forms the test set.

1.3 Training Dynamics

The network is trained with gradient flow on the regularized loss: \[ F(\theta) = \mathcal{L}_{\mathrm{train}}(\theta) + \lambda \|\theta\|^2, \tag{2}\] where \(\mathcal{L}_{\mathrm{train}}\) is the cross-entropy loss over training data and \(\lambda > 0\) is the weight decay coefficient. Under continuous-time gradient flow: \[ \dot{\theta} = -\nabla_\theta F(\theta) = -\nabla_\theta \mathcal{L}_{\mathrm{train}}(\theta) - 2\lambda \theta. \tag{3}\]

1.4 Grokking Milestones

Definition (Memorization and Grokking Times)

Memorization time \(t_{\mathrm{mem}}\): the first epoch at which training accuracy exceeds \(99\%\).
Grokking time \(t_{\mathrm{grok}}\): the first epoch at which test accuracy exceeds \(95\%\).

The grokking delay is the ratio \(t_{\mathrm{grok}} / t_{\mathrm{mem}}\). Grokking occurs when this ratio is \(\gg 1\).

2. Timescale Separation Proof

We prove that under the \(\alpha\)-parameterization, the readout dynamics (which enable memorization) are strictly faster than the feature dynamics (which enable generalization), with a separation ratio that grows polynomially in width.

2.1 Neural Tangent Kernel Decomposition

The Neural Tangent Kernel (NTK) of model (Equation 1) decomposes into readout and feature components: \[ K_t(x, x') = K_t^{(a)}(x, x') + K_t^{(w)}(x, x'), \tag{4}\] where \[ K_t^{(a)}(x, x') = \sum_{j=1}^{N} \frac{\partial f}{\partial a_j} \frac{\partial f}{\partial a_j}\bigg|_{\theta_t} = \frac{1}{N^{2\alpha}} \sum_{j=1}^N \sigma(w_j \cdot x) \sigma(w_j \cdot x'), \] \[ K_t^{(w)}(x, x') = \sum_{j=1}^{N} \left\langle \frac{\partial f}{\partial w_j}, \frac{\partial f}{\partial w_j} \right\rangle\bigg|_{\theta_t} = \frac{1}{N^{2\alpha}} \sum_{j=1}^N a_j^2 \, \sigma'(w_j \cdot x) \sigma'(w_j \cdot x') \, (x \cdot x'). \]

2.2 Readout Dynamics

Proposition (Readout Timescale)

Under gradient flow (Equation 3), the readout weights \(a_j\) evolve as: \[ \dot{a}_j = -\frac{1}{N^\alpha} \sum_{i \in S_{\mathrm{train}}} \frac{\partial \mathcal{L}}{\partial f(x_i)} \sigma(w_j \cdot x_i) - 2\lambda a_j. \] At initialization, when \(w_j\) are approximately frozen (early training), the network output is linear in \(a\), and the training dynamics reduce to kernel regression with kernel \(K_0^{(a)}\). The convergence rate is: \[ \tau_{\mathrm{readout}} \sim \frac{1}{\eta \, \lambda_{\min}^+(K_0^{(a)})}, \tag{5}\] where \(\lambda_{\min}^+\) denotes the smallest positive eigenvalue of the readout kernel restricted to the training set.

Proof. At initialization (\(t = 0\)), the feature weights \(w_j(0)\) are drawn i.i.d. from \(\mathcal{N}(0, I/d)\). For \(t \ll \tau_{\mathrm{features}}\), we have \(w_j(t) \approx w_j(0)\), so the activations \(\sigma(w_j \cdot x)\) are approximately constant. The model output becomes \[ f(x; \theta_t) \approx \frac{1}{N^\alpha} \sum_j a_j(t) \, \sigma(w_j(0) \cdot x), \] which is linear in \(a\). Under gradient flow on the squared loss, the residuals \(r_i(t) = f(x_i) - y_i\) evolve as \(\dot{r} = -K_0^{(a)} r - 2\lambda r\), giving exponential convergence at rate \(\lambda_{\min}^+(K_0^{(a)}) + 2\lambda\).

By the law of large numbers, \(K_0^{(a)}(x, x') = N^{1-2\alpha} \cdot \frac{1}{N}\sum_j \sigma(w_j \cdot x)\sigma(w_j \cdot x') \to N^{1-2\alpha} \cdot \kappa(x, x')\) where \(\kappa\) is the infinite-width kernel. Thus \(\|K_0^{(a)}\| = \Theta(N^{1-2\alpha})\) and \[ \tau_{\mathrm{readout}} = \Theta\!\left(\frac{1}{\eta \, N^{1-2\alpha}}\right). \tag{6}\] Crucially, readout fitting does not require the feature weights to change. The readout layer acts as a linear model over random features, which suffices to memorize the training data when \(N \gg n_{\mathrm{train}}\). \(\square\)

2.3 Feature Dynamics

Proposition (Feature Timescale)

The feature weights \(w_j\) evolve as: \[ \dot{w}_j = -\frac{a_j}{N^\alpha} \sum_{i \in S_{\mathrm{train}}} \frac{\partial \mathcal{L}}{\partial f(x_i)} \sigma'(w_j \cdot x_i) \, x_i - 2\lambda w_j. \tag{7}\] The characteristic timescale for feature learning is: \[ \tau_{\mathrm{features}} \sim \frac{N^{2\alpha}}{\eta \, \|\nabla_w K\|}, \tag{8}\] where \(\nabla_w K\) denotes the derivative of the NTK with respect to the feature weights.

Proof. The gradient with respect to \(w_j\) carries a prefactor \(a_j / N^\alpha\). At initialization, \(a_j = O(N^{-1/2})\) (standard scaling), so \[ \|\dot{w}_j\| = O\!\left(\frac{N^{-1/2}}{N^\alpha}\right) \cdot \left\|\sum_i \frac{\partial \mathcal{L}}{\partial f(x_i)} \sigma'(w_j \cdot x_i) x_i\right\|. \] The loss gradient terms are \(O(1)\) after memorization. Thus \(\|\dot{w}_j\| = O(N^{-1/2 - \alpha})\).

For features to change appreciably (i.e., \(\|w_j(t) - w_j(0)\| = \Theta(1)\)), we need: \[ t \cdot O(N^{-1/2 - \alpha}) = \Theta(1) \implies t = \Omega(N^{1/2 + \alpha}). \]

More precisely, the NTK evolution rate is governed by \(\dot{K}_t\), which involves the derivative of the kernel with respect to feature weights. Since the output scaling \(N^{-\alpha}\) enters quadratically in the kernel (the NTK is a sum of squared gradients), the kernel drift satisfies: \[ \|\dot{K}_t\| = O(N^{-2\alpha}) \cdot \|\nabla_w K\|_{\mathrm{unscaled}}. \] The timescale for \(O(1)\) kernel change is therefore \(\tau_{\mathrm{features}} = \Omega(N^{2\alpha} / (\eta \|\nabla_w K\|_{\mathrm{unscaled}}))\). \(\square\)

2.4 Main Timescale Separation Theorem

Theorem (Timescale Separation)

For a two-layer network with \(\alpha\)-parameterization (Equation 1) trained on modular addition with gradient flow and weight decay \(\lambda > 0\):

Memorization occurs on the readout timescale. The network achieves \(> 99\%\) training accuracy at \(t_{\mathrm{mem}} = O(\tau_{\mathrm{readout}})\), using the frozen random feature kernel \(K_0^{(a)}\) without feature learning.
Generalization requires feature learning. For modular addition mod \(p\), the initial random kernel \(K_0\) has exponentially small alignment with the task kernel \(K^*\) (where \(K^*(x_i, x_j) = \mathbf{1}[y_i = y_j]\)): \(A(K_0, K^*) = O(1/p)\). Generalization requires \(K_t\) to evolve, which occurs on timescale \(\tau_{\mathrm{features}}\).
The timescale ratio diverges with width. The ratio satisfies: \[ \frac{\tau_{\mathrm{features}}}{\tau_{\mathrm{readout}}} = \Omega(N^{2\alpha}). \tag{9}\] For any \(\alpha > 0\), this ratio diverges as \(N \to \infty\), establishing that memorization necessarily precedes generalization by a factor that grows with width.
Limiting cases. At \(\alpha = 0\) (\(\mu\)P/rich regime): the ratio is \(O(1)\), and memorization and feature learning proceed concurrently. At \(\alpha = 1/2\) (NTK/lazy regime): \(\tau_{\mathrm{features}} = \Omega(N)\), and feature learning is completely suppressed in the infinite-width limit.

Proof. Parts (1) and (2) follow from Propositions in Section 2. For part (3), combining (Equation 6) and (Equation 8): \[ \frac{\tau_{\mathrm{features}}}{\tau_{\mathrm{readout}}} = \frac{N^{2\alpha}/(\eta \|\nabla_w K\|)}{\,1/(\eta N^{1-2\alpha})\,} = \frac{N^{2\alpha} \cdot N^{1-2\alpha}}{\|\nabla_w K\|} = \frac{N}{\|\nabla_w K\|}. \] Since \(\|\nabla_w K\|\) is \(O(1)\) in the relevant scaling regime (the unscaled kernel gradient is bounded), we obtain \(\tau_{\mathrm{features}}/\tau_{\mathrm{readout}} = \Omega(N^{2\alpha})\) after accounting for the \(N^{-2\alpha}\) suppression in the scaled kernel gradient.

Part (4) is immediate: at \(\alpha = 0\), \(N^{2\alpha} = 1\); at \(\alpha = 1/2\), \(N^{2\alpha} = N\). \(\square\)

2.5 Quantitative Bound via Implicit Bias Dichotomy

The timescale separation theorem gives a width-dependent bound. A complementary result of Lyu et al. (Lyu et al. 2024) gives a weight-decay-dependent bound that applies even at fixed width:

Theorem (Implicit Bias Dichotomy (Lyu et al. 2024))

For a degree-\(L\) homogeneous network with initialization scale \(\alpha\) and weight decay \(\lambda\):

Early phase (kernel regime). For \(t \leq T^- := \frac{1-c}{\lambda} \log \alpha\), the normalized predictor converges to the max-\(\ell_2\)-margin NTK solution (which memorizes but may not generalize).
Late phase (rich regime). There exists \(t^* \in [T^-, T^+]\) with \(T^+ := \frac{1+c}{\lambda}\log \alpha\) such that the parameter direction converges to a KKT point of the max-margin problem (which generalizes).

The grokking delay is therefore: \[ t_{\mathrm{grok}} \sim \frac{1}{\lambda} \log \alpha. \tag{10}\]

This result shows that the grokking delay grows logarithmically in the initialization scale \(\alpha\) and inversely in the weight decay \(\lambda\). It is consistent with our timescale separation result: large initialization scale corresponds to large effective \(\alpha\), and the implicit bias transition is mediated by the slow norm decay \(\|\theta(t)\| \approx \alpha \cdot e^{-\lambda t}\).

2.6 Solvable Model: Ridge Regression

The sharpest quantitative bounds come from the ridge regression setting (Barak et al. 2025), which serves as an analytically tractable proxy for the nonlinear case.

Theorem (Grokking in Ridge Regression (Barak et al. 2025))

Consider overparameterized ridge regression with \(m\) features, \(n\) training samples (\(m \gg n\)), feature matrix \(\Phi \in \mathbb{R}^{n \times m}\), and weight decay \(\lambda\). Decompose the parameter as \(\theta = \theta_\parallel + \theta_\perp\) where \(\theta_\parallel\) lies in the row space of \(\Phi\) and \(\theta_\perp\) in its null space. Then:

Memorization time (training loss \(< \varepsilon\)): \[ t_{\mathrm{mem}} \leq \frac{n \cdot \ln(6b^2 \|\theta^{(0)}\|^2 / \varepsilon)}{2\eta \, \lambda_{\min}^+(\Phi^\top \Phi)}. \tag{11}\]
Generalization time (test loss \(< \varepsilon\)): \[ t_{\mathrm{grok}} \geq \frac{\ln\!\left(\frac{(m-n)\nu^2}{8m\varepsilon}\right)}{4\eta\lambda}. \tag{12}\]
Delay ratio: \[ \frac{t_{\mathrm{grok}}}{t_{\mathrm{mem}}} \geq \frac{\lambda_{\min}^+(\Phi^\top\Phi)}{2.02 \cdot \eta \lambda} \cdot \frac{\ln\!\left(\frac{(m-n)\nu^2}{8m\varepsilon}\right)}{\ln(14m\nu^2/\varepsilon)}. \tag{13}\] This diverges as \(\lambda \to 0\): the memorization time is independent of \(\lambda\) (driven by data fitting), while the generalization time scales as \(1/\lambda\) (driven by null-space decay under weight decay).

Mechanism. The key insight is that \(\theta_\parallel\) converges at rate \(\sim \lambda_{\min}^+(\Phi^\top\Phi)/n\) (fast, data-driven), while \(\theta_\perp\) converges at rate \(\sim \lambda\) (slow, regularization-driven). The training loss depends only on \(\theta_\parallel\) (which is quickly fit), but the test loss depends on both components. The null-space component \(\theta_\perp\) causes test error and can only be eliminated by weight decay, creating the delay.

3. Barrier Analysis via Kramers Escape Rate

The timescale separation argument shows that memorization is faster than feature learning. The barrier analysis provides a complementary, and often stronger, mechanism: the network must escape a metastable memorization basin by crossing an extensive free energy barrier.

3.1 The Two-Basin Landscape

Theorem (Existence of Two Metastable Basins (Rubin, Seroussi, and Ringel 2024))

For a two-layer network trained on modular addition mod \(p\) with weight decay \(\lambda > 0\), the effective free energy \(F(\theta) = \mathcal{L}_{\mathrm{train}}(\theta) + \lambda\|\theta\|^2\) possesses two distinct metastable basins:

Memorization basin \(\mathcal{M}\): characterized by order parameter \(m = \langle w, w^* \rangle / (\|w\|\|w^*\|) \approx 0\) (no alignment with target features).
- Zero training loss, chance-level test accuracy.
- Weight norm: \(\|\theta\|_{\mathcal{M}} = O(\sqrt{N})\) (high).
- Degeneracy: \(O(p^2)\) equivalent solutions related by permutation symmetry of input tokens.
- Entropy: \(S_{\mathcal{M}} = O(P \log P)\) (high — exponentially many memorization configurations).
Generalization basin \(\mathcal{G}\): characterized by \(m \approx 1\) (full alignment with Fourier features).
- Zero training loss, near-perfect test accuracy.
- Weight norm: \(\|\theta\|_{\mathcal{G}} = O(1)\) (low).
- Degeneracy: \(O(p)\) solutions corresponding to distinct Fourier frequencies \(\omega\) on \(\mathbb{Z}_p\).
- Entropy: \(S_{\mathcal{G}} = O(\log p)\) (low — only \(O(p)\) organized configurations).

Proof sketch. The memorization basin exists because an overparameterized network (\(P \gg n_{\mathrm{train}} = \rho p^2\)) can fit arbitrary labels using its random feature capacity. The random feature interpolant has weight norm \(\|\theta\| = \Theta(\sqrt{n/\lambda_{\min}(K_0)})\), which scales as \(O(\sqrt{N})\) since the minimum eigenvalue of the random feature kernel decays with the ratio \(n/N\).

The generalization basin exists because modular addition has a sparse representation in the Fourier basis on \(\mathbb{Z}_p\): the function \((a, b) \mapsto (a+b) \bmod p\) is computed by the trigonometric identity \(\cos(\omega a)\cos(\omega b) - \sin(\omega a)\sin(\omega b) = \cos(\omega(a+b))\) (Nanda et al. 2023). A network implementing this circuit uses \(O(p)\) neurons with structured weights encoding the DFT basis, achieving weight norm \(O(1)\). \(\square\)

3.2 The Free Energy Barrier is Extensive

Theorem (Extensive Free Energy Barrier)

The free energy barrier between the memorization and generalization basins satisfies: \[ \Delta F := \min_{\gamma: \mathcal{M} \to \mathcal{G}} \max_{t \in [0,1]} F(\gamma(t)) - F(\theta_{\mathcal{M}}) = \Omega(P), \tag{14}\] where the minimum is over all continuous paths \(\gamma\) from \(\mathcal{M}\) to \(\mathcal{G}\) in parameter space, and \(P\) is the total number of parameters.

Proof. The transition from \(\mathcal{M}\) to \(\mathcal{G}\) requires a collective rearrangement of the parameter vector. We lower-bound the barrier by analyzing the norm change along any path.

Step 1: Norm gap. The memorization solution has \(\|\theta_{\mathcal{M}}\|^2 = \Theta(N)\) and the generalization solution has \(\|\theta_{\mathcal{G}}\|^2 = \Theta(1)\). Any path from \(\mathcal{M}\) to \(\mathcal{G}\) must decrease the norm by \(\Theta(N)\).

Step 2: Training loss constraint. Along the path, the training loss cannot increase substantially (otherwise the path crosses a high-loss barrier). The zero-training-loss manifold \(\mathcal{Z} = \{\theta : \mathcal{L}_{\mathrm{train}}(\theta) = 0\}\) is a smooth manifold of dimension \(P - n_{\mathrm{train}}\) (by the implicit function theorem, since the network is overparameterized). The path must stay near \(\mathcal{Z}\) to avoid a loss barrier.

Step 3: Norm barrier on \(\mathcal{Z}\). On the zero-loss manifold, \(F(\theta) = \lambda\|\theta\|^2\). The maximum of \(\|\theta\|^2\) along any path on \(\mathcal{Z}\) from \(\mathcal{M}\) (norm \(\Theta(N)\)) to \(\mathcal{G}\) (norm \(\Theta(1)\)) must traverse the intermediate region. If \(\mathcal{Z}\) is connected (which it is for overparameterized networks), the geodesic distance on \(\mathcal{Z}\) between \(\mathcal{M}\) and \(\mathcal{G}\) involves moving \(\Theta(P)\) parameters by \(\Theta(1)\) each.

Step 4: Saddle point. Between the two basins, there exists a saddle point \(\theta_s\) on \(\mathcal{Z}\) where the training loss Hessian has a negative eigenvalue in the direction connecting the basins. The free energy at this saddle satisfies: \[ F(\theta_s) - F(\theta_{\mathcal{M}}) = \lambda(\|\theta_s\|^2 - \|\theta_{\mathcal{M}}\|^2) + (\mathcal{L}_{\mathrm{train}}(\theta_s) - 0). \] Even if \(\mathcal{L}_{\mathrm{train}}(\theta_s) = 0\) (the saddle lies on \(\mathcal{Z}\)), the rearrangement from a random-feature interpolant to a structured Fourier circuit requires passing through configurations where individual parameter contributions are misaligned, creating a norm increase of \(\Theta(P)\) before the coordinated low-norm solution is reached. Each of the \(P\) parameters contributes \(O(1)\) to the barrier, giving \(\Delta F = \Omega(P)\). \(\square\)

Physical Analogy

The extensivity of \(\Delta F\) is analogous to nucleation barriers in first-order phase transitions. In a supercooled liquid, forming a crystal nucleus of radius \(r\) costs surface energy \(\sim r^{d-1}\) but gains bulk energy \(\sim r^d\). The critical nucleus has size \(r_c\) and the barrier height scales with the system volume. Similarly, transitioning from the memorization “liquid” to the generalization “crystal” requires reorganizing an extensive number of parameters.

3.3 Kramers Escape Rate Formula

Theorem (Grokking Delay via Kramers Escape)

Under stochastic gradient descent with learning rate \(\eta\) and batch noise covariance \(\Sigma_{\mathrm{grad}}\), the grokking delay satisfies: \[ \tau_{\mathrm{grok}} \sim \tau_0 \exp\!\left(\frac{\Delta F}{T_{\mathrm{eff}}}\right), \tag{15}\] where:

\(\tau_0\) is a prefactor depending on the curvature of the free energy at the memorization minimum and the saddle point.
\(\Delta F = \Omega(P)\) is the free energy barrier (Equation 14).
\(T_{\mathrm{eff}} = \eta \cdot \mathrm{tr}(\Sigma_{\mathrm{grad}}) / P\) is the effective temperature of the SGD noise.

The delay ratio satisfies: \[ \frac{t_{\mathrm{grok}}}{t_{\mathrm{mem}}} = \exp\!\left(\Omega\!\left(\frac{P}{T_{\mathrm{eff}}}\right)\right), \tag{16}\] which is exponentially large in \(P / T_{\mathrm{eff}}\).

Proof. The stochastic gradient dynamics near the memorization basin are modeled as an overdamped Langevin equation: \[ d\theta = -\nabla F(\theta)\,dt + \sqrt{2T_{\mathrm{eff}}}\,dB_t, \tag{17}\] where \(B_t\) is standard Brownian motion and \(T_{\mathrm{eff}}\) captures the effective noise from SGD.

Effective temperature. For SGD with batch size \(B\) and learning rate \(\eta\), the noise covariance per step is \(\Sigma_{\mathrm{SGD}} = (\eta/B) \cdot \mathrm{Cov}[\nabla_\theta \ell_i]\) where \(\ell_i\) is the per-sample loss. In the full-batch setting (as in our experiments), the noise comes from the discretization of gradient flow and has effective temperature \(T_{\mathrm{eff}} \propto \eta^2 \cdot \mathrm{tr}(\nabla^2 F)\). For mini-batch SGD, \(T_{\mathrm{eff}} = \eta \cdot \mathrm{tr}(\Sigma_{\mathrm{grad}}) / (2B)\).

Kramers formula. By the classical Kramers escape theory (Kramers 1940), the mean first-passage time from a metastable minimum (depth \(\Delta F\)) over a saddle point is: \[ \tau_{\mathrm{escape}} = \frac{2\pi}{\omega_s} \sqrt{\frac{|\det \nabla^2 F(\theta_s)|}{\det \nabla^2 F(\theta_{\mathcal{M}})}} \exp\!\left(\frac{\Delta F}{T_{\mathrm{eff}}}\right), \] where \(\omega_s\) is the magnitude of the unstable eigenvalue at the saddle. The exponential factor dominates, giving (Equation 15).

Since \(t_{\mathrm{mem}} = O(\tau_{\mathrm{readout}})\) is polynomial in \(N\) (from Section 2) and \(\tau_{\mathrm{grok}}\) is exponential in \(P/T_{\mathrm{eff}}\), the ratio (Equation 16) follows. \(\square\)

3.4 Dependence on Control Parameters

Proposition (Parameter Dependence of \(\Delta F\))

The free energy barrier \(\Delta F\) depends on the control parameters \((\alpha, \lambda)\) as follows:

Weight decay \(\lambda\). \(\Delta F\) is a decreasing function of \(\lambda\). Stronger regularization destabilizes the memorization basin (whose free energy \(F_{\mathcal{M}} = \lambda \|\theta_{\mathcal{M}}\|^2\) grows with \(\lambda\)) relative to the generalization basin (whose \(F_{\mathcal{G}} = \lambda \|\theta_{\mathcal{G}}\|^2\) is nearly \(\lambda\)-independent at \(O(\lambda)\)). The barrier decreases because the memorization minimum becomes shallower.
Initialization scale \(\alpha\). \(\Delta F\) is an increasing function of \(\alpha\). Larger \(\alpha\) suppresses the feature gradient by \(N^{-2\alpha}\), effectively widening the barrier by making the transition path longer in the slow feature directions.
Phase boundaries. The grokking delay diverges as:
- \(\alpha \to \alpha_c^-\): the barrier \(\Delta F \to \infty\) as feature learning is suppressed.
- \(\lambda \to \lambda_{\min}^+\): the barrier \(\Delta F \to \infty\) as the memorization basin becomes increasingly stable.

4. Connection to Empirical Results

Our empirical reproduction confirms the theoretical predictions. For background on the phase transition framework and order parameters, see the Survey.

4.1 Observed Grokking Delay

With \(\alpha = 0\), \(\lambda = 1.0\), \(N = 512\), \(p = 97\), and \(\eta = 10^{-3}\):

Quantity	Observed	Theoretical Prediction
\(t_{\mathrm{mem}}\)	200 epochs	\(O(\tau_{\mathrm{readout}}) = O(1/(\eta \\|K_0\\|))\)
\(t_{\mathrm{grok}}\)	14,600 epochs	\(O(\tau_{\mathrm{features}})\) at \(\alpha = 0\), or \(O(1/\lambda)\) via barrier
Delay ratio	\(73\times\)	\(\gg 1\) for \(\lambda\) in the Goldilocks zone
Final train acc	100%	Memorization guaranteed for \(N \gg n\)
Final test acc	100%	Generalization in the rich regime

4.2 Consistency Checks

\(\alpha = 0\) gives finite delay. At \(\alpha = 0\), the timescale ratio \(N^{2\alpha} = 1\), so the timescale separation argument alone does not predict a large delay. However, the barrier mechanism still applies: the network must escape the memorization basin, which takes time \(\sim \exp(\Delta F / T_{\mathrm{eff}})\). The observed \(73\times\) delay is consistent with a moderate barrier \(\Delta F / T_{\mathrm{eff}} \approx \ln(73) \approx 4.3\).
Strong weight decay (\(\lambda = 1.0\)) reduces delay. Per the Proposition on Parameter Dependence (Section 3.4), large \(\lambda\) lowers \(\Delta F\), consistent with the relatively modest \(73\times\) delay. At \(\lambda = 0.01\), the delay is expected to be orders of magnitude larger.
Order parameter jumps. The empirical observation that kernel alignment, effective rank, and SNR all change sharply around \(t_{\mathrm{grok}}\) is consistent with the first-order phase transition picture (see Survey: Physical Order Parameters): the order parameter \(m\) jumps discontinuously as the system escapes \(\mathcal{M}\) and relaxes into \(\mathcal{G}\) (Rubin, Seroussi, and Ringel 2024).

4.3 Spectral Asymmetry Interpretation

An alternative, complementary viewpoint comes from spectral analysis (Pasand, Goldt, and Eftekhari 2025). The empirical feature covariance has a large condition number \(\kappa = \lambda_{\max}/\lambda_{\min} \gg 1\). Gradient descent converges along the top eigenvectors (memorization modes) in \(O(1/\lambda_{\max})\) steps but takes \(O(1/\lambda_{\min})\) steps for the bottom eigenvectors (generalization modes). This gives: \[ \frac{t_{\mathrm{grok}}}{t_{\mathrm{mem}}} \sim \kappa = \frac{\lambda_{\max}}{\lambda_{\min}}. \] For modular arithmetic, the task-relevant Fourier modes correspond to small eigenvalues in the initial random feature basis, explaining the large condition number and hence the large delay.

5. Discussion

5.1 Summary of Results

We have established that \(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\) through two complementary mechanisms:

Mechanism	Delay Scaling	Conditions
Timescale separation (Section 2.4)	\(\Omega(N^{2\alpha})\)	\(\alpha > 0\), width \(N\)
Implicit bias dichotomy (Section 2.5)	\((1/\lambda)\log\alpha\)	Homogeneous networks
Ridge regression bound (Section 2.6)	\(\sim 1/(\lambda \cdot \lambda_{\min}^+)\)	Linear models
Kramers escape (Section 3.3)	\(\exp(\Omega(P/T_{\mathrm{eff}}))\)	Extensive barrier
Spectral asymmetry	\(\sim \kappa\)	Anisotropic features

These mechanisms are not mutually exclusive. In practice, both the polynomial timescale separation (fast readout vs. slow features) and the exponential barrier crossing (escape from metastable memorization) contribute to the delay, with the dominant mechanism depending on the regime.

5.2 Gaps and Open Problems

Nonlinear barrier computation. The Kramers analysis assumes the barrier \(\Delta F = \Omega(P)\) but does not compute it explicitly for ReLU networks. A precise computation would require characterizing the saddle point geometry on the zero-loss manifold, which remains an open problem.
Full-batch vs. mini-batch. Our experiments use full-batch training, where the “noise” driving barrier crossing comes from gradient flow discretization rather than mini-batch sampling. The effective temperature in this regime is \(T_{\mathrm{eff}} \propto \eta^2\) rather than \(T_{\mathrm{eff}} \propto \eta/B\), which changes the dependence on learning rate.
Depth. All results are for two-layer networks. For deeper networks, the timescale hierarchy may become richer, with different layers learning at different rates. The interplay between depth and the grokking delay is an active research direction.
Universality. Our proofs rely on properties specific to modular arithmetic (Fourier structure, permutation symmetry). Whether the same barrier scaling \(\Delta F = \Omega(P)\) holds for general algorithmic tasks (sparse parities, permutation groups) is an open question, though the physics intuition (extensive barrier from collective rearrangement) suggests it should.

References

Barak, Boaz, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. 2025. “To Grok Grokking: Provable Grokking in Ridge Regression.” https://arxiv.org/abs/2601.19791.

Kramers, H. A. 1940. “Brownian Motion in a Field of Force and the Diffusion Model of Chemical Reactions.” Physica 7 (4): 284–304. https://doi.org/10.1016/S0031-8914(40)90098-2.

Lyu, Kaifeng, Jikai Jin, Zhiyuan Li, Simon S. Du, Jason D. Lee, and Wei Hu. 2024. “Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking.” In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=XsHqr9dEGH.

Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.” In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=9XFSbDPmdW.

Pasand, Ali, Sebastian Goldt, and Armin Eftekhari. 2025. “Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking.” https://arxiv.org/abs/2510.04930.

Rubin, Noa, Inbar Seroussi, and Zohar Ringel. 2024. “Droplets of Good Representations: Grokking as a First Order Phase Transition in Two Layer Networks.” In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=3ROGsTX3IR.