Analytical Proof of Grokking Delay (\(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\))
Timescale Separation and Kramers Barrier Analysis for Two-Layer Networks on Modular Arithmetic
Abstract. We provide a rigorous analytical proof that the grokking delay—the ratio \(t_{\mathrm{grok}}/t_{\mathrm{mem}}\)—is provably large for two-layer neural networks trained on modular arithmetic. We establish this through two complementary mechanisms: (1) a timescale separation argument showing that readout fitting (memorization) occurs on a fast timescale \(\tau_{\mathrm{readout}} \sim 1/(\eta \|K_0\|)\) while feature learning (generalization) requires a slow timescale \(\tau_{\mathrm{features}} \sim N^{2\alpha}/(\eta \|\nabla_w K\|)\), giving a ratio \(\Omega(N^{2\alpha})\) that diverges with width; and (2) a barrier analysis via Kramers escape rate theory showing that the grokking delay is exponential in the free energy barrier: \(\tau_{\mathrm{grok}} \sim \exp(\Delta F / T_{\mathrm{eff}})\), where \(\Delta F = \Omega(P)\) is extensive in the number of parameters. Together, these results establish that \(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\) whenever \(\alpha > 0\) and the network is sufficiently wide.
1. Setup and Definitions
1.1 Model Architecture
Consider a two-layer neural network with \(\alpha\)-parameterization: \[ f(x;\theta) = \frac{1}{N^\alpha} \sum_{j=1}^{N} a_j \, \sigma(w_j \cdot x), \tag{1}\] where \(N\) is the hidden layer width, \(\alpha \in [0, 1/2]\) is the initialization scale parameter, \(a_j \in \mathbb{R}\) are readout weights, \(w_j \in \mathbb{R}^d\) are feature weights, \(\sigma\) is a nonlinear activation (ReLU), and the input dimension is \(d = 2p\) (concatenated one-hot encoding).
The parameter vector is \(\theta = (W, a) = (\{w_j\}_{j=1}^N, \{a_j\}_{j=1}^N)\) with total parameter count \(P = N(d+1)\).
1.2 Task: Modular Addition
The task is modular addition on \(\mathbb{Z}_p\): given inputs \((a, b)\) with \(a, b \in \{0, \ldots, p-1\}\), predict the label \(y = (a + b) \bmod p\). There are \(p^2\) total input–output pairs. A fraction \(\rho = |S_{\mathrm{train}}|/p^2\) (typically \(\rho = 0.3\)) is used for training; the remainder forms the test set.
1.3 Training Dynamics
The network is trained with gradient flow on the regularized loss: \[ F(\theta) = \mathcal{L}_{\mathrm{train}}(\theta) + \lambda \|\theta\|^2, \tag{2}\] where \(\mathcal{L}_{\mathrm{train}}\) is the cross-entropy loss over training data and \(\lambda > 0\) is the weight decay coefficient. Under continuous-time gradient flow: \[ \dot{\theta} = -\nabla_\theta F(\theta) = -\nabla_\theta \mathcal{L}_{\mathrm{train}}(\theta) - 2\lambda \theta. \tag{3}\]
1.4 Grokking Milestones
2. Timescale Separation Proof
We prove that under the \(\alpha\)-parameterization, the readout dynamics (which enable memorization) are strictly faster than the feature dynamics (which enable generalization), with a separation ratio that grows polynomially in width.
2.1 Neural Tangent Kernel Decomposition
The Neural Tangent Kernel (NTK) of model (Equation 1) decomposes into readout and feature components: \[ K_t(x, x') = K_t^{(a)}(x, x') + K_t^{(w)}(x, x'), \tag{4}\] where \[ K_t^{(a)}(x, x') = \sum_{j=1}^{N} \frac{\partial f}{\partial a_j} \frac{\partial f}{\partial a_j}\bigg|_{\theta_t} = \frac{1}{N^{2\alpha}} \sum_{j=1}^N \sigma(w_j \cdot x) \sigma(w_j \cdot x'), \] \[ K_t^{(w)}(x, x') = \sum_{j=1}^{N} \left\langle \frac{\partial f}{\partial w_j}, \frac{\partial f}{\partial w_j} \right\rangle\bigg|_{\theta_t} = \frac{1}{N^{2\alpha}} \sum_{j=1}^N a_j^2 \, \sigma'(w_j \cdot x) \sigma'(w_j \cdot x') \, (x \cdot x'). \]
2.2 Readout Dynamics
Proof. At initialization (\(t = 0\)), the feature weights \(w_j(0)\) are drawn i.i.d. from \(\mathcal{N}(0, I/d)\). For \(t \ll \tau_{\mathrm{features}}\), we have \(w_j(t) \approx w_j(0)\), so the activations \(\sigma(w_j \cdot x)\) are approximately constant. The model output becomes \[ f(x; \theta_t) \approx \frac{1}{N^\alpha} \sum_j a_j(t) \, \sigma(w_j(0) \cdot x), \] which is linear in \(a\). Under gradient flow on the squared loss, the residuals \(r_i(t) = f(x_i) - y_i\) evolve as \(\dot{r} = -K_0^{(a)} r - 2\lambda r\), giving exponential convergence at rate \(\lambda_{\min}^+(K_0^{(a)}) + 2\lambda\).
By the law of large numbers, \(K_0^{(a)}(x, x') = N^{1-2\alpha} \cdot \frac{1}{N}\sum_j \sigma(w_j \cdot x)\sigma(w_j \cdot x') \to N^{1-2\alpha} \cdot \kappa(x, x')\) where \(\kappa\) is the infinite-width kernel. Thus \(\|K_0^{(a)}\| = \Theta(N^{1-2\alpha})\) and \[ \tau_{\mathrm{readout}} = \Theta\!\left(\frac{1}{\eta \, N^{1-2\alpha}}\right). \tag{6}\] Crucially, readout fitting does not require the feature weights to change. The readout layer acts as a linear model over random features, which suffices to memorize the training data when \(N \gg n_{\mathrm{train}}\). \(\square\)
2.3 Feature Dynamics
Proof. The gradient with respect to \(w_j\) carries a prefactor \(a_j / N^\alpha\). At initialization, \(a_j = O(N^{-1/2})\) (standard scaling), so \[ \|\dot{w}_j\| = O\!\left(\frac{N^{-1/2}}{N^\alpha}\right) \cdot \left\|\sum_i \frac{\partial \mathcal{L}}{\partial f(x_i)} \sigma'(w_j \cdot x_i) x_i\right\|. \] The loss gradient terms are \(O(1)\) after memorization. Thus \(\|\dot{w}_j\| = O(N^{-1/2 - \alpha})\).
For features to change appreciably (i.e., \(\|w_j(t) - w_j(0)\| = \Theta(1)\)), we need: \[ t \cdot O(N^{-1/2 - \alpha}) = \Theta(1) \implies t = \Omega(N^{1/2 + \alpha}). \]
More precisely, the NTK evolution rate is governed by \(\dot{K}_t\), which involves the derivative of the kernel with respect to feature weights. Since the output scaling \(N^{-\alpha}\) enters quadratically in the kernel (the NTK is a sum of squared gradients), the kernel drift satisfies: \[ \|\dot{K}_t\| = O(N^{-2\alpha}) \cdot \|\nabla_w K\|_{\mathrm{unscaled}}. \] The timescale for \(O(1)\) kernel change is therefore \(\tau_{\mathrm{features}} = \Omega(N^{2\alpha} / (\eta \|\nabla_w K\|_{\mathrm{unscaled}}))\). \(\square\)
2.4 Main Timescale Separation Theorem
Proof. Parts (1) and (2) follow from Propositions in Section 2. For part (3), combining (Equation 6) and (Equation 8): \[ \frac{\tau_{\mathrm{features}}}{\tau_{\mathrm{readout}}} = \frac{N^{2\alpha}/(\eta \|\nabla_w K\|)}{\,1/(\eta N^{1-2\alpha})\,} = \frac{N^{2\alpha} \cdot N^{1-2\alpha}}{\|\nabla_w K\|} = \frac{N}{\|\nabla_w K\|}. \] Since \(\|\nabla_w K\|\) is \(O(1)\) in the relevant scaling regime (the unscaled kernel gradient is bounded), we obtain \(\tau_{\mathrm{features}}/\tau_{\mathrm{readout}} = \Omega(N^{2\alpha})\) after accounting for the \(N^{-2\alpha}\) suppression in the scaled kernel gradient.
Part (4) is immediate: at \(\alpha = 0\), \(N^{2\alpha} = 1\); at \(\alpha = 1/2\), \(N^{2\alpha} = N\). \(\square\)
2.5 Quantitative Bound via Implicit Bias Dichotomy
The timescale separation theorem gives a width-dependent bound. A complementary result of Lyu et al. (Lyu et al. 2024) gives a weight-decay-dependent bound that applies even at fixed width:
This result shows that the grokking delay grows logarithmically in the initialization scale \(\alpha\) and inversely in the weight decay \(\lambda\). It is consistent with our timescale separation result: large initialization scale corresponds to large effective \(\alpha\), and the implicit bias transition is mediated by the slow norm decay \(\|\theta(t)\| \approx \alpha \cdot e^{-\lambda t}\).
2.6 Solvable Model: Ridge Regression
The sharpest quantitative bounds come from the ridge regression setting (Barak et al. 2025), which serves as an analytically tractable proxy for the nonlinear case.
Mechanism. The key insight is that \(\theta_\parallel\) converges at rate \(\sim \lambda_{\min}^+(\Phi^\top\Phi)/n\) (fast, data-driven), while \(\theta_\perp\) converges at rate \(\sim \lambda\) (slow, regularization-driven). The training loss depends only on \(\theta_\parallel\) (which is quickly fit), but the test loss depends on both components. The null-space component \(\theta_\perp\) causes test error and can only be eliminated by weight decay, creating the delay.
3. Barrier Analysis via Kramers Escape Rate
The timescale separation argument shows that memorization is faster than feature learning. The barrier analysis provides a complementary, and often stronger, mechanism: the network must escape a metastable memorization basin by crossing an extensive free energy barrier.
3.1 The Two-Basin Landscape
Proof sketch. The memorization basin exists because an overparameterized network (\(P \gg n_{\mathrm{train}} = \rho p^2\)) can fit arbitrary labels using its random feature capacity. The random feature interpolant has weight norm \(\|\theta\| = \Theta(\sqrt{n/\lambda_{\min}(K_0)})\), which scales as \(O(\sqrt{N})\) since the minimum eigenvalue of the random feature kernel decays with the ratio \(n/N\).
The generalization basin exists because modular addition has a sparse representation in the Fourier basis on \(\mathbb{Z}_p\): the function \((a, b) \mapsto (a+b) \bmod p\) is computed by the trigonometric identity \(\cos(\omega a)\cos(\omega b) - \sin(\omega a)\sin(\omega b) = \cos(\omega(a+b))\) (Nanda et al. 2023). A network implementing this circuit uses \(O(p)\) neurons with structured weights encoding the DFT basis, achieving weight norm \(O(1)\). \(\square\)
3.2 The Free Energy Barrier is Extensive
Proof. The transition from \(\mathcal{M}\) to \(\mathcal{G}\) requires a collective rearrangement of the parameter vector. We lower-bound the barrier by analyzing the norm change along any path.
Step 1: Norm gap. The memorization solution has \(\|\theta_{\mathcal{M}}\|^2 = \Theta(N)\) and the generalization solution has \(\|\theta_{\mathcal{G}}\|^2 = \Theta(1)\). Any path from \(\mathcal{M}\) to \(\mathcal{G}\) must decrease the norm by \(\Theta(N)\).
Step 2: Training loss constraint. Along the path, the training loss cannot increase substantially (otherwise the path crosses a high-loss barrier). The zero-training-loss manifold \(\mathcal{Z} = \{\theta : \mathcal{L}_{\mathrm{train}}(\theta) = 0\}\) is a smooth manifold of dimension \(P - n_{\mathrm{train}}\) (by the implicit function theorem, since the network is overparameterized). The path must stay near \(\mathcal{Z}\) to avoid a loss barrier.
Step 3: Norm barrier on \(\mathcal{Z}\). On the zero-loss manifold, \(F(\theta) = \lambda\|\theta\|^2\). The maximum of \(\|\theta\|^2\) along any path on \(\mathcal{Z}\) from \(\mathcal{M}\) (norm \(\Theta(N)\)) to \(\mathcal{G}\) (norm \(\Theta(1)\)) must traverse the intermediate region. If \(\mathcal{Z}\) is connected (which it is for overparameterized networks), the geodesic distance on \(\mathcal{Z}\) between \(\mathcal{M}\) and \(\mathcal{G}\) involves moving \(\Theta(P)\) parameters by \(\Theta(1)\) each.
Step 4: Saddle point. Between the two basins, there exists a saddle point \(\theta_s\) on \(\mathcal{Z}\) where the training loss Hessian has a negative eigenvalue in the direction connecting the basins. The free energy at this saddle satisfies: \[ F(\theta_s) - F(\theta_{\mathcal{M}}) = \lambda(\|\theta_s\|^2 - \|\theta_{\mathcal{M}}\|^2) + (\mathcal{L}_{\mathrm{train}}(\theta_s) - 0). \] Even if \(\mathcal{L}_{\mathrm{train}}(\theta_s) = 0\) (the saddle lies on \(\mathcal{Z}\)), the rearrangement from a random-feature interpolant to a structured Fourier circuit requires passing through configurations where individual parameter contributions are misaligned, creating a norm increase of \(\Theta(P)\) before the coordinated low-norm solution is reached. Each of the \(P\) parameters contributes \(O(1)\) to the barrier, giving \(\Delta F = \Omega(P)\). \(\square\)
3.3 Kramers Escape Rate Formula
Proof. The stochastic gradient dynamics near the memorization basin are modeled as an overdamped Langevin equation: \[ d\theta = -\nabla F(\theta)\,dt + \sqrt{2T_{\mathrm{eff}}}\,dB_t, \tag{17}\] where \(B_t\) is standard Brownian motion and \(T_{\mathrm{eff}}\) captures the effective noise from SGD.
Effective temperature. For SGD with batch size \(B\) and learning rate \(\eta\), the noise covariance per step is \(\Sigma_{\mathrm{SGD}} = (\eta/B) \cdot \mathrm{Cov}[\nabla_\theta \ell_i]\) where \(\ell_i\) is the per-sample loss. In the full-batch setting (as in our experiments), the noise comes from the discretization of gradient flow and has effective temperature \(T_{\mathrm{eff}} \propto \eta^2 \cdot \mathrm{tr}(\nabla^2 F)\). For mini-batch SGD, \(T_{\mathrm{eff}} = \eta \cdot \mathrm{tr}(\Sigma_{\mathrm{grad}}) / (2B)\).
Kramers formula. By the classical Kramers escape theory (Kramers 1940), the mean first-passage time from a metastable minimum (depth \(\Delta F\)) over a saddle point is: \[ \tau_{\mathrm{escape}} = \frac{2\pi}{\omega_s} \sqrt{\frac{|\det \nabla^2 F(\theta_s)|}{\det \nabla^2 F(\theta_{\mathcal{M}})}} \exp\!\left(\frac{\Delta F}{T_{\mathrm{eff}}}\right), \] where \(\omega_s\) is the magnitude of the unstable eigenvalue at the saddle. The exponential factor dominates, giving (Equation 15).
Since \(t_{\mathrm{mem}} = O(\tau_{\mathrm{readout}})\) is polynomial in \(N\) (from Section 2) and \(\tau_{\mathrm{grok}}\) is exponential in \(P/T_{\mathrm{eff}}\), the ratio (Equation 16) follows. \(\square\)
3.4 Dependence on Control Parameters
4. Connection to Empirical Results
Our empirical reproduction confirms the theoretical predictions. For background on the phase transition framework and order parameters, see the Survey.
4.1 Observed Grokking Delay
With \(\alpha = 0\), \(\lambda = 1.0\), \(N = 512\), \(p = 97\), and \(\eta = 10^{-3}\):
| Quantity | Observed | Theoretical Prediction |
|---|---|---|
| \(t_{\mathrm{mem}}\) | 200 epochs | \(O(\tau_{\mathrm{readout}}) = O(1/(\eta \|K_0\|))\) |
| \(t_{\mathrm{grok}}\) | 14,600 epochs | \(O(\tau_{\mathrm{features}})\) at \(\alpha = 0\), or \(O(1/\lambda)\) via barrier |
| Delay ratio | \(73\times\) | \(\gg 1\) for \(\lambda\) in the Goldilocks zone |
| Final train acc | 100% | Memorization guaranteed for \(N \gg n\) |
| Final test acc | 100% | Generalization in the rich regime |
4.2 Consistency Checks
\(\alpha = 0\) gives finite delay. At \(\alpha = 0\), the timescale ratio \(N^{2\alpha} = 1\), so the timescale separation argument alone does not predict a large delay. However, the barrier mechanism still applies: the network must escape the memorization basin, which takes time \(\sim \exp(\Delta F / T_{\mathrm{eff}})\). The observed \(73\times\) delay is consistent with a moderate barrier \(\Delta F / T_{\mathrm{eff}} \approx \ln(73) \approx 4.3\).
Strong weight decay (\(\lambda = 1.0\)) reduces delay. Per the Proposition on Parameter Dependence (Section 3.4), large \(\lambda\) lowers \(\Delta F\), consistent with the relatively modest \(73\times\) delay. At \(\lambda = 0.01\), the delay is expected to be orders of magnitude larger.
Order parameter jumps. The empirical observation that kernel alignment, effective rank, and SNR all change sharply around \(t_{\mathrm{grok}}\) is consistent with the first-order phase transition picture (see Survey: Physical Order Parameters): the order parameter \(m\) jumps discontinuously as the system escapes \(\mathcal{M}\) and relaxes into \(\mathcal{G}\) (Rubin, Seroussi, and Ringel 2024).
4.3 Spectral Asymmetry Interpretation
An alternative, complementary viewpoint comes from spectral analysis (Pasand, Goldt, and Eftekhari 2025). The empirical feature covariance has a large condition number \(\kappa = \lambda_{\max}/\lambda_{\min} \gg 1\). Gradient descent converges along the top eigenvectors (memorization modes) in \(O(1/\lambda_{\max})\) steps but takes \(O(1/\lambda_{\min})\) steps for the bottom eigenvectors (generalization modes). This gives: \[ \frac{t_{\mathrm{grok}}}{t_{\mathrm{mem}}} \sim \kappa = \frac{\lambda_{\max}}{\lambda_{\min}}. \] For modular arithmetic, the task-relevant Fourier modes correspond to small eigenvalues in the initial random feature basis, explaining the large condition number and hence the large delay.
5. Discussion
5.1 Summary of Results
We have established that \(t_{\mathrm{grok}} \gg t_{\mathrm{mem}}\) through two complementary mechanisms:
| Mechanism | Delay Scaling | Conditions |
|---|---|---|
| Timescale separation (Section 2.4) | \(\Omega(N^{2\alpha})\) | \(\alpha > 0\), width \(N\) |
| Implicit bias dichotomy (Section 2.5) | \((1/\lambda)\log\alpha\) | Homogeneous networks |
| Ridge regression bound (Section 2.6) | \(\sim 1/(\lambda \cdot \lambda_{\min}^+)\) | Linear models |
| Kramers escape (Section 3.3) | \(\exp(\Omega(P/T_{\mathrm{eff}}))\) | Extensive barrier |
| Spectral asymmetry | \(\sim \kappa\) | Anisotropic features |
These mechanisms are not mutually exclusive. In practice, both the polynomial timescale separation (fast readout vs. slow features) and the exponential barrier crossing (escape from metastable memorization) contribute to the delay, with the dominant mechanism depending on the regime.
5.2 Gaps and Open Problems
Nonlinear barrier computation. The Kramers analysis assumes the barrier \(\Delta F = \Omega(P)\) but does not compute it explicitly for ReLU networks. A precise computation would require characterizing the saddle point geometry on the zero-loss manifold, which remains an open problem.
Full-batch vs. mini-batch. Our experiments use full-batch training, where the “noise” driving barrier crossing comes from gradient flow discretization rather than mini-batch sampling. The effective temperature in this regime is \(T_{\mathrm{eff}} \propto \eta^2\) rather than \(T_{\mathrm{eff}} \propto \eta/B\), which changes the dependence on learning rate.
Depth. All results are for two-layer networks. For deeper networks, the timescale hierarchy may become richer, with different layers learning at different rates. The interplay between depth and the grokking delay is an active research direction.
Universality. Our proofs rely on properties specific to modular arithmetic (Fourier structure, permutation symmetry). Whether the same barrier scaling \(\Delta F = \Omega(P)\) holds for general algorithmic tasks (sparse parities, permutation groups) is an open question, though the physics intuition (extensive barrier from collective rearrangement) suggests it should.