When Does Sparsity Mitigate
the Curse of Depth in LLMs

Revealing sparsity as an intrinsic variance regulator that unlocks effective depth utilization

Dilxat Muhtar*†3,4,5 Xinyuan Song*6 Sebastian Pokutta1 Max Zimmer1 Nico Pelleriti1 Thomas Hofmann2 Shiwei Liu†3,4,5

* Equal Contribution    Corresponding Author

1 Zuse Institute Berlin & Technical University of Berlin

2 ETH Zürich

3 Max Planck Institute for Intelligent Systems

4 ELLIS Institute Tübingen

5 Tübingen AI Center

6 Emory University

Core Contributions

The Problem

Deep LLMs suffer from the "curse of depth" — later layers are under-utilized due to variance explosion in Pre-LN architectures.

The Insight

Sparsity — both implicit (weight decay, long context) and explicit (GQA, MoE) — regulates variance propagation.

The Result

Practical recipe achieves 4.6% accuracy improvement on downstream tasks with superior depth utilization.

The Challenge

The Curse of Depth

Recent studies show that later Transformer layers in deep LLMs are frequently under-utilized. Many layers can be removed or permuted with minimal performance degradation.

Variance Explosion

Output variance grows sub-exponentially with depth

Identity Mapping

Deep layer Jacobians approach identity

Resource Waste

L=32 uses 2.56× parameters but wastes 14 layers

Last-Layer Variance vs Depth

Last-Layer Variance vs Depth

Jacobian Frobenius Norm

Jacobian ||J-I||_F approaches identity with depth

Layer Effectiveness Scores

Layer Effectiveness Scores

The Foundation

Sparsity as Variance Regulator

Core Insight

Sparsity reduces variance propagation in residual-depth architectures. Smaller mask density (higher sparsity) leads to slower variance growth with depth, mitigating the curse of depth.

Implicit Sparsity

Emerges dynamically during training

  • Weight Decay — Induces weight sparsity
  • Long Context — Induces attention sparsity

Explicit Sparsity

Enforced by architectural design

  • GQA — Key/value sharing
  • MoE — Sparse expert routing

Empirical Validation

Experimental Results

Weight Decay

Implicit Sparsity

Weight Decay Variance

Variance vs Weight Decay

Weight Decay (λ) PPL ↓ Usefulness
λ = 015.630.75
λ = 0.0115.200.75
λ = 0.1 ✓14.830.81
λ = 1.015.550.69
λ = 3.0773.420.63

Optimal λ = 0.1 improves both PPL and usefulness score

Sequence Length

Implicit Sparsity

Sequence Length Variance

Variance vs Sequence Length

Sequence Length (T) PPL ↓ Usefulness
T = 25618.510.69
T = 51215.710.75
T = 102414.770.75
T = 2048 ✓14.510.81
T = 409614.520.81
T = 819216.300.75

Sweet spot at T = 2048 with optimal perplexity and usefulness

Grouped Query Attention

Explicit Sparsity

GQA Variance

GQA vs MHA Variance Comparison

Group Size (G) PPL ↓ Usefulness
G = 1 (MHA)14.520.81
G = 4 14.500.87
G = 16 ✓14.470.87

G=16 achieves best PPL with 7% improvement in usefulness score

Mixture of Experts

Explicit Sparsity

MoE 400M

400M Scale

MoE 1B

1B Scale

Model PPL ↓ Usefulness
Dense-400M17.560.87
MoE-2B/400M ✓15.890.94
Dense-1B14.520.81
MoE-7B/1B ✓13.820.94

MoE outperforms dense counterparts by >2 PPL with higher usefulness

Ablation Study: Progressive Performance Gains

Ablation Study

Progressive gains when scaling to L=32 with sparsity ingredients

4.6%

Accuracy Improvement

32-layer, 1.2B-parameter model

Superior depth utilization vs. naive baseline

Key Takeaways

Sparsity is not just for efficiency — it's a critical variance regulator that mitigates the curse of depth.

Both implicit & explicit sparsity work — weight decay, long context, GQA, and MoE all reduce variance.

Practical recipe yields 4.6% boost — a new perspective on training depth-effective LLMs.

Cite This Work

@misc{muhtar2026doessparsitymitigatecurse,
                    title={When Does Sparsity Mitigate the Curse of Depth in LLMs}, 
                    author={Dilxat Muhtar and Xinyuan Song and Sebastian Pokutta and Max Zimmer and Nico Pelleriti and Thomas Hofmann and Shiwei Liu},
                    year={2026},
                    eprint={2603.15389},
                    archivePrefix={arXiv},
                    primaryClass={cs.CL},
                    url={https://arxiv.org/abs/2603.15389}, 
              }