When Does Sparsity Mitigate
the Curse of Depth in LLMs

Revealing sparsity as an intrinsic variance regulator that unlocks effective depth utilization

Paper Code

Dilxat Muhtar^*†3,4,5 Xinyuan Song^*6 Sebastian Pokutta¹ Max Zimmer¹ Nico Pelleriti¹ Thomas Hofmann² Shiwei Liu^†3,4,5

^* Equal Contribution ^† Corresponding Author

¹ Zuse Institute Berlin & Technical University of Berlin

² ETH Zürich

³ Max Planck Institute for Intelligent Systems

⁴ ELLIS Institute Tübingen

⁵ Tübingen AI Center

⁶ Emory University

Core Contributions

The Problem

Deep LLMs suffer from the "curse of depth" — later layers are under-utilized due to variance explosion in Pre-LN architectures.

The Insight

Sparsity — both implicit (weight decay, long context) and explicit (GQA, MoE) — regulates variance propagation.

The Result

Practical recipe achieves 4.6% accuracy improvement on downstream tasks with superior depth utilization.

The Challenge

The Curse of Depth

Recent studies show that later Transformer layers in deep LLMs are frequently under-utilized. Many layers can be removed or permuted with minimal performance degradation.

Variance Explosion

Output variance grows sub-exponentially with depth

Identity Mapping

Deep layer Jacobians approach identity

Resource Waste

L=32 uses 2.56× parameters but wastes 14 layers

Last-Layer Variance vs Depth

Jacobian ||J-I||_F approaches identity with depth

Layer Effectiveness Scores

The Foundation

Sparsity as Variance Regulator

Core Insight

Sparsity reduces variance propagation in residual-depth architectures. Smaller mask density (higher sparsity) leads to slower variance growth with depth, mitigating the curse of depth.

Implicit Sparsity

Emerges dynamically during training

Weight Decay — Induces weight sparsity
Long Context — Induces attention sparsity

Explicit Sparsity

Enforced by architectural design

GQA — Key/value sharing
MoE — Sparse expert routing

Empirical Validation

Experimental Results

Weight Decay

Implicit Sparsity

Variance vs Weight Decay

Weight Decay (λ)	PPL ↓	Usefulness
λ = 0	15.63	0.75
λ = 0.01	15.20	0.75
λ = 0.1 ✓	14.83	0.81
λ = 1.0	15.55	0.69
λ = 3.0	773.42	0.63

Optimal λ = 0.1 improves both PPL and usefulness score

Sequence Length

Implicit Sparsity

Variance vs Sequence Length

Sequence Length (T)	PPL ↓	Usefulness
T = 256	18.51	0.69
T = 512	15.71	0.75
T = 1024	14.77	0.75
T = 2048 ✓	14.51	0.81
T = 4096	14.52	0.81
T = 8192	16.30	0.75

Sweet spot at T = 2048 with optimal perplexity and usefulness

Grouped Query Attention

Explicit Sparsity

GQA vs MHA Variance Comparison

Group Size (G)	PPL ↓	Usefulness
G = 1 (MHA)	14.52	0.81
G = 4	14.50	0.87
G = 16 ✓	14.47	0.87

G=16 achieves best PPL with 7% improvement in usefulness score

Mixture of Experts

Explicit Sparsity

400M Scale

1B Scale

Model	PPL ↓	Usefulness
Dense-400M	17.56	0.87
MoE-2B/400M ✓	15.89	0.94
Dense-1B	14.52	0.81
MoE-7B/1B ✓	13.82	0.94

MoE outperforms dense counterparts by >2 PPL with higher usefulness

Ablation Study: Progressive Performance Gains

Progressive gains when scaling to L=32 with sparsity ingredients

4.6%

Accuracy Improvement

32-layer, 1.2B-parameter model

Superior depth utilization vs. naive baseline

Key Takeaways

Sparsity is not just for efficiency — it's a critical variance regulator that mitigates the curse of depth.

Both implicit & explicit sparsity work — weight decay, long context, GQA, and MoE all reduce variance.

Practical recipe yields 4.6% boost — a new perspective on training depth-effective LLMs.

Cite This Work

@misc{muhtar2026doessparsitymitigatecurse,
                    title={When Does Sparsity Mitigate the Curse of Depth in LLMs}, 
                    author={Dilxat Muhtar and Xinyuan Song and Sebastian Pokutta and Max Zimmer and Nico Pelleriti and Thomas Hofmann and Shiwei Liu},
                    year={2026},
                    eprint={2603.15389},
                    archivePrefix={arXiv},
                    primaryClass={cs.CL},
                    url={https://arxiv.org/abs/2603.15389}, 
              }

When Does Sparsity Mitigatethe Curse of Depth in LLMs

Core Contributions

The Problem

The Insight

The Result

The Curse of Depth

Variance Explosion

Identity Mapping

Resource Waste

Sparsity as Variance Regulator

Core Insight

Implicit Sparsity

Explicit Sparsity

Experimental Results

Weight Decay

Sequence Length

Grouped Query Attention

Mixture of Experts

Ablation Study: Progressive Performance Gains

Key Takeaways

Cite This Work

When Does Sparsity Mitigate
the Curse of Depth in LLMs