Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning
Abstract
Training reasoning language models with repeated examples on smaller datasets yields better performance than single-pass training on larger datasets, with token accuracy serving as a reliable indicator for optimal training duration.
Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
Community
Pretty interesting findings!
In Long-CoT SFT, model achieves better performance when doing multiple epochs on smaller datasets, than single-epoch training on larger datasets, even with the same update/compute budget.
Data repetition vs scaling: fascinating trade-offs! 🔥 Looking forward to seeing how this applies to other long-context scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization (2026)
- Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy (2025)
- Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning (2026)
- Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning (2026)
- Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data (2025)
- ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure (2026)
- AIR: Post-training Data Selection for Reasoning via Attention Head Influence (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper