Chess-SFT-6k — Global Chess Challenge 2025
Model Summary
Chess-SFT-6k is a small, text-only chess-playing language model fine-tuned via Supervised Fine-Tuning (SFT) to select legal and reasonable chess moves from symbolic board representations.
The model is designed for participation in the Global Chess Challenge 2025, where models must choose a legal move from a provided list without access to search, tools, or external engines at inference time.
This checkpoint represents an early-stopped SFT baseline, optimized to reduce illegal moves and catastrophic blunders, and intended as a foundation for reinforcement learning with verifiable rewards (GRPO).
Model Details
- Developed by: Ritwika Kancharla
- Model type: Decoder-only causal language model
- Base model: Qwen/Qwen3-0.6B
- Language: English
- License: MIT
- Finetuned from: Qwen/Qwen3-0.6B
- Competition: Global Chess Challenge 2025 (AIcrowd × AGI House)
Intended Use
Direct Use
- Selecting a single legal chess move (UCI format) from:
- FEN position
- Side to move
- List of legal moves
- Text-only inference without any tools, engines, or search
- Compatible with the Global Chess Challenge starter kit and evaluation pipeline
Downstream Use
- Research on reasoning and decision-making in small language models
- Experiments with curriculum learning and reinforcement learning (GRPO / RLVR)
- Educational or analytical chess assistants (non-engine-based)
Out-of-Scope Use
- Replacement for classical chess engines
- Deep tactical calculation or forced-mate search
- Real-money, rated, or professional chess play
- Any inference-time use of Stockfish, search, or external tools
Training Details
Training Data
- Dataset:
aicrowd/ChessExplained - Dataset file:
ChessExplained_2500k_qwen3.parquet - Dataset size:
2.5M positions (1.04 GB) - Content:
- Symbolic chess positions
- Legal move lists
- Natural language explanations
- Stockfish was used only offline for data generation and evaluation.
- No external tools or engines are used at inference time.
Training Procedure
- Method: Supervised Fine-Tuning (SFT)
- Objective: Next-token prediction (Negative Log-Likelihood)
- Precision: bf16 mixed precision
- Optimizer: AdamW
- Epochs: 1
- Total steps: ~10,100
- Checkpoint selection: Early stopping based on evaluation metrics
Training Loss Progression (Selected)
| Step | Training Loss |
|---|---|
| 100 | 5.1071 |
| 500 | 0.5045 |
| 1,000 | 0.3913 |
| 2,000 | 0.3310 |
| 3,000 | 0.2685 |
| 4,000 | 0.2374 |
| 5,000 | 0.2245 |
| 6,000 | 0.2127 |
| 10,000 | 0.1962 |
Although training loss continued to decrease after 6k steps, chess performance began to regress, motivating early stopping.
Evaluation
Evaluation Setup
- Evaluation performed using the official Global Chess Challenge baseline
- Legal move enforcement handled by the environment
- Move quality evaluated using Stockfish (depth 20)
- Metrics computed over full games
Metrics
- Average Centipawn Loss (ACPL)
- Win / draw / loss rates
- Illegal move rate
- Puzzle success rate
Results Across Checkpoints
| Checkpoint | Positions Trained | Puzzle Success (%) | Illegal Moves (SF) | Avg ACPL vs Random | Avg ACPL vs Stockfish |
|---|---|---|---|---|---|
| 500 | 4,000 | 0.0 | 50 | 600.0 | 410.6 |
| 3,000 | 24,000 | 5.0 | 40 | 420.3 | 126.5 |
| 6,000 | 48,000 | 25.0 | 7 | 50.7 | 90.9 |
| 9,000 | 72,000 | 24.0 | 2 | 144.9 | 85.5 |
Evaluation Summary
- Early training significantly reduced illegal moves and catastrophic blunders.
- ACPL vs Stockfish improved sharply up to ~6,000 steps.
- Continued SFT beyond this point led to regression despite lower training loss.
- Checkpoint 6,000 provided the best trade-off between stability and chess strength.
Bias, Risks, and Limitations
- The model does not perform search and may miss deep tactical combinations.
- Performance depends on patterns learned from supervised data.
- Natural language explanations are not guaranteed to reflect optimal chess reasoning.
- Like all chess-playing LLMs, the model may struggle in rare or highly tactical positions.
Recommendations
This model should be treated as a research artifact rather than a competitive chess engine.
Best performance is expected when combined with curriculum learning and reinforcement learning fine-tuning.
Technical Specifications
Architecture
- Decoder-only Transformer
- Autoregressive next-token prediction
- Chat template and tokenizer inherited from Qwen3
Compute Infrastructure
- Training hardware: NVIDIA H100 (Kaggle)
- Evaluation: CPU/GPU
- Frameworks: PyTorch, Hugging Face Transformers, vLLM
- Chess environment: python-chess (evaluation only)
Environmental Impact
- Cloud provider: Kaggle
- Hardware: NVIDIA H100
- Training duration: A few hours
- Carbon emissions: Not formally estimated
Code and Reproducibility
- Training & evaluation codebase:
https://github.com/AIcrowd/Global-Chess-Challenge-2025-Baselines - Key scripts:
train.pyrun_evaluation.py
Citation
If you use this model, please cite the base model and the competition:
Base model:
Qwen/Qwen3-0.6B
Competition:
Global Chess Challenge 2025 (AIcrowd & AGI House)
Model Card Authors
Ritwika Kancharla
Contact
Via Hugging Face profile
- Downloads last month
- 45