LFM2-350M-GRPO-NuminaMath-10K
This model is a fine-tuned version of LiquidAI/LFM2-350M trained on the AI-MO/NuminaMath-CoT dataset using Group Relative Policy Optimization (GRPO) - an online reinforcement learning method.
Overview
LFM2-350M-GRPO-NuminaMath-10K is optimized for mathematical reasoning tasks. It uses GRPO to learn from reward signals based on answer correctness and format adherence, enabling it to generate more accurate step-by-step solutions.
Key Features
- Reinforcement Learning: Trained with GRPO for improved reasoning capabilities
- Math Focus: Optimized on 10,000 math problems from NuminaMath-CoT
- Multi-Sample Learning: Uses 2 generations per prompt for robust training
- Combined Reward: Evaluates both answer accuracy and output format
Model Details
| Property | Value |
|---|---|
| Developed by | ermiaazarkhalili |
| License | CC-BY-NC-4.0 |
| Language | English |
| Base Model | LiquidAI/LFM2-350M |
| Model Size | Unknown parameters |
| Tensor Type | BF16 |
| Context Length | 2,048 tokens |
| Training Method | GRPO with LoRA |
Training Information
GRPO Configuration
| Parameter | Value |
|---|---|
| Learning Rate | 5e-07 |
| Batch Size | 1 per device |
| Gradient Accumulation Steps | 16 |
| Num Generations | 2 |
| Reward Type | combined |
| Max Prompt Length | 1024 |
| Max Completion Length | 2048 |
| Temperature | 0.7 |
| Beta (KL penalty) | 0.04 |
LoRA Configuration
| Parameter | Value |
|---|---|
| LoRA Rank (r) | 16 |
| LoRA Alpha | 32 |
| LoRA Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Training Metrics
| Metric | Value |
|---|---|
| Final Policy Loss | -0.0431 |
| Training Time | 10h 44m |
Reward Function
The model uses a combined reward function:
- Math Accuracy: Extracts and validates final numerical answers
- Format Compliance: Checks for proper step-by-step reasoning format
- Combined Score: Weighted combination of accuracy and format rewards
Training Hardware
- GPU: NVIDIA H100 40GB MIG (3g.40gb)
- CPU: 8 vCPUs
- Memory: 64GB
- Platform: Compute Canada (Fir Cluster)
Dataset
This model was trained on the AI-MO/NuminaMath-CoT dataset:
| Property | Value |
|---|---|
| Training Samples | 10,000 |
| Format | Chain-of-Thought reasoning |
| Topics | Math (algebra, geometry, calculus, etc.) |
NuminaMath-CoT provides step-by-step mathematical solutions, enabling the model to learn structured reasoning patterns.
Usage
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Math problem
prompt = "Solve step by step: If a train travels 120 km in 2 hours, what is its average speed?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Limitations
- Domain Specific: Optimized for math; may not generalize to other reasoning tasks
- Language: English only
- Hallucinations: May produce incorrect calculations despite correct format
- Verification Needed: Always verify mathematical results independently
Intended Use
Recommended Uses
- Mathematical problem solving
- Step-by-step reasoning demonstrations
- Educational math tutoring applications
- Research on RL-trained language models
Out-of-Scope Uses
- Critical calculations requiring absolute accuracy
- Non-mathematical reasoning tasks
- Production systems without verification
Citation
@misc{ermiaazarkhalili_lfm2_350m_grpo_numinamath_10k,
author = {Ermia Azarkhalili},
title = {LFM2-350M-GRPO-NuminaMath-10K: GRPO-trained LFM2-350M for Math},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K}}
}
Acknowledgments
- LiquidAI for the LFM2 base model
- Hugging Face TRL Team for the GRPO implementation
- NuminaMath dataset creators
- Compute Canada for providing HPC resources