LFM2-350M-GRPO-NuminaMath-10K

This model is a fine-tuned version of LiquidAI/LFM2-350M trained on the AI-MO/NuminaMath-CoT dataset using Group Relative Policy Optimization (GRPO) - an online reinforcement learning method.

Overview

LFM2-350M-GRPO-NuminaMath-10K is optimized for mathematical reasoning tasks. It uses GRPO to learn from reward signals based on answer correctness and format adherence, enabling it to generate more accurate step-by-step solutions.

Key Features

Reinforcement Learning: Trained with GRPO for improved reasoning capabilities
Math Focus: Optimized on 10,000 math problems from NuminaMath-CoT
Multi-Sample Learning: Uses 2 generations per prompt for robust training
Combined Reward: Evaluates both answer accuracy and output format

Model Details

Property	Value
Developed by	ermiaazarkhalili
License	CC-BY-NC-4.0
Language	English
Base Model	LiquidAI/LFM2-350M
Model Size	Unknown parameters
Tensor Type	BF16
Context Length	2,048 tokens
Training Method	GRPO with LoRA

Training Information

GRPO Configuration

Parameter	Value
Learning Rate	5e-07
Batch Size	1 per device
Gradient Accumulation Steps	16
Num Generations	2
Reward Type	combined
Max Prompt Length	1024
Max Completion Length	2048
Temperature	0.7
Beta (KL penalty)	0.04

LoRA Configuration

Parameter	Value
LoRA Rank (r)	16
LoRA Alpha	32
LoRA Dropout	0.05
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Metrics

Metric	Value
Final Policy Loss	-0.0431
Training Time	10h 44m

Reward Function

The model uses a combined reward function:

Math Accuracy: Extracts and validates final numerical answers
Format Compliance: Checks for proper step-by-step reasoning format
Combined Score: Weighted combination of accuracy and format rewards

Training Hardware

GPU: NVIDIA H100 40GB MIG (3g.40gb)
CPU: 8 vCPUs
Memory: 64GB
Platform: Compute Canada (Fir Cluster)

Dataset

This model was trained on the AI-MO/NuminaMath-CoT dataset:

Property	Value
Training Samples	10,000
Format	Chain-of-Thought reasoning
Topics	Math (algebra, geometry, calculus, etc.)

NuminaMath-CoT provides step-by-step mathematical solutions, enabling the model to learn structured reasoning patterns.

Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Math problem
prompt = "Solve step by step: If a train travels 120 km in 2 hours, what is its average speed?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Limitations

Domain Specific: Optimized for math; may not generalize to other reasoning tasks
Language: English only
Hallucinations: May produce incorrect calculations despite correct format
Verification Needed: Always verify mathematical results independently

Intended Use

Recommended Uses

Mathematical problem solving
Step-by-step reasoning demonstrations
Educational math tutoring applications
Research on RL-trained language models

Out-of-Scope Uses

Critical calculations requiring absolute accuracy
Non-mathematical reasoning tasks
Production systems without verification

Citation

@misc{ermiaazarkhalili_lfm2_350m_grpo_numinamath_10k,
    author = {Ermia Azarkhalili},
    title = {LFM2-350M-GRPO-NuminaMath-10K: GRPO-trained LFM2-350M for Math},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K}}
}

Acknowledgments

LiquidAI for the LFM2 base model
Hugging Face TRL Team for the GRPO implementation
NuminaMath dataset creators
Compute Canada for providing HPC resources

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K

Base model

LiquidAI/LFM2-350M

Adapter

(15)

this model

Quantizations

1 model

ermiaazarkhalili
/

LFM2-350M-GRPO-NuminaMath-10K