LFM2-350M-GRPO-NuminaMath-10K

This model is a fine-tuned version of LiquidAI/LFM2-350M trained on the AI-MO/NuminaMath-CoT dataset using Group Relative Policy Optimization (GRPO) - an online reinforcement learning method.

Overview

LFM2-350M-GRPO-NuminaMath-10K is optimized for mathematical reasoning tasks. It uses GRPO to learn from reward signals based on answer correctness and format adherence, enabling it to generate more accurate step-by-step solutions.

Key Features

  • Reinforcement Learning: Trained with GRPO for improved reasoning capabilities
  • Math Focus: Optimized on 10,000 math problems from NuminaMath-CoT
  • Multi-Sample Learning: Uses 2 generations per prompt for robust training
  • Combined Reward: Evaluates both answer accuracy and output format

Model Details

Property Value
Developed by ermiaazarkhalili
License CC-BY-NC-4.0
Language English
Base Model LiquidAI/LFM2-350M
Model Size Unknown parameters
Tensor Type BF16
Context Length 2,048 tokens
Training Method GRPO with LoRA

Training Information

GRPO Configuration

Parameter Value
Learning Rate 5e-07
Batch Size 1 per device
Gradient Accumulation Steps 16
Num Generations 2
Reward Type combined
Max Prompt Length 1024
Max Completion Length 2048
Temperature 0.7
Beta (KL penalty) 0.04

LoRA Configuration

Parameter Value
LoRA Rank (r) 16
LoRA Alpha 32
LoRA Dropout 0.05
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Metrics

Metric Value
Final Policy Loss -0.0431
Training Time 10h 44m

Reward Function

The model uses a combined reward function:

  1. Math Accuracy: Extracts and validates final numerical answers
  2. Format Compliance: Checks for proper step-by-step reasoning format
  3. Combined Score: Weighted combination of accuracy and format rewards

Training Hardware

  • GPU: NVIDIA H100 40GB MIG (3g.40gb)
  • CPU: 8 vCPUs
  • Memory: 64GB
  • Platform: Compute Canada (Fir Cluster)

Dataset

This model was trained on the AI-MO/NuminaMath-CoT dataset:

Property Value
Training Samples 10,000
Format Chain-of-Thought reasoning
Topics Math (algebra, geometry, calculus, etc.)

NuminaMath-CoT provides step-by-step mathematical solutions, enabling the model to learn structured reasoning patterns.

Usage

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Math problem
prompt = "Solve step by step: If a train travels 120 km in 2 hours, what is its average speed?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Limitations

  • Domain Specific: Optimized for math; may not generalize to other reasoning tasks
  • Language: English only
  • Hallucinations: May produce incorrect calculations despite correct format
  • Verification Needed: Always verify mathematical results independently

Intended Use

Recommended Uses

  • Mathematical problem solving
  • Step-by-step reasoning demonstrations
  • Educational math tutoring applications
  • Research on RL-trained language models

Out-of-Scope Uses

  • Critical calculations requiring absolute accuracy
  • Non-mathematical reasoning tasks
  • Production systems without verification

Citation

@misc{ermiaazarkhalili_lfm2_350m_grpo_numinamath_10k,
    author = {Ermia Azarkhalili},
    title = {LFM2-350M-GRPO-NuminaMath-10K: GRPO-trained LFM2-350M for Math},
    year = {2025},
    publisher = {Hugging Face},
    howpublished = {\url{https://huggingface.co/ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K}}
}

Acknowledgments

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K

Base model

LiquidAI/LFM2-350M
Adapter
(15)
this model
Quantizations
1 model

Dataset used to train ermiaazarkhalili/LFM2-350M-GRPO-NuminaMath-10K