Model Card for qwen2.5-3b-instruct-dpo-orca

This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct using Direct Preference Optimization (DPO). The fine-tuning was performed using the trl library's DPOTrainer with PEFT (QLoRA) for parameter-efficient training.

Model Details

Model Description

  • Developed by: Oğulcan Akca
  • Model type: Causal language model based on the Qwen2 architecture.
  • Language(s) (NLP): Primarily English (inherited from base model), potential multilingual capabilities.
  • License: Apache 2.0
  • Fine-tuned from model: Qwen/Qwen2.5-3B-Instruct

This model aims to improve the instruction-following capabilities and overall response quality of the Qwen/Qwen2.5-3B-Instruct base model by aligning it further with human preferences. The DPO fine-tuning was performed on the argilla/distilabel-intel-orca-dpo-pairs dataset, which contains pairs of chosen and rejected responses to various instructions. The goal was to train the model to prefer generating responses similar to the "chosen" examples while avoiding patterns found in the "rejected" examples.

Model Sources

Notebooks

  • Training and Evaluation Notebook (W&B): Kaggle # Part 4 (final)
  • LLM-as-a-judge (Opik): Kaggle

Bias, Risks, and Limitations

  • Knowledge Cutoff: Inherits the knowledge cutoff of the base model.
  • Hallucinations: Like all LLMs, it may generate factually incorrect or nonsensical information.
  • Bias: May reflect biases present in the base model's pre-training data and the DPO fine-tuning dataset.
  • Limited Reasoning: As a 3B parameter model, its complex reasoning and planning capabilities are limited compared to larger models.
  • Evaluation Context: Evaluation was performed on a filtered subset of the Dolly dataset using an LLM judge (Gemini 2.0 Flash Lite). Performance may vary on different datasets or evaluation criteria.

Recommendations

Intended Uses: This model is designed for instruction-following tasks, similar to the base model, but potentially with improved helpfulness, coherence, and adherence to complex instructions. Suitable tasks include:

  • Creative Writing (stories, poems, scripts)
  • Summarization
  • Information Extraction
  • Brainstorming
  • General Chat / Conversational AI

How to Get Started with the Model

This model consists of LoRA adapters trained on top of Qwen/Qwen2.5-3B-Instruct. To use it, first load the base model in 4-bit, then apply the adapters. Make sure to use the specific chat template associated with the Qwen2 base model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# --- Configuration ---
base_model_name = "Qwen/Qwen2.5-3B-Instruct"
adapter_model_name = "ogulcanakca/qwen2.5-3b-instruct-dpo-orca"

# --- Load Base Model in 4-bit ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Usually needed for generation

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# --- Load and Apply PEFT Adapter ---
model = PeftModel.from_pretrained(base_model, adapter_model_name)

# --- Prepare Input using Qwen2 Chat Template ---
prompt = "Write a short story about a cat who learns to code."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# --- Generate Response ---
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Details

Training Data

The model was fine-tuned using the Intel/orca_dpo_pairs dataset available on the Hugging Face Hub. This dataset consists of approximately 12.8k examples, each containing:

  • prompt: The instruction given to the model.
  • chosen: The preferred response.
  • rejected: The less preferred response.

The dataset covers a wide range of instruction types.

Training Procedure

  • Frameworks: transformers, trl, peft, bitsandbytes, accelerate.
  • Method: Direct Preference Optimization (DPO) using trl's DPOTrainer.
  • Infrastructure: Trained on Kaggle Notebooks using a single NVIDIA Tesla P100 GPU (16GB VRAM).
  • Training Monitoring: Weights & Biases (wandb).

Training Hyperparameters

  • Parameter Efficiency: QLoRA (4-bit nf4 quantization with bf16 compute dtype) was used. LoRA parameters: r=16, lora_alpha=32, lora_dropout=0.05, targeting most linear layers (q_proj, k_proj, v_proj, o_proj, etc.).
  • Key Hyperparameters:
    • learning_rate: 5e-5
    • beta: 0.1
    • loss_type: "ipo"
    • num_train_epochs: 1
    • per_device_train_batch_size: 1 (Effective batch size: 8)
    • gradient_accumulation_steps: 8
    • lr_scheduler_type: "cosine"
    • optim: "paged_adamw_8bit"
    • max_length: 1024
    • max_prompt_length: 512
    • gradient_checkpointing: True
    • precompute_ref_log_probs: False

Speeds, Sizes, Times

  • Training Time: The 1-epoch fine-tuning process completed in approximately 15 hours on a single NVIDIA Tesla P100 GPU provided by Kaggle Notebooks. (Note: This duration corresponds to your first successful run with the precompute_ref_log_probs=False setting and without eval. The total duration of your run with eval, including interruptions (around 36 hours), was longer, but the pure training time is around this figure).
  • GPU VRAM Usage: Thanks to QLoRA (4-bit quantization), the peak GPU memory usage during training remained manageable within the 16GB VRAM available on the P100.
  • Checkpoint Size: As QLoRA only saves the adapter weights, the final checkpoint size is relatively small, estimated to be around ~100-200 MB, making it easy to share and load.
  • Throughput: Training speed averaged approximately 0.03 iterations/second during the main training loop (excluding evaluation steps when enabled).

Evaluation

The model was evaluated against the base Qwen/Qwen2.5-3B-Instruct model on a custom test set derived from the databricks/databricks-dolly-15k dataset.

  • Filtering: Examples belonging to the open_qa and closed_qa categories were excluded to focus on instruction following and generation tasks rather than factual recall.
  • Sampling: A random subset of 200 examples was selected from the filtered dataset.
Step Training Loss Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/chosen Logps/rejected Logits/chosen Logits/rejected
1300 12.325900 13.097957 -0.147601 -0.436289 0.856921 0.288688 -4.055865 -6.360097 -1.893258 -2.525452
1325 13.215800 13.116969 -0.145761 -0.434994 0.856921 0.289233 -4.037463 -6.347150 -1.870269 -2.520532
1350 12.509500 13.052545 -0.145866 -0.430553 0.856921 0.284687 -4.038514 -6.302736 -1.884726 -2.529022
1375 12.392900 13.075562 -0.147900 -0.434689 0.856921 0.286789 -4.058856 -6.344100 -1.889521 -2.523355
1400 13.423500 13.080257 -0.147380 -0.435694 0.856921 0.288314 -4.053657 -6.354150 -1.877747 -2.516060
1425 14.636300 13.075521 -0.145920 -0.433277 0.856921 0.287357 -4.039056 -6.329982 -1.870765 -2.516895
1450 10.936000 13.070604 -0.145527 -0.433105 0.856921 0.287578 -4.035127 -6.328260 -1.865718 -2.517346
1475 12.833400 13.085215 -0.145675 -0.433959 0.856921 0.288284 -4.036605 -6.336797 -1.862939 -2.514504
1500 12.713900 13.081480 -0.145784 -0.434095 0.856921 0.288311 -4.037696 -6.338157 -1.863126 -2.514367
1525 14.113300 13.077538 -0.145751 -0.433938 0.856921 0.288186 -4.037369 -6.336587 -1.863070 -2.514565

Testing Data, Factors & Metrics

Testing Data

  • Dataset: databricks/databricks-dolly-15k
  • Preprocessing: The dataset was filtered to exclude examples from the open_qa and closed_qa categories to focus the evaluation on instruction following, creative generation, and reasoning tasks rather than factual recall.
  • Sampling: A random subset of 200 examples was selected from the filtered dataset for the final evaluation.

Factors

  • No specific subpopulations or domains were targeted for disaggregated analysis in this evaluation. The evaluation reflects overall performance on the sampled diverse tasks from the filtered Dolly dataset.

Metrics

Metric Description Score Range Average Result Interpretation
Head-to-Head Preference Score Judge model selects which response (Base vs. DPO) is better for each prompt. 0.0 = DPO wins
1.0 = Base wins
0.07
due to the judge's fallacy
DPO Model was preferred in ~93% of cases.
Usefulness Score Measures how well each model’s response addresses the prompt (independent scoring). 0.0 – 1.0 0.72 DPO Model
0.69 Base Model
Indicates practical helpfulness of responses.
Reference Alignment Score Evaluates semantic similarity to the human-written reference answer (Dolly dataset). 0.0 – 1.0 0.05 =Base=DPO Measures alignment with human “gold standard.”

Results

The evaluation metrics indicate a significant improvement in perceived response quality for the DPO-tuned model compared to the base model:

  • Head-to-Head Preference: The DPO model achieved an average score of ~0.07, where 1.0 indicates the base model winning. This translates to the DPO model winning approximately 93% of the direct comparisons against the base model, demonstrating a strong preference by the LLM judge.
  • Usefulness: The DPO model (avg: 0.72) outperformed the base model (avg: 0.69), showing a measurable, albeit small, improvement in average usefulness.
  • Reference Alignment: Both models scored low on alignment with the human references (DPO avg: 0.05, Base avg: 0.05), suggesting that neither model closely replicated the style or specific content of the Dolly reference answers. DPO tuning did not improve performance on this specific metric.

Overall: The results strongly suggest that the DPO fine-tuning was successful in enhancing the model's response quality and helpfulness, making it generally preferable to the base model according to the LLM judge.

Note: Earlier trials included heuristic metrics (e.g., ROUGE/BLEU) under a “Content Comparison Score,” but these were excluded from the final report in favor of LLM-as-a-Judge metrics, which capture semantic quality more effectively.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ogulcanakca/qwen2.5-3b-instruct-dpo-orca

Base model

Qwen/Qwen2.5-3B
Adapter
(1154)
this model

Datasets used to train ogulcanakca/qwen2.5-3b-instruct-dpo-orca

Evaluation results

  • Usefulness on databricks/databricks-dolly-15k
    self-reported
    0.719
  • Head-to-Head Preference on databricks/databricks-dolly-15k
    self-reported
    0.930