Model Card for qwen2.5-3b-instruct-dpo-orca

This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct using Direct Preference Optimization (DPO). The fine-tuning was performed using the trl library's DPOTrainer with PEFT (QLoRA) for parameter-efficient training.

Model Details

Model Description

Developed by: Oğulcan Akca
Model type: Causal language model based on the Qwen2 architecture.
Language(s) (NLP): Primarily English (inherited from base model), potential multilingual capabilities.
License: Apache 2.0
Fine-tuned from model: Qwen/Qwen2.5-3B-Instruct

This model aims to improve the instruction-following capabilities and overall response quality of the Qwen/Qwen2.5-3B-Instruct base model by aligning it further with human preferences. The DPO fine-tuning was performed on the argilla/distilabel-intel-orca-dpo-pairs dataset, which contains pairs of chosen and rejected responses to various instructions. The goal was to train the model to prefer generating responses similar to the "chosen" examples while avoiding patterns found in the "rejected" examples.

Model Sources

Notebooks

Training and Evaluation Notebook (W&B): Kaggle # Part 4 (final)
LLM-as-a-judge (Opik): Kaggle

Bias, Risks, and Limitations

Knowledge Cutoff: Inherits the knowledge cutoff of the base model.
Hallucinations: Like all LLMs, it may generate factually incorrect or nonsensical information.
Bias: May reflect biases present in the base model's pre-training data and the DPO fine-tuning dataset.
Limited Reasoning: As a 3B parameter model, its complex reasoning and planning capabilities are limited compared to larger models.
Evaluation Context: Evaluation was performed on a filtered subset of the Dolly dataset using an LLM judge (Gemini 2.0 Flash Lite). Performance may vary on different datasets or evaluation criteria.

Recommendations

Intended Uses: This model is designed for instruction-following tasks, similar to the base model, but potentially with improved helpfulness, coherence, and adherence to complex instructions. Suitable tasks include:

Creative Writing (stories, poems, scripts)
Summarization
Information Extraction
Brainstorming
General Chat / Conversational AI

How to Get Started with the Model

This model consists of LoRA adapters trained on top of Qwen/Qwen2.5-3B-Instruct. To use it, first load the base model in 4-bit, then apply the adapters. Make sure to use the specific chat template associated with the Qwen2 base model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

# --- Configuration ---
base_model_name = "Qwen/Qwen2.5-3B-Instruct"
adapter_model_name = "ogulcanakca/qwen2.5-3b-instruct-dpo-orca"

# --- Load Base Model in 4-bit ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Usually needed for generation

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# --- Load and Apply PEFT Adapter ---
model = PeftModel.from_pretrained(base_model, adapter_model_name)

# --- Prepare Input using Qwen2 Chat Template ---
prompt = "Write a short story about a cat who learns to code."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# --- Generate Response ---
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Details

Training Data

The model was fine-tuned using the Intel/orca_dpo_pairs dataset available on the Hugging Face Hub. This dataset consists of approximately 12.8k examples, each containing:

prompt: The instruction given to the model.
chosen: The preferred response.
rejected: The less preferred response.

The dataset covers a wide range of instruction types.

Training Procedure

Frameworks: transformers, trl, peft, bitsandbytes, accelerate.
Method: Direct Preference Optimization (DPO) using trl's DPOTrainer.
Infrastructure: Trained on Kaggle Notebooks using a single NVIDIA Tesla P100 GPU (16GB VRAM).
Training Monitoring: Weights & Biases (wandb).

Training Hyperparameters

Parameter Efficiency: QLoRA (4-bit nf4 quantization with bf16 compute dtype) was used. LoRA parameters: r=16, lora_alpha=32, lora_dropout=0.05, targeting most linear layers (q_proj, k_proj, v_proj, o_proj, etc.).
Key Hyperparameters:
- learning_rate: 5e-5
- beta: 0.1
- loss_type: "ipo"
- num_train_epochs: 1
- per_device_train_batch_size: 1 (Effective batch size: 8)
- gradient_accumulation_steps: 8
- lr_scheduler_type: "cosine"
- optim: "paged_adamw_8bit"
- max_length: 1024
- max_prompt_length: 512
- gradient_checkpointing: True
- precompute_ref_log_probs: False

Speeds, Sizes, Times

Training Time: The 1-epoch fine-tuning process completed in approximately 15 hours on a single NVIDIA Tesla P100 GPU provided by Kaggle Notebooks. (Note: This duration corresponds to your first successful run with the precompute_ref_log_probs=False setting and without eval. The total duration of your run with eval, including interruptions (around 36 hours), was longer, but the pure training time is around this figure).
GPU VRAM Usage: Thanks to QLoRA (4-bit quantization), the peak GPU memory usage during training remained manageable within the 16GB VRAM available on the P100.
Checkpoint Size: As QLoRA only saves the adapter weights, the final checkpoint size is relatively small, estimated to be around ~100-200 MB, making it easy to share and load.
Throughput: Training speed averaged approximately 0.03 iterations/second during the main training loop (excluding evaluation steps when enabled).

Evaluation

The model was evaluated against the base Qwen/Qwen2.5-3B-Instruct model on a custom test set derived from the databricks/databricks-dolly-15k dataset.

Filtering: Examples belonging to the open_qa and closed_qa categories were excluded to focus on instruction following and generation tasks rather than factual recall.
Sampling: A random subset of 200 examples was selected from the filtered dataset.

Step	Training Loss	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/chosen	Logps/rejected	Logits/chosen	Logits/rejected
1300	12.325900	13.097957	-0.147601	-0.436289	0.856921	0.288688	-4.055865	-6.360097	-1.893258	-2.525452
1325	13.215800	13.116969	-0.145761	-0.434994	0.856921	0.289233	-4.037463	-6.347150	-1.870269	-2.520532
1350	12.509500	13.052545	-0.145866	-0.430553	0.856921	0.284687	-4.038514	-6.302736	-1.884726	-2.529022
1375	12.392900	13.075562	-0.147900	-0.434689	0.856921	0.286789	-4.058856	-6.344100	-1.889521	-2.523355
1400	13.423500	13.080257	-0.147380	-0.435694	0.856921	0.288314	-4.053657	-6.354150	-1.877747	-2.516060
1425	14.636300	13.075521	-0.145920	-0.433277	0.856921	0.287357	-4.039056	-6.329982	-1.870765	-2.516895
1450	10.936000	13.070604	-0.145527	-0.433105	0.856921	0.287578	-4.035127	-6.328260	-1.865718	-2.517346
1475	12.833400	13.085215	-0.145675	-0.433959	0.856921	0.288284	-4.036605	-6.336797	-1.862939	-2.514504
1500	12.713900	13.081480	-0.145784	-0.434095	0.856921	0.288311	-4.037696	-6.338157	-1.863126	-2.514367
1525	14.113300	13.077538	-0.145751	-0.433938	0.856921	0.288186	-4.037369	-6.336587	-1.863070	-2.514565

Testing Data, Factors & Metrics

Testing Data

Dataset: databricks/databricks-dolly-15k
Preprocessing: The dataset was filtered to exclude examples from the open_qa and closed_qa categories to focus the evaluation on instruction following, creative generation, and reasoning tasks rather than factual recall.
Sampling: A random subset of 200 examples was selected from the filtered dataset for the final evaluation.

Factors

No specific subpopulations or domains were targeted for disaggregated analysis in this evaluation. The evaluation reflects overall performance on the sampled diverse tasks from the filtered Dolly dataset.

Metrics

Metric	Description	Score Range	Average Result	Interpretation
Head-to-Head Preference Score	Judge model selects which response (Base vs. DPO) is better for each prompt.	0.0 = DPO wins 1.0 = Base wins	0.07 due to the judge's fallacy	DPO Model was preferred in ~93% of cases.
Usefulness Score	Measures how well each model’s response addresses the prompt (independent scoring).	0.0 – 1.0	0.72 DPO Model 0.69 Base Model	Indicates practical helpfulness of responses.
Reference Alignment Score	Evaluates semantic similarity to the human-written reference answer (Dolly dataset).	0.0 – 1.0	0.05 =Base=DPO	Measures alignment with human “gold standard.”

Results

The evaluation metrics indicate a significant improvement in perceived response quality for the DPO-tuned model compared to the base model:

Head-to-Head Preference: The DPO model achieved an average score of ~0.07, where 1.0 indicates the base model winning. This translates to the DPO model winning approximately 93% of the direct comparisons against the base model, demonstrating a strong preference by the LLM judge.
Usefulness: The DPO model (avg: 0.72) outperformed the base model (avg: 0.69), showing a measurable, albeit small, improvement in average usefulness.
Reference Alignment: Both models scored low on alignment with the human references (DPO avg: 0.05, Base avg: 0.05), suggesting that neither model closely replicated the style or specific content of the Dolly reference answers. DPO tuning did not improve performance on this specific metric.

Overall: The results strongly suggest that the DPO fine-tuning was successful in enhancing the model's response quality and helpfulness, making it generally preferable to the base model according to the LLM judge.

Note: Earlier trials included heuristic metrics (e.g., ROUGE/BLEU) under a “Content Comparison Score,” but these were excluded from the final report in favor of LLM-as-a-Judge metrics, which capture semantic quality more effectively.

Downloads last month: 2

Model tree for ogulcanakca/qwen2.5-3b-instruct-dpo-orca

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1154)

this model

Datasets used to train ogulcanakca/qwen2.5-3b-instruct-dpo-orca

Evaluation results

Usefulness on databricks/databricks-dolly-15k
self-reported

0.719
Head-to-Head Preference on databricks/databricks-dolly-15k
self-reported

0.930