Model Card for qwen2.5-3b-instruct-dpo-orca
This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct using Direct Preference Optimization (DPO). The fine-tuning was performed using the trl library's DPOTrainer with PEFT (QLoRA) for parameter-efficient training.
Model Details
Model Description
- Developed by: Oğulcan Akca
- Model type: Causal language model based on the Qwen2 architecture.
- Language(s) (NLP): Primarily English (inherited from base model), potential multilingual capabilities.
- License: Apache 2.0
- Fine-tuned from model: Qwen/Qwen2.5-3B-Instruct
This model aims to improve the instruction-following capabilities and overall response quality of the Qwen/Qwen2.5-3B-Instruct base model by aligning it further with human preferences. The DPO fine-tuning was performed on the argilla/distilabel-intel-orca-dpo-pairs dataset, which contains pairs of chosen and rejected responses to various instructions. The goal was to train the model to prefer generating responses similar to the "chosen" examples while avoiding patterns found in the "rejected" examples.
Model Sources
Notebooks
Bias, Risks, and Limitations
- Knowledge Cutoff: Inherits the knowledge cutoff of the base model.
- Hallucinations: Like all LLMs, it may generate factually incorrect or nonsensical information.
- Bias: May reflect biases present in the base model's pre-training data and the DPO fine-tuning dataset.
- Limited Reasoning: As a 3B parameter model, its complex reasoning and planning capabilities are limited compared to larger models.
- Evaluation Context: Evaluation was performed on a filtered subset of the Dolly dataset using an LLM judge (Gemini 2.0 Flash Lite). Performance may vary on different datasets or evaluation criteria.
Recommendations
Intended Uses: This model is designed for instruction-following tasks, similar to the base model, but potentially with improved helpfulness, coherence, and adherence to complex instructions. Suitable tasks include:
- Creative Writing (stories, poems, scripts)
- Summarization
- Information Extraction
- Brainstorming
- General Chat / Conversational AI
How to Get Started with the Model
This model consists of LoRA adapters trained on top of Qwen/Qwen2.5-3B-Instruct. To use it, first load the base model in 4-bit, then apply the adapters. Make sure to use the specific chat template associated with the Qwen2 base model.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
# --- Configuration ---
base_model_name = "Qwen/Qwen2.5-3B-Instruct"
adapter_model_name = "ogulcanakca/qwen2.5-3b-instruct-dpo-orca"
# --- Load Base Model in 4-bit ---
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Usually needed for generation
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
# --- Load and Apply PEFT Adapter ---
model = PeftModel.from_pretrained(base_model, adapter_model_name)
# --- Prepare Input using Qwen2 Chat Template ---
prompt = "Write a short story about a cat who learns to code."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# --- Generate Response ---
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=512,
pad_token_id=tokenizer.eos_token_id
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Training Details
Training Data
The model was fine-tuned using the Intel/orca_dpo_pairs dataset available on the Hugging Face Hub. This dataset consists of approximately 12.8k examples, each containing:
prompt: The instruction given to the model.chosen: The preferred response.rejected: The less preferred response.
The dataset covers a wide range of instruction types.
Training Procedure
- Frameworks:
transformers,trl,peft,bitsandbytes,accelerate. - Method: Direct Preference Optimization (DPO) using
trl'sDPOTrainer. - Infrastructure: Trained on Kaggle Notebooks using a single NVIDIA Tesla P100 GPU (16GB VRAM).
- Training Monitoring: Weights & Biases (
wandb).
Training Hyperparameters
- Parameter Efficiency: QLoRA (4-bit nf4 quantization with bf16 compute dtype) was used. LoRA parameters:
r=16,lora_alpha=32,lora_dropout=0.05, targeting most linear layers (q_proj,k_proj,v_proj,o_proj, etc.). - Key Hyperparameters:
learning_rate: 5e-5beta: 0.1loss_type: "ipo"num_train_epochs: 1per_device_train_batch_size: 1 (Effective batch size: 8)gradient_accumulation_steps: 8lr_scheduler_type: "cosine"optim: "paged_adamw_8bit"max_length: 1024max_prompt_length: 512gradient_checkpointing: Trueprecompute_ref_log_probs: False
Speeds, Sizes, Times
- Training Time: The 1-epoch fine-tuning process completed in approximately 15 hours on a single NVIDIA Tesla P100 GPU provided by Kaggle Notebooks. (Note: This duration corresponds to your first successful run with the
precompute_ref_log_probs=Falsesetting and withouteval. The total duration of your run witheval, including interruptions (around 36 hours), was longer, but the pure training time is around this figure). - GPU VRAM Usage: Thanks to QLoRA (4-bit quantization), the peak GPU memory usage during training remained manageable within the 16GB VRAM available on the P100.
- Checkpoint Size: As QLoRA only saves the adapter weights, the final checkpoint size is relatively small, estimated to be around ~100-200 MB, making it easy to share and load.
- Throughput: Training speed averaged approximately 0.03 iterations/second during the main training loop (excluding evaluation steps when enabled).
Evaluation
The model was evaluated against the base Qwen/Qwen2.5-3B-Instruct model on a custom test set derived from the databricks/databricks-dolly-15k dataset.
- Filtering: Examples belonging to the
open_qaandclosed_qacategories were excluded to focus on instruction following and generation tasks rather than factual recall. - Sampling: A random subset of 200 examples was selected from the filtered dataset.
| Step | Training Loss | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/chosen | Logps/rejected | Logits/chosen | Logits/rejected |
|---|---|---|---|---|---|---|---|---|---|---|
| 1300 | 12.325900 | 13.097957 | -0.147601 | -0.436289 | 0.856921 | 0.288688 | -4.055865 | -6.360097 | -1.893258 | -2.525452 |
| 1325 | 13.215800 | 13.116969 | -0.145761 | -0.434994 | 0.856921 | 0.289233 | -4.037463 | -6.347150 | -1.870269 | -2.520532 |
| 1350 | 12.509500 | 13.052545 | -0.145866 | -0.430553 | 0.856921 | 0.284687 | -4.038514 | -6.302736 | -1.884726 | -2.529022 |
| 1375 | 12.392900 | 13.075562 | -0.147900 | -0.434689 | 0.856921 | 0.286789 | -4.058856 | -6.344100 | -1.889521 | -2.523355 |
| 1400 | 13.423500 | 13.080257 | -0.147380 | -0.435694 | 0.856921 | 0.288314 | -4.053657 | -6.354150 | -1.877747 | -2.516060 |
| 1425 | 14.636300 | 13.075521 | -0.145920 | -0.433277 | 0.856921 | 0.287357 | -4.039056 | -6.329982 | -1.870765 | -2.516895 |
| 1450 | 10.936000 | 13.070604 | -0.145527 | -0.433105 | 0.856921 | 0.287578 | -4.035127 | -6.328260 | -1.865718 | -2.517346 |
| 1475 | 12.833400 | 13.085215 | -0.145675 | -0.433959 | 0.856921 | 0.288284 | -4.036605 | -6.336797 | -1.862939 | -2.514504 |
| 1500 | 12.713900 | 13.081480 | -0.145784 | -0.434095 | 0.856921 | 0.288311 | -4.037696 | -6.338157 | -1.863126 | -2.514367 |
| 1525 | 14.113300 | 13.077538 | -0.145751 | -0.433938 | 0.856921 | 0.288186 | -4.037369 | -6.336587 | -1.863070 | -2.514565 |
Testing Data, Factors & Metrics
Testing Data
- Dataset: databricks/databricks-dolly-15k
- Preprocessing: The dataset was filtered to exclude examples from the
open_qaandclosed_qacategories to focus the evaluation on instruction following, creative generation, and reasoning tasks rather than factual recall. - Sampling: A random subset of 200 examples was selected from the filtered dataset for the final evaluation.
Factors
- No specific subpopulations or domains were targeted for disaggregated analysis in this evaluation. The evaluation reflects overall performance on the sampled diverse tasks from the filtered Dolly dataset.
Metrics
| Metric | Description | Score Range | Average Result | Interpretation |
|---|---|---|---|---|
| Head-to-Head Preference Score | Judge model selects which response (Base vs. DPO) is better for each prompt. | 0.0 = DPO wins 1.0 = Base wins |
0.07 due to the judge's fallacy |
DPO Model was preferred in ~93% of cases. |
| Usefulness Score | Measures how well each model’s response addresses the prompt (independent scoring). | 0.0 – 1.0 | 0.72 DPO Model 0.69 Base Model |
Indicates practical helpfulness of responses. |
| Reference Alignment Score | Evaluates semantic similarity to the human-written reference answer (Dolly dataset). | 0.0 – 1.0 | 0.05 =Base=DPO | Measures alignment with human “gold standard.” |
Results
The evaluation metrics indicate a significant improvement in perceived response quality for the DPO-tuned model compared to the base model:
- Head-to-Head Preference: The DPO model achieved an average score of ~0.07, where 1.0 indicates the base model winning. This translates to the DPO model winning approximately 93% of the direct comparisons against the base model, demonstrating a strong preference by the LLM judge.
- Usefulness: The DPO model (avg: 0.72) outperformed the base model (avg: 0.69), showing a measurable, albeit small, improvement in average usefulness.
- Reference Alignment: Both models scored low on alignment with the human references (DPO avg: 0.05, Base avg: 0.05), suggesting that neither model closely replicated the style or specific content of the Dolly reference answers. DPO tuning did not improve performance on this specific metric.
Overall: The results strongly suggest that the DPO fine-tuning was successful in enhancing the model's response quality and helpfulness, making it generally preferable to the base model according to the LLM judge.
Note: Earlier trials included heuristic metrics (e.g., ROUGE/BLEU) under a “Content Comparison Score,” but these were excluded from the final report in favor of LLM-as-a-Judge metrics, which capture semantic quality more effectively.
- Downloads last month
- 2
Model tree for ogulcanakca/qwen2.5-3b-instruct-dpo-orca
Datasets used to train ogulcanakca/qwen2.5-3b-instruct-dpo-orca
Evaluation results
- Usefulness on databricks/databricks-dolly-15kself-reported0.719
- Head-to-Head Preference on databricks/databricks-dolly-15kself-reported0.930