qwen3-4b-dpo-qwen-cot-merged-v2
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507, optimized through a two-stage process: Supervised Fine-Tuning (SFT) for structured outputs, followed by Direct Preference Optimization (DPO) using the Unsloth library.
This repository contains the full-merged 16-bit weights. No adapter loading is required.
Model Pipeline
- SFT Stage: The base model was fine-tuned to improve structured output accuracy (JSON, YAML, etc.). During this stage, loss was applied only to the final assistant output, while intermediate reasoning (Chain-of-Thought) was masked. This adapter was then merged into the base model.
- DPO Stage: The SFT-merged model was further optimized using DPO to align reasoning capabilities and response quality with human preferences.
Training Objective
This model has been optimized using DPO to align its responses with preferred outputs, focusing on improving reasoning (Chain-of-Thought) and structured response quality based on the provided preference dataset.
Training Configuration
1. SFT Stage (Pre-DPO) https://huggingface.co/choco800/Qwen3-4B-Instruct-2507-v8
- Base model: Qwen/Qwen3-4B-Instruct-2507
- Method: QLoRA (4-bit) merged into base
- Dataset:
u-10bei/structured_data_with_cot_dataset_512_v2 - Focus: Structured output accuracy with CoT masking
2. DPO Stage
- Base model for DPO: Custom SFT-merged Qwen3-4B (v8)
- Method: DPO (Direct Preference Optimization)
- Epochs: 1
- Learning rate: 5e-07
- Beta: 0.1
- Max sequence length: 2048
- LoRA Config: r=8, alpha=16 (merged into base)
Usage
Since this is a merged model, you can use it directly with transformers.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "your_id/your-repo-name"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Test inference
prompt = "Your question here"
inputs = tokenizer.apply_chat_template([{"role": "user", "content": prompt}], tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Sources & License (IMPORTANT)
- DPO Training Data: [u-10bei/dpo-dataset-qwen-cot]
- SFT Training Data: [u-10bei/structured_data_with_cot_dataset_512_v2]
- License: MIT License. (As per dataset terms).
- Compliance: Users must follow the original base model's license terms.
- Downloads last month
- 30