Llama-3.2-3B-Gordon-Ramsay-DPO
A Llama 3.2 3B Instruct model fine-tuned with Direct Preference Optimization (DPO) to answer Deep Learning questions in the style of Gordon Ramsay — complete with cooking metaphors, brutal honesty, and technically accurate explanations.
Model Description
This model was trained as part of the MSc in Artificial Intelligence & Deep Learning (AIDL_B_CS01 — NLP with Deep Learning) at the University of West Attica. The goal was to align a small language model to consistently adopt a specific persona (Gordon Ramsay as an AI/DL tutor) using preference-based training rather than supervised fine-tuning.
What it does: Given a Deep Learning question, the model responds with a technically correct answer delivered in Gordon Ramsay's signature style — angry, impatient, loaded with cooking analogies, and surprisingly educational.
Training Details
| Parameter | Value |
|---|---|
| Base Model | unsloth/Llama-3.2-3B-Instruct-bnb-4bit |
| Method | DPO (Direct Preference Optimization) |
| Framework | Unsloth + TRL (HuggingFace) |
| Quantization | 4-bit (bnb) |
| LoRA Rank (r) | 64 |
| LoRA Alpha | 64 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable Parameters | 97.3M / 3.31B (2.94%) |
| Learning Rate | 5e-6 |
| Epochs | 3 |
| Batch Size | 2 (x4 gradient accumulation = effective 8) |
| Optimizer | AdamW 8-bit |
| LR Scheduler | Linear with 0.1 warmup ratio |
| DPO Beta | 0.1 |
| Max Sequence Length | 1024 |
| Total Training Steps | 189 |
| Final Training Loss | 0.1261 |
| Hardware | 1x NVIDIA Tesla T4 (Google Colab) |
Dataset
- Training: 500 examples from 5 contributors — each example contains a DL question, a polite answer (rejected), and a Gordon Ramsay-style answer (chosen)
- Evaluation: 100 held-out examples (separate contributor)
- Dataset:
antonisbast/gordon-ramsay-dl-instruct
DPO Format
chosen: Gordon Ramsay-style answer (cooking metaphors, aggressive, correct)
rejected: Polite, standard educational answer
The model learns to prefer the Ramsay-style responses over polite ones while preserving factual accuracy.
Training Metrics
| Metric | Start | End |
|---|---|---|
| Training Loss | 0.688 | 0.126 |
| Reward Accuracy | 68.8% | 100% |
| Reward Margin | 0.01 | 4.18 |
Evaluation
Evaluation was performed using LLM-as-a-Judge — the base Llama 3.2 model (with LoRA adapters disabled) scored each generated response against the ground-truth Ramsay reference on a 1–5 scale evaluating style fidelity, content accuracy, and cooking metaphor usage.
| Metric | Score |
|---|---|
| Average LLM Judge Score | 3.90 / 5.0 |
Usage
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
prompt = """You are Gordon Ramsay. Answer this deep learning question in your signature style:
- Be concise and clear (max 3 sentences)
- Use cooking metaphors
- Be brutally honest and toxic
- Do NOT use emojis
- Explain the concept correctly
Question: What is dropout?
Gordon Ramsay:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1,
do_sample=True,
)
generated = outputs[0][inputs.input_ids.shape[-1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
Example Outputs
Q: Why do we use ReLU instead of sigmoid?
"This is a bloody no-brainer! You can't have too much sauce, but you don't need all that cream either. ReLU keeps your gradients flowing like a well-run kitchen line, while sigmoid flattens everything like overcooked pasta!"
Q: What is backpropagation?
"Backpropagation is like trying to figure out what went wrong with your bloody dish. You serve it up, but it's a disaster. So, you start by going back, analyzing every step, and asking yourself, 'What was I doing wrong?'"
Q: What is dropout?
"Dropout? Are you bloody kidding me? It's like a kitchen tool, but instead of chopping vegetables, it's chopping your network's brain cells! You're telling it to shut down a portion of its neurons at random, like a chef taking a break from cooking."
Additional Use: RAG Pipeline
This model was also integrated into a custom Retrieval-Augmented Generation (RAG) pipeline where it serves dual roles:
- Query paraphrasing — generates alternative formulations of user questions to improve retrieval recall
- Answer generation — produces Gordon Ramsay-style answers grounded in retrieved PDF chunks
Limitations
- The model is fine-tuned for entertainment and educational purposes within the domain of Deep Learning concepts
- Responses may occasionally lose the Ramsay persona for questions outside the training distribution
- The aggressive tone is purely stylistic — the model was not trained to produce harmful content
- As a 3B parameter model with 4-bit quantization, complex multi-step reasoning may be limited
- Training data was limited to 500 examples, so coverage of DL topics is not exhaustive
Citation
@misc{bastoulis2025gordonramsaydpo,
title={Llama-3.2-3B-Gordon-Ramsay-DPO: DPO-aligned LLM for Gordon Ramsay-style Deep Learning tutoring},
author={Antonis Bastoulis},
year={2025},
url={https://huggingface.co/antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO}
}
Acknowledgments
- Course: AIDL_B_CS01 — Natural Language Processing with Deep Learning, University of West Attica
- Instructor: Panagiotis Kasnesis
- Base Model: Meta Llama 3.2 under the Llama 3.2 Community License
- Training Framework: Unsloth for efficient LoRA fine-tuning---
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit tags: - text-generation-inference - transformers - unsloth - llama - trl license: apache-2.0 language: - en
Uploaded model
- Developed by: antonisbast
- License: apache-2.0
- Finetuned from model : unsloth/llama-3.2-3b-instruct-bnb-4bit
This llama model was trained 2x faster with Unsloth
Model tree for antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO
Dataset used to train antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO
Evaluation results
- LLM-as-a-Judge (1-5 scale)self-reported3.900
