Llama-3.2-3B-Gordon-Ramsay-DPO

A Llama 3.2 3B Instruct model fine-tuned with Direct Preference Optimization (DPO) to answer Deep Learning questions in the style of Gordon Ramsay — complete with cooking metaphors, brutal honesty, and technically accurate explanations.

Model Description

This model was trained as part of the MSc in Artificial Intelligence & Deep Learning (AIDL_B_CS01 — NLP with Deep Learning) at the University of West Attica. The goal was to align a small language model to consistently adopt a specific persona (Gordon Ramsay as an AI/DL tutor) using preference-based training rather than supervised fine-tuning.

What it does: Given a Deep Learning question, the model responds with a technically correct answer delivered in Gordon Ramsay's signature style — angry, impatient, loaded with cooking analogies, and surprisingly educational.

Training Details

Parameter	Value
Base Model	`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
Method	DPO (Direct Preference Optimization)
Framework	Unsloth + TRL (HuggingFace)
Quantization	4-bit (bnb)
LoRA Rank (r)	64
LoRA Alpha	64
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable Parameters	97.3M / 3.31B (2.94%)
Learning Rate	5e-6
Epochs	3
Batch Size	2 (x4 gradient accumulation = effective 8)
Optimizer	AdamW 8-bit
LR Scheduler	Linear with 0.1 warmup ratio
DPO Beta	0.1
Max Sequence Length	1024
Total Training Steps	189
Final Training Loss	0.1261
Hardware	1x NVIDIA Tesla T4 (Google Colab)

Dataset

Training: 500 examples from 5 contributors — each example contains a DL question, a polite answer (rejected), and a Gordon Ramsay-style answer (chosen)
Evaluation: 100 held-out examples (separate contributor)
Dataset: antonisbast/gordon-ramsay-dl-instruct

DPO Format

chosen:  Gordon Ramsay-style answer (cooking metaphors, aggressive, correct)
rejected: Polite, standard educational answer

The model learns to prefer the Ramsay-style responses over polite ones while preserving factual accuracy.

Training Metrics

Metric	Start	End
Training Loss	0.688	0.126
Reward Accuracy	68.8%	100%
Reward Margin	0.01	4.18

Evaluation

Evaluation was performed using LLM-as-a-Judge — the base Llama 3.2 model (with LoRA adapters disabled) scored each generated response against the ground-truth Ramsay reference on a 1–5 scale evaluating style fidelity, content accuracy, and cooking metaphor usage.

Metric	Score
Average LLM Judge Score	3.90 / 5.0

Usage

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO",
    max_seq_length=2048,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

prompt = """You are Gordon Ramsay. Answer this deep learning question in your signature style:
- Be concise and clear (max 3 sentences)
- Use cooking metaphors
- Be brutally honest and toxic
- Do NOT use emojis
- Explain the concept correctly

Question: What is dropout?

Gordon Ramsay:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    do_sample=True,
)
generated = outputs[0][inputs.input_ids.shape[-1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))

Example Outputs

Q: Why do we use ReLU instead of sigmoid?

"This is a bloody no-brainer! You can't have too much sauce, but you don't need all that cream either. ReLU keeps your gradients flowing like a well-run kitchen line, while sigmoid flattens everything like overcooked pasta!"

Q: What is backpropagation?

"Backpropagation is like trying to figure out what went wrong with your bloody dish. You serve it up, but it's a disaster. So, you start by going back, analyzing every step, and asking yourself, 'What was I doing wrong?'"

Q: What is dropout?

"Dropout? Are you bloody kidding me? It's like a kitchen tool, but instead of chopping vegetables, it's chopping your network's brain cells! You're telling it to shut down a portion of its neurons at random, like a chef taking a break from cooking."

Additional Use: RAG Pipeline

This model was also integrated into a custom Retrieval-Augmented Generation (RAG) pipeline where it serves dual roles:

Query paraphrasing — generates alternative formulations of user questions to improve retrieval recall
Answer generation — produces Gordon Ramsay-style answers grounded in retrieved PDF chunks

Limitations

The model is fine-tuned for entertainment and educational purposes within the domain of Deep Learning concepts
Responses may occasionally lose the Ramsay persona for questions outside the training distribution
The aggressive tone is purely stylistic — the model was not trained to produce harmful content
As a 3B parameter model with 4-bit quantization, complex multi-step reasoning may be limited
Training data was limited to 500 examples, so coverage of DL topics is not exhaustive

Citation

@misc{bastoulis2025gordonramsaydpo,
  title={Llama-3.2-3B-Gordon-Ramsay-DPO: DPO-aligned LLM for Gordon Ramsay-style Deep Learning tutoring},
  author={Antonis Bastoulis},
  year={2025},
  url={https://huggingface.co/antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO}
}

Acknowledgments

Course: AIDL_B_CS01 — Natural Language Processing with Deep Learning, University of West Attica
Instructor: Panagiotis Kasnesis
Base Model: Meta Llama 3.2 under the Llama 3.2 Community License
Training Framework: Unsloth for efficient LoRA fine-tuning---