---
license: llama3.2
base_model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
tags:
  - llama
  - dpo
  - preference-alignment
  - fine-tuned
  - unsloth
  - lora
  - nlp
  - deep-learning
  - gordon-ramsay
  - text-generation
datasets:
  - antonisbast/gordon-ramsay-dl-instruct
language:
  - en
pipeline_tag: text-generation
model-index:
  - name: Llama-3.2-3B-Gordon-Ramsay-DPO
    results:
      - task:
          type: text-generation
          name: Style Transfer (Gordon Ramsay)
        metrics:
          - name: LLM-as-a-Judge (1-5 scale)
            type: custom
            value: 3.90
---

# Llama-3.2-3B-Gordon-Ramsay-DPO

A Llama 3.2 3B Instruct model fine-tuned with **Direct Preference Optimization (DPO)** to answer Deep Learning questions in the style of Gordon Ramsay — complete with cooking metaphors, brutal honesty, and technically accurate explanations.

## Model Description

This model was trained as part of the MSc in Artificial Intelligence & Deep Learning (AIDL_B_CS01 — NLP with Deep Learning) at the University of West Attica. The goal was to align a small language model to consistently adopt a specific persona (Gordon Ramsay as an AI/DL tutor) using preference-based training rather than supervised fine-tuning.

**What it does:** Given a Deep Learning question, the model responds with a technically correct answer delivered in Gordon Ramsay's signature style — angry, impatient, loaded with cooking analogies, and surprisingly educational.

## Training Details

| Parameter | Value |
|---|---|
| **Base Model** | `unsloth/Llama-3.2-3B-Instruct-bnb-4bit` |
| **Method** | DPO (Direct Preference Optimization) |
| **Framework** | Unsloth + TRL (HuggingFace) |
| **Quantization** | 4-bit (bnb) |
| **LoRA Rank (r)** | 64 |
| **LoRA Alpha** | 64 |
| **Target Modules** | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| **Trainable Parameters** | 97.3M / 3.31B (2.94%) |
| **Learning Rate** | 5e-6 |
| **Epochs** | 3 |
| **Batch Size** | 2 (x4 gradient accumulation = effective 8) |
| **Optimizer** | AdamW 8-bit |
| **LR Scheduler** | Linear with 0.1 warmup ratio |
| **DPO Beta** | 0.1 |
| **Max Sequence Length** | 1024 |
| **Total Training Steps** | 189 |
| **Final Training Loss** | 0.1261 |
| **Hardware** | 1x NVIDIA Tesla T4 (Google Colab) |

### Dataset

- **Training:** 500 examples from 5 contributors — each example contains a DL question, a polite answer (rejected), and a Gordon Ramsay-style answer (chosen)
- **Evaluation:** 100 held-out examples (separate contributor)
- **Dataset:** [`antonisbast/gordon-ramsay-dl-instruct`](https://huggingface.co/datasets/antonisbast/gordon-ramsay-dl-instruct)

### DPO Format

```
chosen:  Gordon Ramsay-style answer (cooking metaphors, aggressive, correct)
rejected: Polite, standard educational answer
```

The model learns to prefer the Ramsay-style responses over polite ones while preserving factual accuracy.

## Training Metrics

| Metric | Start | End |
|---|---|---|
| Training Loss | 0.688 | 0.126 |
| Reward Accuracy | 68.8% | 100% |
| Reward Margin | 0.01 | 4.18 |

## Evaluation

Evaluation was performed using **LLM-as-a-Judge** — the base Llama 3.2 model (with LoRA adapters disabled) scored each generated response against the ground-truth Ramsay reference on a 1–5 scale evaluating style fidelity, content accuracy, and cooking metaphor usage.

| Metric | Score |
|---|---|
| **Average LLM Judge Score** | **3.90 / 5.0** |


## Usage

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO",
    max_seq_length=2048,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(model)

prompt = """You are Gordon Ramsay. Answer this deep learning question in your signature style:
- Be concise and clear (max 3 sentences)
- Use cooking metaphors
- Be brutally honest and toxic
- Do NOT use emojis
- Explain the concept correctly

Question: What is dropout?

Gordon Ramsay:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.1,
    do_sample=True,
)
generated = outputs[0][inputs.input_ids.shape[-1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
```

### Example Outputs

**Q: Why do we use ReLU instead of sigmoid?**
> "This is a bloody no-brainer! You can't have too much sauce, but you don't need all that cream either. ReLU keeps your gradients flowing like a well-run kitchen line, while sigmoid flattens everything like overcooked pasta!"

**Q: What is backpropagation?**
> "Backpropagation is like trying to figure out what went wrong with your bloody dish. You serve it up, but it's a disaster. So, you start by going back, analyzing every step, and asking yourself, 'What was I doing wrong?'"

**Q: What is dropout?**
> "Dropout? Are you bloody kidding me? It's like a kitchen tool, but instead of chopping vegetables, it's chopping your network's brain cells! You're telling it to shut down a portion of its neurons at random, like a chef taking a break from cooking."

## Additional Use: RAG Pipeline

This model was also integrated into a custom **Retrieval-Augmented Generation (RAG)** pipeline where it serves dual roles:
1. **Query paraphrasing** — generates alternative formulations of user questions to improve retrieval recall
2. **Answer generation** — produces Gordon Ramsay-style answers grounded in retrieved PDF chunks

## Limitations

- The model is fine-tuned for **entertainment and educational purposes** within the domain of Deep Learning concepts
- Responses may occasionally lose the Ramsay persona for questions outside the training distribution
- The aggressive tone is purely stylistic — the model was not trained to produce harmful content
- As a 3B parameter model with 4-bit quantization, complex multi-step reasoning may be limited
- Training data was limited to 500 examples, so coverage of DL topics is not exhaustive

## Citation

```bibtex
@misc{bastoulis2025gordonramsaydpo,
  title={Llama-3.2-3B-Gordon-Ramsay-DPO: DPO-aligned LLM for Gordon Ramsay-style Deep Learning tutoring},
  author={Antonis Bastoulis},
  year={2025},
  url={https://huggingface.co/antonisbast/Llama-3.2-3B-Gordon-Ramsay-DPO}
}
```

## Acknowledgments

- **Course:** AIDL_B_CS01 — Natural Language Processing with Deep Learning, University of West Attica
- **Instructor:** Panagiotis Kasnesis
- **Base Model:** [Meta Llama 3.2](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) under the Llama 3.2 Community License
- **Training Framework:** [Unsloth](https://github.com/unslothai/unsloth) for efficient LoRA fine-tuning---

base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
---

# Uploaded  model

- **Developed by:** antonisbast
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3.2-3b-instruct-bnb-4bit

This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)