EuroLLM-22B-Instruct-GGUF (Jugaad Optimized)
This repository contains GGUF format quantizations of utter-project/EuroLLM-22B-Instruct.
Why this release?
Unlike standard automated quantizations, this release was specifically optimized by Jugaad to balance professional performance with consumer hardware constraints.
We focused on enabling the deployment of this powerful 22B parameter model on single 24GB VRAM GPUs (NVIDIA RTX 3090, RTX 4090, L4) while preserving its capability in critical tasks like PII/PHI Extraction (NER) across European languages.
Key Differentiators
- Custom Calibration: Instead of random data, we used a multilingual professional dataset (Medical, Legal, Finance, GDPR) for the Importance Matrix (imatrix) calculation.
- Verified Performance: We didn't just quantize; we benchmarked. Our Q4_K_M quantization achieves an F1 Score of ~0.89 on multilingual NER tasks, outperforming even larger models.
- Hardware-Ready: We provide specific memory usage data to ensure zero OOM errors in production.
π¦ Provided Quantizations
| Filename | Type | Size | Use Case |
|---|---|---|---|
eurollm-22b-Q4_K_M.gguf |
Q4_K_M | 13.0 GB | β RECOMMENDED. Best F1/VRAM balance for 24GB cards. |
eurollm-22b-Q5_K_M.gguf |
Q5_K_M | 15.0 GB | Higher precision if you have >24GB VRAM. |
eurollm-22b-Q6_K.gguf |
Q6_K | 18.0 GB | Near-fp16 performance. Tight fit on 24GB (short context only). |
eurollm-22b-Q8_0.gguf |
Q8_0 | 23.0 GB | Maximum fidelity. Not recommended for 24GB cards (high OOM risk). |
eurollm-22b-IQ4_NL.gguf |
IQ4_NL | 13.0 GB | Alternative non-linear quantization. |
eurollm-22b-IQ4_XS.gguf |
IQ4_XS | 12.0 GB | Smaller footprint if VRAM is very tight. |
eurollm-22b-IQ3_M.gguf |
IQ3_M | 9.8 GB | Low VRAM usage (<12GB). |
eurollm-22b-IQ2_M.gguf |
IQ2_M | 7.5 GB | Extreme compression. |
π Benchmark Results (Multilingual NER)
We tested these models on a tough PII/PHI extraction task across 5 languages (IT, EN, FR, DE, ES).
| Model | Average F1 Score | Notes |
|---|---|---|
| Q4_K_M | 0.890 | Highest score across all tested quantizations |
| IQ4_XS | 0.886 | Excellent efficiency |
| Q8_0 | 0.883 | Surprisingly slightly lower on this specific task |
| IQ4_NL | 0.881 | Solid performer |
Detailed results can be found in the benchmark_ner_results.md file.
βοΈ Technical Details
- Base Model:
utter-project/EuroLLM-22B-2512 - Quantization Tool:
llama.cpp(build 4358) - Calibration Data: Custom mix of Wikipedia (General) + Domain Specific (Medical/Legal/Finance) articles.
- Languages Covered: Italian, English, French, German, Spanish, Portuguese, Dutch, Polish.
Please contact us to receive the file used to calculate the optimization imatrix.
π» Usage
CLI:
./llama-cli -m eurollm-22b-Q4_K_M.gguf -p "Extract the entities from this text..." -n 512 -c 4096
Python:
from llama_cpp import Llama
llm = Llama(
model_path="./eurollm-22b-Q4_K_M.gguf",
n_gpu_layers=-1, # Offload to GPU
n_ctx=8192 # 13GB model leaves plenty of room for context on a 24GB card
)
res = llm.create_chat_completion(
messages=[{"role": "user", "content": "What is the capital of Italy?"}]
)
print(res)
- Downloads last month
- 466
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for jugaadsrl/EuroLLM-22B-Instruct-GGUF
Base model
utter-project/EuroLLM-22B-2512