Mistral-Small-3.2-24B-Instruct-2506 — GGUF Quantizations

This repository contains GGUF quantizations of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities. These quantizations are optimized for AMD RDNA 3 (gfx1100) GPUs (RX 7900 XTX, RX 7900 XT, RX 7900 GRE) using llama.cpp with ROCm/Vulkan backends.

Three K-quant variants are provided — Q4_L, Q4_M, and Q4_S — offering a quality–size tradeoff. All were quantized from the BF16 GGUF baseline on a NVIDIA RTX PRO 6000 (Blackwell) and benchmarked on 2× AMD RX 7900 XTX.

Model Details

Property Value
Base Model mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Format GGUF (K-quants via llama.cpp)
Architecture Mistral3ForConditionalGeneration
LM Layers 40 MistralDecoder layers
Hidden Size 5120
Intermediate Size 32768
Attention Heads 32 (query), 8 (key/value, GQA)
Head Dimension 128
Vocabulary Size 131,072 (Tekken tokenizer: 150,000 regular + 1,000 special, 131,072 used)
Context Window 131,072 tokens
Vision Encoder Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector patch_merge (spatial_merge_size=2)
Quantized Components Text decoder weights
Preserved in F16/Q8_0 Vision tower (separate mmproj files)

Quantization Variants

File Quant Size Description
mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf Q4_0_L ~16.3 GB Best quality — largest K-quant groups, closest to F16
mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf Q4_0_M ~15.1 GB Balanced — good quality at smaller size
mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf Q4_0_S ~14.2 GB Smallest — fastest inference, most compression
mmproj-F16.gguf F16 ~847 MB Vision projector (full precision)
mmproj-Q8_0.gguf Q8_0 ~459 MB Vision projector (8-bit, recommended)

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via llama-perplexity b8984, 588 chunks.

Model PPL ΔPPL vs F16
F16 (baseline) 5.4894
Q4_L 5.5377 +0.88%
Q4_M 5.4417 -0.87%*
Q4_S 5.5035 +0.26%

* Q4_M PPL < F16 PPL is a known artifact of quantized models on wikitext-2 (token distribution shift). KLD is the reliable quality metric.

KL Divergence vs F16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the F16 baseline. Lower is better; 0 = identical to F16.

Methodology: wikitext-2-raw-v1, ctx=512, full-vocab KLD computed on the second half of each 512-token chunk (positions [256–511]), ensuring every scored token has ≥256 tokens of left context. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"

Metric Q4_L Q4_M Q4_S
Mean KLD 0.00968 0.01273 0.02225
Median KLD 0.00523 0.00495 0.00959
99th %ile KLD 0.08416 0.13901 0.24665
95th %ile KLD 0.02951 0.04391 0.07171
Max KLD 1.76017 2.60900 3.41354

Token Probability Divergence (Δp)

Metric Q4_L Q4_M Q4_S
RMS Δp 3.218% 3.492% 4.765%
99th %ile Δp 8.825% 9.113% 12.490%
95th %ile Δp 4.473% 3.926% 6.111%
Same top-p 94.94% 95.01% 93.39%

Same top-p = percentage of tokens where quantized and F16 models agree on the most likely next token.

Quality Ranking

Rank Model Mean KLD Interpretation
1 Q4_L 0.00968 Best — closest to F16
2 Q4_M 0.01273 ~32% more divergence than Q4_L
3 Q4_S 0.02225 ~2.3× the divergence of Q4_L

Throughput Benchmarks (2× AMD RX 7900 XTX, gfx1100)

Benchmarks run on (2× RX 7900 XTX, ROCm 6.4.4) using llama.cpp llama-server with flash attention enabled.

Launch Configuration

llama-server -m <model>.gguf -c 131072 -ngl 99 -fa on \
  --host 0.0.0.0 --cache-type-k f16 --cache-type-v f16 \
  --tensor-split 1,1 --no-mmap -t 23 -ub 256 -b 256 --parallel <N>

Single-Request Throughput (parallel=1)

Model Context Aggregate t/s Avg Latency Min/Max Latency
Q4_S 131,072 43.3 tok/s 7.19 s 0.57 / 17.48 s
Q4_M 131,072 38.7 tok/s 14.96 s 0.56 / 26.08 s
Q4_L 131,072 29.8 tok/s 18.39 s 0.82 / 36.21 s

Multi-Request Throughput (parallel=8)

Model Context Aggregate t/s Avg Latency Min/Max Latency
Q4_S 196,608 165.6 tok/s 26.82 s 0.43 / 50.20 s
Q4_M 196,608 152.3 tok/s 24.91 s 0.41 / 53.21 s
Q4_L 180,000 152.9 tok/s 28.16 s 0.50 / 55.70 s

Detailed Per-Model Results

Q4_S — Fastest

Single request (parallel=1, ctx=131072):

Metric Value
Aggregate throughput 43.3 tok/s
Total tokens 6,234 (20 requests × up to 1,024)
Per-request t/s min/avg/max 4.4 / 32.7 / 45.8
Latency min/avg/max 0.57 s / 7.19 s / 17.48 s
Success 20/20 (100%)

8 concurrent (parallel=8, ctx=196608):

Metric Value
Aggregate throughput 165.6 tok/s
Total tokens 90,478 (160 requests × up to 1,024)
Per-request t/s min/avg/max 2.0 / 18.1 / 24.2
Latency min/avg/max 0.43 s / 26.82 s / 50.20 s
Success 160/160 (100%)

Q4_M — Balanced

Single request (parallel=1, ctx=131072):

Metric Value
Aggregate throughput 38.7 tok/s
Total tokens 11,569 (20 requests × up to 1,024)
Per-request t/s min/avg/max 2.9 / 31.2 / 39.5
Latency min/avg/max 0.56 s / 14.96 s / 26.08 s
Success 20/20 (100%)

8 concurrent (parallel=8, ctx=196608):

Metric Value
Aggregate throughput 152.3 tok/s
Total tokens 78,210 (160 requests × up to 1,024)
Per-request t/s min/avg/max 0.3 / 15.5 / 29.2
Latency min/avg/max 0.41 s / 24.91 s / 53.21 s
Success 160/160 (100%)

Q4_L — Best Quality

Single request (parallel=1, ctx=131072):

Metric Value
Aggregate throughput 29.8 tok/s
Total tokens 10,979 (20 requests × up to 1,024)
Per-request t/s min/avg/max 1.6 / 25.7 / 37.8
Latency min/avg/max 0.82 s / 18.39 s / 36.21 s
Success 20/20 (100%)

8 concurrent (parallel=8, ctx=180000):

Metric Value
Aggregate throughput 152.9 tok/s
Total tokens 88,676 (160 requests × up to 1,024)
Per-request t/s min/avg/max 0.4 / 17.2 / 22.1
Latency min/avg/max 0.50 s / 28.16 s / 55.70 s
Success 160/160 (100%)

Hardware Requirements

AMD GPUs (ROCm / Vulkan)

GPU VRAM Recommended Variant Context
24 GB (RX 7900 XTX) Q4_S Up to 131,072 with f16 KV cache, 2× GPU
24 GB (RX 7900 XTX) Q4_M Up to 131,072 with f16 KV cache, 2× GPU
24 GB (RX 7900 XTX) Q4_L Up to 131,072 with f16 KV cache, 2× GPU

For dual-GPU setups (2× RX 7900 XTX), use --tensor-split 1,1 and --ngl 99 to distribute layers across both GPUs.

NVIDIA GPUs (CUDA)

These GGUF models also work on NVIDIA GPUs via llama.cpp CUDA backend. For NVIDIA deployment, consider the AutoRound W4A16 or NVFP4A16 quantizations instead — they offer better throughput on CUDA via vLLM.

Usage with llama.cpp

Server Mode (Recommended)

# AMD (ROCm):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
  -c 131072 -ngl 99 -fa on \
  --port 8000 --host 0.0.0.0 \
  --cache-type-k f16 --cache-type-v f16 \
  --tensor-split 1,1 --no-mmap -t 23 \
  -ub 256 -b 256 --parallel 4 \
  --mmproj mmproj-Q8_0.gguf

# NVIDIA (CUDA):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
  -c 131072 -ngl 99 -fa on \
  --port 8000 --host 0.0.0.0 \
  -ub 256 -b 256 --parallel 4 \
  --mmproj mmproj-Q8_0.gguf

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-instruct-Q4_M_AMD","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Vision Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"mistral-small-3.2-24b-instruct-Q4_M_AMD",
    "messages":[
      {"role":"user","content":[{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},{"type":"text","text":"Describe this image in one sentence."}]}
    ],
    "max_tokens":100
  }'

Notes

Vision: Image Size Limit

The original Mistral model has max_image_size set to 1540. Images with dimensions exceeding the limit are proportionally downscaled before vision encoding. The mmproj files in this repository match the original specification.

Files in This Repository

File Size Description
mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf ~16.3 GB Q4_L quantized text model — best quality
mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf ~15.1 GB Q4_M quantized text model — balanced
mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf ~14.2 GB Q4_S quantized text model — smallest/fastest
mmproj-F16.gguf ~847 MB Vision projector (F16, full precision)
mmproj-Q8_0.gguf ~459 MB Vision projector (Q8_0, recommended)

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-gguf-gfx1100,
  title = {Mistral-Small-3.2-24B-Instruct-2506 GGUF Quantizations},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GGUF}},
  note = {Quantized with llama.cpp, benchmarked on AMD RX 7900 XTX}
}

Acknowledgments

This quantization was produced using hardware provided by Gratex International, a.s.


Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: llama.cpp Inference Engine: llama.cpp (ROCm / CUDA / Vulkan)

Downloads last month
495
GGUF
Model size
24B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF