Instructions to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="gratex/mistral-small-3.2-24B-Instruct-2506-GGUF", filename="mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Use Docker
docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
- Ollama
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Ollama:
ollama run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
- Unsloth Studio
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting
- Pi
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Run Hermes
hermes
- Docker Model Runner
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Docker Model Runner:
docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
- Lemonade
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
Run and chat with the model
lemonade run user.mistral-small-3.2-24B-Instruct-2506-GGUF-F16
List all available models
lemonade list
Mistral-Small-3.2-24B-Instruct-2506 — GGUF Quantizations
This repository contains GGUF quantizations of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities. These quantizations are optimized for AMD RDNA 3 (gfx1100) GPUs (RX 7900 XTX, RX 7900 XT, RX 7900 GRE) using llama.cpp with ROCm/Vulkan backends.
Three K-quant variants are provided — Q4_L, Q4_M, and Q4_S — offering a quality–size tradeoff. All were quantized from the BF16 GGUF baseline on a NVIDIA RTX PRO 6000 (Blackwell) and benchmarked on 2× AMD RX 7900 XTX.
Model Details
| Property | Value |
|---|---|
| Base Model | mistralai/Mistral-Small-3.2-24B-Instruct-2506 |
| Quantization Format | GGUF (K-quants via llama.cpp) |
| Architecture | Mistral3ForConditionalGeneration |
| LM Layers | 40 MistralDecoder layers |
| Hidden Size | 5120 |
| Intermediate Size | 32768 |
| Attention Heads | 32 (query), 8 (key/value, GQA) |
| Head Dimension | 128 |
| Vocabulary Size | 131,072 (Tekken tokenizer: 150,000 regular + 1,000 special, 131,072 used) |
| Context Window | 131,072 tokens |
| Vision Encoder | Pixtral (24 layers, hidden_size=1024, patch_size=14) |
| Vision Projector | patch_merge (spatial_merge_size=2) |
| Quantized Components | Text decoder weights |
| Preserved in F16/Q8_0 | Vision tower (separate mmproj files) |
Quantization Variants
| File | Quant | Size | Description |
|---|---|---|---|
mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf |
Q4_0_L | ~16.3 GB | Best quality — largest K-quant groups, closest to F16 |
mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf |
Q4_0_M | ~15.1 GB | Balanced — good quality at smaller size |
mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf |
Q4_0_S | ~14.2 GB | Smallest — fastest inference, most compression |
mmproj-F16.gguf |
F16 | ~847 MB | Vision projector (full precision) |
mmproj-Q8_0.gguf |
Q8_0 | ~459 MB | Vision projector (8-bit, recommended) |
Quality Benchmarks
All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.
WikiText-2 Perplexity (ctx=512)
Measured via llama-perplexity b8984, 588 chunks.
| Model | PPL | ΔPPL vs F16 |
|---|---|---|
| F16 (baseline) | 5.4894 | — |
| Q4_L | 5.5377 | +0.88% |
| Q4_M | 5.4417 | -0.87%* |
| Q4_S | 5.5035 | +0.26% |
* Q4_M PPL < F16 PPL is a known artifact of quantized models on wikitext-2 (token distribution shift). KLD is the reliable quality metric.
KL Divergence vs F16 (Static / Prefill)
KL divergence measures how much the output probability distribution has shifted from the F16 baseline. Lower is better; 0 = identical to F16.
Methodology: wikitext-2-raw-v1, ctx=512, full-vocab KLD computed on the second half of each 512-token chunk (positions [256–511]), ensuring every scored token has ≥256 tokens of left context. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"
| Metric | Q4_L | Q4_M | Q4_S |
|---|---|---|---|
| Mean KLD | 0.00968 | 0.01273 | 0.02225 |
| Median KLD | 0.00523 | 0.00495 | 0.00959 |
| 99th %ile KLD | 0.08416 | 0.13901 | 0.24665 |
| 95th %ile KLD | 0.02951 | 0.04391 | 0.07171 |
| Max KLD | 1.76017 | 2.60900 | 3.41354 |
Token Probability Divergence (Δp)
| Metric | Q4_L | Q4_M | Q4_S |
|---|---|---|---|
| RMS Δp | 3.218% | 3.492% | 4.765% |
| 99th %ile Δp | 8.825% | 9.113% | 12.490% |
| 95th %ile Δp | 4.473% | 3.926% | 6.111% |
| Same top-p | 94.94% | 95.01% | 93.39% |
Same top-p = percentage of tokens where quantized and F16 models agree on the most likely next token.
Quality Ranking
| Rank | Model | Mean KLD | Interpretation |
|---|---|---|---|
| 1 | Q4_L | 0.00968 | Best — closest to F16 |
| 2 | Q4_M | 0.01273 | ~32% more divergence than Q4_L |
| 3 | Q4_S | 0.02225 | ~2.3× the divergence of Q4_L |
Throughput Benchmarks (2× AMD RX 7900 XTX, gfx1100)
Benchmarks run on (2× RX 7900 XTX, ROCm 6.4.4) using llama.cpp llama-server with flash attention enabled.
Launch Configuration
llama-server -m <model>.gguf -c 131072 -ngl 99 -fa on \
--host 0.0.0.0 --cache-type-k f16 --cache-type-v f16 \
--tensor-split 1,1 --no-mmap -t 23 -ub 256 -b 256 --parallel <N>
Single-Request Throughput (parallel=1)
| Model | Context | Aggregate t/s | Avg Latency | Min/Max Latency |
|---|---|---|---|---|
| Q4_S | 131,072 | 43.3 tok/s | 7.19 s | 0.57 / 17.48 s |
| Q4_M | 131,072 | 38.7 tok/s | 14.96 s | 0.56 / 26.08 s |
| Q4_L | 131,072 | 29.8 tok/s | 18.39 s | 0.82 / 36.21 s |
Multi-Request Throughput (parallel=8)
| Model | Context | Aggregate t/s | Avg Latency | Min/Max Latency |
|---|---|---|---|---|
| Q4_S | 196,608 | 165.6 tok/s | 26.82 s | 0.43 / 50.20 s |
| Q4_M | 196,608 | 152.3 tok/s | 24.91 s | 0.41 / 53.21 s |
| Q4_L | 180,000 | 152.9 tok/s | 28.16 s | 0.50 / 55.70 s |
Detailed Per-Model Results
Q4_S — Fastest
Single request (parallel=1, ctx=131072):
| Metric | Value |
|---|---|
| Aggregate throughput | 43.3 tok/s |
| Total tokens | 6,234 (20 requests × up to 1,024) |
| Per-request t/s min/avg/max | 4.4 / 32.7 / 45.8 |
| Latency min/avg/max | 0.57 s / 7.19 s / 17.48 s |
| Success | 20/20 (100%) |
8 concurrent (parallel=8, ctx=196608):
| Metric | Value |
|---|---|
| Aggregate throughput | 165.6 tok/s |
| Total tokens | 90,478 (160 requests × up to 1,024) |
| Per-request t/s min/avg/max | 2.0 / 18.1 / 24.2 |
| Latency min/avg/max | 0.43 s / 26.82 s / 50.20 s |
| Success | 160/160 (100%) |
Q4_M — Balanced
Single request (parallel=1, ctx=131072):
| Metric | Value |
|---|---|
| Aggregate throughput | 38.7 tok/s |
| Total tokens | 11,569 (20 requests × up to 1,024) |
| Per-request t/s min/avg/max | 2.9 / 31.2 / 39.5 |
| Latency min/avg/max | 0.56 s / 14.96 s / 26.08 s |
| Success | 20/20 (100%) |
8 concurrent (parallel=8, ctx=196608):
| Metric | Value |
|---|---|
| Aggregate throughput | 152.3 tok/s |
| Total tokens | 78,210 (160 requests × up to 1,024) |
| Per-request t/s min/avg/max | 0.3 / 15.5 / 29.2 |
| Latency min/avg/max | 0.41 s / 24.91 s / 53.21 s |
| Success | 160/160 (100%) |
Q4_L — Best Quality
Single request (parallel=1, ctx=131072):
| Metric | Value |
|---|---|
| Aggregate throughput | 29.8 tok/s |
| Total tokens | 10,979 (20 requests × up to 1,024) |
| Per-request t/s min/avg/max | 1.6 / 25.7 / 37.8 |
| Latency min/avg/max | 0.82 s / 18.39 s / 36.21 s |
| Success | 20/20 (100%) |
8 concurrent (parallel=8, ctx=180000):
| Metric | Value |
|---|---|
| Aggregate throughput | 152.9 tok/s |
| Total tokens | 88,676 (160 requests × up to 1,024) |
| Per-request t/s min/avg/max | 0.4 / 17.2 / 22.1 |
| Latency min/avg/max | 0.50 s / 28.16 s / 55.70 s |
| Success | 160/160 (100%) |
Hardware Requirements
AMD GPUs (ROCm / Vulkan)
| GPU VRAM | Recommended Variant | Context |
|---|---|---|
| 24 GB (RX 7900 XTX) | Q4_S | Up to 131,072 with f16 KV cache, 2× GPU |
| 24 GB (RX 7900 XTX) | Q4_M | Up to 131,072 with f16 KV cache, 2× GPU |
| 24 GB (RX 7900 XTX) | Q4_L | Up to 131,072 with f16 KV cache, 2× GPU |
For dual-GPU setups (2× RX 7900 XTX), use --tensor-split 1,1 and --ngl 99 to distribute layers across both GPUs.
NVIDIA GPUs (CUDA)
These GGUF models also work on NVIDIA GPUs via llama.cpp CUDA backend. For NVIDIA deployment, consider the AutoRound W4A16 or NVFP4A16 quantizations instead — they offer better throughput on CUDA via vLLM.
Usage with llama.cpp
Server Mode (Recommended)
# AMD (ROCm):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
-c 131072 -ngl 99 -fa on \
--port 8000 --host 0.0.0.0 \
--cache-type-k f16 --cache-type-v f16 \
--tensor-split 1,1 --no-mmap -t 23 \
-ub 256 -b 256 --parallel 4 \
--mmproj mmproj-Q8_0.gguf
# NVIDIA (CUDA):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
-c 131072 -ngl 99 -fa on \
--port 8000 --host 0.0.0.0 \
-ub 256 -b 256 --parallel 4 \
--mmproj mmproj-Q8_0.gguf
Inference Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistral-small-3.2-24b-instruct-Q4_M_AMD","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'
Vision Test
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model":"mistral-small-3.2-24b-instruct-Q4_M_AMD",
"messages":[
{"role":"user","content":[{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},{"type":"text","text":"Describe this image in one sentence."}]}
],
"max_tokens":100
}'
Notes
Vision: Image Size Limit
The original Mistral model has max_image_size set to 1540. Images with dimensions exceeding the limit are proportionally downscaled before vision encoding. The mmproj files in this repository match the original specification.
Files in This Repository
| File | Size | Description |
|---|---|---|
mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf |
~16.3 GB | Q4_L quantized text model — best quality |
mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf |
~15.1 GB | Q4_M quantized text model — balanced |
mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf |
~14.2 GB | Q4_S quantized text model — smallest/fastest |
mmproj-F16.gguf |
~847 MB | Vision projector (F16, full precision) |
mmproj-Q8_0.gguf |
~459 MB | Vision projector (Q8_0, recommended) |
License
This quantization is released under the Apache 2.0 License, following the base model's license.
The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.
Citation
If you use this model in your research, please cite:
@misc{mistral-small-3.2-24b-gguf-gfx1100,
title = {Mistral-Small-3.2-24B-Instruct-2506 GGUF Quantizations},
author = {Gratex International},
year = {2026},
howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GGUF}},
note = {Quantized with llama.cpp, benchmarked on AMD RX 7900 XTX}
}
Acknowledgments
This quantization was produced using hardware provided by Gratex International, a.s.
Original Model: mistralai/Mistral-Small-3.2-24B-Instruct-2506 Quantization Tool: llama.cpp Inference Engine: llama.cpp (ROCm / CUDA / Vulkan)
- Downloads last month
- 495
We're not able to determine the quantization variants.
Model tree for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503