Instructions to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="gratex/mistral-small-3.2-24B-Instruct-2506-GGUF",
	filename="mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
# Run inference directly in the terminal:
llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Use Docker

docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

LM Studio
Jan

vLLM

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Ollama
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Ollama:
```
ollama run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
```

Unsloth Studio

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for gratex/mistral-small-3.2-24B-Instruct-2506-GGUF to start chatting

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Run Hermes

hermes

Docker Model Runner
How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Docker Model Runner:
```
docker model run hf.co/gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16
```

Lemonade

How to use gratex/mistral-small-3.2-24B-Instruct-2506-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull gratex/mistral-small-3.2-24B-Instruct-2506-GGUF:F16

Run and chat with the model

lemonade run user.mistral-small-3.2-24B-Instruct-2506-GGUF-F16

List all available models

lemonade list

Mistral-Small-3.2-24B-Instruct-2506 — GGUF Quantizations

This repository contains GGUF quantizations of Mistral-Small-3.2-24B-Instruct-2506, a 24B parameter multimodal model with Pixtral-style vision capabilities. These quantizations are optimized for AMD RDNA 3 (gfx1100) GPUs (RX 7900 XTX, RX 7900 XT, RX 7900 GRE) using llama.cpp with ROCm/Vulkan backends.

Three K-quant variants are provided — Q4_L, Q4_M, and Q4_S — offering a quality–size tradeoff. All were quantized from the BF16 GGUF baseline on a NVIDIA RTX PRO 6000 (Blackwell) and benchmarked on 2× AMD RX 7900 XTX.

Model Details

Property	Value
Base Model	mistralai/Mistral-Small-3.2-24B-Instruct-2506
Quantization Format	GGUF (K-quants via llama.cpp)
Architecture	Mistral3ForConditionalGeneration
LM Layers	40 MistralDecoder layers
Hidden Size	5120
Intermediate Size	32768
Attention Heads	32 (query), 8 (key/value, GQA)
Head Dimension	128
Vocabulary Size	131,072 (Tekken tokenizer: 150,000 regular + 1,000 special, 131,072 used)
Context Window	131,072 tokens
Vision Encoder	Pixtral (24 layers, hidden_size=1024, patch_size=14)
Vision Projector	patch_merge (spatial_merge_size=2)
Quantized Components	Text decoder weights
Preserved in F16/Q8_0	Vision tower (separate mmproj files)

Quantization Variants

File	Quant	Size	Description
`mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf`	Q4_0_L	~16.3 GB	Best quality — largest K-quant groups, closest to F16
`mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf`	Q4_0_M	~15.1 GB	Balanced — good quality at smaller size
`mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf`	Q4_0_S	~14.2 GB	Smallest — fastest inference, most compression
`mmproj-F16.gguf`	F16	~847 MB	Vision projector (full precision)
`mmproj-Q8_0.gguf`	Q8_0	~459 MB	Vision projector (8-bit, recommended)

Quality Benchmarks

All benchmarks use wikitext-2-raw-v1 (test split) — the standard dataset for quantization quality evaluation, matching the methodology of llama.cpp, AutoAWQ, GPTQ, and the academic literature.

WikiText-2 Perplexity (ctx=512)

Measured via llama-perplexity b8984, 588 chunks.

Model	PPL	ΔPPL vs F16
F16 (baseline)	5.4894	—
Q4_L	5.5377	+0.88%
Q4_M	5.4417	-0.87%*
Q4_S	5.5035	+0.26%

* Q4_M PPL < F16 PPL is a known artifact of quantized models on wikitext-2 (token distribution shift). KLD is the reliable quality metric.

KL Divergence vs F16 (Static / Prefill)

KL divergence measures how much the output probability distribution has shifted from the F16 baseline. Lower is better; 0 = identical to F16.

Methodology: wikitext-2-raw-v1, ctx=512, full-vocab KLD computed on the second half of each 512-token chunk (positions [256–511]), ensuring every scored token has ≥256 tokens of left context. KLD direction: KL(P_base ‖ P_quant) — "how well does the quantized model approximate the base?"

Metric	Q4_L	Q4_M	Q4_S
Mean KLD	0.00968	0.01273	0.02225
Median KLD	0.00523	0.00495	0.00959
99th %ile KLD	0.08416	0.13901	0.24665
95th %ile KLD	0.02951	0.04391	0.07171
Max KLD	1.76017	2.60900	3.41354

Token Probability Divergence (Δp)

Metric	Q4_L	Q4_M	Q4_S
RMS Δp	3.218%	3.492%	4.765%
99th %ile Δp	8.825%	9.113%	12.490%
95th %ile Δp	4.473%	3.926%	6.111%
Same top-p	94.94%	95.01%	93.39%

Same top-p = percentage of tokens where quantized and F16 models agree on the most likely next token.

Quality Ranking

Rank	Model	Mean KLD	Interpretation
1	Q4_L	0.00968	Best — closest to F16
2	Q4_M	0.01273	~32% more divergence than Q4_L
3	Q4_S	0.02225	~2.3× the divergence of Q4_L

Throughput Benchmarks (2× AMD RX 7900 XTX, gfx1100)

Benchmarks run on (2× RX 7900 XTX, ROCm 6.4.4) using llama.cpp llama-server with flash attention enabled.

Launch Configuration

llama-server -m <model>.gguf -c 131072 -ngl 99 -fa on \
  --host 0.0.0.0 --cache-type-k f16 --cache-type-v f16 \
  --tensor-split 1,1 --no-mmap -t 23 -ub 256 -b 256 --parallel <N>

Single-Request Throughput (parallel=1)

Model	Context	Aggregate t/s	Avg Latency	Min/Max Latency
Q4_S	131,072	43.3 tok/s	7.19 s	0.57 / 17.48 s
Q4_M	131,072	38.7 tok/s	14.96 s	0.56 / 26.08 s
Q4_L	131,072	29.8 tok/s	18.39 s	0.82 / 36.21 s

Multi-Request Throughput (parallel=8)

Model	Context	Aggregate t/s	Avg Latency	Min/Max Latency
Q4_S	196,608	165.6 tok/s	26.82 s	0.43 / 50.20 s
Q4_M	196,608	152.3 tok/s	24.91 s	0.41 / 53.21 s
Q4_L	180,000	152.9 tok/s	28.16 s	0.50 / 55.70 s

Detailed Per-Model Results

Q4_S — Fastest

Single request (parallel=1, ctx=131072):

Metric	Value
Aggregate throughput	43.3 tok/s
Total tokens	6,234 (20 requests × up to 1,024)
Per-request t/s min/avg/max	4.4 / 32.7 / 45.8
Latency min/avg/max	0.57 s / 7.19 s / 17.48 s
Success	20/20 (100%)

8 concurrent (parallel=8, ctx=196608):

Metric	Value
Aggregate throughput	165.6 tok/s
Total tokens	90,478 (160 requests × up to 1,024)
Per-request t/s min/avg/max	2.0 / 18.1 / 24.2
Latency min/avg/max	0.43 s / 26.82 s / 50.20 s
Success	160/160 (100%)

Q4_M — Balanced

Single request (parallel=1, ctx=131072):

Metric	Value
Aggregate throughput	38.7 tok/s
Total tokens	11,569 (20 requests × up to 1,024)
Per-request t/s min/avg/max	2.9 / 31.2 / 39.5
Latency min/avg/max	0.56 s / 14.96 s / 26.08 s
Success	20/20 (100%)

8 concurrent (parallel=8, ctx=196608):

Metric	Value
Aggregate throughput	152.3 tok/s
Total tokens	78,210 (160 requests × up to 1,024)
Per-request t/s min/avg/max	0.3 / 15.5 / 29.2
Latency min/avg/max	0.41 s / 24.91 s / 53.21 s
Success	160/160 (100%)

Q4_L — Best Quality

Single request (parallel=1, ctx=131072):

Metric	Value
Aggregate throughput	29.8 tok/s
Total tokens	10,979 (20 requests × up to 1,024)
Per-request t/s min/avg/max	1.6 / 25.7 / 37.8
Latency min/avg/max	0.82 s / 18.39 s / 36.21 s
Success	20/20 (100%)

8 concurrent (parallel=8, ctx=180000):

Metric	Value
Aggregate throughput	152.9 tok/s
Total tokens	88,676 (160 requests × up to 1,024)
Per-request t/s min/avg/max	0.4 / 17.2 / 22.1
Latency min/avg/max	0.50 s / 28.16 s / 55.70 s
Success	160/160 (100%)

Hardware Requirements

AMD GPUs (ROCm / Vulkan)

GPU VRAM	Recommended Variant	Context
24 GB (RX 7900 XTX)	Q4_S	Up to 131,072 with f16 KV cache, 2× GPU
24 GB (RX 7900 XTX)	Q4_M	Up to 131,072 with f16 KV cache, 2× GPU
24 GB (RX 7900 XTX)	Q4_L	Up to 131,072 with f16 KV cache, 2× GPU

For dual-GPU setups (2× RX 7900 XTX), use --tensor-split 1,1 and --ngl 99 to distribute layers across both GPUs.

NVIDIA GPUs (CUDA)

These GGUF models also work on NVIDIA GPUs via llama.cpp CUDA backend. For NVIDIA deployment, consider the AutoRound W4A16 or NVFP4A16 quantizations instead — they offer better throughput on CUDA via vLLM.

Usage with llama.cpp

Server Mode (Recommended)

# AMD (ROCm):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
  -c 131072 -ngl 99 -fa on \
  --port 8000 --host 0.0.0.0 \
  --cache-type-k f16 --cache-type-v f16 \
  --tensor-split 1,1 --no-mmap -t 23 \
  -ub 256 -b 256 --parallel 4 \
  --mmproj mmproj-Q8_0.gguf

# NVIDIA (CUDA):
./llama-server -m mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf \
  -c 131072 -ngl 99 -fa on \
  --port 8000 --host 0.0.0.0 \
  -ub 256 -b 256 --parallel 4 \
  --mmproj mmproj-Q8_0.gguf

Inference Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistral-small-3.2-24b-instruct-Q4_M_AMD","messages":[{"role":"user","content":"What is 2+2? One word."}],"max_tokens":10}'

Vision Test

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model":"mistral-small-3.2-24b-instruct-Q4_M_AMD",
    "messages":[
      {"role":"user","content":[{"type":"image_url","image_url":{"url":"https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"}},{"type":"text","text":"Describe this image in one sentence."}]}
    ],
    "max_tokens":100
  }'

Notes

Vision: Image Size Limit

The original Mistral model has max_image_size set to 1540. Images with dimensions exceeding the limit are proportionally downscaled before vision encoding. The mmproj files in this repository match the original specification.

Files in This Repository

File	Size	Description
`mistral-small-3.2-24b-instruct-Q4_L_AMD.gguf`	~16.3 GB	Q4_L quantized text model — best quality
`mistral-small-3.2-24b-instruct-Q4_M_AMD.gguf`	~15.1 GB	Q4_M quantized text model — balanced
`mistral-small-3.2-24b-instruct-Q4_S_AMD.gguf`	~14.2 GB	Q4_S quantized text model — smallest/fastest
`mmproj-F16.gguf`	~847 MB	Vision projector (F16, full precision)
`mmproj-Q8_0.gguf`	~459 MB	Vision projector (Q8_0, recommended)

License

This quantization is released under the Apache 2.0 License, following the base model's license.

The base model mistralai/Mistral-Small-3.2-24B-Instruct-2506 is licensed under Apache 2.0.

Citation

If you use this model in your research, please cite:

@misc{mistral-small-3.2-24b-gguf-gfx1100,
  title = {Mistral-Small-3.2-24B-Instruct-2506 GGUF Quantizations},
  author = {Gratex International},
  year = {2026},
  howpublished = {\url{https://huggingface.co/gratex/Mistral-Small-3.2-24B-Instruct-2506-GGUF}},
  note = {Quantized with llama.cpp, benchmarked on AMD RX 7900 XTX}
}