You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Pixtral-12B-W4A16 (INT4 GPTQ)

This repository contains a high-performance 4-bit quantized version of the Pixtral-12B multimodal model. This quantization is specifically tailored for efficient deployment on NVIDIA Ampere (A100) and Hopper (H100) architectures, balancing significant memory savings with high visual accuracy.

📊 Technical Specifications

Model Size (Quantized): 8.54 GiB (9.18 GB)
Quantization Scheme: Weight-Only 4-bit (W4A16) via GPTQ
Mixed-Precision Architecture:
- Language Backbone: INT4 Weights (Mistral-based)
- Vision Encoder: Full BFloat16 (Preserved for high-fidelity OCR and scene analysis)
- Projector: BFloat16 (Maintains precise vision-language alignment)
- Activations: BFloat16

🚀 Batch Processing & Pipeline Optimization

To maximize throughput in production surveillance or multi-stream captioning, the following optimizations are recommended:

1. High-Concurrency Serving

The model is validated for high concurrency. Using vLLM or SGLang, you should configure the server to handle up to 256 concurrent sequences to fully saturate the GPU compute.

2. Vision Token Compression

Pixtral's vision encoder can generate a large number of tokens. To optimize latency:

Image Resizing: Resize input frames to 512x512 or 768x768 before encoding.
Limit MM Per Prompt: Use --limit-mm-per-prompt '{"image": 1}' in vLLM to prevent memory spikes from multiple images per request.

3. Prefix Caching Strategy

For repeated system prompts (e.g., "Analyze this surveillance frame..."), enable Prefix Caching. This reduces the prefill time for subsequent requests by reusing the KV-cache of the common instruction prefix.

vLLM: --enable-prefix-caching
SGLang: Enabled by default via RadixAttention.

4. Asynchronous Processing

Implement an asynchronous worker pipeline (e.g., using asyncio and aiohttp) to send batches of images. This hides the network latency and ensures the GPU is never idling.

📦 Deployment & Serving

1. vLLM (Recommended Baseline)

vllm serve ./pixtral-12b-W4A16 \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image": 1}' \
  --gpu-memory-utilization 0.90

2. SGLang (Extreme Throughput)

python -m sglang.launch_server \
  --model-path ./pixtral-12b-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --port 8000 \
  --mem-fraction-static 0.9

🔍 Sample Surveillance Example

Prompt:

Analyze this surveillance frame in detail. Describe all people (clothing, actions), 
vehicles (color, type), and interactions. Note any safety hazards or suspicious behavior.

Example Output:

People:

Person on the Left: Wearing a light-colored coat, carrying a black bag, walking towards the background entrance.

Person on the Right: Wearing a dark-colored coat and a face mask, standing near the entrance interaction.

Scene Context: The image shows a public lobby/waiting area with armchairs and tables. Lighting is adequate and no immediate safety hazards or suspicious behaviors are detected.

🤝 Contact & Credits

Developer: Rajesh Jat
Email: jaatrajesh1010@gmail.com

Downloads last month: 3

Safetensors

Model size

3B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for elon-trump/pixtral-12b-2409-w4a16-gptq

Base model

mistralai/Pixtral-12B-Base-2409

Finetuned

mistralai/Pixtral-12B-2409

Quantized

(4)

this model