Pixtral-12B-W4A16 (INT4 GPTQ)
This repository contains a high-performance 4-bit quantized version of the Pixtral-12B multimodal model. This quantization is specifically tailored for efficient deployment on NVIDIA Ampere (A100) and Hopper (H100) architectures, balancing significant memory savings with high visual accuracy.
馃搳 Technical Specifications
- Model Size (Quantized): 8.54 GiB (9.18 GB)
- Quantization Scheme: Weight-Only 4-bit (W4A16) via GPTQ
- Mixed-Precision Architecture:
- Language Backbone: INT4 Weights (Mistral-based)
- Vision Encoder: Full BFloat16 (Preserved for high-fidelity OCR and scene analysis)
- Projector: BFloat16 (Maintains precise vision-language alignment)
- Activations: BFloat16
馃殌 Batch Processing & Pipeline Optimization
To maximize throughput in production surveillance or multi-stream captioning, the following optimizations are recommended:
1. High-Concurrency Serving
The model is validated for high concurrency. Using vLLM or SGLang, you should configure the server to handle up to 256 concurrent sequences to fully saturate the GPU compute.
2. Vision Token Compression
Pixtral's vision encoder can generate a large number of tokens. To optimize latency:
- Image Resizing: Resize input frames to 512x512 or 768x768 before encoding.
- Limit MM Per Prompt: Use
--limit-mm-per-prompt '{"image": 1}'in vLLM to prevent memory spikes from multiple images per request.
3. Prefix Caching Strategy
For repeated system prompts (e.g., "Analyze this surveillance frame..."), enable Prefix Caching. This reduces the prefill time for subsequent requests by reusing the KV-cache of the common instruction prefix.
- vLLM:
--enable-prefix-caching - SGLang: Enabled by default via RadixAttention.
4. Asynchronous Processing
Implement an asynchronous worker pipeline (e.g., using asyncio and aiohttp) to send batches of images. This hides the network latency and ensures the GPU is never idling.
馃摝 Deployment & Serving
1. vLLM (Recommended Baseline)
vllm serve ./pixtral-12b-W4A16 \
--dtype bfloat16 \
--max-model-len 4096 \
--max-num-seqs 256 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--limit-mm-per-prompt '{"image": 1}' \
--gpu-memory-utilization 0.90
2. SGLang (Extreme Throughput)
python -m sglang.launch_server \
--model-path ./pixtral-12b-W4A16 \
--dtype bfloat16 \
--quantization gptq \
--port 8000 \
--mem-fraction-static 0.9
馃攳 Sample Surveillance Example
Prompt:
Analyze this surveillance frame in detail. Describe all people (clothing, actions),
vehicles (color, type), and interactions. Note any safety hazards or suspicious behavior.
Example Output:
People:
- Person on the Left: Wearing a light-colored coat, carrying a black bag, walking towards the background entrance.
- Person on the Right: Wearing a dark-colored coat and a face mask, standing near the entrance interaction.
Scene Context: The image shows a public lobby/waiting area with armchairs and tables. Lighting is adequate and no immediate safety hazards or suspicious behaviors are detected.
馃 Contact & Credits
- Developer: Rajesh Jat
- Email: jaatrajesh1010@gmail.com
- Downloads last month
- 3
Model tree for elon-trump/pixtral-12b-2409-w4a16-gptq
Base model
mistralai/Pixtral-12B-Base-2409