You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Pixtral-12B-W4A16 (INT4 GPTQ)

This repository contains a high-performance 4-bit quantized version of the Pixtral-12B multimodal model. This quantization is specifically tailored for efficient deployment on NVIDIA Ampere (A100) and Hopper (H100) architectures, balancing significant memory savings with high visual accuracy.

馃搳 Technical Specifications

  • Model Size (Quantized): 8.54 GiB (9.18 GB)
  • Quantization Scheme: Weight-Only 4-bit (W4A16) via GPTQ
  • Mixed-Precision Architecture:
    • Language Backbone: INT4 Weights (Mistral-based)
    • Vision Encoder: Full BFloat16 (Preserved for high-fidelity OCR and scene analysis)
    • Projector: BFloat16 (Maintains precise vision-language alignment)
    • Activations: BFloat16

馃殌 Batch Processing & Pipeline Optimization

To maximize throughput in production surveillance or multi-stream captioning, the following optimizations are recommended:

1. High-Concurrency Serving

The model is validated for high concurrency. Using vLLM or SGLang, you should configure the server to handle up to 256 concurrent sequences to fully saturate the GPU compute.

2. Vision Token Compression

Pixtral's vision encoder can generate a large number of tokens. To optimize latency:

  • Image Resizing: Resize input frames to 512x512 or 768x768 before encoding.
  • Limit MM Per Prompt: Use --limit-mm-per-prompt '{"image": 1}' in vLLM to prevent memory spikes from multiple images per request.

3. Prefix Caching Strategy

For repeated system prompts (e.g., "Analyze this surveillance frame..."), enable Prefix Caching. This reduces the prefill time for subsequent requests by reusing the KV-cache of the common instruction prefix.

  • vLLM: --enable-prefix-caching
  • SGLang: Enabled by default via RadixAttention.

4. Asynchronous Processing

Implement an asynchronous worker pipeline (e.g., using asyncio and aiohttp) to send batches of images. This hides the network latency and ensures the GPU is never idling.

馃摝 Deployment & Serving

1. vLLM (Recommended Baseline)

vllm serve ./pixtral-12b-W4A16 \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --limit-mm-per-prompt '{"image": 1}' \
  --gpu-memory-utilization 0.90

2. SGLang (Extreme Throughput)

python -m sglang.launch_server \
  --model-path ./pixtral-12b-W4A16 \
  --dtype bfloat16 \
  --quantization gptq \
  --port 8000 \
  --mem-fraction-static 0.9

馃攳 Sample Surveillance Example

Prompt:

Analyze this surveillance frame in detail. Describe all people (clothing, actions), 
vehicles (color, type), and interactions. Note any safety hazards or suspicious behavior.

Example Output:

People:

  1. Person on the Left: Wearing a light-colored coat, carrying a black bag, walking towards the background entrance.
  2. Person on the Right: Wearing a dark-colored coat and a face mask, standing near the entrance interaction.

Scene Context: The image shows a public lobby/waiting area with armchairs and tables. Lighting is adequate and no immediate safety hazards or suspicious behaviors are detected.

馃 Contact & Credits

Downloads last month
3
Safetensors
Model size
3B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for elon-trump/pixtral-12b-2409-w4a16-gptq

Quantized
(4)
this model