Built with glm-4

This repository provides a quantised version of the GLM-4-9b-Chat-HF model.

Licensing and Usage

This model is distributed under the glm-4-9b License.

  • Attribution: This model is "Built with glm-4".
  • Naming: The model name includes the required "glm-4" prefix.
  • Commercial Use: Users wishing to use this model for commercial purposes must complete the registration here.
  • Restrictions: Usage for military or illegal purposes is strictly prohibited.

Quantisation Overview

This model was quantised to NVFP4 (NVIDIA Float4) to optimize performance while maintaining chat capabilities.

  • Methodology: Quantised using the llmcompressor library with a one-shot calibration process.
  • Calibration Data: A custom-curated dataset was used, pulling from ultrachat_200k and LongAlign-10k to ensure the model handles both short-form and long-form context effectively.
  • Architecture: The process preserved the original 40-layer dense structure (layers 0-39) required for the 9B model architecture.

Quantisation Details

  • Scheme: NVFP4 (NVIDIA Float4 with Microscaling).
  • Format: compressed-tensors.
  • Calibration: One-shot calibration using llmcompressor.
  • Calibration Data: Custom distribution weighted toward production-typical prompt lengths:
    • 256–1024 tokens: 90%
    • 1024–2048 tokens: 10%
  • Precision: W4A4 (4-bit Weights, 4-bit Activations)

Usage

This model is designed for runtimes compatible with the NVFP4 format, such as vLLM.

To avoid numerical instability (e.g., repetitive token loops), you must force the Blackwell-native inference path using the following environment variables.

export VLLM_FP4_ENABLED=1
export VLLM_USE_V1=1

If you run into the following error then you may need to set the envvar TRITON_TXAS_PATH to /usr/local/cuda/bin/ptxas

triton.runtime.errors.PTXASError: PTXAS error: Internal Triton PTX codegen error
`ptxas` stderr:
ptxas fatal   : Value 'sm_121a' is not defined for option 'gpu-name'

An example docker-compose is as follows assuming the model is stored in /models/glm-4-9b-chat-hf-nvfp4:

services:
  glm4-9b-nvfp4:
    image: nvcr.io/nvidia/vllm:26.01-py3
    container_name: glm4-9b-fp4
    ipc: host
    shm_size: "32gb"
    ports:
      - "8080:8080"
    volumes:
      - /models:/mnt/models:ro
    environment:
      # Required for Blackwell JIT compilation
      TRITON_PTXAS_PATH: /usr/local/cuda/bin/ptxas
      VLLM_FP4_ENABLED: 1
      VLLM_USE_V1: 1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --model /mnt/models/glm-4-9b-chat-hf-nvfp4
      --served-model-name glm-4-9b-chat
      --port 8080
      --quantization compressed-tensors
      --kv-cache-dtype fp8
      --max-model-len 32768
      --max-num-seqs 64
      --gpu-memory-utilization 0.75

For questions regarding the original model license, contact license@zhipuai.cn.

Downloads last month
133
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GizzmoShifu/glm-4-9b-chat-hf-nvfp4

Quantized
(4)
this model