GLM-4.7-REAP-40

40% Expert-Pruned GLM-4 (358B -> 218B parameters)

A REAP-pruned version of zai/glm-4.7 (GLM-4.7), reducing the model from 358B to ~218B parameters by pruning 40% of MoE experts while preserving model quality.

Acknowledgments

This work was made possible by:

Model Details

Property Value
Base Model zai/glm-4.7 (GLM-4.7)
Architecture Mixture of Experts (MoE)
Original Parameters 358B
Pruned Parameters ~218B
Compression Ratio 40% experts removed
Experts per Layer 160 -> 96
MoE Layers 92
Precision BF16
Model Size on Disk ~407GB
VRAM Required ~407GB (full precision)

Related Models

Model Params Experts Size VRAM Link
GLM-4.7 (Base) 358B 160 ~700GB ~700GB zai/glm-4.7
GLM-4.7-REAP-30 251B 112 ~470GB ~470GB 0xSero/GLM-4.7-REAP-30
GLM-4.7-REAP-35 233B 104 ~439GB ~439GB 0xSero/GLM-4.7-REAP-35
GLM-4.7-REAP-40 218B 96 ~407GB ~407GB 0xSero/GLM-4.7-REAP-40
GLM-4.7-REAP-50 179B 80 ~345GB ~345GB 0xSero/GLM-4.7-REAP-50
GLM-4.7-REAP-30-W4A16 251B 112 ~118GB ~130GB 0xSero/GLM-4.7-REAP-30-W4A16
GLM-4.7-REAP-35-W4A16 233B 104 ~110GB ~120GB 0xSero/GLM-4.7-REAP-35-W4A16
GLM-4.7-REAP-40-W4A16 218B 96 ~108GB ~115GB 0xSero/GLM-4.7-REAP-40-W4A16
GLM-4.7-REAP-50-W4A16 179B 80 ~92GB ~100GB 0xSero/GLM-4.7-REAP-50-W4A16

Usage

Basic Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "0xSero/GLM-4.7-REAP-40"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

With vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-4.7-REAP-40",
    tensor_parallel_size=8,  # Use all 8 GPUs
    trust_remote_code=True,
    dtype="bfloat16"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Write a Python function to check if a number is prime."], sampling_params)
print(outputs[0].outputs[0].text)

REAP Methodology

REAP (Router-Experts Activation Pruning) is a state-of-the-art MoE pruning technique developed by Cerebras that identifies and removes the least important experts based on their activation patterns.

How REAP Works

                    Original GLM-4.7                          REAP-Pruned
                    ================                          ============

    Input -----> [Router] -----> [Expert 1]  (high activation) --> KEEP
                         |-----> [Expert 2]  (low activation)  --> PRUNE
                         |-----> [Expert 3]  (high activation) --> KEEP
                         |-----> ...
                         |-----> [Expert 160] (low activation) --> PRUNE

    Result: 160 experts -> 96 experts ({100-COMPRESSION_PCT}% retained)

REAP Algorithm Steps

  1. Calibration Phase: Run representative samples through the model
  2. Saliency Scoring: Compute importance scores for each expert:
    • Router assignment frequency (how often selected)
    • Activation magnitude weighted by routing probability
    • Angular distance clustering (identifies redundant experts)
  3. Pruning: Remove the lowest-scoring experts (40% removed)
  4. Router Adjustment: Update router weights for remaining experts

Calibration Dataset

We used a diverse calibration mix optimized for code and agentic tasks:

Dataset Samples Purpose
evol-codealpaca-v1 700 Code generation
xlam-function-calling-60k 330 Function/tool calling
SWE-smith-trajectories 330 Multi-turn agentic tasks
Total 1,360

Combined dataset: 0xSero/glm47-reap-calibration-v2

REAP Configuration

compression_ratio: 0.40
distance_measure: angular
seed: 42
samples: 1360
model_max_length: 2048
prune_method: reap

Reproduce This Model

Prerequisites

# Clone REAP repository
git clone https://github.com/Cerebras/reap
cd reap
pip install -e .

# Or use our fork with GLM-4.7 support
git clone https://github.com/0xSero/reap

Run REAP Pruning

# Full calibration run (~5 hours on 8x H200)
python src/reap/prune.py \
    --model-name zai/glm-4.7 \
    --dataset-name 0xSero/glm47-reap-calibration-v2 \
    --compression-ratio 0.40 \
    --prune-method reap \
    --seed 42 \
    --samples_per_category 1360 \
    --model_max_length 2048 \
    --distance_measure angular \
    --record_pruning_metrics_only true \
    --output_file_name observations_1360_angular-seed_42.pt

# With observation reuse (instant, <5 minutes)
# If you already have the observations file from a previous run:
python src/reap/prune.py \
    --model-name zai/glm-4.7 \
    --compression-ratio 0.40 \
    --load_observations observations_1360_angular-seed_42.pt \
    --prune-method reap \
    --seed 42

Observation Reuse

The key insight: REAP's calibration phase computes per-expert saliency scores that are independent of the compression ratio. Once computed, you can instantly generate models at any pruning level:

# observations_1360_angular-seed_42.pt contains:
# - Expert activation statistics per layer
# - Router weight distributions
# - Angular distance matrices
# Size: ~19MB for GLM-4.7

# Generate multiple variants instantly:
for ratio in [0.30, 0.35, 0.40, 0.50]:
    run_reap(compression_ratio=ratio, load_observations="observations.pt")

Hardware Requirements

Configuration VRAM GPUs Notes
Full Precision (BF16) ~407GB 8x H100/H200 Recommended
W4A16 Quantized ~115GB 4x A100 80GB Good quality/performance
CPU Offload ~64GB GPU + 512GB RAM 1x GPU + CPU Slow but works

Benchmarks

Coming soon - benchmarks are in progress

Benchmark Base GLM-4.7 REAP-40 Delta
HumanEval - - -
MBPP - - -
GSM8K - - -
MMLU - - -

Citation

If you use this model, please cite:

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{glm47reap,
  title={GLM-4.7-REAP: Expert-Pruned GLM-4 Models},
  author={0xSero},
  year={2026},
  howpublished={\url{https://huggingface.co/0xSero/GLM-4.7-REAP-40}}
}

Limitations

  • Requires multi-GPU setup for inference (~407GB VRAM)
  • Some capability degradation vs full model (expected with 40% pruning)
  • Calibrated primarily on code/agentic tasks; may have reduced performance on other domains
  • Use W4A16 variant for lower VRAM requirements

License

Apache 2.0 (inherited from base GLM-4 model)


Built with REAP by Cerebras | Sponsored by Prime Intellect

Downloads last month
14
Safetensors
Model size
218B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xSero/GLM-4.7-REAP-40

Quantizations
2 models