GLM-4.7-REAP-40
40% Expert-Pruned GLM-4 (358B -> 218B parameters)
A REAP-pruned version of zai/glm-4.7 (GLM-4.7), reducing the model from 358B to ~218B parameters by pruning 40% of MoE experts while preserving model quality.
Acknowledgments
This work was made possible by:
- Prime Intellect - Compute sponsorship (8x H200 cluster)
- Cerebras - REAP methodology (arXiv:2510.13999)
- Intel - AutoRound quantization framework
Model Details
| Property | Value |
|---|---|
| Base Model | zai/glm-4.7 (GLM-4.7) |
| Architecture | Mixture of Experts (MoE) |
| Original Parameters | 358B |
| Pruned Parameters | ~218B |
| Compression Ratio | 40% experts removed |
| Experts per Layer | 160 -> 96 |
| MoE Layers | 92 |
| Precision | BF16 |
| Model Size on Disk | ~407GB |
| VRAM Required | ~407GB (full precision) |
Related Models
| Model | Params | Experts | Size | VRAM | Link |
|---|---|---|---|---|---|
| GLM-4.7 (Base) | 358B | 160 | ~700GB | ~700GB | zai/glm-4.7 |
| GLM-4.7-REAP-30 | 251B | 112 | ~470GB | ~470GB | 0xSero/GLM-4.7-REAP-30 |
| GLM-4.7-REAP-35 | 233B | 104 | ~439GB | ~439GB | 0xSero/GLM-4.7-REAP-35 |
| GLM-4.7-REAP-40 | 218B | 96 | ~407GB | ~407GB | 0xSero/GLM-4.7-REAP-40 |
| GLM-4.7-REAP-50 | 179B | 80 | ~345GB | ~345GB | 0xSero/GLM-4.7-REAP-50 |
| GLM-4.7-REAP-30-W4A16 | 251B | 112 | ~118GB | ~130GB | 0xSero/GLM-4.7-REAP-30-W4A16 |
| GLM-4.7-REAP-35-W4A16 | 233B | 104 | ~110GB | ~120GB | 0xSero/GLM-4.7-REAP-35-W4A16 |
| GLM-4.7-REAP-40-W4A16 | 218B | 96 | ~108GB | ~115GB | 0xSero/GLM-4.7-REAP-40-W4A16 |
| GLM-4.7-REAP-50-W4A16 | 179B | 80 | ~92GB | ~100GB | 0xSero/GLM-4.7-REAP-50-W4A16 |
Usage
Basic Inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "0xSero/GLM-4.7-REAP-40"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)
With vLLM (Recommended for Production)
from vllm import LLM, SamplingParams
llm = LLM(
model="0xSero/GLM-4.7-REAP-40",
tensor_parallel_size=8, # Use all 8 GPUs
trust_remote_code=True,
dtype="bfloat16"
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Write a Python function to check if a number is prime."], sampling_params)
print(outputs[0].outputs[0].text)
REAP Methodology
REAP (Router-Experts Activation Pruning) is a state-of-the-art MoE pruning technique developed by Cerebras that identifies and removes the least important experts based on their activation patterns.
How REAP Works
Original GLM-4.7 REAP-Pruned
================ ============
Input -----> [Router] -----> [Expert 1] (high activation) --> KEEP
|-----> [Expert 2] (low activation) --> PRUNE
|-----> [Expert 3] (high activation) --> KEEP
|-----> ...
|-----> [Expert 160] (low activation) --> PRUNE
Result: 160 experts -> 96 experts ({100-COMPRESSION_PCT}% retained)
REAP Algorithm Steps
- Calibration Phase: Run representative samples through the model
- Saliency Scoring: Compute importance scores for each expert:
- Router assignment frequency (how often selected)
- Activation magnitude weighted by routing probability
- Angular distance clustering (identifies redundant experts)
- Pruning: Remove the lowest-scoring experts (40% removed)
- Router Adjustment: Update router weights for remaining experts
Calibration Dataset
We used a diverse calibration mix optimized for code and agentic tasks:
| Dataset | Samples | Purpose |
|---|---|---|
| evol-codealpaca-v1 | 700 | Code generation |
| xlam-function-calling-60k | 330 | Function/tool calling |
| SWE-smith-trajectories | 330 | Multi-turn agentic tasks |
| Total | 1,360 |
Combined dataset: 0xSero/glm47-reap-calibration-v2
REAP Configuration
compression_ratio: 0.40
distance_measure: angular
seed: 42
samples: 1360
model_max_length: 2048
prune_method: reap
Reproduce This Model
Prerequisites
# Clone REAP repository
git clone https://github.com/Cerebras/reap
cd reap
pip install -e .
# Or use our fork with GLM-4.7 support
git clone https://github.com/0xSero/reap
Run REAP Pruning
# Full calibration run (~5 hours on 8x H200)
python src/reap/prune.py \
--model-name zai/glm-4.7 \
--dataset-name 0xSero/glm47-reap-calibration-v2 \
--compression-ratio 0.40 \
--prune-method reap \
--seed 42 \
--samples_per_category 1360 \
--model_max_length 2048 \
--distance_measure angular \
--record_pruning_metrics_only true \
--output_file_name observations_1360_angular-seed_42.pt
# With observation reuse (instant, <5 minutes)
# If you already have the observations file from a previous run:
python src/reap/prune.py \
--model-name zai/glm-4.7 \
--compression-ratio 0.40 \
--load_observations observations_1360_angular-seed_42.pt \
--prune-method reap \
--seed 42
Observation Reuse
The key insight: REAP's calibration phase computes per-expert saliency scores that are independent of the compression ratio. Once computed, you can instantly generate models at any pruning level:
# observations_1360_angular-seed_42.pt contains:
# - Expert activation statistics per layer
# - Router weight distributions
# - Angular distance matrices
# Size: ~19MB for GLM-4.7
# Generate multiple variants instantly:
for ratio in [0.30, 0.35, 0.40, 0.50]:
run_reap(compression_ratio=ratio, load_observations="observations.pt")
Hardware Requirements
| Configuration | VRAM | GPUs | Notes |
|---|---|---|---|
| Full Precision (BF16) | ~407GB | 8x H100/H200 | Recommended |
| W4A16 Quantized | ~115GB | 4x A100 80GB | Good quality/performance |
| CPU Offload | ~64GB GPU + 512GB RAM | 1x GPU + CPU | Slow but works |
Benchmarks
Coming soon - benchmarks are in progress
| Benchmark | Base GLM-4.7 | REAP-40 | Delta |
|---|---|---|---|
| HumanEval | - | - | - |
| MBPP | - | - | - |
| GSM8K | - | - | - |
| MMLU | - | - | - |
Citation
If you use this model, please cite:
@article{jones2025reap,
title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
author={Jones, et al.},
journal={arXiv preprint arXiv:2510.13999},
year={2025}
}
@misc{glm47reap,
title={GLM-4.7-REAP: Expert-Pruned GLM-4 Models},
author={0xSero},
year={2026},
howpublished={\url{https://huggingface.co/0xSero/GLM-4.7-REAP-40}}
}
Limitations
- Requires multi-GPU setup for inference (~407GB VRAM)
- Some capability degradation vs full model (expected with 40% pruning)
- Calibrated primarily on code/agentic tasks; may have reduced performance on other domains
- Use W4A16 variant for lower VRAM requirements
License
Apache 2.0 (inherited from base GLM-4 model)
Built with REAP by Cerebras | Sponsored by Prime Intellect
- Downloads last month
- 14