GLM-4.7-REAP-40

40% Expert-Pruned GLM-4 (358B -> 218B parameters)

A REAP-pruned version of zai/glm-4.7 (GLM-4.7), reducing the model from 358B to ~218B parameters by pruning 40% of MoE experts while preserving model quality.

Acknowledgments

This work was made possible by:

Prime Intellect - Compute sponsorship (8x H200 cluster)
Cerebras - REAP methodology (arXiv:2510.13999)
Intel - AutoRound quantization framework

Model Details

Property	Value
Base Model	zai/glm-4.7 (GLM-4.7)
Architecture	Mixture of Experts (MoE)
Original Parameters	358B
Pruned Parameters	~218B
Compression Ratio	40% experts removed
Experts per Layer	160 -> 96
MoE Layers	92
Precision	BF16
Model Size on Disk	~407GB
VRAM Required	~407GB (full precision)

Related Models

Model	Params	Experts	Size	VRAM	Link
GLM-4.7 (Base)	358B	160	~700GB	~700GB	zai/glm-4.7
GLM-4.7-REAP-30	251B	112	~470GB	~470GB	0xSero/GLM-4.7-REAP-30
GLM-4.7-REAP-35	233B	104	~439GB	~439GB	0xSero/GLM-4.7-REAP-35
GLM-4.7-REAP-40	218B	96	~407GB	~407GB	0xSero/GLM-4.7-REAP-40
GLM-4.7-REAP-50	179B	80	~345GB	~345GB	0xSero/GLM-4.7-REAP-50
GLM-4.7-REAP-30-W4A16	251B	112	~118GB	~130GB	0xSero/GLM-4.7-REAP-30-W4A16
GLM-4.7-REAP-35-W4A16	233B	104	~110GB	~120GB	0xSero/GLM-4.7-REAP-35-W4A16
GLM-4.7-REAP-40-W4A16	218B	96	~108GB	~115GB	0xSero/GLM-4.7-REAP-40-W4A16
GLM-4.7-REAP-50-W4A16	179B	80	~92GB	~100GB	0xSero/GLM-4.7-REAP-50-W4A16

Usage

Basic Inference

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "0xSero/GLM-4.7-REAP-40"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True)
print(response)

With vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(
    model="0xSero/GLM-4.7-REAP-40",
    tensor_parallel_size=8,  # Use all 8 GPUs
    trust_remote_code=True,
    dtype="bfloat16"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate(["Write a Python function to check if a number is prime."], sampling_params)
print(outputs[0].outputs[0].text)

REAP Methodology

REAP (Router-Experts Activation Pruning) is a state-of-the-art MoE pruning technique developed by Cerebras that identifies and removes the least important experts based on their activation patterns.

How REAP Works

                    Original GLM-4.7                          REAP-Pruned
                    ================                          ============

    Input -----> [Router] -----> [Expert 1]  (high activation) --> KEEP
                         |-----> [Expert 2]  (low activation)  --> PRUNE
                         |-----> [Expert 3]  (high activation) --> KEEP
                         |-----> ...
                         |-----> [Expert 160] (low activation) --> PRUNE

    Result: 160 experts -> 96 experts ({100-COMPRESSION_PCT}% retained)

REAP Algorithm Steps

Calibration Phase: Run representative samples through the model
Saliency Scoring: Compute importance scores for each expert:
- Router assignment frequency (how often selected)
- Activation magnitude weighted by routing probability
- Angular distance clustering (identifies redundant experts)
Pruning: Remove the lowest-scoring experts (40% removed)
Router Adjustment: Update router weights for remaining experts

Calibration Dataset

We used a diverse calibration mix optimized for code and agentic tasks:

Dataset	Samples	Purpose
evol-codealpaca-v1	700	Code generation
xlam-function-calling-60k	330	Function/tool calling
SWE-smith-trajectories	330	Multi-turn agentic tasks
Total	1,360

Combined dataset: 0xSero/glm47-reap-calibration-v2

REAP Configuration

compression_ratio: 0.40
distance_measure: angular
seed: 42
samples: 1360
model_max_length: 2048
prune_method: reap

Reproduce This Model

Prerequisites

# Clone REAP repository
git clone https://github.com/Cerebras/reap
cd reap
pip install -e .

# Or use our fork with GLM-4.7 support
git clone https://github.com/0xSero/reap

Run REAP Pruning

# Full calibration run (~5 hours on 8x H200)
python src/reap/prune.py \
    --model-name zai/glm-4.7 \
    --dataset-name 0xSero/glm47-reap-calibration-v2 \
    --compression-ratio 0.40 \
    --prune-method reap \
    --seed 42 \
    --samples_per_category 1360 \
    --model_max_length 2048 \
    --distance_measure angular \
    --record_pruning_metrics_only true \
    --output_file_name observations_1360_angular-seed_42.pt

# With observation reuse (instant, <5 minutes)
# If you already have the observations file from a previous run:
python src/reap/prune.py \
    --model-name zai/glm-4.7 \
    --compression-ratio 0.40 \
    --load_observations observations_1360_angular-seed_42.pt \
    --prune-method reap \
    --seed 42

Observation Reuse

The key insight: REAP's calibration phase computes per-expert saliency scores that are independent of the compression ratio. Once computed, you can instantly generate models at any pruning level:

# observations_1360_angular-seed_42.pt contains:
# - Expert activation statistics per layer
# - Router weight distributions
# - Angular distance matrices
# Size: ~19MB for GLM-4.7

# Generate multiple variants instantly:
for ratio in [0.30, 0.35, 0.40, 0.50]:
    run_reap(compression_ratio=ratio, load_observations="observations.pt")

Hardware Requirements

Configuration	VRAM	GPUs	Notes
Full Precision (BF16)	~407GB	8x H100/H200	Recommended
W4A16 Quantized	~115GB	4x A100 80GB	Good quality/performance
CPU Offload	~64GB GPU + 512GB RAM	1x GPU + CPU	Slow but works

Benchmarks

Coming soon - benchmarks are in progress

Benchmark	Base GLM-4.7	REAP-40	Delta
HumanEval	-	-	-
MBPP	-	-	-
GSM8K	-	-	-
MMLU	-	-	-

Citation

If you use this model, please cite:

@article{jones2025reap,
  title={REAP: Router-Experts Activation Pruning for Efficient Mixture-of-Experts},
  author={Jones, et al.},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{glm47reap,
  title={GLM-4.7-REAP: Expert-Pruned GLM-4 Models},
  author={0xSero},
  year={2026},
  howpublished={\url{https://huggingface.co/0xSero/GLM-4.7-REAP-40}}
}

Limitations

Requires multi-GPU setup for inference (~407GB VRAM)
Some capability degradation vs full model (expected with 40% pruning)
Calibrated primarily on code/agentic tasks; may have reduced performance on other domains
Use W4A16 variant for lower VRAM requirements

License

Apache 2.0 (inherited from base GLM-4 model)

Built with REAP by Cerebras | Sponsored by Prime Intellect

Downloads last month: 14

Safetensors

Model size

218B params

Tensor type

F32

BF16

Model tree for 0xSero/GLM-4.7-REAP-40

Quantizations

2 models