Qwen2.5-3B-Instruct-HXQ
1.6x smaller. HellaSwag 74.9%. Best fidelity in the lineup.
Qwen2.5-3B-Instruct compressed from 6.0 GB to 3.8 GB with only +0.69% PPL delta. Downstream task scores preserved after 1.6x compression. No calibration data. No architecture-specific tuning. Just
pip installandfrom_pretrained().
Install and Run
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/qwen2.5-3b-instruct-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/qwen2.5-3b-instruct-helix")
inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.
Downstream Benchmarks
Evaluated with lm-evaluation-harness on an NVIDIA 4090:
| Benchmark | HXQ (1.6x) |
|---|---|
| HellaSwag (acc_norm) | 74.86% |
| ARC-Easy (acc_norm) | 72.85% |
| ARC-Challenge (acc_norm) | 48.72% |
Task performance is preserved after 1.6x compression.
Compression Benchmark
| Dense (BF16) | HXQ | |
|---|---|---|
| Size | 6.0 GB | 3.8 GB |
| Perplexity (WikiText-2) | 5.495 | 5.533 (+0.69%) |
| Compression ratio | — | 1.6x |
| Compressed modules | — | 252 HelixLinear layers |
| Architecture | Qwen2 (36 layers, GQA, 2 KV heads) | unchanged |
Eval: WikiText-2 test split, 2048 tokens, stride 512.
Good to Know
- GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
- Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substrate— the quantizer is not built into transformers. You needpip install "helix-substrate[hf]". - Tied embeddings —
lm_headsharesembed_tokens, stored at full precision.
What is HelixCode?
HelixCode is a universal weight compression codec based on vector quantization:
- Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
- The compressed form is the executable —
HelixLinearperformscodebook[indices] @ xdirectly, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP, CNN) - No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone
How It Works
import helix_substrateregisters thehxqquantizer with HuggingFacefrom_pretrained()readsquantization_config.quant_method = "hxq"fromconfig.json- The quantizer replaces 252
nn.Linearmodules withHelixLinearshells before weight loading - Safetensors populates the codebook, indices, and sidecar buffers directly
- The model runs in compressed form — no decompression needed
Why This Model
This is the fidelity champion — at +0.69% PPL, it has the lowest degradation of any model in the lineup. The 3B Instruct variant's weights compress exceptionally cleanly with scalar VQ, proving that HelixCode scales with model size (larger models compress better).
Compression Receipt
Compressed tensors: 252
Exact tensors: 182 (norms, embeddings, biases, tied lm_head)
Total keys: 1,190
Output size: 3,836 MB
Weight ratio: 1.6x
PPL delta: +0.69% (5.533 vs 5.495 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512
Companion Models
Same codec, same pip install, multiple architectures:
| Model | Architecture | Ratio | PPL Delta |
|---|---|---|---|
| qwen2.5-14b-instruct-helix | Transformer | 3.4x | pending |
| qwen2.5-7b-instruct-helix | Transformer | 2.2x | +6.34% |
| qwen2.5-coder-3b-helix | Transformer (code) | 1.6x | +1.92% |
| qwen2.5-coder-1.5b-instruct-helix | Transformer (code) | 2.4x | +1.63% |
| tinyllama-1.1b-helix | Transformer | 4.0x | +0.78% |
| zamba2-2.7b-instruct-helix | Hybrid (Mamba2+Transformer) | 1.8x | +6.59% |
| zamba2-1.2b-helix | Hybrid (Mamba2+Transformer) | 1.7x | +2.90% |
| mamba2-1.3b-helix | Pure SSM (Mamba2) | 2.1x | +8.0% |
| mamba-130m-helix | Pure SSM | 3.8x | +18.4% |
Citation
@software{helix_substrate_2026,
title={Helix Substrate: Universal Weight Compression via HelixCode},
author={EchoLabs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from Qwen/Qwen2.5-3B-Instruct).
- Downloads last month
- 1,397
Model tree for EchoLabs33/qwen2.5-3b-instruct-hxq
Collection including EchoLabs33/qwen2.5-3b-instruct-hxq
Evaluation results
- Accuracy (norm) on HellaSwagself-reported0.749
- Accuracy (norm) on ARC-Easyself-reported0.729
- Accuracy (norm) on ARC-Challengeself-reported0.487
- Perplexity on WikiText-2test set self-reported5.533