Qwen2.5-3B-Instruct-HXQ

1.6x smaller. HellaSwag 74.9%. Best fidelity in the lineup.

Qwen2.5-3B-Instruct compressed from 6.0 GB to 3.8 GB with only +0.69% PPL delta. Downstream task scores preserved after 1.6x compression. No calibration data. No architecture-specific tuning. Just pip install and from_pretrained().

Install and Run

pip install "helix-substrate[hf]"

import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/qwen2.5-3b-instruct-helix")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/qwen2.5-3b-instruct-helix")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That's it. import helix_substrate registers the quantizer. from_pretrained() handles the rest automatically.

Downstream Benchmarks

Evaluated with lm-evaluation-harness on an NVIDIA 4090:

Benchmark	HXQ (1.6x)
HellaSwag (acc_norm)	74.86%
ARC-Easy (acc_norm)	72.85%
ARC-Challenge (acc_norm)	48.72%

Task performance is preserved after 1.6x compression.

Compression Benchmark

	Dense (BF16)	HXQ
Size	6.0 GB	3.8 GB
Perplexity (WikiText-2)	5.495	5.533 (+0.69%)
Compression ratio	—	1.6x
Compressed modules	—	252 HelixLinear layers
Architecture	Qwen2 (36 layers, GQA, 2 KV heads)	unchanged

Eval: WikiText-2 test split, 2048 tokens, stride 512.

Good to Know

GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Fused kernels for additional speedup are in progress.
Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
Requires helix-substrate — the quantizer is not built into transformers. You need pip install "helix-substrate[hf]".
Tied embeddings — lm_head shares embed_tokens, stored at full precision.

What is HelixCode?

HelixCode is a universal weight compression codec based on vector quantization:

Each weight matrix is replaced by a 256-entry codebook (float32) + uint8 index matrix + optional sidecar corrections for outlier values
The compressed form is the executable — HelixLinear performs codebook[indices] @ x directly, no decompression step
Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP, CNN)
No calibration data required — unlike GPTQ/AWQ, codebooks are fit from the weights alone

How It Works

import helix_substrate registers the hxq quantizer with HuggingFace
from_pretrained() reads quantization_config.quant_method = "hxq" from config.json
The quantizer replaces 252 nn.Linear modules with HelixLinear shells before weight loading
Safetensors populates the codebook, indices, and sidecar buffers directly
The model runs in compressed form — no decompression needed

Why This Model

This is the fidelity champion — at +0.69% PPL, it has the lowest degradation of any model in the lineup. The 3B Instruct variant's weights compress exceptionally cleanly with scalar VQ, proving that HelixCode scales with model size (larger models compress better).

Compression Receipt

Compressed tensors:  252
Exact tensors:       182  (norms, embeddings, biases, tied lm_head)
Total keys:          1,190
Output size:         3,836 MB
Weight ratio:        1.6x
PPL delta:           +0.69% (5.533 vs 5.495 dense)
Eval: WikiText-2 test, 2048 tokens, stride=512

Companion Models

Same codec, same pip install, multiple architectures:

Model	Architecture	Ratio	PPL Delta
qwen2.5-14b-instruct-helix	Transformer	3.4x	pending
qwen2.5-7b-instruct-helix	Transformer	2.2x	+6.34%
qwen2.5-coder-3b-helix	Transformer (code)	1.6x	+1.92%
qwen2.5-coder-1.5b-instruct-helix	Transformer (code)	2.4x	+1.63%
tinyllama-1.1b-helix	Transformer	4.0x	+0.78%
zamba2-2.7b-instruct-helix	Hybrid (Mamba2+Transformer)	1.8x	+6.59%
zamba2-1.2b-helix	Hybrid (Mamba2+Transformer)	1.7x	+2.90%
mamba2-1.3b-helix	Pure SSM (Mamba2)	2.1x	+8.0%
mamba-130m-helix	Pure SSM	3.8x	+18.4%

Citation

@software{helix_substrate_2026,
  title={Helix Substrate: Universal Weight Compression via HelixCode},
  author={EchoLabs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}