Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP
NVIDIA FP4 (NVFP4) quantized version of llmfan46/Qwen3.6-27B-uncensored-heretic-v2, with full multimodal (vision) and built-in MTP speculative decoding preserved.
Model Details
- Base model: llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (Heretic v1.2.0 MPOA abliteration of
Qwen/Qwen3.6-27B) - Architecture:
Qwen3_5ForConditionalGeneration(hybrid Gated-DeltaNet + Gated full attention, MLA-style q_proj) - Quantization: NVFP4 via nvidia-modelopt (NVFP4_DEFAULT_CFG)
- Calibration: 20 samples × 8192 seq from
cnn_dailymail(sakamakismile recipe) - Model size: ~20.6 GB (vs ~54 GB bf16 original)
- MTP head: bf16, restored from
Qwen/Qwen3.6-27B(the abliterated base ships without MTP) - Vision encoder: bf16 (unquantized, ~0.9 GB)
What's quantized, what's not
| Component | Format | Notes |
|---|---|---|
| MLP (gate/up/down_proj) | NVFP4 | All 64 layers |
| Full attention (qkv/o_proj) | NVFP4 | 16 layers (every 4th) |
| Linear attention (in_proj_a/b/dt_proj/g_proj/o_proj) | NVFP4 | 48 DeltaNet layers |
Linear attention conv1d |
bf16 | Mamba SSM kernel, excluded |
MTP head (mtp.*) |
bf16 | 15 tensors / ~0.85 GB, sourced from Qwen/Qwen3.6-27B |
lm_head + embed_tokens |
bf16 | Shared with MTP drafter |
| Vision encoder | bf16 | All model.visual.* weights, excluded |
| Norms, biases, A_log, dt_bias | bf16 | Small tensors, excluded |
Quantization Recipe
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from modelopt.torch.export.model_utils import get_language_model_from_vl
from transformers import Qwen3_5ForConditionalGeneration, AutoTokenizer
model = Qwen3_5ForConditionalGeneration.from_pretrained(
"llmfan46/Qwen3.6-27B-uncensored-heretic-v2",
dtype="auto", device_map="auto",
max_memory={0: "4GiB", "cpu": "120GiB"},
offload_folder="/tmp/offload",
trust_remote_code=True,
)
language_model = get_language_model_from_vl(model)[-1]
quant_cfg = {**mtq.NVFP4_DEFAULT_CFG}
quant_cfg["quant_cfg"] += [
{"quantizer_name": "*lm_head*", "enable": False},
{"quantizer_name": "*linear_attn.conv1d*", "enable": False},
{"quantizer_name": "*mtp.*", "enable": False},
{"quantizer_name": "*visual.*", "enable": False},
]
mtq.quantize(language_model, quant_cfg, forward_loop=...)
export_hf_checkpoint(model, export_dir="./out")
After export, mtp.* is stitched in from the official Qwen/Qwen3.6-27B shards (15 tensors) since llmfan46/heretic-v2 ships without trained MTP weights.
Recipe follows sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (verified 207 tok/s on RTX PRO 6000 Blackwell), with visual.* added to the ignore list to keep multimodal capability.
Usage with vLLM
Minimal launch (chat + reasoning + tool calling, no spec decode)
docker run -d --name heretic-v2-nvfp4 \
--ipc host --network host --device nvidia.com/gpu=all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
-e VLLM_USE_FLASHINFER_MOE_FP4=0 \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
vllm/vllm-openai:latest-cu130 \
--model lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
--host 0.0.0.0 --port 8000 \
--max-model-len 200000 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 1 \
--kv-cache-dtype fp8 \
--trust-remote-code --language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder
flashinfer-cutlass(not the defaultmarlin) is required because the Gated-DeltaNetdt_projhas output dim 96, which Marlin'stile_n_size=64rejects.
With MTP speculative decoding (requires vLLM patch)
vLLM ≤ 0.20.0 has a known issue: in qwen3_5_mtp.py, only mtp.fc is forced unquantized for NVFP4 checkpoints, the MTP transformer layers themselves still inherit --quantization modelopt and try to load NVFP4-shape params from bf16 weights → shape assertion failure.
Apply the small one-liner workaround at container start:
docker run -d --name heretic-v2-nvfp4-mtp \
--ipc host --network host --device nvidia.com/gpu=all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
--entrypoint bash \
vllm/vllm-openai:latest-cu130 -lc "
python3 -c \"
import re
F='/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py'
s=open(F).read()
s=s.replace(
'self.layers = torch.nn.ModuleList(\n Qwen3_5DecoderLayer(\n vllm_config,',
'_orig_qc = vllm_config.quant_config\n vllm_config.quant_config = fc_quant\n self.layers = torch.nn.ModuleList(\n Qwen3_5DecoderLayer(\n vllm_config,', 1)
s=s.replace(
'self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(',
'vllm_config.quant_config = _orig_qc\n self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(', 1)
open(F,'w').write(s)
print('patched')\"
exec vllm serve lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
--host 0.0.0.0 --port 8000 \
--max-model-len 200000 --gpu-memory-utilization 0.95 \
--max-num-seqs 1 --kv-cache-dtype fp8 \
--trust-remote-code --language-model-only \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder \
--speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'
"
The patch makes MTP Qwen3_5DecoderLayer inherit fc_quant=None (same trick already applied to mtp.fc upstream). An upstream fix is desirable — see vLLM qwen3_5_mtp.py:75-99.
Disabling thinking mode
{
"chat_template_kwargs": {"enable_thinking": false}
}
Memory budget (RTX 5090, 32GB VRAM)
| Component | Size |
|---|---|
| NVFP4 weights | ~16 GB |
lm_head + embed_tokens (bf16) |
~5 GB |
linear_attn.conv1d + norms etc (bf16) |
~0.5 GB |
| MTP head (bf16) | ~0.85 GB |
Vision encoder (bf16, skipped via --language-model-only) |
~0.9 GB |
| KV cache (fp8, 200k ctx) | ~6.4 GB |
| Overhead | ~3 GB |
| Total (text-only, 200k ctx) | ~30 GB |
Performance (RTX 5090, vLLM 0.20.0, MTP k=3)
Single-stream, synchronous, with --max-num-seqs 1 --max-model-len 200000:
| Workload | Prompt | Output | tok/s (median) |
|---|---|---|---|
| code | 512 | 1536 | 113.0 |
| prose | 256 | 2048 | 119.5 |
| long-32k | 32000 | 1024 | 72.0 |
| xlong-100k | 100000 | 512 | 35.8 |
| extreme-180k | 180000 | 256 | ~3.9 (prefill-dominated) |
MTP spec decode acceptance: 64–78% across workloads (k=3, mean accepted tokens/step ≈ 1.93–2.34). Without MTP, single-stream baseline is ~58 tok/s — MTP gives ~1.76× speedup on short workloads.
Capabilities
Multimodal (vision)
Drop --language-model-only and remove the LM-only flag to load the vision tower; image input via the standard OpenAI image_url content block.
Tool calling
Verified with --enable-auto-tool-choice --tool-call-parser qwen3_coder.
How It Was Made
- Quantize with
modeloptAPI directly —hf_ptq.pyfromModel-Optimizer/examples/llm_ptqdoesn't work for transformers ≥ 5.0 VL configs (AutoModelForCausalLM.from_configfails on missing top-levelvocab_size, which lives undertext_config). - Stitch MTP from
Qwen/Qwen3.6-27B's shards 13/15 + 15/15 intomodel-mtp-extra.safetensors(llmfan46/heretic-v2ships without trained MTP weights despitetext_config.mtp_num_hidden_layers: 1). - Patch
exclude_modulesin bothhf_quant_config.jsonandconfig.jsonto add the fused names vLLM creates (mtp.layers.0.self_attn.qkv_proj,mtp.layers.0.mlp.gate_up_proj, etc.); modelopt only emits unfused names by default. - Patch vLLM's
qwen3_5_mtp.pyso the MTP transformer layers inheritfc_quant(theNonequant config already applied tomtp.fc); without this MTP cannot load on NVFP4 checkpoints.
Acknowledgments
- llmfan46 for the heretic-v2 abliterated base
- Heretic v1.2.0 (MPOA refusal ablation)
- sakamakismile for the NVFP4+MTP recipe
- Qwen team for the base architecture and the official MTP head weights
- NVIDIA Model-Optimizer for the NVFP4 quantization framework
- vLLM for serving infrastructure
- Downloads last month
- 812
Model tree for lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP
Base model
Qwen/Qwen3.6-27B