Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP

NVIDIA FP4 (NVFP4) quantized version of llmfan46/Qwen3.6-27B-uncensored-heretic-v2, with full multimodal (vision) and built-in MTP speculative decoding preserved.

Model Details

  • Base model: llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (Heretic v1.2.0 MPOA abliteration of Qwen/Qwen3.6-27B)
  • Architecture: Qwen3_5ForConditionalGeneration (hybrid Gated-DeltaNet + Gated full attention, MLA-style q_proj)
  • Quantization: NVFP4 via nvidia-modelopt (NVFP4_DEFAULT_CFG)
  • Calibration: 20 samples × 8192 seq from cnn_dailymail (sakamakismile recipe)
  • Model size: ~20.6 GB (vs ~54 GB bf16 original)
  • MTP head: bf16, restored from Qwen/Qwen3.6-27B (the abliterated base ships without MTP)
  • Vision encoder: bf16 (unquantized, ~0.9 GB)

What's quantized, what's not

Component Format Notes
MLP (gate/up/down_proj) NVFP4 All 64 layers
Full attention (qkv/o_proj) NVFP4 16 layers (every 4th)
Linear attention (in_proj_a/b/dt_proj/g_proj/o_proj) NVFP4 48 DeltaNet layers
Linear attention conv1d bf16 Mamba SSM kernel, excluded
MTP head (mtp.*) bf16 15 tensors / ~0.85 GB, sourced from Qwen/Qwen3.6-27B
lm_head + embed_tokens bf16 Shared with MTP drafter
Vision encoder bf16 All model.visual.* weights, excluded
Norms, biases, A_log, dt_bias bf16 Small tensors, excluded

Quantization Recipe

import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from modelopt.torch.export.model_utils import get_language_model_from_vl
from transformers import Qwen3_5ForConditionalGeneration, AutoTokenizer

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "llmfan46/Qwen3.6-27B-uncensored-heretic-v2",
    dtype="auto", device_map="auto",
    max_memory={0: "4GiB", "cpu": "120GiB"},
    offload_folder="/tmp/offload",
    trust_remote_code=True,
)
language_model = get_language_model_from_vl(model)[-1]

quant_cfg = {**mtq.NVFP4_DEFAULT_CFG}
quant_cfg["quant_cfg"] += [
    {"quantizer_name": "*lm_head*", "enable": False},
    {"quantizer_name": "*linear_attn.conv1d*", "enable": False},
    {"quantizer_name": "*mtp.*", "enable": False},
    {"quantizer_name": "*visual.*", "enable": False},
]
mtq.quantize(language_model, quant_cfg, forward_loop=...)
export_hf_checkpoint(model, export_dir="./out")

After export, mtp.* is stitched in from the official Qwen/Qwen3.6-27B shards (15 tensors) since llmfan46/heretic-v2 ships without trained MTP weights.

Recipe follows sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (verified 207 tok/s on RTX PRO 6000 Blackwell), with visual.* added to the ignore list to keep multimodal capability.

Usage with vLLM

Minimal launch (chat + reasoning + tool calling, no spec decode)

docker run -d --name heretic-v2-nvfp4 \
  --ipc host --network host --device nvidia.com/gpu=all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  vllm/vllm-openai:latest-cu130 \
  --model lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --trust-remote-code --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

flashinfer-cutlass (not the default marlin) is required because the Gated-DeltaNet dt_proj has output dim 96, which Marlin's tile_n_size=64 rejects.

With MTP speculative decoding (requires vLLM patch)

vLLM ≤ 0.20.0 has a known issue: in qwen3_5_mtp.py, only mtp.fc is forced unquantized for NVFP4 checkpoints, the MTP transformer layers themselves still inherit --quantization modelopt and try to load NVFP4-shape params from bf16 weights → shape assertion failure.

Apply the small one-liner workaround at container start:

docker run -d --name heretic-v2-nvfp4-mtp \
  --ipc host --network host --device nvidia.com/gpu=all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  --entrypoint bash \
  vllm/vllm-openai:latest-cu130 -lc "
python3 -c \"
import re
F='/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py'
s=open(F).read()
s=s.replace(
  'self.layers = torch.nn.ModuleList(\n            Qwen3_5DecoderLayer(\n                vllm_config,',
  '_orig_qc = vllm_config.quant_config\n        vllm_config.quant_config = fc_quant\n        self.layers = torch.nn.ModuleList(\n            Qwen3_5DecoderLayer(\n                vllm_config,', 1)
s=s.replace(
  'self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(',
  'vllm_config.quant_config = _orig_qc\n        self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(', 1)
open(F,'w').write(s)
print('patched')\"
exec vllm serve lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 200000 --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 --kv-cache-dtype fp8 \
  --trust-remote-code --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'
"

The patch makes MTP Qwen3_5DecoderLayer inherit fc_quant=None (same trick already applied to mtp.fc upstream). An upstream fix is desirable — see vLLM qwen3_5_mtp.py:75-99.

Disabling thinking mode

{
  "chat_template_kwargs": {"enable_thinking": false}
}

Memory budget (RTX 5090, 32GB VRAM)

Component Size
NVFP4 weights ~16 GB
lm_head + embed_tokens (bf16) ~5 GB
linear_attn.conv1d + norms etc (bf16) ~0.5 GB
MTP head (bf16) ~0.85 GB
Vision encoder (bf16, skipped via --language-model-only) ~0.9 GB
KV cache (fp8, 200k ctx) ~6.4 GB
Overhead ~3 GB
Total (text-only, 200k ctx) ~30 GB

Performance (RTX 5090, vLLM 0.20.0, MTP k=3)

Single-stream, synchronous, with --max-num-seqs 1 --max-model-len 200000:

Workload Prompt Output tok/s (median)
code 512 1536 113.0
prose 256 2048 119.5
long-32k 32000 1024 72.0
xlong-100k 100000 512 35.8
extreme-180k 180000 256 ~3.9 (prefill-dominated)

MTP spec decode acceptance: 64–78% across workloads (k=3, mean accepted tokens/step ≈ 1.93–2.34). Without MTP, single-stream baseline is ~58 tok/s — MTP gives ~1.76× speedup on short workloads.

Capabilities

Multimodal (vision)

Drop --language-model-only and remove the LM-only flag to load the vision tower; image input via the standard OpenAI image_url content block.

Tool calling

Verified with --enable-auto-tool-choice --tool-call-parser qwen3_coder.

How It Was Made

  1. Quantize with modelopt API directly — hf_ptq.py from Model-Optimizer/examples/llm_ptq doesn't work for transformers ≥ 5.0 VL configs (AutoModelForCausalLM.from_config fails on missing top-level vocab_size, which lives under text_config).
  2. Stitch MTP from Qwen/Qwen3.6-27B's shards 13/15 + 15/15 into model-mtp-extra.safetensors (llmfan46/heretic-v2 ships without trained MTP weights despite text_config.mtp_num_hidden_layers: 1).
  3. Patch exclude_modules in both hf_quant_config.json and config.json to add the fused names vLLM creates (mtp.layers.0.self_attn.qkv_proj, mtp.layers.0.mlp.gate_up_proj, etc.); modelopt only emits unfused names by default.
  4. Patch vLLM's qwen3_5_mtp.py so the MTP transformer layers inherit fc_quant (the None quant config already applied to mtp.fc); without this MTP cannot load on NVFP4 checkpoints.

Acknowledgments

Downloads last month
812
Safetensors
Model size
16B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP

Base model

Qwen/Qwen3.6-27B
Quantized
(14)
this model