Instructions to use lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP")
model = AutoModelForImageTextToText.from_pretrained("lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP

SGLang

How to use lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP with Docker Model Runner:
```
docker model run hf.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP
```

Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP

NVIDIA FP4 (NVFP4) quantized version of llmfan46/Qwen3.6-27B-uncensored-heretic-v2, with full multimodal (vision) and built-in MTP speculative decoding preserved.

Model Details

Base model: llmfan46/Qwen3.6-27B-uncensored-heretic-v2 (Heretic v1.2.0 MPOA abliteration of Qwen/Qwen3.6-27B)
Architecture: Qwen3_5ForConditionalGeneration (hybrid Gated-DeltaNet + Gated full attention, MLA-style q_proj)
Quantization: NVFP4 via nvidia-modelopt (NVFP4_DEFAULT_CFG)
Calibration: 20 samples × 8192 seq from cnn_dailymail (sakamakismile recipe)
Model size: ~20.6 GB (vs ~54 GB bf16 original)
MTP head: bf16, restored from Qwen/Qwen3.6-27B (the abliterated base ships without MTP)
Vision encoder: bf16 (unquantized, ~0.9 GB)

What's quantized, what's not

Component	Format	Notes
MLP (gate/up/down_proj)	NVFP4	All 64 layers
Full attention (qkv/o_proj)	NVFP4	16 layers (every 4th)
Linear attention (in_proj_a/b/dt_proj/g_proj/o_proj)	NVFP4	48 DeltaNet layers
Linear attention `conv1d`	bf16	Mamba SSM kernel, excluded
MTP head (`mtp.*`)	bf16	15 tensors / ~0.85 GB, sourced from `Qwen/Qwen3.6-27B`
`lm_head` + `embed_tokens`	bf16	Shared with MTP drafter
Vision encoder	bf16	All `model.visual.*` weights, excluded
Norms, biases, A_log, dt_bias	bf16	Small tensors, excluded

Quantization Recipe

import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from modelopt.torch.export.model_utils import get_language_model_from_vl
from transformers import Qwen3_5ForConditionalGeneration, AutoTokenizer

model = Qwen3_5ForConditionalGeneration.from_pretrained(
    "llmfan46/Qwen3.6-27B-uncensored-heretic-v2",
    dtype="auto", device_map="auto",
    max_memory={0: "4GiB", "cpu": "120GiB"},
    offload_folder="/tmp/offload",
    trust_remote_code=True,
)
language_model = get_language_model_from_vl(model)[-1]

quant_cfg = {**mtq.NVFP4_DEFAULT_CFG}
quant_cfg["quant_cfg"] += [
    {"quantizer_name": "*lm_head*", "enable": False},
    {"quantizer_name": "*linear_attn.conv1d*", "enable": False},
    {"quantizer_name": "*mtp.*", "enable": False},
    {"quantizer_name": "*visual.*", "enable": False},
]
mtq.quantize(language_model, quant_cfg, forward_loop=...)
export_hf_checkpoint(model, export_dir="./out")

After export, mtp.* is stitched in from the official Qwen/Qwen3.6-27B shards (15 tensors) since llmfan46/heretic-v2 ships without trained MTP weights.

Recipe follows sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP (verified 207 tok/s on RTX PRO 6000 Blackwell), with visual.* added to the ignore list to keep multimodal capability.

Usage with vLLM

Minimal launch (chat + reasoning + tool calling, no spec decode)

docker run -d --name heretic-v2-nvfp4 \
  --ipc host --network host --device nvidia.com/gpu=all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  -e VLLM_USE_FLASHINFER_MOE_FP4=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  vllm/vllm-openai:latest-cu130 \
  --model lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --kv-cache-dtype fp8 \
  --trust-remote-code --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder

flashinfer-cutlass (not the default marlin) is required because the Gated-DeltaNet dt_proj has output dim 96, which Marlin's tile_n_size=64 rejects.

With MTP speculative decoding (requires vLLM patch)

vLLM ≤ 0.20.0 has a known issue: in qwen3_5_mtp.py, only mtp.fc is forced unquantized for NVFP4 checkpoints, the MTP transformer layers themselves still inherit --quantization modelopt and try to load NVFP4-shape params from bf16 weights → shape assertion failure.

Apply the small one-liner workaround at container start:

docker run -d --name heretic-v2-nvfp4-mtp \
  --ipc host --network host --device nvidia.com/gpu=all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e VLLM_NVFP4_GEMM_BACKEND=flashinfer-cutlass \
  --entrypoint bash \
  vllm/vllm-openai:latest-cu130 -lc "
python3 -c \"
import re
F='/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5_mtp.py'
s=open(F).read()
s=s.replace(
  'self.layers = torch.nn.ModuleList(\n            Qwen3_5DecoderLayer(\n                vllm_config,',
  '_orig_qc = vllm_config.quant_config\n        vllm_config.quant_config = fc_quant\n        self.layers = torch.nn.ModuleList(\n            Qwen3_5DecoderLayer(\n                vllm_config,', 1)
s=s.replace(
  'self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(',
  'vllm_config.quant_config = _orig_qc\n        self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(', 1)
open(F,'w').write(s)
print('patched')\"
exec vllm serve lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 200000 --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 --kv-cache-dtype fp8 \
  --trust-remote-code --language-model-only \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder \
  --speculative-config '{\"method\":\"qwen3_5_mtp\",\"num_speculative_tokens\":3}'
"

The patch makes MTP Qwen3_5DecoderLayer inherit fc_quant=None (same trick already applied to mtp.fc upstream). An upstream fix is desirable — see vLLM qwen3_5_mtp.py:75-99.

Disabling thinking mode

{
  "chat_template_kwargs": {"enable_thinking": false}
}

Memory budget (RTX 5090, 32GB VRAM)

Component	Size
NVFP4 weights	~16 GB
`lm_head` + `embed_tokens` (bf16)	~5 GB
`linear_attn.conv1d` + norms etc (bf16)	~0.5 GB
MTP head (bf16)	~0.85 GB
Vision encoder (bf16, skipped via `--language-model-only`)	~0.9 GB
KV cache (fp8, 200k ctx)	~6.4 GB
Overhead	~3 GB
Total (text-only, 200k ctx)	~30 GB

Performance (RTX 5090, vLLM 0.20.0, MTP k=3)

Single-stream, synchronous, with --max-num-seqs 1 --max-model-len 200000:

Workload	Prompt	Output	tok/s (median)
code	512	1536	113.0
prose	256	2048	119.5
long-32k	32000	1024	72.0
xlong-100k	100000	512	35.8
extreme-180k	180000	256	~3.9 (prefill-dominated)

MTP spec decode acceptance: 64–78% across workloads (k=3, mean accepted tokens/step ≈ 1.93–2.34). Without MTP, single-stream baseline is ~58 tok/s — MTP gives ~1.76× speedup on short workloads.

Capabilities

Multimodal (vision)

Drop --language-model-only and remove the LM-only flag to load the vision tower; image input via the standard OpenAI image_url content block.

Tool calling

Verified with --enable-auto-tool-choice --tool-call-parser qwen3_coder.

How It Was Made

Quantize with modelopt API directly — hf_ptq.py from Model-Optimizer/examples/llm_ptq doesn't work for transformers ≥ 5.0 VL configs (AutoModelForCausalLM.from_config fails on missing top-level vocab_size, which lives under text_config).
Stitch MTP from Qwen/Qwen3.6-27B's shards 13/15 + 15/15 into model-mtp-extra.safetensors (llmfan46/heretic-v2 ships without trained MTP weights despite text_config.mtp_num_hidden_layers: 1).
Patch exclude_modules in both hf_quant_config.json and config.json to add the fused names vLLM creates (mtp.layers.0.self_attn.qkv_proj, mtp.layers.0.mlp.gate_up_proj, etc.); modelopt only emits unfused names by default.
Patch vLLM's qwen3_5_mtp.py so the MTP transformer layers inherit fc_quant (the None quant config already applied to mtp.fc); without this MTP cannot load on NVFP4 checkpoints.

Acknowledgments

llmfan46 for the heretic-v2 abliterated base
Heretic v1.2.0 (MPOA refusal ablation)
sakamakismile for the NVFP4+MTP recipe
Qwen team for the base architecture and the official MTP head weights
NVIDIA Model-Optimizer for the NVFP4 quantization framework
vLLM for serving infrastructure

Downloads last month: 812

Safetensors

Model size

16B params

Tensor type

BF16

F8_E4M3

Model tree for lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP

Base model

Qwen/Qwen3.6-27B

Finetuned

llmfan46/Qwen3.6-27B-uncensored-heretic-v2

Quantized

(14)

this model