Qwen3.5-27B-DFlash
This model is still under training.
DFlash is a novel speculative decoding method that utilizes a lightweight block diffusion model for drafting. It enables efficient, high-quality parallel drafting that pushes the limits of inference speed.
This model is the drafter component. It must be used in conjunction with the target model Qwen/Qwen3.5-27B. It was trained with a context length of 4096 tokens.
π Quick Start
SGLang
Installation
uv pip install "git+https://github.com/sgl-project/sglang.git@refs/pull/20547/head#subdirectory=python"
Inference
export SGLANG_ENABLE_SPEC_V2=1
export SGLANG_ENABLE_DFLASH_SPEC_V2=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-27B \
--speculative-algorithm DFLASH \
--speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--mamba-scheduler-strategy extra_buffer \
--trust-remote-code
Note: For long-context or agentic usage, consider adding
--speculative-dflash-draft-window-size WINDOW_SIZEto enable sliding-window attention for the draft model.
vLLM
Thanks to the community and all contributors! Check out the following PRs to see how to run DFlash on vLLM: #36847 and #36767.
Early Results
- Thinking: enabled
- Max new tokens: 4096
- Block size: 16
- 2.2 Epoch
Dataset Accept Length GSM8K 6.80 Math500 7.46 HumanEval 8.50 MBPP 6.76 MT-Bench 5.14 Alpaca 4.74
- Downloads last month
- 201
Collection including z-lab/Qwen3.5-27B-DFlash
Collection
Block Diffusion for Flash Speculative Decoding β’ 12 items β’ Updated β’ 26