The model startup using vllm failed.
Follow the vllm installation method provided in the document:
# install vllm
pip install vllm==0.11.2
# install deep_gemm
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM/third-party
git clone https://github.com/NVIDIA/cutlass.git
git clone https://github.com/fmtlib/fmt.git
cd ../
git checkout v2.1.1.post3
pip install . --no-build-isolation
An error occurred when starting vllm:
ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
When using vllm v0.13.0, the following error occurred during startup
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
The startup command for vllm is as follows:
export VLLM_USE_DEEP_GEMM=0 # ATM, this line is a "must" for Hopper devices
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4
vllm serve \
__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ \
--served-model-name MY_MODEL_NAME \
--enable-auto-tool-choice \
--tool-call-parser deepseek_v31 \
--reasoning-parser deepseek_v3 \
--swap-space 16 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 8 \
--enable-expert-parallel \ # optional
--speculative-config '{"model": "__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' \ # optional, 50%+- throughput increase is observed
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
Please help me. How can I properly start it? Thank you.
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.
I have the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
Using 8x RTX Blackwell 6000
Parameter:VLLM_USE_DEEP_GEMM=1
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_FLASHINFER_SAMPLER=0
OMP_NUM_THREADS=4
vllm serve QuantTrio/DeepSeek-V3.2-AWQ
--host 192.168.xxx.yyy
--port 8000
--enable-auto-tool-choice
--tool-call-parser deepseek_v31
--reasoning-parser deepseek_v3
--swap-space 16
--max-num-seqs 32
--gpu-memory-utilization 0.9
--trust-remote-code
--served-model-name "vllm_thinkingparam"
--tensor-parallel-size 8
--enable-expert-parallel
--speculative-config '{"model": "QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}'
--max_model_len $token
Have you all tried the one from vLLM official guide for Deepseek-V3.2?
source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/[email protected] --no-build-isolation # Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases
Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.
Are the SM120 (RTX Blackwell) supported? For me it seems they arent
Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.Are the SM120 (RTX Blackwell) supported? For me it seems they arent
Could you try to edit the config.json file, change "torch_dtype": "bfloat16" to "torch_dtype": "float16"
Then have a try one more time. If this still doesn't work, then it probably indeed doesn't work 🥲
Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.
:(
Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.:(
🥲
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.
I'm using 8*A100.
What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.I'm using 8*A100.
As above, I tested with 8×A100 and encountered the same issue. We need to wait for vLLM to support Sparse Attention on the Ampere architecture.