The model startup using vllm failed.

#5
by beausoft - opened

Follow the vllm installation method provided in the document:

# install vllm
pip install vllm==0.11.2
# install deep_gemm
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM/third-party
git clone https://github.com/NVIDIA/cutlass.git
git clone https://github.com/fmtlib/fmt.git
cd ../
git checkout v2.1.1.post3
pip install . --no-build-isolation

An error occurred when starting vllm:

ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

When using vllm v0.13.0, the following error occurred during startup

ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

The startup command for vllm is as follows:

export VLLM_USE_DEEP_GEMM=0  # ATM, this line is a "must" for Hopper devices
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

vllm serve \
    __YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ \
    --served-model-name MY_MODEL_NAME \
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v31 \
    --reasoning-parser deepseek_v3 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \  # optional
    --speculative-config '{"model": "__YOUR_PATH__/QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}' \  # optional, 50%+- throughput increase is observed
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Please help me. How can I properly start it? Thank you.

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I have the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Using 8x RTX Blackwell 6000

Parameter:
VLLM_USE_DEEP_GEMM=1
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
VLLM_USE_FLASHINFER_MOE_FP16=1
VLLM_USE_FLASHINFER_SAMPLER=0
OMP_NUM_THREADS=4
vllm serve QuantTrio/DeepSeek-V3.2-AWQ
--host 192.168.xxx.yyy
--port 8000
--enable-auto-tool-choice
--tool-call-parser deepseek_v31
--reasoning-parser deepseek_v3
--swap-space 16
--max-num-seqs 32
--gpu-memory-utilization 0.9
--trust-remote-code
--served-model-name "vllm_thinkingparam"
--tensor-parallel-size 8
--enable-expert-parallel
--speculative-config '{"model": "QuantTrio/DeepSeek-V3.2-AWQ", "num_speculative_tokens": 1}'
--max_model_len $token

QuantTrio org

Have you all tried the one from vLLM official guide for Deepseek-V3.2?

source .venv/bin/activate
uv pip install -U vllm --torch-backend auto
uv pip install git+https://github.com/deepseek-ai/[email protected] --no-build-isolation # Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

QuantTrio org

Yeah, i tried this - ending in the same problem:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.bfloat16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

Are the SM120 (RTX Blackwell) supported? For me it seems they arent

Could you try to edit the config.json file, change "torch_dtype": "bfloat16" to "torch_dtype": "float16"
Then have a try one more time. If this still doesn't work, then it probably indeed doesn't work 🥲

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

:(

QuantTrio org

Yeah, i tried it but unfortunally a pretty similar errormessage occurs:
ValueError: No valid attention backend found for cuda with AttentionSelectorConfig(head_size=576, dtype=torch.float16, kv_cache_dtype=auto, block_size=None, use_mla=True, has_sink=False, use_sparse=True, use_mm_prefix=False, attn_type=decoder). Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, FlashMLA Sparse is only supported on Hopper and Blackwell devices.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [dtype not supported, compute capability not supported]}.

:(

🥲

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

What GPU device are you using?
This model uses a sparse‑attention mechanism, and for now it can only run on H‑series and B‑series cards. Older GPUs are not yet supported.

I'm using 8*A100.

As above, I tested with 8×A100 and encountered the same issue. We need to wait for vLLM to support Sparse Attention on the Ampere architecture.

Sign up or log in to comment