Introduction

The Model is trained on the Ultrachat dataset regenerate by Qwen3.5-35B-A3B. And I only train the first conversation. I use the Specforge to train this. The training parameter is listed below:


SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)

# train eagle3 for qwen3.5
NUM_GPUS=1
TP_SIZE=1
BUILD_DATASET_NUM_PROC=${BUILD_DATASET_NUM_PROC:-64}

export HF_DATASETS_CACHE=$ROOT_DIR/cache/hf_datasets

NUM_GPUS=2
CUDA_VISIBLE_DEVICES=2,3 torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    $ROOT_DIR/scripts/train_eagle3.py \
    --target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
    --draft-model-config $ROOT_DIR/configs/qwen3.5-35b-a3b-eagle3.json \
    --train-data-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_1w_first_turn.jsonl  \
    --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \
    --output-dir $ROOT_DIR/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1-template-qwen3.5 \
    --num-epochs 10 \
    --batch-size 1 \
    --tp-size 1 \
    --learning-rate 1e-4 \
    --max-length 8192 \
    --chat-template qwen3.5 \
    --cache-dir $ROOT_DIR/cache \
    --embedding-key "model.language_model.embed_tokens.weight" \
    --sglang-mem-fraction-static 0.6 \
    --save-interval 5000 \
    --report-to tensorboard

I test it use the sglang from this sglang branch. I updated the smallest code in sglang0.5.9 to support qwen3.5 eagle3.

The test result is listed below:

Server Command:

python -m sglang.launch_server   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --speculative-algorithm EAGLE3   \
    --speculative-num-steps 4   \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 5   \
    --enable-flashinfer-allreduce-fusion   \
    --mem-fraction-static 0.8 \
    --speculative-draft-model-path /data/jiapingW/projects/SpecForge/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1/epoch_5_step_275000

Training Set(The first 30 messages):

python3 -m sglang.bench_serving   \
    --backend sglang-oai-chat   \
    --host 127.0.0.1   \
    --port 30000   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --dataset-name custom   \
    --num-prompts 30   \
    --max-concurrency 1   \
    --request-rate inf \
    --dataset-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_first_turn.jsonl

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     30        
Benchmark duration (s):                  140.30    
Total input tokens:                      5323      
Total input text tokens:                 5323      
Total generated tokens:                  55291     
Total generated tokens (retokenized):    55143     
Request throughput (req/s):              0.21      
Input token throughput (tok/s):          37.94     
Output token throughput (tok/s):         394.10    
Peak output token throughput (tok/s):    121.00    
Peak concurrent requests:                2         
Total token throughput (tok/s):          432.04    
Concurrency:                             1.00      
Accept length:                           3.35      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4676.08   
Median E2E Latency (ms):                 4950.71   
P90 E2E Latency (ms):                    5718.35   
P99 E2E Latency (ms):                    6004.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          56.77     
Median TTFT (ms):                        54.68     
P99 TTFT (ms):                           66.60     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.49      
Median TPOT (ms):                        2.44      
P99 TPOT (ms):                           2.91      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.51      
Median ITL (ms):                         1.71      
P95 ITL (ms):                            8.20      
P99 ITL (ms):                            8.55      
Max ITL (ms):                            92.97     
==================================================

ShareGPT Dataset(The first 30 messages):

python3 -m sglang.bench_serving   \
    --backend sglang-oai-chat   \
    --host 127.0.0.1   \
    --port 30000   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --num-prompts 30   \
    --max-concurrency 1   \
    --request-rate inf

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     30        
Benchmark duration (s):                  24.28     
Total input tokens:                      10010     
Total input text tokens:                 10010     
Total generated tokens:                  7895      
Total generated tokens (retokenized):    7895      
Request throughput (req/s):              1.24      
Input token throughput (tok/s):          412.26    
Output token throughput (tok/s):         325.16    
Peak output token throughput (tok/s):    120.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          737.42    
Concurrency:                             1.00      
Accept length:                           2.93      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   808.93    
Median E2E Latency (ms):                 814.78    
P90 E2E Latency (ms):                    1501.26   
P99 E2E Latency (ms):                    2613.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          58.99     
Median TTFT (ms):                        57.31     
P99 TTFT (ms):                           82.40     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.91      
Median TPOT (ms):                        2.75      
P99 TPOT (ms):                           6.78      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.86      
Median ITL (ms):                         1.74      
P95 ITL (ms):                            8.32      
P99 ITL (ms):                            8.56      
Max ITL (ms):                            8.82      
==================================================

Downloads last month: 78

Safetensors

Model size

0.2B params

Tensor type

I64

BF16

BOOL

Model tree for jiapingW/Qwen3.5-35B-A3B-Eagle3-Specforge

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Finetuned

(51)

this model