Introduction

The Model is trained on the Ultrachat dataset regenerate by Qwen3.5-35B-A3B. And I only train the first conversation. I use the Specforge to train this. The training parameter is listed below:


SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)

# train eagle3 for qwen3.5
NUM_GPUS=1
TP_SIZE=1
BUILD_DATASET_NUM_PROC=${BUILD_DATASET_NUM_PROC:-64}

export HF_DATASETS_CACHE=$ROOT_DIR/cache/hf_datasets

NUM_GPUS=2
CUDA_VISIBLE_DEVICES=2,3 torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    $ROOT_DIR/scripts/train_eagle3.py \
    --target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
    --draft-model-config $ROOT_DIR/configs/qwen3.5-35b-a3b-eagle3.json \
    --train-data-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_1w_first_turn.jsonl  \
    --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \
    --output-dir $ROOT_DIR/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1-template-qwen3.5 \
    --num-epochs 10 \
    --batch-size 1 \
    --tp-size 1 \
    --learning-rate 1e-4 \
    --max-length 8192 \
    --chat-template qwen3.5 \
    --cache-dir $ROOT_DIR/cache \
    --embedding-key "model.language_model.embed_tokens.weight" \
    --sglang-mem-fraction-static 0.6 \
    --save-interval 5000 \
    --report-to tensorboard

I test it use the sglang from this sglang branch. I updated the smallest code in sglang0.5.9 to support qwen3.5 eagle3.

The test result is listed below:

Server Command:

python -m sglang.launch_server   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --speculative-algorithm EAGLE3   \
    --speculative-num-steps 4   \
    --speculative-eagle-topk 1   \
    --speculative-num-draft-tokens 5   \
    --enable-flashinfer-allreduce-fusion   \
    --mem-fraction-static 0.8 \
    --speculative-draft-model-path /data/jiapingW/projects/SpecForge/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1/epoch_5_step_275000
  • Training Set(The first 30 messages):
python3 -m sglang.bench_serving   \
    --backend sglang-oai-chat   \
    --host 127.0.0.1   \
    --port 30000   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --dataset-name custom   \
    --num-prompts 30   \
    --max-concurrency 1   \
    --request-rate inf \
    --dataset-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_first_turn.jsonl
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     30        
Benchmark duration (s):                  140.30    
Total input tokens:                      5323      
Total input text tokens:                 5323      
Total generated tokens:                  55291     
Total generated tokens (retokenized):    55143     
Request throughput (req/s):              0.21      
Input token throughput (tok/s):          37.94     
Output token throughput (tok/s):         394.10    
Peak output token throughput (tok/s):    121.00    
Peak concurrent requests:                2         
Total token throughput (tok/s):          432.04    
Concurrency:                             1.00      
Accept length:                           3.35      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   4676.08   
Median E2E Latency (ms):                 4950.71   
P90 E2E Latency (ms):                    5718.35   
P99 E2E Latency (ms):                    6004.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          56.77     
Median TTFT (ms):                        54.68     
P99 TTFT (ms):                           66.60     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.49      
Median TPOT (ms):                        2.44      
P99 TPOT (ms):                           2.91      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.51      
Median ITL (ms):                         1.71      
P95 ITL (ms):                            8.20      
P99 ITL (ms):                            8.55      
Max ITL (ms):                            92.97     
==================================================
  • ShareGPT Dataset(The first 30 messages):
python3 -m sglang.bench_serving   \
    --backend sglang-oai-chat   \
    --host 127.0.0.1   \
    --port 30000   \
    --model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B   \
    --num-prompts 30   \
    --max-concurrency 1   \
    --request-rate inf
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 1         
Successful requests:                     30        
Benchmark duration (s):                  24.28     
Total input tokens:                      10010     
Total input text tokens:                 10010     
Total generated tokens:                  7895      
Total generated tokens (retokenized):    7895      
Request throughput (req/s):              1.24      
Input token throughput (tok/s):          412.26    
Output token throughput (tok/s):         325.16    
Peak output token throughput (tok/s):    120.00    
Peak concurrent requests:                4         
Total token throughput (tok/s):          737.42    
Concurrency:                             1.00      
Accept length:                           2.93      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   808.93    
Median E2E Latency (ms):                 814.78    
P90 E2E Latency (ms):                    1501.26   
P99 E2E Latency (ms):                    2613.32   
---------------Time to First Token----------------
Mean TTFT (ms):                          58.99     
Median TTFT (ms):                        57.31     
P99 TTFT (ms):                           82.40     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2.91      
Median TPOT (ms):                        2.75      
P99 TPOT (ms):                           6.78      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           2.86      
Median ITL (ms):                         1.74      
P95 ITL (ms):                            8.32      
P99 ITL (ms):                            8.56      
Max ITL (ms):                            8.82      
==================================================
Downloads last month
78
Safetensors
Model size
0.2B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for jiapingW/Qwen3.5-35B-A3B-Eagle3-Specforge

Finetuned
(51)
this model