Introduction
The Model is trained on the Ultrachat dataset regenerate by Qwen3.5-35B-A3B. And I only train the first conversation. I use the Specforge to train this. The training parameter is listed below:
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
ROOT_DIR=$(dirname $SCRIPT_DIR)
# train eagle3 for qwen3.5
NUM_GPUS=1
TP_SIZE=1
BUILD_DATASET_NUM_PROC=${BUILD_DATASET_NUM_PROC:-64}
export HF_DATASETS_CACHE=$ROOT_DIR/cache/hf_datasets
NUM_GPUS=2
CUDA_VISIBLE_DEVICES=2,3 torchrun \
--standalone \
--nproc_per_node $NUM_GPUS \
$ROOT_DIR/scripts/train_eagle3.py \
--target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
--draft-model-config $ROOT_DIR/configs/qwen3.5-35b-a3b-eagle3.json \
--train-data-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_1w_first_turn.jsonl \
--build-dataset-num-proc $BUILD_DATASET_NUM_PROC \
--output-dir $ROOT_DIR/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1-template-qwen3.5 \
--num-epochs 10 \
--batch-size 1 \
--tp-size 1 \
--learning-rate 1e-4 \
--max-length 8192 \
--chat-template qwen3.5 \
--cache-dir $ROOT_DIR/cache \
--embedding-key "model.language_model.embed_tokens.weight" \
--sglang-mem-fraction-static 0.6 \
--save-interval 5000 \
--report-to tensorboard
I test it use the sglang from this sglang branch. I updated the smallest code in sglang0.5.9 to support qwen3.5 eagle3.
The test result is listed below:
Server Command:
python -m sglang.launch_server \
--model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 5 \
--enable-flashinfer-allreduce-fusion \
--mem-fraction-static 0.8 \
--speculative-draft-model-path /data/jiapingW/projects/SpecForge/outputs/qwen3.5-35b-a3b-ultrachat-regen-first-turn/draft-vocab-32000-kvhead-16-full-attn-plus1/epoch_5_step_275000
- Training Set(The first 30 messages):
python3 -m sglang.bench_serving \
--backend sglang-oai-chat \
--host 127.0.0.1 \
--port 30000 \
--model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
--dataset-name custom \
--num-prompts 30 \
--max-concurrency 1 \
--request-rate inf \
--dataset-path /data/jiapingW/projects/SpecForge/cache/dataset/ultrachat_train_regen_first_turn.jsonl
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 30
Benchmark duration (s): 140.30
Total input tokens: 5323
Total input text tokens: 5323
Total generated tokens: 55291
Total generated tokens (retokenized): 55143
Request throughput (req/s): 0.21
Input token throughput (tok/s): 37.94
Output token throughput (tok/s): 394.10
Peak output token throughput (tok/s): 121.00
Peak concurrent requests: 2
Total token throughput (tok/s): 432.04
Concurrency: 1.00
Accept length: 3.35
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 4676.08
Median E2E Latency (ms): 4950.71
P90 E2E Latency (ms): 5718.35
P99 E2E Latency (ms): 6004.36
---------------Time to First Token----------------
Mean TTFT (ms): 56.77
Median TTFT (ms): 54.68
P99 TTFT (ms): 66.60
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.49
Median TPOT (ms): 2.44
P99 TPOT (ms): 2.91
---------------Inter-Token Latency----------------
Mean ITL (ms): 2.51
Median ITL (ms): 1.71
P95 ITL (ms): 8.20
P99 ITL (ms): 8.55
Max ITL (ms): 92.97
==================================================
- ShareGPT Dataset(The first 30 messages):
python3 -m sglang.bench_serving \
--backend sglang-oai-chat \
--host 127.0.0.1 \
--port 30000 \
--model /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \
--num-prompts 30 \
--max-concurrency 1 \
--request-rate inf
============ Serving Benchmark Result ============
Backend: sglang-oai-chat
Traffic request rate: inf
Max request concurrency: 1
Successful requests: 30
Benchmark duration (s): 24.28
Total input tokens: 10010
Total input text tokens: 10010
Total generated tokens: 7895
Total generated tokens (retokenized): 7895
Request throughput (req/s): 1.24
Input token throughput (tok/s): 412.26
Output token throughput (tok/s): 325.16
Peak output token throughput (tok/s): 120.00
Peak concurrent requests: 4
Total token throughput (tok/s): 737.42
Concurrency: 1.00
Accept length: 2.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 808.93
Median E2E Latency (ms): 814.78
P90 E2E Latency (ms): 1501.26
P99 E2E Latency (ms): 2613.32
---------------Time to First Token----------------
Mean TTFT (ms): 58.99
Median TTFT (ms): 57.31
P99 TTFT (ms): 82.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.91
Median TPOT (ms): 2.75
P99 TPOT (ms): 6.78
---------------Inter-Token Latency----------------
Mean ITL (ms): 2.86
Median ITL (ms): 1.74
P95 ITL (ms): 8.32
P99 ITL (ms): 8.56
Max ITL (ms): 8.82
==================================================
- Downloads last month
- 78