lightseek commited on
Commit
ca8a4db
·
verified ·
1 Parent(s): a86c9f8

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ benchmark_results.png filter=lfs diff=lfs merge=lfs -text
37
+ images/benchmark_results.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: moonshotai/Kimi-K2.5
4
+ tags:
5
+ - speculative-decoding
6
+ - eagle3
7
+ - draft-model
8
+ - kimi-k2.5
9
+ ---
10
+
11
+ ## Model Overview
12
+
13
+ **kimi-k2.5-eagle3** is an Eagle3 MTP draft model for accelerating inference of [Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), trained with **[TorchSpec](https://github.com/torchspec-project/TorchSpec)** — an online speculative decoding training framework that runs FSDP training and inference concurrently.
14
+
15
+ Training data is available at [lightseekorg/kimi-mtp-dataset](https://huggingface.co/datasets/lightseekorg/kimi-mtp-dataset).
16
+
17
+ ### Training Setup
18
+
19
+ - **Cluster**: 4 nodes × 8× H200 (32 GPUs total)
20
+ - **Training**: 2 nodes (16 GPUs), FSDP
21
+ - **Inference**: 2 nodes (16 GPUs), Engine (TP=8 per node)
22
+ - **Duration**: ~14 hours per phase
23
+
24
+ Training ran in two phases, each 20k steps (~300k samples):
25
+ - **Phase 1**: Regenerated [open-perfectblend](https://huggingface.co/datasets/mlabonne/open-perfectblend) dataset
26
+ - **Phase 2**: Mixed dataset (English, VL, Chinese, function-call, agent, creative writing)
27
+
28
+ All training responses were **regenerated by Kimi-K2.5 via Engine** to match the base model's exact token distribution.
29
+
30
+ ### Training Curves
31
+
32
+ The plots show loss, token acceptance accuracy, and simulated accept_length during training. Both eval sets contain 256 samples drawn from each phase's own training corpus.
33
+
34
+ **Phase 1 (steps 0 → 20k):**
35
+
36
+ ![Phase 1 training curves](images/eval_metrics_phase1.png)
37
+
38
+ **Phase 2 (steps 20k → 40k):**
39
+
40
+ ![Phase 2 training curves](images/eval_metrics_phase2.png)
41
+
42
+ ---
43
+
44
+ ## Performance
45
+
46
+ The primary metric is **accept_length** — the average number of tokens accepted per speculation step with `topk=1, num_steps=3, num_draft_tokens=4`. Higher is better.
47
+
48
+ Benchmarks were run using [SpecForge](https://github.com/sgl-project/SpecForge/blob/main/benchmarks)'s `bench_eagle3.py`. BFCL v3 benchmarks (†) use a custom extension to the original script.
49
+
50
+ ![accept_length by dataset and method](images/benchmark_results.png)
51
+
52
+ | Category | Dataset | n | Phase 1 (20k steps) | Phase 2 (40k steps) |
53
+ |----------|---------|---|---------------------|---------------------|
54
+ | Dialogue | [MTBench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) | 80 | 2.624 | 2.687 |
55
+ | Chinese | [CEval](https://huggingface.co/datasets/ceval/ceval-exam) | 212 | 1.482 | 2.295 |
56
+ | Math | [GSM8K](https://github.com/openai/grade-school-math) | 500 | 3.123 | 3.201 |
57
+ | Code | [HumanEval](https://huggingface.co/datasets/openai/openai_humaneval) | 164 | 3.242 | 3.285 |
58
+ | Math | [MATH500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) | 500 | 3.323 | 3.342 |
59
+ | Math | [AIME](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024) | 30 | 2.972 | 3.033 |
60
+ | VL | [MMStar](https://huggingface.co/datasets/Lin-Chen/MMStar) | 200 | 2.566 | 2.787 |
61
+ | Function Call † | [BFCL v3](https://huggingface.co/datasets/gorilla-llm/Berkeley-Function-Calling-Leaderboard) simple | 400 | 3.729 | 3.798 |
62
+ | Function Call † | BFCL v3 multiple | 200 | 3.745 | 3.809 |
63
+ | Function Call † | BFCL v3 parallel | 200 | 3.596 | 3.669 |
64
+ | Function Call † | BFCL v3 parallel_multiple | 200 | 3.525 | 3.601 |
65
+ | Function Call † | BFCL v3 live_simple | 1547 | 3.515 | 3.667 |
66
+ | Function Call † | BFCL v3 live_multiple | 1030 | 3.407 | 3.453 |
67
+ | Function Call † | BFCL v3 live_parallel | 97 | 3.303 | 3.410 |
68
+ | Function Call † | BFCL v3 live_parallel_multiple | 170 | 3.070 | 3.159 |
69
+
70
+ ---
71
+
72
+ ## Quick Start
73
+
74
+ ### Requirements
75
+
76
+ - NVIDIA GPU with CUDA 12.0+
77
+ - [SGLang](https://github.com/sgl-project/sglang) ≥ 0.5.8
78
+
79
+ ### Launch Server
80
+
81
+ ```bash
82
+ python -m sglang.launch_server \
83
+ --model-path /path/to/Kimi-K2.5 \
84
+ --tp 8 \
85
+ --trust-remote-code \
86
+ --speculative-algorithm EAGLE3 \
87
+ --speculative-draft-model-path lightseekorg/kimi-k2.5-eagle3 \
88
+ --speculative-num-steps 3 \
89
+ --speculative-eagle-topk 1 \
90
+ --speculative-num-draft-tokens 4 \
91
+ --mem-fraction-static 0.75 \
92
+ --dtype bfloat16
93
+ ```
94
+
95
+ ### Run Benchmarks
96
+
97
+ ```bash
98
+ python bench_eagle3.py \
99
+ --model-path /path/to/Kimi-K2.5 \
100
+ --port 30000 \
101
+ --config-list 1,3,1,4 \
102
+ --benchmark-list <benchmark_name> \
103
+ --skip-launch-server
104
+ ```
105
+
106
+ `--config-list` format: `topk,num_steps,topk,num_draft_tokens`.
config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLMEagle3"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 163584,
8
+ "draft_vocab_size": 163840,
9
+ "dtype": "float16",
10
+ "eos_token_id": 163585,
11
+ "head_dim": 128,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 7168,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 12288,
16
+ "max_position_embeddings": 262144,
17
+ "max_window_layers": 36,
18
+ "mlp_bias": false,
19
+ "model_type": "llama",
20
+ "num_attention_heads": 64,
21
+ "num_hidden_layers": 1,
22
+ "num_key_value_heads": 64,
23
+ "pretraining_tp": 1,
24
+ "rms_norm_eps": 1e-06,
25
+ "rope_scaling": null,
26
+ "rope_theta": 1000000,
27
+ "sliding_window": null,
28
+ "tie_word_embeddings": false,
29
+ "transformers_version": "4.57.1",
30
+ "use_cache": true,
31
+ "use_sliding_window": false,
32
+ "vocab_size": 163840
33
+ }
images/benchmark_results.png ADDED

Git LFS Details

  • SHA256: d59803f08928f47dc7e65d83f04905d4e672becd1547c0faf8be617992e43749
  • Pointer size: 131 Bytes
  • Size of remote file: 173 kB
images/eval_metrics_phase1.png ADDED
images/eval_metrics_phase2.png ADDED
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55479ddb915be4c9ecb7e4b1920d38f7c6ef46f8e9e3ef0badfdec3c8527294e
3
+ size 4007716200
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff1819b92df82db085d4c1f71acdeffe28dbd44bb130a6fc65fd9d4e656b817b
3
+ size 2348810368
model.safetensors.index.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_parameters": 3178262528,
4
+ "total_size": 6356525056
5
+ },
6
+ "weight_map": {
7
+ "embed_tokens.weight": "model-00001-of-00002.safetensors",
8
+ "fc.weight": "model-00001-of-00002.safetensors",
9
+ "lm_head.weight": "model-00002-of-00002.safetensors",
10
+ "midlayer.hidden_norm.weight": "model-00001-of-00002.safetensors",
11
+ "midlayer.input_layernorm.weight": "model-00001-of-00002.safetensors",
12
+ "midlayer.mlp.down_proj.weight": "model-00001-of-00002.safetensors",
13
+ "midlayer.mlp.gate_proj.weight": "model-00001-of-00002.safetensors",
14
+ "midlayer.mlp.up_proj.weight": "model-00001-of-00002.safetensors",
15
+ "midlayer.post_attention_layernorm.weight": "model-00001-of-00002.safetensors",
16
+ "midlayer.self_attn.k_proj.weight": "model-00001-of-00002.safetensors",
17
+ "midlayer.self_attn.o_proj.weight": "model-00001-of-00002.safetensors",
18
+ "midlayer.self_attn.q_proj.weight": "model-00001-of-00002.safetensors",
19
+ "midlayer.self_attn.v_proj.weight": "model-00001-of-00002.safetensors",
20
+ "norm.weight": "model-00001-of-00002.safetensors"
21
+ }
22
+ }