Files changed (1) hide show
  1. README.md +107 -5
README.md CHANGED
@@ -1,5 +1,107 @@
1
- ---
2
- license: other
3
- license_name: modified-mit
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: modified-mit
4
+ license_link: LICENSE
5
+ base_model:
6
+ - moonshotai/Kimi-K2-Thinking
7
+ ---
8
+
9
+ # Model Overview
10
+
11
+ - **Model Architecture:** Kimi-K2-Thinking
12
+ - **Input:** Text
13
+ - **Output:** Text
14
+ - **Supported Hardware Microarchitecture:** AMD MI350/MI355
15
+ - **ROCm:** 7.0
16
+ - **Transformers:** 4.57.6
17
+ - **Operating System(s):** Linux
18
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
19
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.11.2)
20
+ - **Quantized layers:** `experts`, `shared_experts`, `self_attn`
21
+ - **Weight quantization:** MoE OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
22
+ - **Activation quantization:** MoE OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
23
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
24
+
25
+ This model was built with Kimi-K2-Thinking model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
26
+
27
+ # Model Quantization
28
+
29
+ The model was quantized from [unsloth/Kimi-K2-Thinking-BF16](https://huggingface.co/unsloth/Kimi-K2-Thinking-BF16) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized.
30
+
31
+ **Quantization scripts:**
32
+ ```
33
+ cd Quark/examples/torch/language_modeling/llm_ptq/
34
+ exclude_layers="*mlp.gate *lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj"
35
+
36
+ python quantize_quark.py \
37
+ --model_dir unsloth/Kimi-K2-Thinking-BF16 \
38
+ --quant_scheme mxfp4 \
39
+ --layer_quant_scheme '*self_attn*' ptpc_fp8 \
40
+ --exclude_layers $exclude_layers \
41
+ --output_dir amd/Kimi-K2-Thinking-MXFP4-AttnFP8 \
42
+ --file2file_quantization
43
+ ```
44
+
45
+ # Deployment
46
+ ### Use with vLLM
47
+
48
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
49
+
50
+ ## Evaluation
51
+ The model was evaluated on GSM8K benchmarks.
52
+
53
+ ### Accuracy
54
+
55
+ <table>
56
+ <tr>
57
+ <td><strong>Benchmark</strong>
58
+ </td>
59
+ <td><strong>Kimi-K2-Thinking </strong>
60
+ </td>
61
+ <td><strong>Kimi-K2-Thinking-MXFP4-AttnFP8(this model)</strong>
62
+ </td>
63
+ <td><strong>Recovery</strong>
64
+ </td>
65
+ </tr>
66
+ <tr>
67
+ <td>GSM8K (flexible-extract)
68
+ </td>
69
+ <td>94.16
70
+ </td>
71
+ <td>92.95
72
+ </td>
73
+ <td>98.71%
74
+ </td>
75
+ </tr>
76
+ </table>
77
+
78
+ ### Reproduction
79
+
80
+ The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM, lm-eval and amd-quark compiled and installed from source inside the image.
81
+
82
+ #### Launching server
83
+ ```
84
+ export VLLM_ATTENTION_BACKEND="TRITON_MLA"
85
+ export VLLM_ROCM_USE_AITER=1
86
+ export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
87
+
88
+ vllm serve amd/Kimi-K2-Thinking-MXFP4-AttnFP8 \
89
+ --tensor-parallel-size 8 \
90
+ --enable-auto-tool-choice \
91
+ --tool-call-parser kimi_k2 \
92
+ --reasoning-parser kimi_k2 \
93
+ --trust-remote-code
94
+ ```
95
+
96
+ #### Evaluating model in a new terminal
97
+ ```
98
+ lm_eval \
99
+ --model local-completions \
100
+ --model_args "model=amd/Kimi-K2-Thinking-MXFP4-AttnFP8,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
101
+ --tasks gsm8k \
102
+ --num_fewshot 5 \
103
+ --batch_size 1
104
+ ```
105
+
106
+ # License
107
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.