nologik commited on
Commit
ae85f11
·
verified ·
1 Parent(s): e9c9d33

Add CONVERSION_SUMMARY.md

Browse files
Files changed (1) hide show
  1. CONVERSION_SUMMARY.md +172 -0
CONVERSION_SUMMARY.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IQuest-Loop-Instruct GGUF Conversion Summary
2
+
3
+ **Date**: 2026-01-07
4
+ **Model**: IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
5
+ **Achievement**: World's first IQuest-Loop-Instruct GGUF conversion
6
+
7
+ ## Files Created
8
+
9
+ | File | Size | Format | SHA256 | Completion Time |
10
+ |------|------|--------|--------|----------------|
11
+ | IQuest-Coder-V1-40B-Loop-Instruct-f16.gguf | 75GB | F16 | `b70d3bb48753e786c8afca7556b818341fc9258e29083be4b0375c5a8b788289` | 2m 6s |
12
+ | IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf | 23GB | Q4_K_M | `b665999c8d6660ba0ea29cbbb072056052ef965a233ef65661ec16a16b39a9e3` | 2m 23s |
13
+ | IQuest-Coder-V1-40B-Loop-Instruct-q5_k_m.gguf | 27GB | Q5_K_M | `a15814998038c8c6334f69bc11b776bce785350c933ce95fe9c41c4c7ec708ba` | 1m 41s |
14
+ | IQuest-Coder-V1-40B-Loop-Instruct-q8_0.gguf | 40GB | Q8_0 | `a9323b7ca583a842737dd4ec1f7422101c68ededf2a86c75a8d5e9da70eaae06` | 53s |
15
+
16
+ ## Technical Implementation
17
+
18
+ ### Architecture Support
19
+
20
+ Created `IQuestLoopCoderModel` class in llama.cpp's `convert_hf_to_gguf.py`:
21
+ - Inherits from `LlamaModel` (compatible architecture base)
22
+ - Maps 160 loop-specific `gate_projections` tensors to GGUF format
23
+ - Preserves loop parameters in metadata:
24
+ - `llama.loop.num`: 2
25
+ - `llama.loop.window_size`: 64
26
+
27
+ ### Tensor Mapping
28
+
29
+ **Gate Projections** (160 tensors total):
30
+ - Source: `model.gate_projections.{0-79}.{weight|bias}`
31
+ - Shape: `[128, 40]` weight + `[40]` bias per layer
32
+ - Target: `blk.{layer}.loop_gate.{weight|bias}`
33
+ - Quantization: Uses fallback q5_0/q5_1 for Q4_K_M/Q5_K_M (tensors too small for standard quantization)
34
+
35
+ **Standard Tensors** (721 tensors):
36
+ - Uses LlamaModel's standard tensor mapping
37
+ - Attention: Q, K, V, Output projections
38
+ - FFN: Gate, Up, Down projections
39
+ - Normalization: Attention & FFN RMS norms
40
+
41
+ ## Conversion Statistics
42
+
43
+ - **Total Tensors**: 883
44
+ - Standard Llama: 721
45
+ - Loop Gates: 160 (80 layers × 2 per layer)
46
+ - Embeddings: 2
47
+ - **Vocabulary Size**: 76,800 tokens
48
+ - **Context Length**: 131,072 tokens
49
+ - **Hidden Layers**: 80
50
+ - **Attention Heads**: 40 (8 KV heads)
51
+ - **Hidden Size**: 5,120
52
+ - **FFN Size**: 27,648
53
+
54
+ ## Current Status
55
+
56
+ ### What Works ✅
57
+
58
+ 1. **Conversion**: Successfully converts HuggingFace → GGUF F16
59
+ 2. **Quantization**: All standard quantization levels work (Q4_K_M, Q5_K_M, Q8_0, etc.)
60
+ 3. **Metadata**: Loop parameters correctly stored in GGUF metadata
61
+ 4. **Tensor Preservation**: All 883 tensors including loop gates successfully converted
62
+ 5. **Ollama Import**: Ollama accepts and imports the GGUF file
63
+
64
+ ### What Needs Work 🔧
65
+
66
+ 1. **Runtime Support**: llama.cpp runtime needs loop attention mechanism implementation
67
+ 2. **Inference**: Model loads but cannot run inference yet (loop gates not used)
68
+ 3. **Testing**: Need to validate loop attention behavior matches original PyTorch
69
+
70
+ ## Implementation Details
71
+
72
+ ### Modified Files
73
+
74
+ **`/tmp/convert_hf_to_gguf.py`** (lines 2695-2733):
75
+ ```python
76
+ @ModelBase.register("IQuestLoopCoderForCausalLM")
77
+ class IQuestLoopCoderModel(LlamaModel):
78
+ """IQuest Loop Coder model with recurrent loop attention mechanism."""
79
+ model_arch = gguf.MODEL_ARCH.LLAMA
80
+
81
+ def __init__(self, *args, **kwargs):
82
+ super().__init__(*args, **kwargs)
83
+ self.loop_num = self.hparams.get('loop_num', 2)
84
+ self.loop_window_size = self.hparams.get('loop_window_size', 64)
85
+
86
+ def set_gguf_parameters(self):
87
+ super().set_gguf_parameters()
88
+ self.gguf_writer.add_uint32("llama.loop.num", self.loop_num)
89
+ self.gguf_writer.add_uint32("llama.loop.window_size", self.loop_window_size)
90
+
91
+ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None):
92
+ if "gate_projections" in name:
93
+ parts = name.split('.')
94
+ if len(parts) >= 4 and parts[1] == "gate_projections":
95
+ layer_num = parts[2]
96
+ param_type = parts[3]
97
+ new_name = f"blk.{layer_num}.loop_gate.{param_type}"
98
+ return [(new_name, data_torch)]
99
+ return super().modify_tensors(data_torch, name, bid)
100
+ ```
101
+
102
+ ## Next Steps for Community
103
+
104
+ ### For llama.cpp Maintainers
105
+
106
+ 1. **Implement Loop Attention Runtime**:
107
+ - Read `llama.loop.num` and `llama.loop.window_size` from GGUF metadata
108
+ - Load `blk.{layer}.loop_gate.{weight|bias}` tensors
109
+ - Implement recurrent loop attention mechanism in CUDA/CPU kernels
110
+ - Reference: Original implementation at IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
111
+
112
+ 2. **Add Unit Tests**:
113
+ - Verify tensor loading
114
+ - Validate loop parameter reading
115
+ - Test against PyTorch reference implementation
116
+
117
+ 3. **Documentation**:
118
+ - Add Loop architecture to supported models list
119
+ - Document loop parameter usage
120
+ - Provide conversion examples
121
+
122
+ ### For Model Users
123
+
124
+ 1. **Wait for Runtime Support**: These GGUFs will work once llama.cpp implements loop attention
125
+ 2. **Use Regular Variant**: For immediate use, IQuest-Coder (non-Loop) is fully supported
126
+ 3. **Contribute**: Help implement loop attention in llama.cpp runtime
127
+
128
+ ## Performance Expectations (Once Runtime Supports Loop)
129
+
130
+ Based on quantization levels:
131
+
132
+ - **Q4_K_M (23GB)**: Recommended for most deployments, 30% of original size
133
+ - **Q5_K_M (27GB)**: Better quality, 35% of original size
134
+ - **Q8_0 (40GB)**: Excellent quality, 53% of original size, minimal loss
135
+ - **F16 (75GB)**: Full precision reference
136
+
137
+ ## Docker Build System
138
+
139
+ **Image**: `avarok/dgx-spark-complete:latest`
140
+ **Base**: `dgx-vllm:cutlass-nvfp4-v15`
141
+ **Includes**:
142
+ - vLLM v15 with IQuest Loop Coder support
143
+ - llama.cpp with CUDA support
144
+ - Conversion scripts (convert_to_gguf.sh, quantize.sh)
145
+ - Optimized for NVIDIA GB10 (SM 12.1)
146
+
147
+ ## References
148
+
149
+ - **Original Model**: https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct
150
+ - **llama.cpp Issue**: #18517 - Request for Loop-Instruct support
151
+ - **PR Inspiration**: #18524 - Regular IQuestCoder support
152
+ - **Debugging Journey**: /workspace/builds/DEBUGGING_JOURNEY.md
153
+
154
+ ## Credits
155
+
156
+ - **Hardware**: Dual NVIDIA DGX Spark with GB10 GPUs
157
+ - **Model**: IQuestLab team for Loop architecture innovation
158
+ - **Tools**: llama.cpp (ggerganov), vLLM team
159
+ - **First GGUF**: This conversion is the first Loop-Instruct variant in GGUF format
160
+
161
+ ## Verification
162
+
163
+ SHA256 checksums provided for all files. Verify before use:
164
+ ```bash
165
+ sha256sum IQuest-Coder-V1-40B-Loop-Instruct-*.gguf
166
+ ```
167
+
168
+ ---
169
+
170
+ **Status**: Conversion successful, runtime support pending
171
+ **Date**: 2026-01-07
172
+ **Next**: Submit PR to llama.cpp with implementation + publish to HuggingFace