Stable run on 2x RTX 5090 and 2 Xeon E5 2696 V4 and DDR4 with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K, opencode works with this
First of all, a huge thank you to Ubergarm for this high-quality IQ4_K quantization. It works beautifully!
I managed to find the "sweet spot" for running this massive GLM-4.7 MoE model on a dual consumer GPU setup. Here are my results and the specific configuration to maximize VRAM usage without OOM crashes.
Hardware Configuration:
GPUs: 2x NVIDIA RTX 5090 (32GB VRAM each)
CPU: 2 xeon E5 2696 V4 ( witout avx512 ), CUDA 12.8 , Nvidia Driver Version: 580.65.06
RAM: DDR4 400 GO ( LXC in pve Proxmox 9 )
Software: ik_llama.cpp with https://github.com/ikawrakow/ik_llama.cpp/pull/1080
Performance:
Generation Speed: ~6.1 t/s
Prompt Processing: ~16.4 t/s
VRAM Usage: ~31GB per card (95% utilization, rock solid)
Since the full model doesn't fit in 64GB of VRAM, I used manual tensor overrides (-ot) to force exactly 17 layers of experts onto the GPUs (10 on GPU0, 7 on GPU1), while keeping the rest on CPU RAM. The KV cache is compressed to Q4_0 to save space.
numactl --interleave=all ./build/bin/llama-server
--model GLM-4.7-IQ4_K-00001-of-00006.gguf
--ctx-size 131072
--threads 64 --threads-batch 64
--n-gpu-layers 99
--tensor-split 0.5,0.5
--cache-type-k q4_0
--cache-type-v q4_0
\
Force layers 0-9 experts to CUDA0 (~21GB)
-ot 'blk.[0-9]..*exps.weight=CUDA0'
\
Force layers 10-16 experts to CUDA1 (~15GB + overhead)
-ot 'blk.1[0-6]..*exps.weight=CUDA1'
\
Send all remaining experts (layers 17-92) to CPU
-ot '.*exps.weight=CPU'
If you have more idea for best result
Glad you're getting some success! Tuning your exact parameters for your rig is part of the fun these days haha...
My initial thoughts are:
- Go ahead and build from tip of main now as that PR is merged up so you'll get all the latest goodies as they arrive.
- If you're going all the way down to q4_0 for kv-cache consider trying hadamard k cache stuff from these PRs:
- https://github.com/ikawrakow/ik_llama.cpp/pull/1033
- https://github.com/ikawrakow/ik_llama.cpp/pull/1034
- basically try maybe
--k-cache-hadamard -ctk q4_0 -ctv q5_0or similar for fun
- If you can run bare metal instead of through proxmox it might help
- your NUMA situation is gonna be one of the biggest considerations e.g. BIOS config etc, too much to list here, there is a lot of chatter on it but you can try adding
--numa numactlto your command too etc. - Those CPUs each have only 22 physical cores each psure, likely try
--threads 44 --threads-batch 44or play with the numbers - tho on intel it can be cpu limited and maybe the SMT does help, you'll have to enchmark across a bunch of values to see - consider using
--n-cpu-moe 72 -ts 30,32or similar as offload strategy can make a difference sometimes if slower PCIe speeds. (or change your 10-16 experts to be the last ~8 layers of for the CUDA1) - there is a PR discussion about that with ik and me, can't find it right now lol. (72 should be total number of layers (like 91ish) minus however many you want on GPU) - Consider benchmarking with
llama-sweep-benchand making graphs or having some way to compare all your benchmark runs across the desired context length - You'll probably not use that much context at those speeds so consider dropping down context to like 65k and increasing batch sizes e.g.
-ub 4096 -b 4096for sure
That's enough to keep you busy for the rest of the year! ;p Have fun!
Thank's again
I tried --threads 44 --threads-batch 44, but after many tests, the sweet spot is --threads 82 --threads-batch 82. I gain about 0.1 t/s compared to 44. I suspect this is because I am running on a VM with vCPUs rather than bare metal.
I also tried --n-cpu-moe XX, but it resulted in a VRAM imbalance (one GPU OOM while the other had space), so I couldn't use it efficiently.
-ub 4096 -b 4096 works well, and --k-cache-hadamard is great.
regarding the model, I need a minimum of 80K context for my coding projects. I had to switch to IQ5_K because the IQ4 quantization was causing syntax errors (missing ')' brackets, etc.). Since IQ5 is larger, I had to optimize the offloading manually.
Here is my best stable command so far (running on dual RTX 5090):
numactl --interleave=all ./build/bin/llama-server
--model ~/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 84992
--no-mmap
--threads 82 --threads-batch 82
--batch-size 4096 --ubatch-size 4096
--parallel 1 --flash-attn 1
--jinja --verbose
--n-gpu-layers 99
--tensor-split 0.5,0.5
--split-mode layer
--run-time-repack
--cache-type-k q4_0 --cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'
VRAM Usage:
GPU0: 32014MiB / 32607MiB
GPU1: 31390MiB / 32607MiB
Performance:
Token gen: 5.1 t/s | Prompt processing: 16.7 t/s"
Now i must test -ctk q4_0 -ctv q5_0 and -ts 30,32
I make some change :
--batch-size 1024
--ubatch-size 1024
to have more space in vram and i instal opencode withy this conf file :
/user/.config/opencode/opencode.json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ik_local": {
"npm": "@ai-sdk/openai-compatible",
"name": "GLM 4.7 (Local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"apiKey": "dummy-key"
},
"models": {
"glm-4.7-iq5": {
"id": "GLM-4.7-IQ5",
"name": "GLM-4.7-IQ5",
"tools": true
}
}
}
},
"model": "ik_local/glm-4.7-iq5"
And ... it's works !!!
Glad you're having more success integrating the LLM into your workflow and client of choice!
I gain about 0.1 t/s compared to 44
A gain of 0.1 t/s is not too big, try to look at your CPU usage with say btop or htop (and power usage if you have a method to measure it) as it may take a lot more cpu/power when using all the extra hyperthreads. As you mention if you're running virtualized that will likely have effects too.
I also tried --n-cpu-moe XX, but it resulted in a VRAM imbalance (one GPU OOM while the other had space), so I couldn't use it efficiently.
Yes you'll have to find the right number for XX and then prevent oom with a combination of either say -ts 20,32 (or whatever works) or manually place layers with -ot ....=CUDA1 etc which can be a bit tricky if you're new too it and take a lot of tries to dial it in.
Finally, if you are okay with a smaller quant, getting more of the model onto your great GPUs could speed things up a lot! I can run the smallest smol-IQ1_KT at almost 400 tok/sec PP and 30 tok/sec TG across 2x older RTX A6000 GPUs full offload on 96GB VRAM. It will def be slower if you can't fit the entire thing in VRAM, but just some examples.
P.S. Make sure to try with and without -rtr (run time repack) flag - last time I checked using -rtr negates boost from larger batch sizes but can get a little bit more TG instead. So depending on your workload you might want one or the other.
Happy new year!
Just here to say happy new year! β€οΈ
I made a small but important updateβnot on the server side, but in the OpenCode configuration (opencode.json)βto properly handle context management. Without this, OpenCode lacks awareness of context limits and cannot accurately calculate the remaining available space.
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ik_local": {
"npm": "@ai-sdk/openai-compatible",
"name": "GLM 4.7 IQ5 (Local Cluster)",
"options": {
"baseURL": "http://X.X.X.X:8080/v1",
"apiKey": "dummy-key",
"timeout": 900000
},
"models": {
"glm-4.7-iq5": {
"id": "GLM-4.7-IQ5",
"name": "GLM-4.7-IQ5",
"limit": {
"context": 84992,
"output": 32000
}
}
}
}
},
"agent": {
"build": {
"model": "ik_local/glm-4.7-iq5",
"steps": 20,
"permission": {
"read": "allow",
"edit": "allow",
"bash": "ask",
"websearch": "allow"
}
}
}
}
I had to tweak the parameters further. We realized that when launching OpenCode connected to the LLM, the initial system prompt consumes about 10,000 tokens just for the tool definitions. This massively eats into the context window.
For our complex development projects, the remaining context (previously around 74k effective) wasn't sufficient. However, simply trying to increase the window with the old settings caused immediate VRAM OOM errors or crashes during work.
To stabilize the system at 100k context, we had to significantly lower the batch sizes and adjust the split.
Here is the stable configuration:
numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /user/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 102400
--no-mmap
--threads 82
--threads-batch 82
--batch-size 512
--ubatch-size 512
--parallel 1
--flash-attn 1
--jinja
--verbose
--n-gpu-layers 99
--tensor-split 0.465,0.535
--split-mode layer
--run-time-repack
--cache-type-k q4_0
--cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'
Changes to support 100k context:
We had to reduce batch-size and ubatch-size to 512.
Adjusted --tensor-split 0.465,0.535 to distribute the load better across the 2 GPUs.
Confirmed --parallel 1 is strictly necessary (otherwise context gets divided).
It is slower, but it is finally stable with the large context required for dev work.
Token generation: ~4.3 t/s
Prompt processing: ~14.3 t/s
You're really pushing the limits of your rig!
A few more thoughts:
- If you're using
numactl --interleave=allyou might want to add--numa distributeand see if that helps anything. You might need to tweak your BIOS configs as well and check for the best setting of NPS0/NPS1 (on AMD EPYC) or on Intel XeonSNC=Disableor whatever your Xeon has etc. - You might be able to add
-ger --merge-qkvas well to get just another percent or two performance possibly - You might be able to leave batch sizes the default of
-ub 512 -b 2048if you need that extra little bit of VRAM - Since you have 2x GPUs and this model is supported with ik's new
-sm graphfeature, you definitely want to be trying that instead of the old-sm layer. You likely won't need to custom balance the extra layers and can just use-ts(i've never seen anyone use floats for the split values, a simple integer works fine, but if you got it working that's great!
Have fun tuning!
After days of extensive testing on my old Dell T630 ( check the bios parameters, 2014 server, 460 GB DDR4 2400 Mhz),Dual Xeon E5-2696 v4) with ik_llama.cpp, I look for the "sweet spot" configuration for this model (IQ5_K) with 100k context.
Here are my findings for anyone running similar Dual Socket hardware:
Split Threads Strategy:
--threads 44 (Physical cores only) is optimal for Generation speed.
--threads-batch 88 (Full HyperThreading) significantly boosts Prompt Processing speed.
Micro-Batching: Maximize --ubatch-size within VRAM limits (step of 64), but be careful: specific values like 640 caused regressions with OpenCode despite working in OpenWebUI. 576 proved to be the stable maximum.
NUMA: --numa distribute + numactl --interleave=all is mandatory for stability on dual-socket.
Performance:
OpenWebUI: ~5.3 t/s Gen | ~17 t/s Prompt (?)
OpenCode: ~4.0-5.1 t/s Gen | 40-60 t/s Prompt
Note: Still investigating errors with -sm graph, currently sticking to manual layer split.
Final Command:
numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /root/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 102400
--no-mmap
--threads 44
--threads-batch 88
--batch-size 2340
--ubatch-size 576
--parallel 1
--flash-attn 1
--jinja
--verbose
--n-gpu-layers 99
--tensor-split 48,52
--split-mode layer
--numa distribute
--run-time-repack
-ger
--merge-qkv
--cache-type-k q4_0
--cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'
You keep moving the needle and getting better perf! You have great GPUs but the VRAM is slower speed so gonna hurt your token generation unless you use a smaller quant or model with less active weights. Glad the numa stuff helped out to at least get a little more out of the RAM despite NUMA nodes.
Note: Still investigating errors with -sm graph, currently sticking to manual layer split.
What command are you trying, or error are you seeing? If you are still using -ot you will also need to use -smgs to allow -sm graph.
Also @Doctor-Shotgun wrote a good description if you want some more information: https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0
@Ubergarm Thanks for the guidance! I've rebuilt with the latest ik_llama.cpp and applied changes based on your recommendations and @Doctor-Shotgun 's guide.
Here's what I changed:
Configuration Updates
- Switched from -sm layer β -sm graph (working smoothly with the new binary).
- Simplified tensor overrides to a single pattern: -ot 'blk.(1[5-9]|[2-9][0-9])..*exps.weight=CPU'
- Increased --batch-size to 2880
- Added -gr for graph reuse
- Kept --tensor-split 48,52 to perfectly balance VRAM usage between the GPUs (near the max of the 2 GPU vram , less than 1 GB).
Regarding -grt q8_0
Tested extensively with mixed results:
- OpenWebUI: works, but no measurable performance gain.
- OpenCode: significant performance degradation.
Current Performance (with -sm graph):
Still benchmarking, but prompt processing is visibly faster than layer mode (~44 t/s). Generation is very stable at ~4.8-4.95 t/s.
The dev like this parameters for the satble reply of GLM.
numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /root/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 44 --threads-batch 88 --batch-size 2880 --ubatch-size 576 --parallel 1 --flash-attn 1 --jinja --verbose --n-gpu-layers 999 --tensor-split 48,52 --split-mode graph --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --k-cache-hadamard -ot 'blk.(1[5-9]|[2-9][0-9])..*exps.weight=CPU'
The same but other computer ( my personal PC )
fedora 42
CPU: AMD EPYC 7532 (Bare Metal), RAM: 512 GB DDR4,
7x RTX 3090 (24GB) - (2 cards linked via NVLink)
AND ... 1x Tesla V100 SXM2 16GB (using a Chinese SXM2-to-PCIe adapter card)
Key Compilation Findings
If you mix Ampere (3090) and Volta (V100), you MUST disable MMQ.
cmake .. -DGGML_CUDA_FORCE_MMQ=OFF (Crucial! MMQ breaks V100 calculation in mixed setups and opencode).
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70;86" (Targeting both Volta and Ampere).
Configuration :
Context: 102400 tokens
ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-4.7-GGUF/GLM-4.7-IQ5_K-00001-of-00007.gguf --alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 2048 --ubatch-size 2048 --parallel 1 --flash-attn 1 --jinja --verbose --n-gpu-layers 999 --split-mode layer --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --tensor-split 1.2,1,1,1,1,1,1,0.3 -ot 'blk.(0|1|2|3|4|5|6|7|8)..*exps.weight=CUDA0' -ot 'blk.(9|10|11|12|13|14|55)..*exps.weight=CUDA1' -ot 'blk.(15|16|17|18|19|20|54)..*exps.weight=CUDA2' -ot 'blk.(21|22|23|24|25|26|53)..*exps.weight=CUDA3' -ot 'blk.(27|28|29|30|31|32|52)..*exps.weight=CUDA4' -ot 'blk.(33|34|35|36|37|38|51)..*exps.weight=CUDA5' -ot 'blk.(39|40|41|42|43|44|50)..*exps.weight=CUDA6' -ot 'blk.(45|46|47|48|49)..*exps.weight=CUDA7' -ot '.*exps.weight=CPU'
(Note i placed later layers (50-55) onto earlier GPUs to fill VRAM gaps , it must be optimised after)
VRAM Utilization: 96-97% on almost all cards (V100 is at 15.7/16GB!).
ggml_cuda_set_peer_access: Enabled only for the NVLink pair.
Speed: 10 Tokens/s @ 100k Context.
I avoided CUDA Graphs (--no-graphs implicitly via split-mode layer) because of the V100/Volta architecture mix. Has anyone successfully run sm_graph on a mixed Ampere/Volta setup without breaking P2P?
Wow, you have come a long way in a short time! I didn't realize the old Tesla cards even worked very much. Honestly, you might be better off dropping the Tesla V100 and going with a slightly smaller quant to make use of -sm graph on the 7x 3090's.
Volta card with SXM2 have very low price, you can find pcie chine card for 650 β¬ with 32 Go and SYS-4029GP-TVRT server with 8 32 GB V100 with quick Nvlink for 6000 β¬ ( 256 GB with 4 pcie and add 4 RTX4090 for total of 352 GB and i want to test chinese RTX 4090 with 48 GB of ram https://www.youtube.com/watch?v=jA4Bhw1S_2o )
So i need to test mixture of card and performance
New quick reply :
I sell the V100 card, i buy used RTX 3090
new compil ( new drivers nvidia 580.126.09 )
cd ~/ik_llama.cpp
rm -rf build && mkdir build
git reset --hard
git clean -fd
git pull origin main
git submodule update --init --recursive
ccache -C
conda activate ik_build
conda install -c nvidia/label/cuda-13.1.0 cuda-toolkit
conda install -c nvidia/label/cuda-13.1.0 nccl
rm -rf build && mkdir build
cmake -B build -G Ninja
-DCMAKE_BUILD_TYPE=Release
-DGGML_NATIVE=OFF
-DGGML_AVX=ON
-DGGML_AVX2=ON
-DGGML_FMA=ON
-DGGML_F16C=ON
-DGGML_AVX512=OFF
-DGGML_AVX512_VBMI=OFF
-DGGML_AVX512_VNNI=OFF
-DGGML_CUDA=ON
-DCMAKE_CUDA_ARCHITECTURES="86"
-DGGML_CUDA_F16=ON
-DGGML_CUDA_FORCE_MMQ=OFF
-DGGML_CUDA_PEER_MAX_BATCH_SIZE=4096
-DGGML_NCCL=ON
-DGGML_CCACHE=ON
-DLLAMA_BUILD_SERVER=ON
-DLLAMA_BUILD_EXAMPLES=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_IQK_FLASH_ATTENTION=ON
( don't sure of all )
so with 8 RTX 3090 :
~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-4.7-GGUF/GLM-4.7-IQ5_K-00001-of-00007.gguf --alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 2048 --ubatch-size 4096 --verbose --parallel 1 --flash-attn 1 --verbose --n-gpu-layers 999 --split-mode graph --tensor-split 0.9,1,1,1,1,1,1,1 --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --k-cache-hadamard --jinja -ot 'blk.([6-9]|[6-9][0-9])..*exps.weight=CPU'
Result : 9 token/s (but more stable)
Now i wait for your new Kimi k2.5 quants ;)
Heya! You have built quite a rig! Definitely check out my discussion of LACT to save yourself some power by undervolting those 8x GPUs (around minute 22) https://blog.aifoundry.org/p/adventures-in-model-quantization
Sorry I've been slow lately, life has been busy. I'll hope to take a look at K2.5 end of this week with luck.
AesSedai released the "full quality" Kimi-K2.5 Q4_X which if you want the best, that is it.
I'll look into possibly smaller quants later still given you can run <192GB quants on full VRAM which will likely be much faster given your DRAM speeds.
