Force layers 0-9 experts to CUDA0 (~21GB)

Owner Dec 26, 2025

•

edited Dec 26, 2025

Glad you're getting some success! Tuning your exact parameters for your rig is part of the fun these days haha...

My initial thoughts are:

Go ahead and build from tip of main now as that PR is merged up so you'll get all the latest goodies as they arrive.
If you're going all the way down to q4_0 for kv-cache consider trying hadamard k cache stuff from these PRs:

https://github.com/ikawrakow/ik_llama.cpp/pull/1033
https://github.com/ikawrakow/ik_llama.cpp/pull/1034
basically try maybe --k-cache-hadamard -ctk q4_0 -ctv q5_0 or similar for fun

If you can run bare metal instead of through proxmox it might help
your NUMA situation is gonna be one of the biggest considerations e.g. BIOS config etc, too much to list here, there is a lot of chatter on it but you can try adding --numa numactl to your command too etc.
Those CPUs each have only 22 physical cores each psure, likely try --threads 44 --threads-batch 44 or play with the numbers - tho on intel it can be cpu limited and maybe the SMT does help, you'll have to enchmark across a bunch of values to see
consider using --n-cpu-moe 72 -ts 30,32 or similar as offload strategy can make a difference sometimes if slower PCIe speeds. (or change your 10-16 experts to be the last ~8 layers of for the CUDA1) - there is a PR discussion about that with ik and me, can't find it right now lol. (72 should be total number of layers (like 91ish) minus however many you want on GPU)
Consider benchmarking with llama-sweep-bench and making graphs or having some way to compare all your benchmark runs across the desired context length
You'll probably not use that much context at those speeds so consider dropping down context to like 65k and increasing batch sizes e.g. -ub 4096 -b 4096 for sure

That's enough to keep you busy for the rest of the year! ;p Have fun!

martossien changed discussion title from Stable run on 2x RTX 5090 (32GB) with ik_llama.cpp - 6.1 t/s to Stable run on 2x RTX 5090 (32GB) with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K Dec 29, 2025

Dec 29, 2025

Thank's again

I tried --threads 44 --threads-batch 44, but after many tests, the sweet spot is --threads 82 --threads-batch 82. I gain about 0.1 t/s compared to 44. I suspect this is because I am running on a VM with vCPUs rather than bare metal.
I also tried --n-cpu-moe XX, but it resulted in a VRAM imbalance (one GPU OOM while the other had space), so I couldn't use it efficiently.
-ub 4096 -b 4096 works well, and --k-cache-hadamard is great.
regarding the model, I need a minimum of 80K context for my coding projects. I had to switch to IQ5_K because the IQ4 quantization was causing syntax errors (missing ')' brackets, etc.). Since IQ5 is larger, I had to optimize the offloading manually.
Here is my best stable command so far (running on dual RTX 5090):
numactl --interleave=all ./build/bin/llama-server
--model ~/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 84992
--no-mmap
--threads 82 --threads-batch 82
--batch-size 4096 --ubatch-size 4096
--parallel 1 --flash-attn 1
--jinja --verbose
--n-gpu-layers 99
--tensor-split 0.5,0.5
--split-mode layer
--run-time-repack
--cache-type-k q4_0 --cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'

VRAM Usage:
GPU0: 32014MiB / 32607MiB
GPU1: 31390MiB / 32607MiB

Performance:
Token gen: 5.1 t/s | Prompt processing: 16.7 t/s"

Now i must test -ctk q4_0 -ctv q5_0 and -ts 30,32

martossien changed discussion title from Stable run on 2x RTX 5090 (32GB) with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K to Stable run on 2x RTX 5090 and 2 Xeon E5 2696 V4 and DDR4 with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K Dec 29, 2025

Dec 30, 2025

I make some change :
--batch-size 1024
--ubatch-size 1024
to have more space in vram and i instal opencode withy this conf file :

/user/.config/opencode/opencode.json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ik_local": {
"npm": "@ai-sdk/openai-compatible",
"name": "GLM 4.7 (Local)",
"options": {
"baseURL": "http://127.0.0.1:8080/v1",
"apiKey": "dummy-key"
},
"models": {
"glm-4.7-iq5": {
"id": "GLM-4.7-IQ5",
"name": "GLM-4.7-IQ5",
"tools": true
}
}
}
},
"model": "ik_local/glm-4.7-iq5"

And ... it's works !!!

martossien changed discussion title from Stable run on 2x RTX 5090 and 2 Xeon E5 2696 V4 and DDR4 with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K to Stable run on 2x RTX 5090 and 2 Xeon E5 2696 V4 and DDR4 with ik_llama.cpp - 6.1 t/s on IQ4_K and 5.1 t/s on IQ5_K, opencode works with this Dec 30, 2025

Owner Dec 31, 2025

Glad you're having more success integrating the LLM into your workflow and client of choice!

I gain about 0.1 t/s compared to 44

A gain of 0.1 t/s is not too big, try to look at your CPU usage with say btop or htop (and power usage if you have a method to measure it) as it may take a lot more cpu/power when using all the extra hyperthreads. As you mention if you're running virtualized that will likely have effects too.

I also tried --n-cpu-moe XX, but it resulted in a VRAM imbalance (one GPU OOM while the other had space), so I couldn't use it efficiently.

Yes you'll have to find the right number for XX and then prevent oom with a combination of either say -ts 20,32 (or whatever works) or manually place layers with -ot ....=CUDA1 etc which can be a bit tricky if you're new too it and take a lot of tries to dial it in.

Finally, if you are okay with a smaller quant, getting more of the model onto your great GPUs could speed things up a lot! I can run the smallest smol-IQ1_KT at almost 400 tok/sec PP and 30 tok/sec TG across 2x older RTX A6000 GPUs full offload on 96GB VRAM. It will def be slower if you can't fit the entire thing in VRAM, but just some examples.

P.S. Make sure to try with and without -rtr (run time repack) flag - last time I checked using -rtr negates boost from larger batch sizes but can get a little bit more TG instead. So depending on your workload you might want one or the other.

Happy new year!

mtcl

Jan 3

Just here to say happy new year! ❤️

Jan 7

•

edited Jan 7

I made a small but important update—not on the server side, but in the OpenCode configuration (opencode.json)—to properly handle context management. Without this, OpenCode lacks awareness of context limits and cannot accurately calculate the remaining available space.

{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ik_local": {
"npm": "@ai-sdk/openai-compatible",
"name": "GLM 4.7 IQ5 (Local Cluster)",
"options": {
"baseURL": "http://X.X.X.X:8080/v1",
"apiKey": "dummy-key",
"timeout": 900000
},
"models": {
"glm-4.7-iq5": {
"id": "GLM-4.7-IQ5",
"name": "GLM-4.7-IQ5",
"limit": {
"context": 84992,
"output": 32000
}
}
}
}
},
"agent": {
"build": {
"model": "ik_local/glm-4.7-iq5",
"steps": 20,
"permission": {
"read": "allow",
"edit": "allow",
"bash": "ask",
"websearch": "allow"
}
}
}
}

Jan 12

•

edited Jan 12

I had to tweak the parameters further. We realized that when launching OpenCode connected to the LLM, the initial system prompt consumes about 10,000 tokens just for the tool definitions. This massively eats into the context window.
For our complex development projects, the remaining context (previously around 74k effective) wasn't sufficient. However, simply trying to increase the window with the old settings caused immediate VRAM OOM errors or crashes during work.
To stabilize the system at 100k context, we had to significantly lower the batch sizes and adjust the split.

Here is the stable configuration:
numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /user/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 102400
--no-mmap
--threads 82
--threads-batch 82
--batch-size 512
--ubatch-size 512
--parallel 1
--flash-attn 1
--jinja
--verbose
--n-gpu-layers 99
--tensor-split 0.465,0.535
--split-mode layer
--run-time-repack
--cache-type-k q4_0
--cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'

Changes to support 100k context:
We had to reduce batch-size and ubatch-size to 512.
Adjusted --tensor-split 0.465,0.535 to distribute the load better across the 2 GPUs.
Confirmed --parallel 1 is strictly necessary (otherwise context gets divided).

It is slower, but it is finally stable with the large context required for dev work.
Token generation: ~4.3 t/s
Prompt processing: ~14.3 t/s

Owner Jan 12

You're really pushing the limits of your rig!

A few more thoughts:

If you're using numactl --interleave=all you might want to add --numa distribute and see if that helps anything. You might need to tweak your BIOS configs as well and check for the best setting of NPS0/NPS1 (on AMD EPYC) or on Intel Xeon SNC=Disable or whatever your Xeon has etc.
You might be able to add -ger --merge-qkv as well to get just another percent or two performance possibly
You might be able to leave batch sizes the default of -ub 512 -b 2048 if you need that extra little bit of VRAM
Since you have 2x GPUs and this model is supported with ik's new -sm graph feature, you definitely want to be trying that instead of the old -sm layer. You likely won't need to custom balance the extra layers and can just use -ts (i've never seen anyone use floats for the split values, a simple integer works fine, but if you got it working that's great!

Have fun tuning!

Jan 16

•

edited Jan 17

After days of extensive testing on my old Dell T630 ( check the bios parameters, 2014 server, 460 GB DDR4 2400 Mhz),Dual Xeon E5-2696 v4) with ik_llama.cpp, I look for the "sweet spot" configuration for this model (IQ5_K) with 100k context.
Here are my findings for anyone running similar Dual Socket hardware:
Split Threads Strategy:
--threads 44 (Physical cores only) is optimal for Generation speed.
--threads-batch 88 (Full HyperThreading) significantly boosts Prompt Processing speed.
Micro-Batching: Maximize --ubatch-size within VRAM limits (step of 64), but be careful: specific values like 640 caused regressions with OpenCode despite working in OpenWebUI. 576 proved to be the stable maximum.
NUMA: --numa distribute + numactl --interleave=all is mandatory for stability on dual-socket.

Performance:

OpenWebUI: ~5.3 t/s Gen | ~17 t/s Prompt (?)
OpenCode: ~4.0-5.1 t/s Gen | 40-60 t/s Prompt

Note: Still investigating errors with -sm graph, currently sticking to manual layer split.

Final Command:
numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /root/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5
--host 0.0.0.0 --port 8080
--ctx-size 102400
--no-mmap
--threads 44
--threads-batch 88
--batch-size 2340
--ubatch-size 576
--parallel 1
--flash-attn 1
--jinja
--verbose
--n-gpu-layers 99
--tensor-split 48,52
--split-mode layer
--numa distribute
--run-time-repack
-ger
--merge-qkv
--cache-type-k q4_0
--cache-type-v q4_0
--k-cache-hadamard
-ot 'blk.[0-8]..*exps.weight=CUDA0'
-ot 'blk.(8[6-9]|9[0-2])..*exps.weight=CUDA1'
-ot '.*exps.weight=CPU'

Owner Jan 18

•

edited Jan 18

You keep moving the needle and getting better perf! You have great GPUs but the VRAM is slower speed so gonna hurt your token generation unless you use a smaller quant or model with less active weights. Glad the numa stuff helped out to at least get a little more out of the RAM despite NUMA nodes.

Note: Still investigating errors with -sm graph, currently sticking to manual layer split.

What command are you trying, or error are you seeing? If you are still using -ot you will also need to use -smgs to allow -sm graph.

Also @Doctor-Shotgun wrote a good description if you want some more information: https://gist.github.com/DocShotgun/a02a4c0c0a57e43ff4f038b46ca66ae0

Jan 21

•

edited Jan 21

@Ubergarm Thanks for the guidance! I've rebuilt with the latest ik_llama.cpp and applied changes based on your recommendations and @Doctor-Shotgun 's guide.

Here's what I changed:

Configuration Updates

Switched from -sm layer → -sm graph (working smoothly with the new binary).
Simplified tensor overrides to a single pattern: -ot 'blk.(1[5-9]|[2-9][0-9])..*exps.weight=CPU'
Increased --batch-size to 2880
Added -gr for graph reuse
Kept --tensor-split 48,52 to perfectly balance VRAM usage between the GPUs (near the max of the 2 GPU vram , less than 1 GB).

Regarding -grt q8_0
Tested extensively with mixed results:

OpenWebUI: works, but no measurable performance gain.
OpenCode: significant performance degradation.

Current Performance (with -sm graph):
Still benchmarking, but prompt processing is visibly faster than layer mode (~44 t/s). Generation is very stable at ~4.8-4.95 t/s.

The dev like this parameters for the satble reply of GLM.

numactl --interleave=all /root/ik_llama.cpp/build/bin/llama-server
--model /root/ik_llama.cpp/models/GLM-4.7-Ubergarm/IQ5_K/GLM-4.7-IQ5_K-00001-of-00007.gguf
--alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 44 --threads-batch 88 --batch-size 2880 --ubatch-size 576 --parallel 1 --flash-attn 1 --jinja --verbose --n-gpu-layers 999 --tensor-split 48,52 --split-mode graph --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --k-cache-hadamard -ot 'blk.(1[5-9]|[2-9][0-9])..*exps.weight=CPU'

Jan 23

The same but other computer ( my personal PC )
fedora 42
CPU: AMD EPYC 7532 (Bare Metal), RAM: 512 GB DDR4,
7x RTX 3090 (24GB) - (2 cards linked via NVLink)
AND ... 1x Tesla V100 SXM2 16GB (using a Chinese SXM2-to-PCIe adapter card)
Key Compilation Findings
If you mix Ampere (3090) and Volta (V100), you MUST disable MMQ.
cmake .. -DGGML_CUDA_FORCE_MMQ=OFF (Crucial! MMQ breaks V100 calculation in mixed setups and opencode).
cmake .. -DCMAKE_CUDA_ARCHITECTURES="70;86" (Targeting both Volta and Ampere).
Configuration :
Context: 102400 tokens
ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-4.7-GGUF/GLM-4.7-IQ5_K-00001-of-00007.gguf --alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 2048 --ubatch-size 2048 --parallel 1 --flash-attn 1 --jinja --verbose --n-gpu-layers 999 --split-mode layer --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --tensor-split 1.2,1,1,1,1,1,1,0.3 -ot 'blk.(0|1|2|3|4|5|6|7|8)..*exps.weight=CUDA0' -ot 'blk.(9|10|11|12|13|14|55)..*exps.weight=CUDA1' -ot 'blk.(15|16|17|18|19|20|54)..*exps.weight=CUDA2' -ot 'blk.(21|22|23|24|25|26|53)..*exps.weight=CUDA3' -ot 'blk.(27|28|29|30|31|32|52)..*exps.weight=CUDA4' -ot 'blk.(33|34|35|36|37|38|51)..*exps.weight=CUDA5' -ot 'blk.(39|40|41|42|43|44|50)..*exps.weight=CUDA6' -ot 'blk.(45|46|47|48|49)..*exps.weight=CUDA7' -ot '.*exps.weight=CPU'

(Note i placed later layers (50-55) onto earlier GPUs to fill VRAM gaps , it must be optimised after)
VRAM Utilization: 96-97% on almost all cards (V100 is at 15.7/16GB!).
ggml_cuda_set_peer_access: Enabled only for the NVLink pair.
Speed: 10 Tokens/s @ 100k Context.

I avoided CUDA Graphs (--no-graphs implicitly via split-mode layer) because of the V100/Volta architecture mix. Has anyone successfully run sm_graph on a mixed Ampere/Volta setup without breaking P2P?

Owner Jan 23

Wow, you have come a long way in a short time! I didn't realize the old Tesla cards even worked very much. Honestly, you might be better off dropping the Tesla V100 and going with a slightly smaller quant to make use of -sm graph on the 7x 3090's.

Jan 23

Volta card with SXM2 have very low price, you can find pcie chine card for 650 € with 32 Go and SYS-4029GP-TVRT server with 8 32 GB V100 with quick Nvlink for 6000 € ( 256 GB with 4 pcie and add 4 RTX4090 for total of 352 GB and i want to test chinese RTX 4090 with 48 GB of ram https://www.youtube.com/watch?v=jA4Bhw1S_2o )
So i need to test mixture of card and performance

Feb 1

New quick reply :
I sell the V100 card, i buy used RTX 3090
new compil ( new drivers nvidia 580.126.09 )

cd ~/ik_llama.cpp
rm -rf build && mkdir build
git reset --hard
git clean -fd
git pull origin main
git submodule update --init --recursive
ccache -C
conda activate ik_build

conda install -c nvidia/label/cuda-13.1.0 cuda-toolkit
conda install -c nvidia/label/cuda-13.1.0 nccl

rm -rf build && mkdir build

cmake -B build -G Ninja
-DCMAKE_BUILD_TYPE=Release
-DGGML_NATIVE=OFF
-DGGML_AVX=ON
-DGGML_AVX2=ON
-DGGML_FMA=ON
-DGGML_F16C=ON
-DGGML_AVX512=OFF
-DGGML_AVX512_VBMI=OFF
-DGGML_AVX512_VNNI=OFF
-DGGML_CUDA=ON
-DCMAKE_CUDA_ARCHITECTURES="86"
-DGGML_CUDA_F16=ON
-DGGML_CUDA_FORCE_MMQ=OFF
-DGGML_CUDA_PEER_MAX_BATCH_SIZE=4096
-DGGML_NCCL=ON
-DGGML_CCACHE=ON
-DLLAMA_BUILD_SERVER=ON
-DLLAMA_BUILD_EXAMPLES=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DGGML_IQK_FLASH_ATTENTION=ON

( don't sure of all )

so with 8 RTX 3090 :

~/ik_llama.cpp/build/bin/llama-server --model /home/admin_ia/.cache/lm-studio/models/ubergarm/GLM-4.7-GGUF/GLM-4.7-IQ5_K-00001-of-00007.gguf --alias GLM-4.7-IQ5 --host 0.0.0.0 --port 8080 --ctx-size 102400 --no-mmap --threads 32 --threads-batch 64 --batch-size 2048 --ubatch-size 4096 --verbose --parallel 1 --flash-attn 1 --verbose --n-gpu-layers 999 --split-mode graph --tensor-split 0.9,1,1,1,1,1,1,1 --numa distribute --run-time-repack -gr -ger --merge-qkv --cache-type-k q4_0 --cache-type-v q4_0 --k-cache-hadamard --jinja -ot 'blk.([6-9]|[6-9][0-9])..*exps.weight=CPU'

Result : 9 token/s (but more stable)

Now i wait for your new Kimi k2.5 quants ;)

Owner Feb 2

Heya! You have built quite a rig! Definitely check out my discussion of LACT to save yourself some power by undervolting those 8x GPUs (around minute 22) https://blog.aifoundry.org/p/adventures-in-model-quantization

Sorry I've been slow lately, life has been busy. I'll hope to take a look at K2.5 end of this week with luck.

Owner Feb 2