These models may *degrade* performance on prompts < 32k and only needed for LM Studio users.

#8
by ubergarm - opened

I've been scratching my head about these models, but r/LocalLLaMA member u/AaronFeng47 helped me understand these unsloth/Qwen3-30B-A3B-128K-GGUF are really only meant for people who:

  1. Use LM Studio
  2. Use prompt lengths over 32k regularly

Otherwise according to Qwen this configuration may degrade performance:

If the average context length does not exceed 32,768 tokens, we do not recommend enabling YaRN in this scenario, as it may potentially degrade model performance.
https://huggingface.co/Qwen/Qwen3-30B-A3B#processing-long-texts

Also regarding the longer imatrix context length, Daniel mentioned up to 12k, but its unclear how this effects the model performance for shorter prompts or if it would help in the 32k+ prompt length case?

Probably need more benchmarking, but the current evidence suggests there isn't much measurable improvement thus far.

So if you want to try long context mode on something other than LM Studio, the Qwen Model Card tells you to do it this way:

llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768

Okay, hops this saves people from wasting time and bandwidth choosing the wrong quant for their needs!

Cheers!

i dont see how that yarning is unsloth custom ? its very much the case that you gonna face lost in middle ( good read btw and still highly relevant)
in general the first 1/4 of the ctx is somewhat accurate if its multi-turn .. yarning will degrade quality .. but thats not inherited by the quant or by unsloth .. that is a side effect of how much ctx the model was trained with

4x stretching that will have impacts for sure - but such can be observed with any model

Unsloth AI org

No this is false on 3 points.

  1. First the context length for Qwen 3 is not 32K, it's 40960 - we verified this with the Qwen team. Ie any quant using a 32K context size is actually wrong. We communicated this with the Qwen team during their pre-release and helped resolve issues.

  2. Second, yes enabling YaRN like that is fine, but you MUST calibrate the imatrix importance matrix to account for longer sequence lengths - ie your own importance plots show some differences to the importance matrix since we used 12K context lengths. Yes it's less than 32K, but 12K is much better than 512.

  3. YaRN scales the RoPE embeddings, so doing imatrix on 512 sequence lengths will not be equivalent to doing imatrix on 12K context lengths - note https://blog.eleuther.ai/yarn/ shows shorter contexts degrade in accuracy, so you can't just simply set YaRN and expect the same perf on quantized models. This is only the case for BF16.

unsloths-qwen3-ggufs-are-updated-with-a-new-improved-v0-fq1dp1al880f1.webp

Thanks I'll reply once here to keep it simple:
https://www.reddit.com/r/LocalLLaMA/comments/1kju1y1/comment/mru9ob7/

shimmyshimmer changed discussion status to closed

Sign up or log in to comment