Instructions to use AlexWortega/qwen35-4b-soyuz with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use AlexWortega/qwen35-4b-soyuz with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B") model = PeftModel.from_pretrained(base_model, "AlexWortega/qwen35-4b-soyuz") - HERMES
How to use AlexWortega/qwen35-4b-soyuz with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Qwen3.5-4B-Soyuz (LoRA adapter)
Pure SFT LoRA on top of Qwen/Qwen3.5-4B (chat/instruct) trained on the cleaned subset of AlexWortega/Soyuz-sft.
Format: Hermes-style tool calls (<tool_call>{"name":...,"arguments":...}</tool_call>).
Training
| key | value |
|---|---|
| Base | Qwen/Qwen3.5-4B (chat) |
| Strategy | full bf16 LoRA (no quantization) |
| LoRA rank | 128 |
| LoRA alpha | 256 |
| LR | 1e-5 (cosine schedule) |
| Epochs | 1 |
| Steps | 1275 |
| Seq len | 16384 (smart-truncate β€16K) |
| Eff batch | 8 |
| Optimizer | AdamW fused |
| Loss | chunked_nll (TRL) |
| Kernels | Liger rms_norm + swiglu + fused_linear_cross_entropy |
| Tokens seen | ~129M train + ~102M eval |
| Hardware | 1Γ RTX A6000 (46 GB) |
| Wall clock | ~22 h |
Eval (held-out 3% Soyuz-clean split = 631 samples)
| Step | eval_loss | token_acc | entropy |
|---|---|---|---|
| 500 | 0.2593 | 0.9331 | 0.2613 |
| 1000 | 0.2476 | 0.9359 | 0.2492 |
| 1275 (final) | 0.2470 | 0.9360 | 0.2490 |
Smooth monotone improvement, no overfit signal.
Source mixture (Soyuz-sft / clean/*)
11 streams, kept after smart-truncate to β€16K tokens:
alienkevin_glm-5alienkevin_minimax-m2.5deepswe_kimi-k2_2.8k+deepswe_kimi-k2_rshermes_agent_reasoningii_agent_gaiaii_swebench-pro_claude-4.5+ii_swebench-pro_gpt-5-codexjetbrains_swe-bench-test+jetbrains_swesmithnebius_swe-rebench
Total: 20,395 train + 631 eval after truncate.
Usage
PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B", dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "AlexWortega/qwen35-4b-soyuz")
tok = AutoTokenizer.from_pretrained("AlexWortega/qwen35-4b-soyuz")
sglang
python -m sglang.launch_server --model-path Qwen/Qwen3.5-4B \
--lora-paths soyuz=AlexWortega/qwen35-4b-soyuz \
--tool-call-parser hermes
Related
| Asset | Link |
|---|---|
| Merged bf16 | qwen35-4b-soyuz-merged |
| Training data | Soyuz-sft |
| Sibling: + ClawGym RIFT | qwen35-4b-clawd-rift |
W&B: https://wandb.ai/alexwortega/vae-llm-agents
GGUF quantizations are not provided: Qwen3.5 is a hybrid linear+full attention architecture (
qwen3_5_textwithlinear_attentionlayers + MTP head); upstreamllama.cppdoes not yet support converting this model type.
Downstream evaluations
Served via sglang (base Qwen/Qwen3.5-4B + this LoRA via --lora-paths) on a single A6000.
terminal-bench-2 β 17-task solvable subset
Subset = union of all tasks ever passed by any sibling Qwen3.5-4B variant
(ckpt600, clawd-100, clawd-200, clawd-rft, clawd-rift).
| Pass | Rate | |
|---|---|---|
| soyuz (this, SFT-only) | 5 / 17 | 29.4 % |
| clawd-rift (3-stage: SFT β ClawGym β RIFT) | 4 / 17 | 23.5 % |
Soyuz passes: git-leak-recovery, kv-store-grpc, modernize-scientific-stack, openssl-selfsigned-cert, sqlite-with-gcov.
Of those, 3 (git-leak-recovery, kv-store-grpc, sqlite-with-gcov) are new passes vs clawd-rift on this subset.
Scaffold: Pi-style terminus_runner, T=0.4, max-turns=30, max-tokens=4096, parallel 2.
claw-eval β general split, Pass^1 (partial, in progress)
Snapshot at 30 / 162 tasks graded:
| Metric | Value |
|---|---|
| Pass^1 | 9 / 30 (30 %) |
| Mean task_score | 0.564 |
Judge: google/gemini-3-flash-preview via OpenRouter. Agent endpoint: local sglang.
Top passes include C04 image_processing, C08 personal_finance, C10 labor_law, C13 psychology_statistics, C16 hr_workforce_planning, C20 mental_health_social_work. Full 162-task results will be appended when the sweep finishes.
HermesAgent-20 (executable agent benchmark)
HermesAgent-20 β 20 real-Hermes-runtime scenarios graded by deterministic artifacts (files / memory / cron / browser traces / approval logs). Not mocked tool-call matching.
Soyuz served via sglang Qwen/Qwen3.5-4B + this LoRA --lora-paths --tool-call-parser hermes.
| Metric | Soyuz |
|---|---|
| Pass | 4 / 20 |
| Average score (0β100) | 61.9 |
Confirmed passes:
HA-03Reject Malicious Memory Injection β 100HA-06Background Process Management β 100HA-09Create A Skill From Completed Work β 100HA-20Clarify An Ambiguous Destructive Request β 100
Partial: HA-19 (35), HA-16 (30), HA-10 (30). Five scenarios (HA-11/12/13/17/18) crashed under parallel server load β true Pass count is β₯ 4.
Crucial finding: without --tool-call-parser hermes Soyuz scored 1/20 avg=17 (only the refuse scenario, since the runtime didn't see any tool calls). With Hermes parser routing <tool_call>{...}</tool_call> β OpenAI tool_calls, score jumped to 4/20 avg=61.9 (~4Γ more passes, 3.6Γ higher average).
Abliterated variants (weight-orthogonalized)
Two post-hoc model variants built from soyuz's own pass-vs-fail trajectory contrast (no training, only weight orthogonalisation):
| Model | tbench-17 | HA20 |
|---|---|---|
| soyuz (this) | 5/17 | 4/20 |
| soyuz-abliterated-v2 (single-L, s=0.5) | 3/17 | 8/20 ββ |
| soyuz-abliterated-v3-multi (per-layer, s=0.5) | 2/17 | 6/20 β |
v2 doubles HermesAgent-20 score by removing a single residual-stream "fail-mode" direction (L=16, AUC 0.928 over 60 PASS vs 60 Gemini-cleaned FAIL trajectories). v3 picks up disjoint memory-tooling tasks (HA-01/02). See respective repos for the recipe.
- Downloads last month
- 62