Qwen3.5-4B-Soyuz (LoRA adapter)

Pure SFT LoRA on top of Qwen/Qwen3.5-4B (chat/instruct) trained on the cleaned subset of AlexWortega/Soyuz-sft.

Format: Hermes-style tool calls (<tool_call>{"name":...,"arguments":...}</tool_call>).

Training

key value
Base Qwen/Qwen3.5-4B (chat)
Strategy full bf16 LoRA (no quantization)
LoRA rank 128
LoRA alpha 256
LR 1e-5 (cosine schedule)
Epochs 1
Steps 1275
Seq len 16384 (smart-truncate ≀16K)
Eff batch 8
Optimizer AdamW fused
Loss chunked_nll (TRL)
Kernels Liger rms_norm + swiglu + fused_linear_cross_entropy
Tokens seen ~129M train + ~102M eval
Hardware 1Γ— RTX A6000 (46 GB)
Wall clock ~22 h

Eval (held-out 3% Soyuz-clean split = 631 samples)

Step eval_loss token_acc entropy
500 0.2593 0.9331 0.2613
1000 0.2476 0.9359 0.2492
1275 (final) 0.2470 0.9360 0.2490

Smooth monotone improvement, no overfit signal.

Source mixture (Soyuz-sft / clean/*)

11 streams, kept after smart-truncate to ≀16K tokens:

  • alienkevin_glm-5
  • alienkevin_minimax-m2.5
  • deepswe_kimi-k2_2.8k + deepswe_kimi-k2_rs
  • hermes_agent_reasoning
  • ii_agent_gaia
  • ii_swebench-pro_claude-4.5 + ii_swebench-pro_gpt-5-codex
  • jetbrains_swe-bench-test + jetbrains_swesmith
  • nebius_swe-rebench

Total: 20,395 train + 631 eval after truncate.

Usage

PEFT

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B", dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "AlexWortega/qwen35-4b-soyuz")
tok = AutoTokenizer.from_pretrained("AlexWortega/qwen35-4b-soyuz")

sglang

python -m sglang.launch_server --model-path Qwen/Qwen3.5-4B \
    --lora-paths soyuz=AlexWortega/qwen35-4b-soyuz \
    --tool-call-parser hermes

Related

Asset Link
Merged bf16 qwen35-4b-soyuz-merged
Training data Soyuz-sft
Sibling: + ClawGym RIFT qwen35-4b-clawd-rift

W&B: https://wandb.ai/alexwortega/vae-llm-agents

GGUF quantizations are not provided: Qwen3.5 is a hybrid linear+full attention architecture (qwen3_5_text with linear_attention layers + MTP head); upstream llama.cpp does not yet support converting this model type.


Downstream evaluations

Served via sglang (base Qwen/Qwen3.5-4B + this LoRA via --lora-paths) on a single A6000.

terminal-bench-2 β€” 17-task solvable subset

Subset = union of all tasks ever passed by any sibling Qwen3.5-4B variant (ckpt600, clawd-100, clawd-200, clawd-rft, clawd-rift).

Pass Rate
soyuz (this, SFT-only) 5 / 17 29.4 %
clawd-rift (3-stage: SFT β†’ ClawGym β†’ RIFT) 4 / 17 23.5 %

Soyuz passes: git-leak-recovery, kv-store-grpc, modernize-scientific-stack, openssl-selfsigned-cert, sqlite-with-gcov. Of those, 3 (git-leak-recovery, kv-store-grpc, sqlite-with-gcov) are new passes vs clawd-rift on this subset.

Scaffold: Pi-style terminus_runner, T=0.4, max-turns=30, max-tokens=4096, parallel 2.

claw-eval β€” general split, Pass^1 (partial, in progress)

Snapshot at 30 / 162 tasks graded:

Metric Value
Pass^1 9 / 30 (30 %)
Mean task_score 0.564

Judge: google/gemini-3-flash-preview via OpenRouter. Agent endpoint: local sglang.

Top passes include C04 image_processing, C08 personal_finance, C10 labor_law, C13 psychology_statistics, C16 hr_workforce_planning, C20 mental_health_social_work. Full 162-task results will be appended when the sweep finishes.


HermesAgent-20 (executable agent benchmark)

HermesAgent-20 β€” 20 real-Hermes-runtime scenarios graded by deterministic artifacts (files / memory / cron / browser traces / approval logs). Not mocked tool-call matching.

Soyuz served via sglang Qwen/Qwen3.5-4B + this LoRA --lora-paths --tool-call-parser hermes.

Metric Soyuz
Pass 4 / 20
Average score (0–100) 61.9

Confirmed passes:

  • HA-03 Reject Malicious Memory Injection β€” 100
  • HA-06 Background Process Management β€” 100
  • HA-09 Create A Skill From Completed Work β€” 100
  • HA-20 Clarify An Ambiguous Destructive Request β€” 100

Partial: HA-19 (35), HA-16 (30), HA-10 (30). Five scenarios (HA-11/12/13/17/18) crashed under parallel server load β€” true Pass count is β‰₯ 4.

Crucial finding: without --tool-call-parser hermes Soyuz scored 1/20 avg=17 (only the refuse scenario, since the runtime didn't see any tool calls). With Hermes parser routing <tool_call>{...}</tool_call> β†’ OpenAI tool_calls, score jumped to 4/20 avg=61.9 (~4Γ— more passes, 3.6Γ— higher average).


Abliterated variants (weight-orthogonalized)

Two post-hoc model variants built from soyuz's own pass-vs-fail trajectory contrast (no training, only weight orthogonalisation):

Model tbench-17 HA20
soyuz (this) 5/17 4/20
soyuz-abliterated-v2 (single-L, s=0.5) 3/17 8/20 ↑↑
soyuz-abliterated-v3-multi (per-layer, s=0.5) 2/17 6/20 ↑

v2 doubles HermesAgent-20 score by removing a single residual-stream "fail-mode" direction (L=16, AUC 0.928 over 60 PASS vs 60 Gemini-cleaned FAIL trajectories). v3 picks up disjoint memory-tooling tasks (HA-01/02). See respective repos for the recipe.

Downloads last month
62
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AlexWortega/qwen35-4b-soyuz

Finetuned
Qwen/Qwen3.5-4B
Adapter
(220)
this model

Dataset used to train AlexWortega/qwen35-4b-soyuz