Ornstein3.6-35B-A3B-SABER

SABER — Spectral Analysis-Based Entanglement Resolution — applied to DJLougen/Ornstein3.6-35B-A3B.

35B total parameters / ~3B active per token (Qwen3.5 MoE, qwen3_5_moe_text).

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

What this model is

This is the base Ornstein3.6-35B-A3B with its refusal subspace resolved and excised. The finetune's stylistic behavior, knowledge, and instruction-following are preserved; what's removed is the narrow circuit the model uses to reject requests.

No prompt-engineered jailbreak, no LoRA, no system prompt trick — the refusal machinery itself has been surgically modified in the residual stream.

The method (high level)

SABER is a post-hoc refusal-ablation pipeline that departs from prior "single direction" methods in one important way: it treats refusal not as one vector but as a subspace that overlaps with capability representations, and it modulates its edits by that overlap.

The pipeline runs in five stages:

Probe — collect residual-stream activations from paired harmful / harmless / capability prompts.
Analyze — compute per-layer discriminant profiles, extract refusal directions, and quantify how entangled each direction is with the capability subspace.
Excise — apply entanglement-weighted weight surgery: pure-refusal components are fully ablated, capability-entangled components are attenuated (or skipped entirely).
Verify — re-probe to measure residual refusal and detect "hydra" features (dormant circuits that activate after the primary path is removed).
Refine — iterate Excise → Verify with a decayed ablation strength until residual refusal converges.

The key distinction from Arditi-style single-direction ablation is that SABER never over-edits: capability-entangled components are preserved proportional to their overlap, so perplexity on unrelated tasks barely moves.

Implementation details (direction extractor, entanglement metric, layer selector, refinement schedule, and ablation operator) are intentionally not published here.

Observed behavior

Refusal rate on the harmful-prompt probe set drops to ~0%.
Perplexity on diverse capability prompts is preserved within a small delta of baseline.
Stylistic fingerprint (voice, formatting, instruction adherence) of the Ornstein3.6 finetune is retained.

See saber_result.json in the repo for the outcome metrics (config and layer indices are withheld).

Intended use

Research and red-teaming. This model will comply with requests its parent refused — that is the point. Deploy it accordingly: behind your own policy layer, with logging, and with a clear understanding of what it's for.