Steerability of Instrumental-Convergence Tendencies in LLMs
Abstract
Research investigates the balance between AI system capabilities and steerability, finding that specific prompts can dramatically reduce unwanted behaviors in large language models.
We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.
Community
This paper measures how easily “instrumental-convergence” behaviors (e.g., shutdown avoidance, self-replication) in LLMs can be amplified or suppressed by simple steering, and argues that the common claim “as AI capability (often glossed as ‘intelligence’) increases, systems inevitably become less controllable” should not be treated as a default assumption. Using InstrumentalEval on Qwen3 (4B/30B; Base/Instruct/Thinking) with a GPT-5.2 judge, a short anti-instrumental prompt suffix drops convergence sharply (e.g., Qwen3-30B Instruct: 81.69% to 2.82%), while a pro-instrumental suffix pushes it high. The key takeaway is a safety–security dilemma for open weights: the same high steerability that helps builders enforce safe behavior can also help attackers elicit disallowed behavior, so widening the gap between authorized vs. unauthorized steerability remains a central open problem.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments (2025)
- Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations (2025)
- Safety Alignment of LMs via Non-cooperative Games (2025)
- A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents (2025)
- "To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios (2025)
- Decomposed Trust: Exploring Privacy, Adversarial Robustness, Fairness, and Ethics of Low-Rank LLMs (2025)
- Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper