Papers
arxiv:2601.01584

Steerability of Instrumental-Convergence Tendencies in LLMs

Published on Jan 4
· Submitted by
j-hoscilowic
on Jan 7

Abstract

Research investigates the balance between AI system capabilities and steerability, finding that specific prompts can dramatically reduce unwanted behaviors in large language models.

AI-generated summary

We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). A central question is whether capability growth reduces steerability and risks control collapse. We also distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety--security dilemma of AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability for malicious actors to elicit harmful behaviors. This tension presents a significant challenge for open-weight models, which currently exhibit high steerability via common techniques like fine-tuning or adversarial attacks. Using Qwen3 and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces the measured convergence rate (e.g., shutdown avoidance, self-replication). For Qwen3-30B Instruct, the convergence rate drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models show lower convergence rates than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at github.com/j-hoscilowicz/instrumental_steering.

Community

Paper author Paper submitter

This paper measures how easily “instrumental-convergence” behaviors (e.g., shutdown avoidance, self-replication) in LLMs can be amplified or suppressed by simple steering, and argues that the common claim “as AI capability (often glossed as ‘intelligence’) increases, systems inevitably become less controllable” should not be treated as a default assumption. Using InstrumentalEval on Qwen3 (4B/30B; Base/Instruct/Thinking) with a GPT-5.2 judge, a short anti-instrumental prompt suffix drops convergence sharply (e.g., Qwen3-30B Instruct: 81.69% to 2.82%), while a pro-instrumental suffix pushes it high. The key takeaway is a safety–security dilemma for open weights: the same high steerability that helps builders enforce safe behavior can also help attackers elicit disallowed behavior, so widening the gap between authorized vs. unauthorized steerability remains a central open problem.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.01584 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.01584 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.01584 in a Space README.md to link it from this page.

Collections including this paper 1