InfoPO: Information-Driven Policy Optimization for User-Centric Agents
Abstract
InfoPO optimizes agent-user collaboration by identifying valuable interaction turns through information-gain rewards and adaptive variance-gated fusion for improved decision-making.
Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.
Community
🌟 We introduce InfoPO (Information-Driven Policy Optimization) — a practical way to train multi-turn LLM agents with turn-level credit assignment.
🧠Key idea: treat interaction as active uncertainty reduction. We compute a counterfactual information-gain reward by comparing the agent’s next-action distribution with vs. without user feedback (masked-feedback counterfactual), so the model learns which turns actually matter.
🎯 Why it matters: outcome-only rewards in multi-turn GRPO-style training can be sparse and noisy. InfoPO provides dense, targeted learning signals, and we further keep it task-aligned via an adaptive variance-gated fusion with task outcomes.
📊 Results: consistent gains and improved stability across diverse interactive settings (e.g., intent clarification, collaborative coding, tool-augmented decision making), including UserGym, ColBench, and τ²-Bench.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AT2PO: Agentic Turn-based Policy Optimization via Tree Search (2026)
- Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling (2026)
- Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization (2026)
- Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System (2026)
- ICA: Information-Aware Credit Assignment for Visually Grounded Long-Horizon Information-Seeking Agents (2026)
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training (2026)
- Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper