arxiv:2606.11926

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Published on Jun 10

· Submitted by

Yuyang Hu on Jun 11

#3 Paper of the day

NLPIR Lab @ RUC

Upvote

Authors:

Yuyang Hu ,

Guanting Dong ,

Abstract

An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

View arXiv page View PDF Project page GitHub 14 Add to collection

Community

namespace-ERI

Paper author Paper submitter about 6 hours ago

From single-turn chatbots, to multi-turn dialogue systems, and then to tool-using agents, we believe the next important stage is the rise of Autonomous Agents. However, many existing efforts are either tightly bound to specific scenarios and single tasks, or remain at the research-prototype stage without being truly deployable in practice. This raises a central question: what should a general and practical autonomous agent look like?

In our new work, Toward Generalist Autonomous Research via Hypothesis-Tree Refinement, we present our answer: Arbor. Automated research should not be reduced to repeated trial-and-error. Instead, it should explore in a structured way, organizing hypotheses, evidence, failures, and accumulated experience into an evolving research state, much like the process of real scientific inquiry. Each new attempt should build upon the discoveries and lessons from previous explorations.

Arbor first emphasizes generality. It is not tied to a particular benchmark or task format. Instead, it unifies diverse research tasks, including model training, harness engineering, and data synthesis, under the framework of Autonomous Optimization. As long as there is an artifact to optimize, a clear objective, and executable feedback signals, Arbor can conduct long-horizon search and iterative improvement around it.

Arbor also emphasizes practicality. It is not merely a paper idea or a research prototype confined to the lab. We open-source a fully runnable CLI and an Agent Skill Suite. Users can directly run the complete Arbor CLI for long-horizon automated research experiments, or load Arbor-style skills into environments such as Codex and Claude Code, enabling existing coding agents to gain more structured autonomous research capabilities.

Arbor supports long-running experiments in real codebases, disciplined dev/test evaluation, git worktree isolation, checkpoint/resume, dashboard and report generation, and one-line plugin adaptation for different task types. Our goal is to move auto-research from a conceptual vision toward a truly usable system.