| --- |
| datasets: |
| - DeepMath-103K |
| language: |
| - en |
| library_name: transformers |
| license: apache-2.0 |
| pipeline_tag: text-generation |
| tags: |
| - reasoning |
| - reinforcement-learning |
| - rlvr |
| - mcts |
| - math |
| - iclr-2026 |
| model-index: |
| - name: DeepSearch-1.5B |
| results: |
| - task: |
| type: text-generation |
| name: Mathematical Reasoning |
| dataset: |
| name: AIME 2024 |
| type: text |
| metrics: |
| - type: avg@32 |
| value: 53.65 |
| - type: avg@32 |
| value: 35.42 |
| - type: avg@32 |
| value: 90.39 |
| - type: avg@32 |
| value: 92.53 |
| - type: avg@32 |
| value: 40.0 |
| - type: avg@32 |
| value: 65.72 |
| --- |
| |
| <div align="center"> |
| <span style="font-family: default; font-size: 1.5em;">🚀 DeepSearch-1.5B</span> |
| </div> |
|
|
| **DeepSearch-1.5B🌟** is a 1.5B parameter reasoning model trained with **Reinforcement Learning with Verifiable Rewards (RLVR)**, enhanced by **Monte Carlo Tree Search (MCTS)**. |
| Unlike prior approaches that restrict structured search to inference, DeepSearch integrates MCTS *into training*, enabling systematic exploration, fine-grained credit assignment, and efficient replay buffering. |
|
|
| This model achieves **state-of-the-art accuracy among 1.5B reasoning models** while being **5.7× more compute-efficient** than extended RL training baselines. |
|
|
|  |
|
|
|
|
| --- |
|
|
| ## Model Details |
|
|
| - **Developed by**: Fang Wu\*, Weihao Xuan\*, Heli Qi\*, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi |
| - **Institutional affiliations**: Stanford University, University of Tokyo, RIKEN AIP, University of Washington, UC Berkeley, Amazon AWS, Columbia University |
| - **Paper**: [DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search](https://huggingface.co/papers/2509.25454) |
| - **Code**: [Github](https://github.com/smiles724/DeepSearch) |
| - **Base Model**: Nemotron-Research-Reasoning-Qwen-1.5B v2 |
| - **Parameters**: 1.5B |
| - **Framework**: veRL |
| - **License**: Apache-2.0 |
| |
| --- |
| |
| ## Quickstart |
| |
| ### Environment |
| ``` |
| pip install vllm # vllm>=v0.8.5.post1 should work |
| pip install transformers # transformers>=4.52.4 should work |
| ``` |
| |
| |
| ### Using vLLM to generate |
| ```python |
| from vllm import LLM, SamplingParams |
| from transformers import AutoTokenizer |
| |
| |
| def convert_question_to_messages(question: str): |
| messages = [ |
| {"role": "user", |
| "content": question + " Let's think step by step and output the final answer within \\boxed{}. \ |
| "} |
| ] |
| return messages |
| |
| |
| model_id="fangwu97/DeepSearch-1.5B" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| |
| sampling_params = SamplingParams( |
| temperature=0.6, |
| top_p=0.95, |
| max_tokens=32768 |
| ) |
| |
| model = LLM( |
| model=model_id, |
| tensor_parallel_size=1 |
| ) |
| prompt = tokenizer.apply_chat_template( |
| convert_question_to_messages("Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$."), |
| add_generation_prompt=True, |
| tokenize=False |
| ) |
| |
| outputs = model.generate({"prompt": prompt}, sampling_params=sampling_params, use_tqdm=False) |
| response = outputs[0].outputs[0].text |
| print(response) |
| ``` |
| |
| ## Performance |
| |
| | Benchmark | Nemotron-RR-Qwen-1.5B v2 | DeepSearch-1.5B | |
| |-----------|--------------------------|-----------------| |
| | AIME 2024 | 51.77 | **53.65** | |
| | AIME 2025 | 32.92 | **35.42** | |
| | AMC 2023 | 88.83 | **90.39** | |
| | MATH500 | 92.24 | **92.53** | |
| | Minerva | 39.75 | **40.00** | |
| | Olympiad | 64.69 | **65.72** | |
| | **Average** | 61.70 | **62.95** | |
| |
| DeepSearch improves average accuracy by **+1.25 points** over the best prior 1.5B model, while using **5.7× more GPU hours**. |
| |
| |
| ## Training |
| |
| - **Dataset**: DeepMath-103K (rigorously decontaminated) |
| - **Training steps**: 100 |
| - **Search strategy**: |
| - Global Frontier Selection |
| - Entropy-based guidance |
| - Replay buffer with solution caching |
| - **Hardware**: 16× NVIDIA H100 (96GB) |
| - **Compute**: ~330 GPU hours |
| |
| --- |
| |
| ## Ethical Considerations |
| |
| - Positive: Reduces training costs and carbon footprint. |
| - Risks: Systematic exploration methods could be adapted to sensitive domains (e.g., code synthesis). |
| - Transparency: Full implementation and training details are released for reproducibility. |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{wu2025deepsearch, |
| title = {DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search}, |
| author = {Wu, Fang and Xuan, Weihao and Qi, Heli and Lu, Ximing and Tu, Aaron and Li, Li Erran and Choi, Yejin}, |
| year = {2025}, |
| eprint = {2509.25454}, |
| archivePrefix = {arXiv}, |
| primaryClass = {cs.AI}, |
| doi = {10.48550/arXiv.2509.25454}, |
| } |
| ``` |