re paper
updated
Scaling RL to Long Videos
Paper
•
2507.07966
•
Published
•
159
Group Sequence Policy Optimization
Paper
•
2507.18071
•
Published
•
316
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement
Learning
Paper
•
2507.14111
•
Published
•
23
MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge
Paper
•
2507.21183
•
Published
•
14
SAND-Math: Using LLMs to Generate Novel, Difficult and Useful
Mathematics Questions and Answers
Paper
•
2507.20527
•
Published
•
6
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Paper
•
2507.16806
•
Published
•
6
EDGE-GRPO: Entropy-Driven GRPO with Guided Error Correction for
Advantage Diversity
Paper
•
2507.21848
•
Published
•
8
Geometric-Mean Policy Optimization
Paper
•
2507.20673
•
Published
•
31
A Survey of Self-Evolving Agents: On Path to Artificial Super
Intelligence
Paper
•
2507.21046
•
Published
•
82
L0: Reinforcement Learning to Become General Agents
Paper
•
2506.23667
•
Published
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for
RLVR
Paper
•
2507.15778
•
Published
•
20
SRFT: A Single-Stage Method with Supervised and Reinforcement
Fine-Tuning for Reasoning
Paper
•
2506.19767
•
Published
•
15
R-Search: Empowering LLM Reasoning with Search via Multi-Reward
Reinforcement Learning
Paper
•
2506.04185
•
Published
TreeRPO: Tree Relative Policy Optimization
Paper
•
2506.05183
•
Published
TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Paper
•
2506.11902
•
Published
Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Paper
•
2410.12934
•
Published
•
1
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper
•
2412.06559
•
Published
•
85
tencent/WeDLM-8B-Instruct
Text Generation
•
8B
•
Updated
•
2.18k
•
282