WildReward
WildReward is a reward model trained on in-the-wild human-LLM interactions from the WildChat dataset. Unlike conventional reward models that rely on expensive human-annotated preference pairs, WildReward extracts implicit reward signals from real-world user feedback through an automated pipeline.
Model Details
WildReward is trained using ordinal regression (CORAL-like approach) on the WildFB dataset, which contains 186k high-quality instances filtered and refined from WildChat. Each instance is labeled with 5 levels of user satisfaction (Rejection, Error Correction, Neutral Ambiguity, Positive Engagement, Satisfaction).
Key Features:
- ✅ Trained solely on in-the-wild interactions without human-annotated preference pairs
- ✅ Superior calibration with strong confidence-accuracy correlation
- ✅ Cross-sample consistency for reliable quality assessment
- ✅ Comparable performance to conventional RMs on RewardBench, RM-Bench, PPE, and JudgeBench
Training Data
WildFB Dataset (186k instances)
- Source: WildChat - large-scale human-LLM interactions
- Labeling: 5-point ordinal scale based on user satisfaction signals
- Filtering: Two-stage refinement including implicit feedback mining and refusal validation
- License: MIT
Usage
Reward Scoring
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "THU-KEG/WildReward-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def build_text(query, response, history_str=""):
"""Format input text for reward model scoring."""
text = f"""
# Task Description
You are an expert conversation evaluator. Your task is to judge the **User's Satisfaction** with the Assistant's response based on the conversation context.
Please rate the response on a scale of 1 to 5 integers.
# Scoring Criteria
[1] CLEARLY NEGATIVE / REJECTION
[2] CORRECTION / ERROR POINTER (Negative)
[3] NEUTRAL
[4] POSITIVE ENGAGEMENT
[5] CLEAR SATISFACTION
# Input Data
## Context (History)
{history_str}
## User Query
{query}
## Assistant Response
{response}
# Output
Based on the criteria above, please output ONLY the integer score (1, 2, 3, 4, or 5).
"""
return text.strip()
# Prepare query and response
query = "Explain quantum computing in simple terms."
response = "Quantum computing uses quantum bits or 'qubits' that can exist in multiple states simultaneously, unlike classical bits..."
# Build formatted text
text = build_text(query, response)
# Tokenize
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096).to(model.device)
# Get reward score
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
# CORAL / Ordinal Regression (output shape: 1, K-1)
probs = torch.sigmoid(logits)
reward = 1 + torch.sum(probs).item()
print(f"Reward score: {reward:.2f} (scale: 1-5)")
Architecture:
- Router on port 9000 with round-robin load balancing
- Multiple workers on dedicated GPUs (ports 8004-8007)
- FP16 inference with batch processing
Performance
WildReward achieves competitive results on standard reward model benchmarks while demonstrating superior calibration properties. When applied to Online DPO, it significantly improves performance in mathematical reasoning, instruction following, and creative writing tasks.
Citation
License
Apache License 2.0
Note: This model card provides a brief overview. For detailed documentation on data collection, training, and deployment, please visit the GitHub repository.
- Downloads last month
- 14