humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head
A Bradley-Terry preference reward model for New Yorker–style cartoon captions. Given an image of a cartoon and two candidate captions, it ranks which is funnier. The intended downstream use is as the reward signal for GRPO fine-tuning of a Qwen3-VL-2B-Thinking captioning policy.
This adapter sits on top of Qwen/Qwen2.5-VL-3B-Instruct. We added a scalar
score head that pools the last non-pad token's hidden state through a single
nn.Linear(2048, 1, bias=False) and trained the whole thing with the standard
Bradley-Terry loss -log σ(score(chosen) - score(rejected)).
Headline numbers
Held-out validation pairs (Bradley-Terry chosen vs rejected, dataset's
validation split):
| eval set size | pairwise accuracy | reward margin (mean ± std) | BT loss |
|---|---|---|---|
| 500 (training-time eval) | 0.696 | +0.96 | 0.604 |
| 2000 (post-training eval) | 0.682 | +0.85 ± 1.73 | 0.624 |
For reference: random is 0.50; 0.65 is the practical floor for "usable as an RL reward signal"; the dataset paper (Zhang et al. 2024) reports their best RM at ~0.70 pairwise accuracy on this same data.
We also tried a 60K-pair follow-up run with the same recipe; it regressed to 0.617 on the same 2K eval (the 20K → 60K data extension brings in noisier preference pairs that the model can't fit at this LoRA capacity and constant LR=2e-4). The 20K checkpoint is the production version.
Training details
- Backbone:
Qwen/Qwen2.5-VL-3B-Instruct(frozen, LoRA-adapted) - Adapter: LoRA r=32, α=32, target_modules="all-linear", bias="none"
- Score head:
nn.Linear(2048, 1, bias=False), zero-initialized so the initial reward is 0 and BT loss starts at exactly log(2) - Pooling: last non-pad token, sequence-level (chosen and rejected scored independently and combined in a Bradley-Terry pairwise loss)
- Optimizer: AdamW (fused), lr 2e-4, constant schedule, no warmup, weight_decay=0.0, max_grad_norm=1.0
- Effective batch size: 32 (per-device 4 × accum 8 × 1 GPU)
- Precision: bf16, FlashAttention-2, no gradient checkpointing
- Image preprocessing: cartoons resized so the long edge is 448 px before the Qwen processor (otherwise ~750 image tokens per pair, ~2× slower)
- Data: 20 000 of 268 556 available Bradley-Terry pairs from
yguooo/newyorker_caption_ranking(3-σ rating gap, ≤1000 pairs/contest), trained 1 epoch (625 optimizer steps) - Hardware: 1 × NVIDIA A100-SXM4-80GB
- Wall clock: 4054 s (≈67 min training + ≈5 min sequential val eval)
The training loss curve was still meaningfully decreasing at end of epoch (running average dropped from 0.69 → 0.46 → 0.33 across the run), which suggests this is under-trained relative to the 268K available pairs. A larger-data follow-up is in progress; we may swap this checkpoint for something better once that lands.
Usage
from pathlib import Path
import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from PIL import Image
from torch import nn
from transformers import AutoModel, AutoProcessor
# Download artifacts
local_dir = Path(snapshot_download("Broyojo/humor-r1-rm-qwen25vl-3b-20k"))
# Materialize backbone + adapter + score head
base = AutoModel.from_pretrained(
"Qwen/Qwen2.5-VL-3B-Instruct",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
backbone = PeftModel.from_pretrained(base, local_dir / "backbone_adapter")
score_head = nn.Linear(base.config.hidden_size, 1, bias=False).to(torch.bfloat16)
score_head.load_state_dict(torch.load(local_dir / "reward_head.pt"))
processor = AutoProcessor.from_pretrained(local_dir / "processor")
backbone.eval().to("cuda")
score_head.eval().to("cuda")
@torch.no_grad()
def score(image: Image.Image, prompt: str, caption: str) -> float:
text = (
f"{prompt}\n\nCandidate caption: {caption}\n\n"
"Judge how funny this caption is for the cartoon."
)
messages = [
{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": text}]}
]
chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
inputs = processor(text=[chat], images=[image], return_tensors="pt", padding=True).to("cuda")
out = backbone(**inputs, return_dict=True)
last_hidden = out.last_hidden_state
last_idx = inputs["attention_mask"].long().sum(dim=1) - 1
pooled = last_hidden[torch.arange(last_hidden.size(0), device=last_hidden.device), last_idx]
return float(score_head(pooled.to(score_head.weight.dtype)).item())
A higher score means "funnier"; only the difference between two scores is calibrated, not the absolute value (Bradley-Terry shift-invariance).
Files in this repo
backbone_adapter/— LoRA weights and PEFT config on top of Qwen2.5-VL-3Bprocessor/— the Qwen2.5-VL processor (image processor + tokenizer + chat template)reward_head.pt—state_dictof the scalarnn.Linear(2048, 1)score headreward_model_config.json— base model id and score head shape
Limitations
- Trained on a New Yorker–specific humor distribution; OOD on other cartoons is unverified.
- Pairs were filtered to a 3-σ rating gap, so the RM is well-calibrated on easy preferences but its accuracy on subtle ones is lower.
- Bradley-Terry rewards are shift-invariant; the absolute score has no meaning beyond ranking.
- The model is 3B + LoRA; if you need a stronger reward signal the same recipe scales straightforwardly to Qwen2.5-VL-7B/72B.
- Downloads last month
- -
Model tree for HumorR1/rm-qwen25vl-3b-20k
Base model
Qwen/Qwen2.5-VL-3B-Instruct