humor-r1 reward model — Qwen2.5-VL-3B + LoRA + scalar head

A Bradley-Terry preference reward model for New Yorker–style cartoon captions. Given an image of a cartoon and two candidate captions, it ranks which is funnier. The intended downstream use is as the reward signal for GRPO fine-tuning of a Qwen3-VL-2B-Thinking captioning policy.

This adapter sits on top of Qwen/Qwen2.5-VL-3B-Instruct. We added a scalar score head that pools the last non-pad token's hidden state through a single nn.Linear(2048, 1, bias=False) and trained the whole thing with the standard Bradley-Terry loss -log σ(score(chosen) - score(rejected)).

Headline numbers

Held-out validation pairs (Bradley-Terry chosen vs rejected, dataset's validation split):

eval set size pairwise accuracy reward margin (mean ± std) BT loss
500 (training-time eval) 0.696 +0.96 0.604
2000 (post-training eval) 0.682 +0.85 ± 1.73 0.624

For reference: random is 0.50; 0.65 is the practical floor for "usable as an RL reward signal"; the dataset paper (Zhang et al. 2024) reports their best RM at ~0.70 pairwise accuracy on this same data.

We also tried a 60K-pair follow-up run with the same recipe; it regressed to 0.617 on the same 2K eval (the 20K → 60K data extension brings in noisier preference pairs that the model can't fit at this LoRA capacity and constant LR=2e-4). The 20K checkpoint is the production version.

Training details

  • Backbone: Qwen/Qwen2.5-VL-3B-Instruct (frozen, LoRA-adapted)
  • Adapter: LoRA r=32, α=32, target_modules="all-linear", bias="none"
  • Score head: nn.Linear(2048, 1, bias=False), zero-initialized so the initial reward is 0 and BT loss starts at exactly log(2)
  • Pooling: last non-pad token, sequence-level (chosen and rejected scored independently and combined in a Bradley-Terry pairwise loss)
  • Optimizer: AdamW (fused), lr 2e-4, constant schedule, no warmup, weight_decay=0.0, max_grad_norm=1.0
  • Effective batch size: 32 (per-device 4 × accum 8 × 1 GPU)
  • Precision: bf16, FlashAttention-2, no gradient checkpointing
  • Image preprocessing: cartoons resized so the long edge is 448 px before the Qwen processor (otherwise ~750 image tokens per pair, ~2× slower)
  • Data: 20 000 of 268 556 available Bradley-Terry pairs from yguooo/newyorker_caption_ranking (3-σ rating gap, ≤1000 pairs/contest), trained 1 epoch (625 optimizer steps)
  • Hardware: 1 × NVIDIA A100-SXM4-80GB
  • Wall clock: 4054 s (≈67 min training + ≈5 min sequential val eval)

The training loss curve was still meaningfully decreasing at end of epoch (running average dropped from 0.69 → 0.46 → 0.33 across the run), which suggests this is under-trained relative to the 268K available pairs. A larger-data follow-up is in progress; we may swap this checkpoint for something better once that lands.

Usage

from pathlib import Path

import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from PIL import Image
from torch import nn
from transformers import AutoModel, AutoProcessor

# Download artifacts
local_dir = Path(snapshot_download("Broyojo/humor-r1-rm-qwen25vl-3b-20k"))

# Materialize backbone + adapter + score head
base = AutoModel.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
backbone = PeftModel.from_pretrained(base, local_dir / "backbone_adapter")
score_head = nn.Linear(base.config.hidden_size, 1, bias=False).to(torch.bfloat16)
score_head.load_state_dict(torch.load(local_dir / "reward_head.pt"))

processor = AutoProcessor.from_pretrained(local_dir / "processor")

backbone.eval().to("cuda")
score_head.eval().to("cuda")


@torch.no_grad()
def score(image: Image.Image, prompt: str, caption: str) -> float:
    text = (
        f"{prompt}\n\nCandidate caption: {caption}\n\n"
        "Judge how funny this caption is for the cartoon."
    )
    messages = [
        {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": text}]}
    ]
    chat = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
    inputs = processor(text=[chat], images=[image], return_tensors="pt", padding=True).to("cuda")
    out = backbone(**inputs, return_dict=True)
    last_hidden = out.last_hidden_state
    last_idx = inputs["attention_mask"].long().sum(dim=1) - 1
    pooled = last_hidden[torch.arange(last_hidden.size(0), device=last_hidden.device), last_idx]
    return float(score_head(pooled.to(score_head.weight.dtype)).item())

A higher score means "funnier"; only the difference between two scores is calibrated, not the absolute value (Bradley-Terry shift-invariance).

Files in this repo

  • backbone_adapter/ — LoRA weights and PEFT config on top of Qwen2.5-VL-3B
  • processor/ — the Qwen2.5-VL processor (image processor + tokenizer + chat template)
  • reward_head.ptstate_dict of the scalar nn.Linear(2048, 1) score head
  • reward_model_config.json — base model id and score head shape

Limitations

  • Trained on a New Yorker–specific humor distribution; OOD on other cartoons is unverified.
  • Pairs were filtered to a 3-σ rating gap, so the RM is well-calibrated on easy preferences but its accuracy on subtle ones is lower.
  • Bradley-Terry rewards are shift-invariant; the absolute score has no meaning beyond ranking.
  • The model is 3B + LoRA; if you need a stronger reward signal the same recipe scales straightforwardly to Qwen2.5-VL-7B/72B.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HumorR1/rm-qwen25vl-3b-20k

Adapter
(138)
this model

Dataset used to train HumorR1/rm-qwen25vl-3b-20k