Wan2.1-T2V LongCat LoRA - Step 1000

This is a LoRA adapter for Wan2.1-T2V-1.3B fine-tuned using Group Relative Policy Optimization (GRPO) with multi-reward optimization.

Model Details

Base Model: Wan2.1-T2V-1.3B-Diffusers
Training Method: GRPO (Group Relative Policy Optimization)
Training Steps: 1000
LoRA Rank: 128
LoRA Alpha: 64
Video Resolution: 480×832 pixels, 81 frames (~5 seconds @ 16 fps)
Framework: GenRL

Training Configuration

Reward Functions

This model was optimized using a weighted combination of four reward functions:

Reward Function	Weight	Purpose
HPSv3 General	1.0	General aesthetic quality assessment
HPSv3 Percentile	1.0	Percentile-based aesthetic normalization
VideoAlign Motion Quality	1.0	Video motion coherence and quality
VideoAlign Text Alignment	1.0	Text-to-video semantic alignment

Hardware & Training Setup

Hardware: 8 nodes × 8 A100/H100 GPUs (64 GPUs total)
Distributed Training: FSDP (Full Sharding Data Parallel)
- Sharding Strategy: full_shard
- Activation Checkpointing: Enabled
- Mixed Precision: bfloat16
Training Batch Size: 4 per GPU
Gradient Accumulation: Auto-computed
Learning Rate: 1e-4
Optimizer: AdamW
- β1: 0.9
- β2: 0.999
- Weight Decay: 1e-4
- Epsilon: 1e-8
EMA: Enabled
- Decay: 0.9
- Update Interval: 8 steps

GRPO Hyperparameters

Beta (KL penalty): 3e-4
Clip Range: 1e-3
Advantage Clipping: 5.0
Max Gradient Norm: 1.0
Timestep Fraction: 0.99
Per-Prompt Stat Tracking: Enabled
Weight Advantages: Enabled

Sampling Configuration

Training Steps: 16
Guidance Scale: 4.5
SDE Type: flow_sde
SDE Window Size: 1
SDE Window Range: [0, 6]
Diffusion Clipping: Enabled (value: 0.45)
Videos per Prompt: 4
Same Latent: Enabled

LoRA Configuration

{
  "r": 128,
  "lora_alpha": 64,
  "target_modules": [
    "to_k",
    "to_q",
    "to_v",
    "to_out.0",
    "net.0.proj",
    "net.2"
  ],
  "lora_dropout": 0.0,
  "bias": "none",
  "init_lora_weights": "gaussian"
}

Usage

Installation

pip install diffusers transformers accelerate torch

Inference Code

import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

# Load base model
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA weights
pipe.load_lora_weights("YOUR_USERNAME/longcat-step1000")

# Generate video
prompt = "A golden retriever playing in a sunny park, high quality, detailed"
video = pipe(
    prompt=prompt,
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=4.5,
    generator=torch.Generator().manual_seed(42)
).frames[0]

# Save video
export_to_video(video, "output.mp4", fps=16)

Using with Base Model

from diffusers import WanPipeline
import torch

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")

# Load this LoRA
pipe.load_lora_weights("YOUR_USERNAME/longcat-step1000")

# Generate
video = pipe(
    "A cat walking on the street",
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=4.5
).frames[0]

Performance

This checkpoint at 1000 training steps shows significant improvements in:

✅ Enhanced aesthetic quality (HPSv3)
✅ Improved motion coherence (VideoAlign MQ)
✅ Better text-video alignment (VideoAlign TA)
✅ More stable and consistent video generation

Note: This mid-training checkpoint offers a good balance between quality and training time. For the best performance, consider checkpoint-1500.

Training Details

Dataset

Prompt Dataset: Filtered high-quality text prompts
Prompts per Epoch: Configurable batches
Evaluation Frequency: Every 100 steps

Optimization Strategy

Loss Reweighting: LongCat strategy
Advantage Computation: Per-reward advantages with weighting
Inner Epochs: 1
CFG Training: Enabled

Limitations

Optimized for 480×832 resolution; other resolutions may yield suboptimal results
Trained on 81-frame sequences (~5s @ 16fps)
Performance depends on prompt quality and guidance scale
May still exhibit some artifacts compared to fully converged checkpoints

Training Framework

This model was trained using GenRL, a scalable reinforcement learning framework for visual generation.

License

This model is released under the MIT License.

Acknowledgements

Base Model: Wan2.1-T2V-1.3B
Reward Models:
- HPSv3 for aesthetic scoring
- VideoAlign for motion quality and text alignment
Training Framework: GenRL
PEFT: Hugging Face PEFT for LoRA implementation

Citation

If you use this model in your research, please cite:

@misc{genrl,
  author = {GenRL Contributors},
  title = {GenRL: Reinforcement Learning Framework for Visual Generation},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ModelTC/GenRL}},
}

Downloads last month: 38

Model tree for lightx2v/Wan2.1-T2V-1.3B-longcat-step1000

Base model

Wan-AI/Wan2.1-T2V-1.3B-Diffusers

Adapter

(6)

this model

Collection including lightx2v/Wan2.1-T2V-1.3B-longcat-step1000

GenRL

Collection

Model collections trained with our framework: https://github.com/ModelTC/GenRL • 3 items • Updated Feb 11 • 3