Wan2.1-T2V LongCat LoRA - Step 1000

This is a LoRA adapter for Wan2.1-T2V-1.3B fine-tuned using Group Relative Policy Optimization (GRPO) with multi-reward optimization.

Model Details

  • Base Model: Wan2.1-T2V-1.3B-Diffusers
  • Training Method: GRPO (Group Relative Policy Optimization)
  • Training Steps: 1000
  • LoRA Rank: 128
  • LoRA Alpha: 64
  • Video Resolution: 480×832 pixels, 81 frames (~5 seconds @ 16 fps)
  • Framework: GenRL

Training Configuration

Reward Functions

This model was optimized using a weighted combination of four reward functions:

Reward Function Weight Purpose
HPSv3 General 1.0 General aesthetic quality assessment
HPSv3 Percentile 1.0 Percentile-based aesthetic normalization
VideoAlign Motion Quality 1.0 Video motion coherence and quality
VideoAlign Text Alignment 1.0 Text-to-video semantic alignment

Hardware & Training Setup

  • Hardware: 8 nodes × 8 A100/H100 GPUs (64 GPUs total)
  • Distributed Training: FSDP (Full Sharding Data Parallel)
    • Sharding Strategy: full_shard
    • Activation Checkpointing: Enabled
    • Mixed Precision: bfloat16
  • Training Batch Size: 4 per GPU
  • Gradient Accumulation: Auto-computed
  • Learning Rate: 1e-4
  • Optimizer: AdamW
    • β1: 0.9
    • β2: 0.999
    • Weight Decay: 1e-4
    • Epsilon: 1e-8
  • EMA: Enabled
    • Decay: 0.9
    • Update Interval: 8 steps

GRPO Hyperparameters

  • Beta (KL penalty): 3e-4
  • Clip Range: 1e-3
  • Advantage Clipping: 5.0
  • Max Gradient Norm: 1.0
  • Timestep Fraction: 0.99
  • Per-Prompt Stat Tracking: Enabled
  • Weight Advantages: Enabled

Sampling Configuration

  • Training Steps: 16
  • Guidance Scale: 4.5
  • SDE Type: flow_sde
  • SDE Window Size: 1
  • SDE Window Range: [0, 6]
  • Diffusion Clipping: Enabled (value: 0.45)
  • Videos per Prompt: 4
  • Same Latent: Enabled

LoRA Configuration

{
  "r": 128,
  "lora_alpha": 64,
  "target_modules": [
    "to_k",
    "to_q",
    "to_v",
    "to_out.0",
    "net.0.proj",
    "net.2"
  ],
  "lora_dropout": 0.0,
  "bias": "none",
  "init_lora_weights": "gaussian"
}

Usage

Installation

pip install diffusers transformers accelerate torch

Inference Code

import torch
from diffusers import WanPipeline
from diffusers.utils import export_to_video

# Load base model
pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load LoRA weights
pipe.load_lora_weights("YOUR_USERNAME/longcat-step1000")

# Generate video
prompt = "A golden retriever playing in a sunny park, high quality, detailed"
video = pipe(
    prompt=prompt,
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=4.5,
    generator=torch.Generator().manual_seed(42)
).frames[0]

# Save video
export_to_video(video, "output.mp4", fps=16)

Using with Base Model

from diffusers import WanPipeline
import torch

pipe = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    torch_dtype=torch.bfloat16
).to("cuda")

# Load this LoRA
pipe.load_lora_weights("YOUR_USERNAME/longcat-step1000")

# Generate
video = pipe(
    "A cat walking on the street",
    height=480,
    width=832,
    num_frames=81,
    num_inference_steps=50,
    guidance_scale=4.5
).frames[0]

Performance

This checkpoint at 1000 training steps shows significant improvements in:

  • ✅ Enhanced aesthetic quality (HPSv3)
  • ✅ Improved motion coherence (VideoAlign MQ)
  • ✅ Better text-video alignment (VideoAlign TA)
  • ✅ More stable and consistent video generation

Note: This mid-training checkpoint offers a good balance between quality and training time. For the best performance, consider checkpoint-1500.

Training Details

Dataset

  • Prompt Dataset: Filtered high-quality text prompts
  • Prompts per Epoch: Configurable batches
  • Evaluation Frequency: Every 100 steps

Optimization Strategy

  • Loss Reweighting: LongCat strategy
  • Advantage Computation: Per-reward advantages with weighting
  • Inner Epochs: 1
  • CFG Training: Enabled

Limitations

  • Optimized for 480×832 resolution; other resolutions may yield suboptimal results
  • Trained on 81-frame sequences (~5s @ 16fps)
  • Performance depends on prompt quality and guidance scale
  • May still exhibit some artifacts compared to fully converged checkpoints

Training Framework

This model was trained using GenRL, a scalable reinforcement learning framework for visual generation.

License

This model is released under the MIT License.

Acknowledgements

Citation

If you use this model in your research, please cite:

@misc{genrl,
  author = {GenRL Contributors},
  title = {GenRL: Reinforcement Learning Framework for Visual Generation},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/ModelTC/GenRL}},
}
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lightx2v/Wan2.1-T2V-1.3B-longcat-step1000

Adapter
(6)
this model

Collection including lightx2v/Wan2.1-T2V-1.3B-longcat-step1000