This model was published in HF papers on 2024-02-20 and contributed to Hugging Face Transformers on 2026-06-19.

VideoPrism

The VideoPrism model was proposed in the paper VideoPrism: A Foundational Visual Encoder for Video Understanding by Google DeepMind (blog post).

VideoPrism is a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model. The model is pretrained on a large-scale heterogeneous corpus containing 36M high-quality video-caption pairs and 582M video clips with noisy parallel text (e.g., ASR transcripts). The pretraining approach improves upon masked autoencoding through global-local distillation of semantic video embeddings and a token shuffling scheme, enabling the model to focus primarily on the video modality while leveraging text associated with videos. VideoPrism achieves state-of-the-art performance on 31 out of 33 video understanding benchmarks across four broad task groups, from web video question answering to computer vision for science.

You can find all original VideoPrism checkpoints under the VideoPrism collection.

Notes:

VideoPrism uses a factorized spatio-temporal encoder architecture, processing videos through separate spatial and temporal transformers.
The model supports video-text contrastive learning through VideoPrismClipModel, which combines a video encoder and a text encoder. VideoPrismConfig must be used with this model.
For video classification tasks, use VideoPrismForVideoClassification which adds a classification head on top of the video encoder. VideoPrismVisionConfig must be used with this model.
The vision encoder can be used standalone via VideoPrismVisionModel for extracting video features. VideoPrismVisionConfig must be used with this model.
The default input resolution is 288x288 pixels with 16 frames per video clip for the base models and 8 frames for the large models. Set interpolate_pos_encoding=True to use the models with custom resolution and frames per clip.

This model was contributed by MHRDYN7 and reviewed by vasqu & zucchini-nlp. The original code can be found here.

Usage example

The snippet below shows how to load the VideoPrismVisionModel for feature extraction using the AutoModel class.

import torch
from transformers import AutoModel, AutoVideoProcessor

processor = AutoVideoProcessor.from_pretrained("google/videoprism-base-f16r288", revision="refs/pr/4")
model = AutoModel.from_pretrained(
    "google/videoprism-base-f16r288",
    revision="refs/pr/4",
    device_map="auto",
    # use "flash_attention_2" for faster inference on supported hardware
    # attn_implementation="flash_attention_2" 
)

video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"

# when do_sample_frames=True, 16/8 frames will be sampled by default depending on the checkpoint size base/large.
processed_video_inputs = processor(videos=[video_url], return_metadata=True, do_sample_frames=True)
video_metadata = processed_video_inputs["video_metadata"]
video_inputs = processed_video_inputs["pixel_values_videos"].to(model.device)
outputs = model(video_inputs)

# VideoPrism encoder outputs
encoder_outputs = outputs.last_hidden_state

Transformers

VideoPrism

Usage example

VideoPrismVisionConfig

class transformers.VideoPrismVisionConfig

VideoPrismTextConfig

class transformers.VideoPrismTextConfig

VideoPrismConfig

class transformers.VideoPrismConfig

VideoPrismTokenizer

class transformers.VideoPrismTokenizer

get_sentinel_token_ids

get_sentinel_tokens

VideoPrismProcessor

class transformers.VideoPrismProcessor

VideoPrismVisionModel

class transformers.VideoPrismVisionModel

forward

VideoPrismVideoModel

class transformers.VideoPrismVideoModel

forward

VideoPrismTextModel

class transformers.VideoPrismTextModel

forward

VideoPrismClipModel

class transformers.VideoPrismClipModel

forward

VideoPrismForVideoClassification

class transformers.VideoPrismForVideoClassification

forward