Depth Anything V2 Estimator Block

A custom Modular Diffusers block for monocular depth estimation using Depth Anything V2. Supports both images and videos.

Features

  • Relative depth estimation using Depth Anything V2 (Large variant, 335M params)
  • Image and video input support
  • Grayscale or turbo colormap visualization

Installation

# Using uv
uv sync

# Using pip
pip install -r requirements.txt

Quick Start

Load the block

from diffusers import ModularPipelineBlocks
import torch

blocks = ModularPipelineBlocks.from_pretrained(
    "your-username/depth-anything-v2-estimator",  # or local path "."
    trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")

Single image - grayscale depth

from PIL import Image

image = Image.open("photo.jpg")
output = pipeline(image=image)

# Save depth map
output.depth_image.save("photo_depth.png")

# Access raw relative depth tensor
print(output.predicted_depth.shape)  # (H, W)

Single image - turbo colormap

output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")

Video - grayscale depth

from block import save_video

output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")

Video - turbo colormap

output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")

Inputs

Parameter Type Default Description
image PIL.Image - Image to estimate depth for
video_path str - Path to input video. When provided, image is ignored
colormap str "grayscale" "grayscale" or "turbo" (colormapped)

Outputs

Image mode

Output Type Description
depth_image PIL.Image Normalized depth visualization
predicted_depth torch.Tensor Raw relative depth (H x W)

Video mode

Output Type Description
depth_frames List[PIL.Image] Per-frame depth visualizations
fps float Source video frame rate

Depth Normalization

Depth values are min-max normalized and inverted so that bright areas represent nearby surfaces and dark areas represent distant ones.

  • Bright = close, dark = far (grayscale)
  • Warm (red/yellow) = close, cool (blue) = far (turbo)

Model Variants

The block defaults to depth-anything/Depth-Anything-V2-Large-hf. Other available variants:

Variant Model ID Params
Small depth-anything/Depth-Anything-V2-Small-hf 24.8M
Base depth-anything/Depth-Anything-V2-Base-hf 97.5M
Large (default) depth-anything/Depth-Anything-V2-Large-hf 335M
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support