Depth Anything V2 Estimator Block

A custom Modular Diffusers block for monocular depth estimation using Depth Anything V2. Supports both images and videos.

Features

Relative depth estimation using Depth Anything V2 (Large variant, 335M params)
Image and video input support
Grayscale or turbo colormap visualization

Installation

# Using uv
uv sync

# Using pip
pip install -r requirements.txt

Quick Start

Load the block

from diffusers import ModularPipelineBlocks
import torch

blocks = ModularPipelineBlocks.from_pretrained(
    "your-username/depth-anything-v2-estimator",  # or local path "."
    trust_remote_code=True,
)
pipeline = blocks.init_pipeline()
pipeline.load_components(torch_dtype=torch.float16)
pipeline.to("cuda")

Single image - grayscale depth

from PIL import Image

image = Image.open("photo.jpg")
output = pipeline(image=image)

# Save depth map
output.depth_image.save("photo_depth.png")

# Access raw relative depth tensor
print(output.predicted_depth.shape)  # (H, W)

Single image - turbo colormap

output = pipeline(image=image, colormap="turbo")
output.depth_image.save("photo_depth_turbo.png")

Video - grayscale depth

from block import save_video

output = pipeline(video_path="input.mp4", colormap="grayscale")
save_video(output.depth_frames, output.fps, "output_depth.mp4")

Video - turbo colormap

output = pipeline(video_path="input.mp4", colormap="turbo")
save_video(output.depth_frames, output.fps, "output_depth_turbo.mp4")

Inputs

Parameter	Type	Default	Description
`image`	`PIL.Image`	-	Image to estimate depth for
`video_path`	`str`	-	Path to input video. When provided, `image` is ignored
`colormap`	`str`	`"grayscale"`	`"grayscale"` or `"turbo"` (colormapped)

Outputs

Image mode

Output	Type	Description
`depth_image`	`PIL.Image`	Normalized depth visualization
`predicted_depth`	`torch.Tensor`	Raw relative depth (H x W)

Video mode

Output	Type	Description
`depth_frames`	`List[PIL.Image]`	Per-frame depth visualizations
`fps`	`float`	Source video frame rate

Depth Normalization

Depth values are min-max normalized and inverted so that bright areas represent nearby surfaces and dark areas represent distant ones.

Bright = close, dark = far (grayscale)
Warm (red/yellow) = close, cool (blue) = far (turbo)

Model Variants

The block defaults to depth-anything/Depth-Anything-V2-Large-hf. Other available variants:

Variant	Model ID	Params
Small	`depth-anything/Depth-Anything-V2-Small-hf`	24.8M
Base	`depth-anything/Depth-Anything-V2-Base-hf`	97.5M
Large (default)	`depth-anything/Depth-Anything-V2-Large-hf`	335M

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support