Instructions to use mlx-community/Lance-3B-Video-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/Lance-3B-Video-bf16 with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Lance-3B-Video-bf16 mlx-community/Lance-3B-Video-bf16
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Lance-3B-Video-bf16 (MLX, video specialist) — 🚧 ALPHA
- Status
- Why a separate "Video" checkpoint?
- Quickstart
- Phase 5m fix — silent quality regression at n_lat ≈ 11,520 RESOLVED (v0.5.2)
- Known issue: structured-but-degraded mesh artifacts at n_lat ≥ ~30k
- Performance (M5 Max 128 GB)
- Files in this repo
- Provenance
- Limitations + caveats
- License
- Citation
- Links
- Status
📂 Part of the Lance MLX collection on mlx-community.
Lance-3B-Video-bf16 (MLX, video specialist) — 🚧 ALPHA
MLX port of ByteDance Intelligent Creation Lab's Lance — the video-specialist Lance_3B_Video checkpoint, converted to bf16 for Apple Silicon. ~6.44 B LLM parameters + 669 M Qwen2.5-VL ViT bundled, with the 126,976-entry latent_pos_embed table needed for video-scale latent grids.
⚠️ This is an alpha release. t2v is production-quality through 768²×25f (n_lat=16,128) after the Phase 5m CFG-renorm fix (v0.5.2), verified across two prompts (panda surfing, bus + Big Ben). At n_lat ≥ ~30k (768²×49f, 480×848×121f) Phase 5m partially closes the original "pure noise" failure to a milder "structured-but-degraded with mesh artifacts" failure — the model attempts the scene but the VAE outputs colored geometric tiles overlaid on it. See Status below.
Lance is ByteDance's 3B-active unified multimodal model (paper, code, HF original). This is not Lance/LanceDB, the columnar data format.
Status
🟡 Alpha — t2v is production-quality through 768²×25f (n_lat=16,128) after the Phase 5m fix; n_lat ≥ ~30k produces structured-but-degraded output with mesh artifacts (improvement over pre-fix pure noise, but not usable); understanding pipelines unvalidated.
| Capability | Status | Notes |
|---|---|---|
| t2v at 256×256 × ≤17 frames | 🟢 Production | Red panda surfing demo shows real temporal motion. ~33 s/clip on M5 Max. |
| t2v at 512×512 × 17 frames (n_lat ≤ 5,120) | 🟢 Production | Painterly aesthetic (this checkpoint's training-time style). |
| t2v at 640×640 × 17 frames (n_lat ≤ 8,000) | 🟢 Production | New scale validated under Phase 5m. |
| t2v at 768×768 × ≤13 frames (n_lat ≤ 9,216) | 🟢 Production | Painterly; legacy baseline. |
| t2v at 768×768 × 17 frames (n_lat = 11,520) | 🟢 Production (Phase 5m) | CFG-renorm fix in v0.5.2 — closes the silent quality regression. Pass cfg_renorm_type="global" to restore legacy default. |
| t2v at 768×768 × 25 frames (n_lat = 16,128) | 🟢 Production (Phase 5m+) | Verified post-fix with clean diagnostic prompt (bus + Big Ben). Quality equivalent-or-better than 17f; the n_lat → quality relationship is stochastic (seed × scale), not a monotonic degradation curve. |
| t2v at 768×768 × 33-41 frames (n_lat = 21k–26k) | ❓ Untested with Phase 5m | Gap in the empirical sweep. Likely in-envelope based on 25f result but unverified. |
| t2v at 768×768 × 49 frames (n_lat = 29,952) | ❌ Structured-but-degraded | Manual verification 2026-05-23: Phase 5m partially closes the pre-fix pure-noise collapse to a milder failure — the model attempts the scene (Big Ben silhouette barely visible) but the VAE produces colored geometric mesh artifacts overlaid throughout. Numerical signature: final std=0.623 vs ~0.88 for clean runs — channel renorm clamps too aggressively at late timesteps once n_lat reaches 30k, pushing latents outside the VAE's trained distribution. ~78 min wall-clock + 84.6 GB peak memory. Tracked as issue #1. |
| t2v at Lance reference (480×848 × 121f, n_lat ≈ 49 k) | ❌ Same regime as above | Untested directly at this exact dimension but expected to fall in the same degraded-mesh-artifact regime as 768²×49f. |
| x2t_video (video VQA / captioning) | 🟡 Implemented, not validated | Pipeline lands in lance-mlx but hasn't been compared against Phase 0 oracle. |
| video_edit (instruction-based) | 🟡 Implemented, not validated | Direct fusion of t2v + image_edit. Will only be as good as t2v at the chosen scale. |
For production-quality image tasks (t2i, image_edit, x2t_image), use the sibling repo mlx-community/Lance-3B-bf16 — it's fully validated.
Why a separate "Video" checkpoint?
ByteDance ships two variants of Lance that differ in fine-tuning (NOT just latent_pos_embed size):
Lance_3B— image specialist. Crystal-clear photorealistic t2i.Lance_3B_Video— video specialist. Same architecture, further fine-tuned on video data. Native aesthetic is painterly (verified by per-tensor diff:_moe_genQK-norms differ by 0.5–0.85 in 6+ layers;lm_headandembed_tokensare byte-identical).
This checkpoint also bundles the Qwen2.5-VL ViT for video-understanding tasks, with the larger 126,976-entry latent_pos_embed table that addresses video-resolution token grids.
Quickstart
Install from the lance-mlx source repo:
git clone https://github.com/xocialize/lance-mlx
cd lance-mlx && uv sync
Download this checkpoint:
from huggingface_hub import snapshot_download
weights = snapshot_download("mlx-community/Lance-3B-Video-bf16")
Text-to-video (recommended scale)
from lance_mlx.pipeline.t2v import TextToVideoPipeline
pipe = TextToVideoPipeline.from_pretrained(
lance_weights_dir=weights,
vae_safetensors=f"{weights}/vae.safetensors",
)
frames = pipe.generate(
"A red panda surfing on a sunny wave.",
num_frames=16, height=256, width=256,
num_steps=30, cfg_scale=4.0, seed=42,
)
# frames is np.ndarray of shape (T_decoded, H, W, 3) uint8
Encode to MP4 with imageio:
import imageio
with imageio.get_writer("out.mp4", fps=12, codec="libx264") as writer:
for f in frames:
writer.append_data(f)
Video understanding (alpha)
from lance_mlx.pipeline.understanding import UnderstandingPipeline
pipe = UnderstandingPipeline.from_pretrained(
lance_weights_dir=weights,
vit_safetensors=f"{weights}/vit.safetensors",
)
answer = pipe.generate_video(
video="my_video.mp4",
question="Describe what happens in this video.",
num_sample_frames=16, target_h=224, target_w=224,
max_new_tokens=256, prompt_style="lance",
)
print(answer)
⚠️ Unvalidated against Phase 0 oracle. Treat answers as exploratory.
Phase 5m fix — silent quality regression at n_lat ≈ 11,520 RESOLVED (v0.5.2)
The "global" CFG-renorm cap was computing a single scalar L2 over the entire velocity tensor. At higher n_lat (≈ 2× the production baseline) the L2 sum spans roughly twice as many elements, so the same cap silently over-suppressed high-frequency detail — composition + identity correct, but textures and sky gradients degraded.
Fix (default since v0.5.2): cfg_renorm_type="channel" computes per-channel L2 separately, so pathological channels clamp without dragging the aggregate signal down. Detail returns at high n_lat without regressing small scales.
Evidence (768²×17f, seed=43, cfg=4.0): "global" final std 0.907 (over-suppressed), "none" final std 1.112 (uncapped + clean), "channel" final std 0.900 (capped per-channel + visually matches "none"). V0 safety A/B at 768²×13f confirms no small-scale regression.
Pass cfg_renorm_type="global" to restore the legacy default.
Known issue: structured-but-degraded mesh artifacts at n_lat ≥ ~30k
Lance_3B_Video t2v pre-Phase-5m collapsed to pure random noise at very-high latent counts. Post-Phase-5m the failure mode is milder but still unusable: the model attempts the scene (recognizable silhouettes barely visible) but the VAE outputs colored geometric mesh tiles overlaid throughout.
Bisection on Phase 5m defaults (cfg_renorm_type="channel"):
T_frames n_lat result
1 2,304 coherent (same as t2i)
5 4,608 coherent
9 6,912 coherent (painterly)
13 9,216 coherent (painterly, mild temporal drift)
17 11,520 coherent ← Phase 5m fix restored detail
25 16,128 coherent ← Phase 5m+ verified across two prompts
33 21,120 untested
41 26,304 untested
49 29,952 structured-but-degraded ← manual verification 2026-05-23
Numerical signature of the degraded regime: final std=0.623 (49f) vs ~0.88 (clean runs). Channel renorm clamps too aggressively at late timesteps once n_lat reaches ~30k, pushing latents outside the VAE's trained distribution. The mesh-tile pattern is the VAE's response to out-of-distribution latents — not random noise but a low-rank geometric approximation.
Open candidates for a future Phase 5n / issue #1 fix:
- Per-channel renorm threshold that scales with n_lat (currently constant)
- Alternative late-timestep clamping (e.g. cfg_interval=[0.4, 1.0] to disable CFG entirely in the last steps)
- Investigating whether VAE decoder can be retrained on Phase-5m-style latents (longer-term)
The bug does not affect:
- Image tasks (use
mlx-community/Lance-3B-bf16). - t2v through 768² × 25f with Phase 5m defaults.
- The model checkpoint itself — same weights produce coherent images at any resolution.
Tracked at github.com/xocialize/lance-mlx/issues/1.
Performance (M5 Max 128 GB)
| Task | Configuration | Wall-clock |
|---|---|---|
| t2v | 256² × 16f, 30 steps, CFG=4.0 | ~33 s |
| t2v | 512² × 16f, 30 steps, CFG=4.0 | ~60 s |
| t2v | 768² × 13f, 30 steps, CFG=4.0 | ~145 s |
CFG doubles the forward cost since cond + uncond run sequentially. KV cache for the text prefix is a Phase 5 follow-up.
Files in this repo
| File | Size | Purpose |
|---|---|---|
model.safetensors |
12.87 GB | LLM weights (1021 tensors, both UND + GEN towers, with 126,976-entry latent_pos_embed) |
vit.safetensors |
1.34 GB | Qwen2.5-VL ViT (semantic encoder for x2t_video) |
vae.safetensors |
1.41 GB | Lance's bundled Wan2.2 VAE (also available standalone as mlx-community/Wan2.2-VAE-Lance-bf16) |
config.json |
– | Qwen2_5_VLForConditionalGeneration config |
conversion_report.json |
– | Provenance |
tokenizer.json / vocab.json |
– | Qwen2.5-VL vocabulary |
Provenance
Source: bytedance-research/Lance/Lance_3B_Video/model.safetensors (1411 tensors including bundled ViT; 6.437 B LLM + 0.669 B ViT params).
Converted via scripts/02_convert.py. The bundled ViT is extracted to a sibling vit.safetensors with the vit_model. prefix stripped, matching the layout convention of the image-specialist repo.
Limitations + caveats
- Aesthetic is painterly by design. Lance_3B_Video was further fine-tuned on video data; the native style is intentionally painterly, not photorealistic. Lance_3B (image specialist) is the crystal-photo checkpoint.
- Pending-verification regime at n_lat ≥ ~30k (see Known issue). Phase 5m fixed the silent quality regression at intermediate n_lat (verified through 16,128 with channel renorm).
- No streaming or batched generation.
- English + Chinese prompts. Other languages are out of distribution.
License
This MLX port: Apache 2.0.
Underlying weights:
- Lance: Apache 2.0 (ByteDance Intelligent Creation Lab).
- Wan2.2 VAE: Apache 2.0 (Alibaba).
- Qwen2.5-VL: Apache 2.0 (Alibaba).
See NOTICE for attribution.
Citation
@article{fu2026lance,
title={Lance: Unified Multimodal Modeling by Multi-Task Synergy},
author={Fu, Fengyi and Huang, Mengqi and Wu, Shaojin and others},
journal={arXiv preprint arXiv:2605.18678},
year={2026}
}
Links
- MLX port code + phase notes:
github.com/xocialize/lance-mlx - Open issue (t2v scale collapse): #1
- Original PyTorch model:
bytedance-research/Lance - Image specialist (production):
mlx-community/Lance-3B-bf16 - Wan2.2 VAE (standalone):
mlx-community/Wan2.2-VAE-Lance-bf16
- Downloads last month
- 736
Quantized