oguzhanercan 's Collections Image-Video General Tasks
updated
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of
Images and Videos
Paper
• 2501.04001
• Published
• 47
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
Vision Token
Paper
• 2501.03895
• Published
• 52
An Empirical Study of Autoregressive Pre-training from Videos
Paper
• 2501.05453
• Published
• 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale
Pre-Training
Paper
• 2501.07556
• Published
• 7
MINIMA: Modality Invariant Image Matching
Paper
• 2412.19412
• Published
• 4
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Paper
• 2501.12375
• Published
• 23
Intuitive physics understanding emerges from self-supervised pretraining
on natural videos
Paper
• 2502.11831
• Published
• 20
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Paper
• 2502.17157
• Published
• 52
"Principal Components" Enable A New Language of Images
Paper
• 2503.08685
• Published
• 12
What's in a Latent? Leveraging Diffusion Latent Space for Domain
Generalization
Paper
• 2503.06698
• Published
• 4
Segment Any Motion in Videos
Paper
• 2503.22268
• Published
• 19
FlashVSR: Towards Real-Time Diffusion-Based Streaming Video
Super-Resolution
Paper
• 2510.12747
• Published
• 39
VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval
Paper
• 2602.08099
• Published
• 121