Image Papers
updated
Visual Instruction Tuning
Paper
• 2304.08485
• Published
• 21
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
• 2311.05437
• Published
• 51
Improved Baselines with Visual Instruction Tuning
Paper
• 2310.03744
• Published
• 39
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
• 2309.14525
• Published
• 32
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Paper
• 2309.09958
• Published
• 20
Generate Anything Anywhere in Any Scene
Paper
• 2306.17154
• Published
• 23
LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day
Paper
• 2306.00890
• Published
• 14
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Paper
• 2401.01885
• Published
• 28
Instruct-Imagen: Image Generation with Multi-modal Instruction
Paper
• 2401.01952
• Published
• 32
High-Quality Image Restoration Following Human Instructions
Paper
• 2401.16468
• Published
• 15
AI training resources for GLAM: a snapshot
Paper
• 2205.04738
• Published
• 2
Agile But Safe: Learning Collision-Free High-Speed Legged Locomotion
Paper
• 2401.17583
• Published
• 26
Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
Paper
• 2310.00653
• Published
• 3
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
• 2402.05935
• Published
• 17
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
Lumos : Empowering Multimodal LLMs with Scene Text Recognition
Paper
• 2402.08017
• Published
• 27
Deep Residual Learning for Image Recognition
Paper
• 1512.03385
• Published
• 12
Foundation Models for Generalist Geospatial Artificial Intelligence
Paper
• 2310.18660
• Published
• 11
U-Net: Convolutional Networks for Biomedical Image Segmentation
Paper
• 1505.04597
• Published
• 17
LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video
Editing
Paper
• 2402.10294
• Published
• 27
The boundary of neural network trainability is fractal
Paper
• 2402.06184
• Published
• 4
Video ReCap: Recursive Captioning of Hour-Long Videos
Paper
• 2402.13250
• Published
• 26
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong
Vision-language Adapter
Paper
• 2402.10896
• Published
• 16
Improving Robustness for Joint Optimization of Camera Poses and
Decomposed Low-Rank Tensorial Radiance Fields
Paper
• 2402.13252
• Published
• 19