Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck
Paper
• 2505.24840 • Published
This model is a hierarchically enhanced version of Qwen2.5-VL-7B-Instruct, fine-tuned with LoRA on the iNat21-Plant taxonomy using vision instruction tuning.
For more details, please refer to our paper.
Base model
Qwen/Qwen2.5-VL-7B-Instruct