تقرير
CoLLaVO: Crayon Large Language and Vision mOdel
العنوان: | CoLLaVO: Crayon Large Language and Vision mOdel |
---|---|
المؤلفون: | Lee, Byung-Kwan, Park, Beomchan, Kim, Chae Won, Ro, Yong Man |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computer Vision and Pattern Recognition |
الوصف: | The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting. Comment: ACL 2024 Findings. Code available: https://github.com/ByungKwanLee/CoLLaVO |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2402.11248 |
رقم الأكسشن: | edsarx.2402.11248 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |