Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

التفاصيل البيبلوغرافية
العنوان: Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
المؤلفون: Li, Xiujun, Lu, Yujie, Gan, Zhe, Gao, Jianfeng, Wang, William Yang, Choi, Yejin
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
الوصف: Recent multimodal large language models (MLLMs) have shown promising instruction following capabilities on vision-language tasks. In this work, we introduce VISUAL MODALITY INSTRUCTION (VIM), and investigate how well multimodal models can understand textual instructions provided in pixels, despite not being explicitly trained on such data during pretraining or fine-tuning. We adapt VIM to eight benchmarks, including OKVQA, MM-Vet, MathVista, MMMU, and probe diverse MLLMs in both the text-modality instruction (TEM) setting and VIM setting. Notably, we observe a significant performance disparity between the original TEM and VIM settings for open-source MLLMs, indicating that open-source MLLMs face greater challenges when text instruction is presented solely in image form. To address this issue, we train v-MLLM, a generalizable model that is capable to conduct robust instruction following in both text-modality and visual-modality instructions.
Comment: Github: https://github.com/VIM-Bench/VIM_TOOL, Model and Data: https://huggingface.co/VIM-Bench
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2311.17647
رقم الأكسشن: edsarx.2311.17647
قاعدة البيانات: arXiv