mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

التفاصيل البيبلوغرافية
العنوان:	mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning
المؤلفون:	Wei, Jingxuan, Xu, Nan, Chang, Guiyong, Luo, Yin, Yu, BiHui, Guo, Ruifeng
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
الوصف:	In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve either direct multimodal processing or a table-to-text conversion followed by language model analysis, have limitations in effectively handling these complex scenarios. This paper introduces a novel multimodal chart question-answering model, specifically designed to address these intricate tasks. Our model integrates visual and linguistic processing, overcoming the constraints of existing methods. We adopt a dual-phase training approach: the initial phase focuses on aligning image and text representations, while the subsequent phase concentrates on optimizing the model's interpretative and analytical abilities in chart-related queries. This approach has demonstrated superior performance on multiple public datasets, particularly in handling color, structure, and textless chart questions, indicating its effectiveness in complex multimodal tasks.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2404.01548
رقم الأكسشن:	edsarx.2404.01548
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.