OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

التفاصيل البيبلوغرافية
العنوان:	OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
المؤلفون:	Zhao, Tiancheng, Zhang, Qianqian, Lee, Kyusong, Liu, Peng, Zhang, Lu, Fang, Chunxin, Liao, Jiajia, Jiang, Kelei, Ma, Yibo, Xu, Ruochen
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
الوصف:	We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding. Comment: 14 pages
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2407.04923
رقم الأكسشن:	edsarx.2407.04923
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.