OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

التفاصيل البيبلوغرافية
العنوان: OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding
المؤلفون: Zhao, Tiancheng, Zhang, Qianqian, Lee, Kyusong, Liu, Peng, Zhang, Lu, Fang, Chunxin, Liao, Jiajia, Jiang, Kelei, Ma, Yibo, Xu, Ruochen
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
الوصف: We introduce OmChat, a model designed to excel in handling long contexts and video understanding tasks. OmChat's new architecture standardizes how different visual inputs are processed, making it more efficient and adaptable. It uses a dynamic vision encoding process to effectively handle images of various resolutions, capturing fine details across a range of image qualities. OmChat utilizes an active progressive multimodal pretraining strategy, which gradually increases the model's capacity for long contexts and enhances its overall abilities. By selecting high-quality data during training, OmChat learns from the most relevant and informative data points. With support for a context length of up to 512K, OmChat demonstrates promising performance in tasks involving multiple images and videos, outperforming most open-source models in these benchmarks. Additionally, OmChat proposes a prompting strategy for unifying complex multimodal inputs including single image text, multi-image text and videos, and achieving competitive performance on single-image benchmarks. To further evaluate the model's capabilities, we proposed a benchmark dataset named Temporal Visual Needle in a Haystack. This dataset assesses OmChat's ability to comprehend temporal visual details within long videos. Our analysis highlights several key factors contributing to OmChat's success: support for any-aspect high image resolution, the active progressive pretraining strategy, and high-quality supervised fine-tuning datasets. This report provides a detailed overview of OmChat's capabilities and the strategies that enhance its performance in visual understanding.
Comment: 14 pages
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.04923
رقم الأكسشن: edsarx.2407.04923
قاعدة البيانات: arXiv