تقرير
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
العنوان: | Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning |
---|---|
المؤلفون: | Wang, Alex Jinpeng, Li, Linjie, Lin, Yiqi, Li, Min, Wang, Lijuan, Shou, Mike Zheng |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computer Vision and Pattern Recognition |
الوصف: | Training models with longer in-context lengths is a significant challenge for multimodal model due to substantial GPU memory and computational costs. This exploratory study does not present state-of-the-art models; rather, it introduces an innovative method designed to increase in-context text length in multi-modality large language models (MLLMs) efficiently. We present Visualized In-Context Text Processing (VisInContext), which processes long in-context text using visual tokens. This technique significantly reduces GPU memory usage and floating point operations (FLOPs) for both training and inferenceing stage. For instance, our method expands the pre-training in-context text length from 256 to 2048 tokens with nearly same FLOPs for a 56 billion parameter MOE model. Experimental results demonstrate that model trained with VisInContext delivers superior performance on common downstream benchmarks for in-context few-shot evaluation. Additionally, VisInContext is complementary to existing methods for increasing in-context text length and enhances document understanding capabilities, showing great potential in document QA tasks and sequential document retrieval. Comment: 12 pages. The website is \url{https://fingerrec.github.io/visincontext} |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2406.02547 |
رقم الأكسشن: | edsarx.2406.02547 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |