تقرير
Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
العنوان: | Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data |
---|---|
المؤلفون: | Ma, Wufei, Li, Kai, Jiang, Zhongshi, Meshry, Moustafa, Liu, Qihao, Wang, Huiyu, Häne, Christian, Yuille, Alan |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computer Vision and Pattern Recognition |
الوصف: | Recent video-text foundation models have demonstrated strong performance on a wide variety of downstream video understanding tasks. Can these video-text models genuinely understand the contents of natural videos? Standard video-text evaluations could be misleading as many questions can be inferred merely from the objects and contexts in a single frame or biases inherent in the datasets. In this paper, we aim to better assess the capabilities of current video-text models and understand their limitations. We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD), and a new Feint6K dataset. To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning. Analyses show that previous video-text foundation models can be easily fooled by counterfactually augmented data and are far behind human-level performance. In order to narrow the gap between video-text models and human performance on RCAD, we identify a key limitation of current contrastive approaches on video-text data and introduce LLM-teacher, a more effective approach to learn action semantics by leveraging knowledge obtained from a pretrained large language model. Experiments and analyses show that our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models. Our Feint6K dataset and project page is available at https://feint6k.github.io. Comment: ECCV 2024. Project page: https://feint6k.github.io |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2407.13094 |
رقم الأكسشن: | edsarx.2407.13094 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |