تقرير
CinePile: A Long Video Question Answering Dataset and Benchmark
العنوان: | CinePile: A Long Video Question Answering Dataset and Benchmark |
---|---|
المؤلفون: | Rawal, Ruchit, Saifullah, Khalid, Basri, Ronen, Jacobs, David, Somepalli, Gowthami, Goldstein, Tom |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia |
الوصف: | Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at https://hf.co/datasets/tomg-group-umd/cinepile Comment: Project page with all the artifacts - https://ruchitrawal.github.io/cinepile/. Updated version with results on Gemini Flash model and additional related work |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2405.08813 |
رقم الأكسشن: | edsarx.2405.08813 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |