The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

التفاصيل البيبلوغرافية
العنوان: The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
المؤلفون: Jia, Wenqi, Liu, Miao, Jiang, Hao, Ananthabhotla, Ishwarya, Rehg, James M., Ithapu, Vamsi Krishna, Gao, Ruohan
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition
الوصف: In recent years, the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions, where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer, we introduce the Ego-Exocentric Conversational Graph Prediction problem, marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV), for the joint prediction of conversation behaviors -- speaking and listening -- for both the camera wearer as well as all other social partners present in the egocentric video. Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities. To validate our method, we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our project page at https://vjwq.github.io/AV-CONV/.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2312.12870
رقم الأكسشن: edsarx.2312.12870
قاعدة البيانات: arXiv