LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

التفاصيل البيبلوغرافية
العنوان: LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval
المؤلفون: Huijuan Xu, Kate Saenko, Bryan A. Plummer, Reuben Tan
المصدر: WACV
بيانات النشر: arXiv, 2019.
سنة النشر: 2019
مصطلحات موضوعية: FOS: Computer and information sciences, Computer Science - Machine Learning, Exploit, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Context (language use), 02 engineering and technology, 010501 environmental sciences, computer.software_genre, Semantics, Machine learning, 01 natural sciences, Machine Learning (cs.LG), Attention network, 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, 0105 earth and related environmental sciences, business.industry, Image and Video Processing (eess.IV), Location awareness, Electrical Engineering and Systems Science - Image and Video Processing, Moment (mathematics), Graph (abstract data type), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language
الوصف: The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to the given natural language query without access to temporal annotations during training. Prior strongly- and weakly-supervised approaches often leverage co-attention mechanisms to learn visual-semantic representations for localization. However, while such approaches tend to focus on identifying relationships between elements of the video and language modalities, there is less emphasis on modeling relational context between video frames given the semantic context of the query. Consequently, the above-mentioned visual-semantic representations, built upon local frame features, do not contain much contextual information. To address this limitation, we propose a Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to reason about correspondences between all possible pairs of frames, given the semantic context of the query. Comprehensive experiments across two datasets, DiDeMo and Charades-Sta, demonstrate the effectiveness of our proposed latent co-attention model where it outperforms current state-of-the-art (SOTA) weakly-supervised approaches by a significant margin. Notably, it even achieves a 11% improvement to Recall@1 accuracy over strongly-supervised SOTA methods on DiDeMo.
DOI: 10.48550/arxiv.1909.13784
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::fd5cb08bcf7f6ab0c2008a9d0a396f69
حقوق: OPEN
رقم الأكسشن: edsair.doi.dedup.....fd5cb08bcf7f6ab0c2008a9d0a396f69
قاعدة البيانات: OpenAIRE
الوصف
DOI:10.48550/arxiv.1909.13784