Motion and Context-Aware Audio-Visual Conditioned Video Prediction

التفاصيل البيبلوغرافية
العنوان: Motion and Context-Aware Audio-Visual Conditioned Video Prediction
المؤلفون: Xu, Yating, Hu, Conghui, Lee, Gim Hee
سنة النشر: 2022
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition
الوصف: The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks.
Comment: BMVC 2023
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2212.04679
رقم الأكسشن: edsarx.2212.04679
قاعدة البيانات: arXiv