Motion and Context-Aware Audio-Visual Conditioned Video Prediction

التفاصيل البيبلوغرافية
العنوان:	Motion and Context-Aware Audio-Visual Conditioned Video Prediction
المؤلفون:	Xu, Yating, Hu, Conghui, Lee, Gim Hee
سنة النشر:	2022
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition
الوصف:	The existing state-of-the-art method for audio-visual conditioned video prediction uses the latent codes of the audio-visual frames from a multimodal stochastic network and a frame encoder to predict the next visual frame. However, a direct inference of per-pixel intensity for the next visual frame is extremely challenging because of the high-dimensional image space. To this end, we decouple the audio-visual conditioned video prediction into motion and appearance modeling. The multimodal motion estimation predicts future optical flow based on the audio-motion correlation. The visual branch recalls from the motion memory built from the audio features to enable better long term prediction. We further propose context-aware refinement to address the diminishing of the global appearance context in the long-term continuous warping. The global appearance context is extracted by the context encoder and manipulated by motion-conditioned affine transformation before fusion with features of warped frames. Experimental results show that our method achieves competitive results on existing benchmarks. Comment: BMVC 2023
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2212.04679
رقم الأكسشن:	edsarx.2212.04679
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.