ET tu, CLIP? Addressing Common Object Errors for Unseen Environments

التفاصيل البيبلوغرافية
العنوان: ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
المؤلفون: Byun, Ye Won, Jiao, Cathy, Noroozizadeh, Shahriar, Sun, Jimin, Vitiello, Rosa
المصدر: Conference on Computer Vision and Pattern Recognition (CVPR 2022) - Embodied AI Workshop
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Robotics
الوصف: We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2406.17876
رقم الأكسشن: edsarx.2406.17876
قاعدة البيانات: arXiv