GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models

التفاصيل البيبلوغرافية
العنوان: GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models
المؤلفون: Azad, Shehreen, Jain, Yash, Garg, Rishit, Rawat, Yogesh S, Vineet, Vibhav
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computer Vision and Pattern Recognition
الوصف: Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understanding in visual perception. In this work, we focus on the geometric comprehension of these models; specifically targeting the depths and heights of objects within a scene. Our observations reveal that, although VLMs excel in basic geometric properties perception such as shape and size, they encounter significant challenges in reasoning about the depth and height of objects. To address this, we introduce GeoMeter, a suite of benchmark datasets encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios to rigorously evaluate these aspects. We benchmark 17 state-of-the-art VLMs using these datasets and find that they consistently struggle with both depth and height perception. Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models. This study aims to pave the way for the development of VLMs with enhanced geometric understanding, crucial for real-world applications.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2408.11748
رقم الأكسشن: edsarx.2408.11748
قاعدة البيانات: arXiv