Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks

التفاصيل البيبلوغرافية
العنوان:	Leveraging Vision-Language Foundation Models for Fine-Grained Downstream Tasks
المؤلفون:	Coquenet, Denis, Rambour, Clément, Dalsasso, Emanuele, Thome, Nicolas
سنة النشر:	2023
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition
الوصف:	Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets, especially thanks to their free-text inputs. However, they struggle to handle some downstream tasks, such as fine-grained attribute detection and localization. In this paper, we propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models. Using the CLIP architecture as baseline, we show strong improvements on bird fine-grained attribute detection and localization tasks, while also increasing the classification performance on the CUB200-2011 dataset. We provide source code for reproducibility purposes: it is available at https://github.com/FactoDeepLearning/MultitaskVLFM.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2307.06795
رقم الأكسشن:	edsarx.2307.06795
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.