Jina CLIP: Your CLIP Model Is Also Your Text Retriever

التفاصيل البيبلوغرافية
العنوان:	Jina CLIP: Your CLIP Model Is Also Your Text Retriever
المؤلفون:	Koukounas, Andreas, Mastrapas, Georgios, Günther, Michael, Wang, Bo, Martens, Scott, Mohr, Isabelle, Sturua, Saba, Akram, Mohammad Kalim, Martínez, Joan Fontanals, Ognawala, Saahil, Guzman, Susana, Werk, Maximilian, Wang, Nan, Xiao, Han
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, 68T50, I.2.7
الوصف:	Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks. Comment: 4 pages, MFM-EAI@ICML2024
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2405.20204
رقم الأكسشن:	edsarx.2405.20204
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.