تقرير
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
العنوان: | Jina CLIP: Your CLIP Model Is Also Your Text Retriever |
---|---|
المؤلفون: | Koukounas, Andreas, Mastrapas, Georgios, Günther, Michael, Wang, Bo, Martens, Scott, Mohr, Isabelle, Sturua, Saba, Akram, Mohammad Kalim, Martínez, Joan Fontanals, Ognawala, Saahil, Guzman, Susana, Werk, Maximilian, Wang, Nan, Xiao, Han |
سنة النشر: | 2024 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, 68T50, I.2.7 |
الوصف: | Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks. Comment: 4 pages, MFM-EAI@ICML2024 |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2405.20204 |
رقم الأكسشن: | edsarx.2405.20204 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |