Jina CLIP: Your CLIP Model Is Also Your Text Retriever

التفاصيل البيبلوغرافية
العنوان: Jina CLIP: Your CLIP Model Is Also Your Text Retriever
المؤلفون: Koukounas, Andreas, Mastrapas, Georgios, Günther, Michael, Wang, Bo, Martens, Scott, Mohr, Isabelle, Sturua, Saba, Akram, Mohammad Kalim, Martínez, Joan Fontanals, Ognawala, Saahil, Guzman, Susana, Werk, Maximilian, Wang, Nan, Xiao, Han
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, 68T50, I.2.7
الوصف: Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
Comment: 4 pages, MFM-EAI@ICML2024
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2405.20204
رقم الأكسشن: edsarx.2405.20204
قاعدة البيانات: arXiv