Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

التفاصيل البيبلوغرافية
العنوان: Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions
المؤلفون: Zhang, Tingwei, Zhang, Collin, Morris, John X., Bagdasaryan, Eugene, Shmatikov, Vitaly
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Cryptography and Security, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
الوصف: We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.08970
رقم الأكسشن: edsarx.2407.08970
قاعدة البيانات: arXiv