MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

التفاصيل البيبلوغرافية
العنوان:	MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
المؤلفون:	Berman, William, Peysakhovich, Alexander
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
الوصف:	We train a model to generate images from multimodal prompts of interleaved text and images such as "a man and his dog in an animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2406.18790
رقم الأكسشن:	edsarx.2406.18790
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.