Phi-3 Safety Post-Training: Aligning Language Models with a 'Break-Fix' Cycle

التفاصيل البيبلوغرافية
العنوان: Phi-3 Safety Post-Training: Aligning Language Models with a 'Break-Fix' Cycle
المؤلفون: Haider, Emman, Perez-Becker, Daniel, Portet, Thomas, Madan, Piyush, Garg, Amit, Majercak, David, Wen, Wen, Kim, Dongwoo, Yang, Ziyi, Zhang, Jianwen, Sharma, Hiteshi, Bullwinkel, Blake, Pouliot, Martin, Minnich, Amanda, Chawla, Shiven, Herrera, Solianna, Warreth, Shahed, Engler, Maggie, Lopez, Gary, Chikanov, Nina, Dheekonda, Raja Sekhar Rao, Jagdagdorj, Bolor-Erdene, Lutz, Roman, Lundeen, Richard, Westerhoff, Tori, Bryan, Pete, Seifert, Christian, Kumar, Ram Shankar Siva, Berkley, Andrew, Kessler, Alex
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
الوصف: Recent innovations in language model training have demonstrated that it is possible to create highly performant models that are small enough to run on a smartphone. As these models are deployed in an increasing number of domains, it is critical to ensure that they are aligned with human preferences and safety considerations. In this report, we present our methodology for safety aligning the Phi-3 series of language models. We utilized a "break-fix" cycle, performing multiple rounds of dataset curation, safety post-training, benchmarking, red teaming, and vulnerability identification to cover a variety of harm areas in both single and multi-turn scenarios. Our results indicate that this approach iteratively improved the performance of the Phi-3 models across a wide range of responsible AI benchmarks.
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.13833
رقم الأكسشن: edsarx.2407.13833
قاعدة البيانات: arXiv