ShieldGemma: Generative AI Content Moderation Based on Gemma

التفاصيل البيبلوغرافية
العنوان:	ShieldGemma: Generative AI Content Moderation Based on Gemma
المؤلفون:	Zeng, Wenjun, Liu, Yuchi, Mullins, Ryan, Peran, Ludovic, Fernandez, Joe, Harkous, Hamza, Narasimhan, Karthik, Proud, Drew, Kumar, Piyush, Radharapu, Bhaktipriya, Sturman, Olivia, Wahltinez, Oscar
سنة النشر:	2024
المجموعة:	Computer Science
مصطلحات موضوعية:	Computer Science - Computation and Language, Computer Science - Machine Learning
الوصف:	We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.
نوع الوثيقة:	Working Paper
URL الوصول:	http://arxiv.org/abs/2407.21772
رقم الأكسشن:	edsarx.2407.21772
قاعدة البيانات:	arXiv

الوصف
الوصف غير متاح.