Revisiting the Robust Alignment of Circuit Breakers

التفاصيل البيبلوغرافية
العنوان: Revisiting the Robust Alignment of Circuit Breakers
المؤلفون: Schwinn, Leo, Geisler, Simon
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Cryptography and Security
الوصف: Over the past decade, adversarial training has emerged as one of the few reliable methods for enhancing model robustness against adversarial attacks [Szegedy et al., 2014, Madry et al., 2018, Xhonneux et al., 2024], while many alternative approaches have failed to withstand rigorous subsequent evaluations. Recently, an alternative defense mechanism, namely "circuit breakers" [Zou et al., 2024], has shown promising results for aligning LLMs. In this report, we show that the robustness claims of "Improving Alignment and Robustness with Circuit Breakers" against unconstraint continuous attacks in the embedding space of the input tokens may be overestimated [Zou et al., 2024]. Specifically, we demonstrate that by implementing a few simple changes to embedding space attacks [Schwinn et al., 2024a,b], we achieve 100% attack success rate (ASR) against circuit breaker models. Without conducting any further hyperparameter tuning, these adjustments increase the ASR by more than 80% compared to the original evaluation. Code is accessible at: https://github.com/SchwinnL/circuit-breakers-eval
نوع الوثيقة: Working Paper
URL الوصول: http://arxiv.org/abs/2407.15902
رقم الأكسشن: edsarx.2407.15902
قاعدة البيانات: arXiv