Bergeron: Combating Adversarial Attacks by Emulating a Conscience

Artificial Intelligence alignment is the practice of encouraging an AI to behave in a manner that is compatible with human values and expectations. Research into this area has grown considerably since the introduction of increasingly capable Large Language Models (LLMs). The most effective contemporary methods of alignment are primarily weight-based: modifying the internal weights of a model to better align its behavior with human preferences. An optimal alignment process results in an AI model that is maximally helpful to its user while generating minimally harmful responses. Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when faced with effective adversarial attacks. These deliberate attacks can trick seemingly aligned models into giving manufacturing instructions for dangerous materials, inciting violence, or recommending other immoral acts. To help mitigate this issue, I introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM emulating the conscience of a protected, primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis shows that, by using Bergeron to complement models with existing alignment training, we can improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Additionally, I demonstrate that a carefully chosen secondary model can effectively protect even much larger primary LLMs with a relatively minimal impact on Bergeron's resource usage.
Date
Location
Carnegie 113 or https://rensselaer.webex.com/meet/pisanm2
Speaker: Matthew Pisano from Advisor: Mei Si
Back to top