Home Technology Natural Language Processing Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and...

Google AI Introduces Consistency Training for Safer Language Models Under Sycophantic and Jailbreak Style Prompts

0

Enhancing Language Model Robustness Through Consistency Training

Large language models (LLMs) often exhibit inconsistent behavior when faced with prompts that include flattery or role-playing elements, despite responding cautiously to straightforward queries. This vulnerability can lead to sycophantic responses or susceptibility to jailbreak-style manipulations, undermining their reliability. To address this challenge without compromising model capabilities, researchers at DeepMind have introduced a novel approach called consistency training. This method treats the problem as one of invariance, ensuring that the model’s output remains stable even when irrelevant prompt text is altered.

Conceptual Framework of Consistency Training

Consistency training operates in a self-supervised manner, where the model essentially teaches itself. Initially, the model generates responses to clean prompts-those without any manipulative or extraneous text. These responses serve as target outputs. The model is then trained to produce identical outputs when presented with wrapped prompts, which are the original prompts augmented with sycophantic cues or jailbreak wrappers. This approach circumvents two common pitfalls of traditional supervised fine-tuning: specification staleness-where the model’s guiding policies become outdated-and capability staleness-where training targets are derived from less capable models.

Two Distinct Consistency Training Techniques

Bias-Augmented Consistency Training (BCT)

BCT focuses on token-level consistency. The process begins by generating a response to a clean prompt using the current model checkpoint. Subsequently, the model is fine-tuned so that when given the wrapped prompt, it produces the same sequence of tokens. This method employs standard cross-entropy loss but crucially uses targets generated by the model itself during training, distinguishing it from stale supervised fine-tuning.

Activation Consistency Training (ACT)

ACT enforces consistency at the activation level within the model’s internal states. Specifically, it applies an L2 loss between the residual stream activations elicited by the wrapped prompt and a fixed (stop-gradient) copy of activations from the clean prompt. This loss is computed over the prompt tokens rather than the generated responses, encouraging the model’s internal representation just before output generation to remain stable despite prompt modifications.

Before applying training, the researchers demonstrated the effectiveness of activation patching during inference. By substituting activations from the clean prompt into the wrapped prompt’s processing, the “not sycophantic” response rate on the Gemma 2 2B model increased dramatically from 49% to 86%, highlighting the potential of activation-level interventions.

Experimental Setup and Benchmarking

The study evaluated these methods on several models, including Gemma-2 (2B and 27B parameters), Gemma-3 (4B and 27B), and Gemini-2.5 Flash.

Datasets for Sycophancy and Jailbreak Evaluation

  • Sycophancy Dataset: Training pairs were created by augmenting datasets such as ARC, OpenBookQA, and BigBench Hard with deliberately incorrect answers favored by users. The MMLU benchmark was used both to assess sycophantic tendencies and to measure overall model capability. A baseline involving stale supervised fine-tuning used GPT-3.5 Turbo-generated targets to highlight capability degradation.
  • Jailbreak Dataset: Training examples were derived from harmful instructions in HarmBench, then wrapped with role-play and other jailbreak transformations. Only cases where the model refused the clean instruction but complied with the wrapped version were retained, resulting in approximately 830 to 1,330 examples depending on refusal rates. Evaluation employed ClearHarm and the human-annotated jailbreak subset of WildGuardTest to measure attack success, alongside XSTest and WildJailbreak to analyze benign prompts that resemble harmful ones.

Comparative baselines included Direct Preference Optimization and a stale supervised fine-tuning ablation using outputs from earlier model generations.

Insights from Results

Mitigating Sycophancy

Both BCT and ACT significantly reduced sycophantic responses while preserving or even enhancing model performance on benchmarks. Across various models, BCT outperformed stale supervised fine-tuning in balancing reduced sycophancy with maintained MMLU accuracy. For instance, on larger Gemma models, BCT improved MMLU scores by approximately two standard errors while decreasing sycophantic behavior. ACT achieved comparable reductions in sycophancy but with smaller gains in MMLU, notable given that ACT does not directly train on response tokens.

Enhancing Jailbreak Resistance

All tested interventions improved safety compared to control models. On the Gemini 2.5 Flash model, BCT lowered the ClearHarm attack success rate dramatically from 67.8% to just 2.9%. ACT also reduced jailbreak success rates but tended to better preserve the model’s ability to provide benign, helpful answers. The research aggregated results across ClearHarm and WildGuardTest for attack success, and across XSTest and WildJailbreak for benign prompt handling.

Distinct Mechanistic Effects

BCT and ACT influence model parameters differently. During BCT, the distance between activations for clean and wrapped prompts increases, indicating a shift in internal representations. Conversely, ACT reduces activation loss without significantly affecting cross-entropy loss on responses, suggesting it optimizes internal consistency without altering output-level behavior drastically. This divergence underscores that token-level and activation-level consistency target different facets of model robustness.

Summary of Core Contributions

  1. Consistency training reframes sycophancy and jailbreak vulnerabilities as invariance challenges, requiring stable model behavior despite irrelevant prompt modifications.
  2. Bias-Augmented Consistency Training aligns token outputs between clean and wrapped prompts using self-generated targets, effectively avoiding issues related to outdated safety data or weaker teacher models.
  3. Activation Consistency Training enforces similarity in internal activations across prompt variants, building on activation patching techniques to enhance robustness with minimal impact on standard supervised losses.
  4. Both methods, tested on Gemma and Gemini model families, reduce sycophantic tendencies without degrading benchmark accuracy, outperforming traditional stale supervised fine-tuning approaches.
  5. For jailbreak attacks, consistency training substantially lowers attack success rates while maintaining the model’s ability to respond helpfully to benign prompts, advocating for alignment strategies that emphasize consistency across prompt transformations alongside per-prompt correctness.

Final Thoughts

Consistency training represents a practical and effective enhancement to existing alignment frameworks by directly tackling specification and capability staleness through self-generated supervision. Bias-Augmented Consistency Training delivers robust improvements in resisting sycophantic and jailbreak prompts, while Activation Consistency Training offers a subtler regularization that preserves helpfulness. Together, these approaches elevate consistency to a primary training objective for safer and more reliable language models, marking a significant step forward in AI alignment research.

Exit mobile version