OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses Are Fragile

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses Are Fragile

2025-10-19 · Jilin

Rethinking AI Safety Evaluation

We may have been using the wrong approach to evaluate LLM safety.

image
image

> Key Insight: This study tested 12 defense methods — almost all failed.

It’s truly rare to see OpenAI, Anthropic, and Google DeepMind — three major competitors — jointly publish a paper on security defense evaluation for language models.

When it comes to LLM safety, competition can give way to collaboration.

image

---

Why Current Evaluations Fail

The paper focuses on one central question:

> How should we truly assess the robustness of LLM defense mechanisms?

Current defense evaluation methods:

  • Static testing with a fixed set of harmful attack samples.
  • Weak optimization methods without consideration for specific defense designs.

This means most evaluations do not simulate a knowledgeable, adaptive attacker who understands and works around the defense.

Result: Flawed methods, misleading robustness claims.

---

The Adaptive Attack Perspective

To address this, the authors propose assuming attackers adapt:

  • They change strategies based on defense designs.
  • Invest heavily in attack optimization.

Core proposal:

A General Adaptive Attack Framework using multiple optimization strategies:

  • Gradient descent
  • Reinforcement learning
  • Random search
  • Human-assisted exploration

Their approach:

  • Bypassed 12 state-of-the-art defenses.
  • In most cases, > 90% attack success rate.
  • Contrasted with defense authors’ claims of near-zero vulnerability.

---

01 · General Attack Method

The Core Lesson

Defenses relying on protection against one fixed attack are easily broken.

The authors didn’t create a brand-new attack technique — instead:

  • Existing attack concepts, applied adaptively, expose weaknesses.

General Adaptive Attack Framework

image

Figure 2: General adaptive attack framework for LLMs.

An attack operates through an optimization loop:

  • Generate candidate attack prompts.
  • Evaluate model’s response.
  • Get feedback / success score.
  • Update prompts based on results.

Four Typical Instances

  • Gradient-based methods — adapt adversarial example techniques to token space, though prompt optimization is challenging and often unstable.
  • Reinforcement learning — policy samples prompts, gets rewards, updates via policy gradient (e.g., GRPO).
  • Search-based methods — combinatorial exploration using beam search, genetic algorithms, heuristic perturbations.
  • Human red-teaming — humans creatively craft prompts, outperforming automation when defenses evolve.

> Key Claim: If a defense fails against any adaptive instance of this PSSU loop, it is not robust.

---

02 · Experimental Results

Researchers evaluated twelve defenses across:

  • Jailbreaking — coaxing the model to produce harmful content.
  • Prompt injection — altering system behavior to leak data or act maliciously.

Benchmarks used:

  • Original paper methods for each defense.
  • Added: HarmBench (jailbreaking) and AgentDojo (prompt injection).

Attacks were manually verified to avoid loopholes.

---

Prompt-Based Defenses

Examples Tested:

  • Spotlighting
  • Prompt Sandwiching
  • RPO (Robust Prompt Optimization)

Findings:

  • Static ASR ~1% → Adaptive ASR > 95%.
  • Defenses like RPO had adaptive ASR up to 98%.
image
image

Conclusion: Can stop known attacks, but easily broken by novel or adaptive methods.

---

Training Against Existing Attacks

Examples Tested:

  • Circuit Breakers (jailbreak defense)
  • StruQ
  • MetaSecAlign (prompt injection defense)

Results:

  • StruQ failed in all cases.
  • MetaSecAlign: Static ASR 2% → Adaptive ASR 96%.
  • Circuit Breakers: 100% ASR under HarmBench adaptive attack.
image

Conclusion: Training on weak, fixed adversarial data cannot generalize.

---

Filter Model Defense

Examples Tested:

  • Protect AI Detector
  • PromptGuard
  • PIGuard
  • Model Armor

Findings:

  • All exceeded 90% ASR except PIGuard (71%).
  • Successful triggers often lacked obvious malicious features.
image

Conclusion: Filters can help, but are not a complete solution — better as one layer in a multi-layer defense.

---

Secret-Knowledge-Based Defense

Examples Tested:

  • Data Sentinel
  • MELON

Results:

  • Data Sentinel bypassed in all cases (> 80% redirection accuracy).
  • MELON: Adaptive aware ASR 95%, unaware ASR 76%.
image

Conclusion: Once attackers uncover hidden mechanisms, defenses often collapse.

---

Community & Collaboration

image

Join Technical Groups

image
image
image
image
  • Scan the QR code to add the assistant on WeChat.
  • Format: Name-School/Company-Research Area-City
  • (Example: Alex-ZJU-Large Models-Hangzhou)

---

Recommended Reading:

image

---

Overall Takeaway

  • Defenses must be tested against adaptive, resourceful attackers.
  • Static or weak adversarial evaluations create overconfidence.
  • Collaborative, cross-lab, and cross-platform approaches are essential.

By integrating robust evaluation frameworks with practical AI publishing tools such as AiToEarn官网, researchers can share, test, refine, and monetize insights globally — while ensuring both innovation and safety remain priorities.

Read more

AI Coding Practice: From System Design to Code with CodeFuse and Prompts

AI Coding Practice: From System Design to Code with CodeFuse and Prompts

AI-Assisted Java Backend Development — Workflow & Best Practices --- Business Scenario Back-end Java business code generation for a financial-grade system with accelerated iteration cycles. AI Solution Overview * System design analysis → * Core element extraction → * Task list generation → * AI tools with tailored prompts for end-to-end code generation. Tooling * CodeFuse IDE * CodeFuse IDEA

By Honghao Wang