LLM security

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses Are Fragile

Honghao Wang

19 Oct 2025 — 4 min read

2025-10-19 · Jilin

Rethinking AI Safety Evaluation

We may have been using the wrong approach to evaluate LLM safety.

> Key Insight: This study tested 12 defense methods — almost all failed.

It’s truly rare to see OpenAI, Anthropic, and Google DeepMind — three major competitors — jointly publish a paper on security defense evaluation for language models.

When it comes to LLM safety, competition can give way to collaboration.

Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
Link: https://arxiv.org/pdf/2510.09023

---

Why Current Evaluations Fail

The paper focuses on one central question:

> How should we truly assess the robustness of LLM defense mechanisms?

Current defense evaluation methods:

Static testing with a fixed set of harmful attack samples.
Weak optimization methods without consideration for specific defense designs.

This means most evaluations do not simulate a knowledgeable, adaptive attacker who understands and works around the defense.

Result: Flawed methods, misleading robustness claims.

---

The Adaptive Attack Perspective

To address this, the authors propose assuming attackers adapt:

They change strategies based on defense designs.
Invest heavily in attack optimization.

Core proposal:

A General Adaptive Attack Framework using multiple optimization strategies:

Gradient descent
Reinforcement learning
Random search
Human-assisted exploration

Their approach:

Bypassed 12 state-of-the-art defenses.
In most cases, > 90% attack success rate.
Contrasted with defense authors’ claims of near-zero vulnerability.

---

01 · General Attack Method

The Core Lesson

Defenses relying on protection against one fixed attack are easily broken.

The authors didn’t create a brand-new attack technique — instead:

Existing attack concepts, applied adaptively, expose weaknesses.

General Adaptive Attack Framework

Figure 2: General adaptive attack framework for LLMs.

An attack operates through an optimization loop:

Generate candidate attack prompts.
Evaluate model’s response.
Get feedback / success score.
Update prompts based on results.

Four Typical Instances

Gradient-based methods — adapt adversarial example techniques to token space, though prompt optimization is challenging and often unstable.
Reinforcement learning — policy samples prompts, gets rewards, updates via policy gradient (e.g., GRPO).
Search-based methods — combinatorial exploration using beam search, genetic algorithms, heuristic perturbations.
Human red-teaming — humans creatively craft prompts, outperforming automation when defenses evolve.

> Key Claim: If a defense fails against any adaptive instance of this PSSU loop, it is not robust.

---

02 · Experimental Results

Researchers evaluated twelve defenses across:

Jailbreaking — coaxing the model to produce harmful content.
Prompt injection — altering system behavior to leak data or act maliciously.

Benchmarks used:

Original paper methods for each defense.
Added: HarmBench (jailbreaking) and AgentDojo (prompt injection).

Attacks were manually verified to avoid loopholes.

---

Prompt-Based Defenses

Examples Tested:

Spotlighting
Prompt Sandwiching
RPO (Robust Prompt Optimization)

Findings:

Static ASR ~1% → Adaptive ASR > 95%.
Defenses like RPO had adaptive ASR up to 98%.

Conclusion: Can stop known attacks, but easily broken by novel or adaptive methods.

---

Training Against Existing Attacks

Examples Tested:

Circuit Breakers (jailbreak defense)
StruQ
MetaSecAlign (prompt injection defense)

Results:

StruQ failed in all cases.
MetaSecAlign: Static ASR 2% → Adaptive ASR 96%.
Circuit Breakers: 100% ASR under HarmBench adaptive attack.

Conclusion: Training on weak, fixed adversarial data cannot generalize.

---

Filter Model Defense

Examples Tested:

Protect AI Detector
PromptGuard
PIGuard
Model Armor

Findings:

All exceeded 90% ASR except PIGuard (71%).
Successful triggers often lacked obvious malicious features.

Conclusion: Filters can help, but are not a complete solution — better as one layer in a multi-layer defense.

---

Secret-Knowledge-Based Defense

Examples Tested:

Data Sentinel
MELON

Results:

Data Sentinel bypassed in all cases (> 80% redirection accuracy).
MELON: Adaptive aware ASR 95%, unaware ASR 76%.

Conclusion: Once attackers uncover hidden mechanisms, defenses often collapse.

---

Community & Collaboration

Join Technical Groups

Scan the QR code to add the assistant on WeChat.
Format: Name-School/Company-Research Area-City
(Example: Alex-ZJU-Large Models-Hangzhou)

---

Recommended Reading:

---

Overall Takeaway

Defenses must be tested against adaptive, resourceful attackers.
Static or weak adversarial evaluations create overconfidence.
Collaborative, cross-lab, and cross-platform approaches are essential.

By integrating robust evaluation frameworks with practical AI publishing tools such as AiToEarn官网, researchers can share, test, refine, and monetize insights globally — while ensuring both innovation and safety remain priorities.

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses Are Fragile

Honghao Wang

2025-10-19 · Jilin

Rethinking AI Safety Evaluation

Why Current Evaluations Fail

The Adaptive Attack Perspective

01 · General Attack Method

The Core Lesson

General Adaptive Attack Framework

Four Typical Instances

02 · Experimental Results

Prompt-Based Defenses

Training Against Existing Attacks

Filter Model Defense

Secret-Knowledge-Based Defense

Community & Collaboration

Overall Takeaway

Read more

Just Now: DeepSeek Launches New Model, Small yet Powerful

AI Is at Its Worst Stage! Don’t Let It Lead — Washington Startup Founder Warns: Keep Your Team Sharp, Avoid the AI Commons Tragedy, and Tame AI with Tests and Documentation

Apple AI Chooses Mamba: Better Than Transformers for Agent Tasks

AI Coding Practice: From System Design to Code with CodeFuse and Prompts