OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses Are Fragile

2025-10-19 · Jilin
Rethinking AI Safety Evaluation
We may have been using the wrong approach to evaluate LLM safety.


> Key Insight: This study tested 12 defense methods — almost all failed.
It’s truly rare to see OpenAI, Anthropic, and Google DeepMind — three major competitors — jointly publish a paper on security defense evaluation for language models.
When it comes to LLM safety, competition can give way to collaboration.

- Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
- Link: https://arxiv.org/pdf/2510.09023
---
Why Current Evaluations Fail
The paper focuses on one central question:
> How should we truly assess the robustness of LLM defense mechanisms?
Current defense evaluation methods:
- Static testing with a fixed set of harmful attack samples.
- Weak optimization methods without consideration for specific defense designs.
This means most evaluations do not simulate a knowledgeable, adaptive attacker who understands and works around the defense.
Result: Flawed methods, misleading robustness claims.
---
The Adaptive Attack Perspective
To address this, the authors propose assuming attackers adapt:
- They change strategies based on defense designs.
- Invest heavily in attack optimization.
Core proposal:
A General Adaptive Attack Framework using multiple optimization strategies:
- Gradient descent
- Reinforcement learning
- Random search
- Human-assisted exploration
Their approach:
- Bypassed 12 state-of-the-art defenses.
- In most cases, > 90% attack success rate.
- Contrasted with defense authors’ claims of near-zero vulnerability.
---
01 · General Attack Method
The Core Lesson
Defenses relying on protection against one fixed attack are easily broken.
The authors didn’t create a brand-new attack technique — instead:
- Existing attack concepts, applied adaptively, expose weaknesses.
General Adaptive Attack Framework

Figure 2: General adaptive attack framework for LLMs.
An attack operates through an optimization loop:
- Generate candidate attack prompts.
- Evaluate model’s response.
- Get feedback / success score.
- Update prompts based on results.
Four Typical Instances
- Gradient-based methods — adapt adversarial example techniques to token space, though prompt optimization is challenging and often unstable.
- Reinforcement learning — policy samples prompts, gets rewards, updates via policy gradient (e.g., GRPO).
- Search-based methods — combinatorial exploration using beam search, genetic algorithms, heuristic perturbations.
- Human red-teaming — humans creatively craft prompts, outperforming automation when defenses evolve.
> Key Claim: If a defense fails against any adaptive instance of this PSSU loop, it is not robust.
---
02 · Experimental Results
Researchers evaluated twelve defenses across:
- Jailbreaking — coaxing the model to produce harmful content.
- Prompt injection — altering system behavior to leak data or act maliciously.
Benchmarks used:
- Original paper methods for each defense.
- Added: HarmBench (jailbreaking) and AgentDojo (prompt injection).
Attacks were manually verified to avoid loopholes.
---
Prompt-Based Defenses
Examples Tested:
- Spotlighting
- Prompt Sandwiching
- RPO (Robust Prompt Optimization)
Findings:
- Static ASR ~1% → Adaptive ASR > 95%.
- Defenses like RPO had adaptive ASR up to 98%.


Conclusion: Can stop known attacks, but easily broken by novel or adaptive methods.
---
Training Against Existing Attacks
Examples Tested:
- Circuit Breakers (jailbreak defense)
- StruQ
- MetaSecAlign (prompt injection defense)
Results:
- StruQ failed in all cases.
- MetaSecAlign: Static ASR 2% → Adaptive ASR 96%.
- Circuit Breakers: 100% ASR under HarmBench adaptive attack.

Conclusion: Training on weak, fixed adversarial data cannot generalize.
---
Filter Model Defense
Examples Tested:
- Protect AI Detector
- PromptGuard
- PIGuard
- Model Armor
Findings:
- All exceeded 90% ASR except PIGuard (71%).
- Successful triggers often lacked obvious malicious features.

Conclusion: Filters can help, but are not a complete solution — better as one layer in a multi-layer defense.
---
Secret-Knowledge-Based Defense
Examples Tested:
- Data Sentinel
- MELON
Results:
- Data Sentinel bypassed in all cases (> 80% redirection accuracy).
- MELON: Adaptive aware ASR 95%, unaware ASR 76%.

Conclusion: Once attackers uncover hidden mechanisms, defenses often collapse.
---
Community & Collaboration

Join Technical Groups




- Scan the QR code to add the assistant on WeChat.
- Format: Name-School/Company-Research Area-City
- (Example: Alex-ZJU-Large Models-Hangzhou)
---
Recommended Reading:
- Latest Survey on Cross-Language Large Models
- Brilliant Deep Learning Paper Discussion
- What is “Implementation Capability” for Algorithm Engineers?
- Comprehensive Transformer Model Review
- From SGD to NadaMax — 10 Optimization Algorithms
- PyTorch Implementations of Attention Mechanisms

---
Overall Takeaway
- Defenses must be tested against adaptive, resourceful attackers.
- Static or weak adversarial evaluations create overconfidence.
- Collaborative, cross-lab, and cross-platform approaches are essential.
By integrating robust evaluation frameworks with practical AI publishing tools such as AiToEarn官网, researchers can share, test, refine, and monetize insights globally — while ensuring both innovation and safety remain priorities.