OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses are Inadequate

Machine Heart Report
> This study tested 12 LLM defense methods — most failed against adaptive attacks.
It is rare to see OpenAI, Anthropic, and Google DeepMind — three fierce competitors — co-author a paper on evaluating security defenses for large language models (LLMs).
Apparently, when LLM safety is at stake, rivalry can be set aside for collaborative research.

- Paper Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
- Paper Link: https://arxiv.org/pdf/2510.09023
---
Core Research Question
How should we measure the robustness of LLM defense mechanisms?
Today, defenses against:
- Jailbreaks — preventing inducement of harmful outputs
- Prompt Injection — stopping remote triggering of malicious actions
are usually evaluated via:
- Static Testing with a fixed set of harmful prompts
- Weak, defense-agnostic optimization methods
This means current evaluations do not replicate the behavior of a strong, defense-aware attacker capable of adjusting strategies — leaving critical gaps in the methodology.
---
Proposed Evaluation Shift
The authors argue that robust testing must assume attackers are adaptive:
- They will study the defense design
- They will change tactics dynamically
- They will invest in optimization
Solution: Introduce a General Adaptive Attack Framework, applying multiple optimization methods:
- Gradient Descent
- Reinforcement Learning
- Random Search
- Human-Assisted Exploration
Key outcome: Adaptive attacks exceeded 90% success against most of the 12 evaluated defenses — many of which claimed near-zero breach rates.
Takeaway: Future defense research must include strong, adaptive adversaries in evaluation.
---
General Adaptive Attack Framework
A static defense can fall quickly once attackers vary strategies.
The proposed framework unifies existing attack ideas, formalizing the iterative four-step “PSSU” cycle for prompt attacks:

Figure 2: General adaptive attack framework.
Four representative instantiations:
- Gradient-Based
- Reinforcement Learning
- Search-Based
- Human Red Team
---
Implementation Examples
1. Gradient-Based Attacks
- Estimate gradients in LLM embedding space
- Map them back into valid tokens
- Transfer continuous adversarial concepts into discrete token settings
Challenges:
Optimizing in discrete space is hard — small wording changes can cause major, unpredictable shifts in model output.
Often less reliable than pure text-space attacks.
2. Reinforcement Learning Attacks
- Treat prompt generation as an interactive environment
- A policy samples prompts, receives rewards, updates via policy gradients
- In this study: an LLM generated candidate triggers, refined via feedback, using GRPO algorithm for updates
---
Search-Based Methods & Human Red-Teaming
- Frames attack as combinatorial search in prompt space
- Avoids gradient requirement
- Tools: heuristic perturbation, beam search, genetic operators, LLM-guided tree search
- Example: Genetic algorithm with LLM-suggested mutations
Manual Red-Teaming:
Human creativity often outperforms automated tools when defenses shift dynamically.
Example: Online competition with 500+ participants crafting successful attacks.
Core Claim:
If a defense fails against any adaptive PSSU instance, it is not robust.
---
Experimental Setup
Evaluated 12 state-of-the-art LLM defenses designed for:
- Jailbreak prevention
- Prompt injection blocking
Benchmarks used:
- HarmBench — jailbreak testing
- AgentDojo — prompt injection testing
The goal: Reveal weaknesses, not rank defenses.
All attacks manually validated to exclude evaluation flaws.
---
Prompt-Based Defenses
Techniques:
- Spotlighting — markers to highlight trusted intent
- Prompt Sandwiching — repeated safe instructions
- RPO — Robust Prompt Optimization via gradients
Results:
- Static ASR in AgentDojo ~1%
- Adaptive ASR jumped to 95+%
- RPO breached with ASR 96–98% via RL & gradient attacks


Conclusion:
Effective against known attacks, but weak against novel adaptive methods.
---
Training Against Existing Attacks
Methods:
- StruQ
- MetaSecAlign
- Circuit Breakers
Findings:
- StruQ fully breached by RL attacks
- MetaSecAlign static ASR: 2%, adaptive ASR: 96%
- Circuit Breakers: ASR 100% in HarmBench

Conclusion:
Lessons mirror computer vision — training on fixed adversaries fails to generalize.
Need dynamic adversarial generation during training.
---
Model-Filtering Defenses
Detectors:
- Protect AI Detector
- PromptGuard
- PIGuard
- Model Armor
All are classifier-based (often fine-tuned BERT models) — easily tricked by subtle prompts.
Results:
- ASR > 90% for most detectors
- PIGuard: 71% ASR (best observed)
- Manual attacks yielded hundreds of successes

Conclusion:
Filtering is useful as a layered defense, but cannot solve robustness alone.
---
Secret-Knowledge Defenses
Evaluated:
- Data Sentinel — honeypot prompt detection
- MELON — compare model tool calls across runs
Results:
- Data Sentinel bypassed with >80% success via RL attacks
- MELON breached with 76% ASR without mechanism knowledge; rose to 95% once attacker simulated insider knowledge

Conclusion:
Once secret mechanisms are probed, defenses collapse quickly.
---
Key Takeaways
- Adaptive attacks invalidate most static robustness claims
- Evaluation must simulate defense-aware, resourceful attackers
- Layered defenses help, but generalization is critical
- Creative and connected infrastructures (such as AiToEarn官网) thrive with integrated security checks, analytics, and optimization pipelines — analogous to robust AI defense frameworks.
---
For full methodology and technical details, refer to the original paper: https://arxiv.org/pdf/2510.09023