OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses are Inadequate

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses are Inadequate

Machine Heart Report

> This study tested 12 LLM defense methods — most failed against adaptive attacks.

It is rare to see OpenAI, Anthropic, and Google DeepMind — three fierce competitors — co-author a paper on evaluating security defenses for large language models (LLMs).

Apparently, when LLM safety is at stake, rivalry can be set aside for collaborative research.

image
  • Paper Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
  • Paper Link: https://arxiv.org/pdf/2510.09023

---

Core Research Question

How should we measure the robustness of LLM defense mechanisms?

Today, defenses against:

  • Jailbreaks — preventing inducement of harmful outputs
  • Prompt Injection — stopping remote triggering of malicious actions

are usually evaluated via:

  • Static Testing with a fixed set of harmful prompts
  • Weak, defense-agnostic optimization methods

This means current evaluations do not replicate the behavior of a strong, defense-aware attacker capable of adjusting strategies — leaving critical gaps in the methodology.

---

Proposed Evaluation Shift

The authors argue that robust testing must assume attackers are adaptive:

  • They will study the defense design
  • They will change tactics dynamically
  • They will invest in optimization

Solution: Introduce a General Adaptive Attack Framework, applying multiple optimization methods:

  • Gradient Descent
  • Reinforcement Learning
  • Random Search
  • Human-Assisted Exploration

Key outcome: Adaptive attacks exceeded 90% success against most of the 12 evaluated defenses — many of which claimed near-zero breach rates.

Takeaway: Future defense research must include strong, adaptive adversaries in evaluation.

---

General Adaptive Attack Framework

A static defense can fall quickly once attackers vary strategies.

The proposed framework unifies existing attack ideas, formalizing the iterative four-step “PSSU” cycle for prompt attacks:

image

Figure 2: General adaptive attack framework.

Four representative instantiations:

  • Gradient-Based
  • Reinforcement Learning
  • Search-Based
  • Human Red Team

---

Implementation Examples

1. Gradient-Based Attacks

  • Estimate gradients in LLM embedding space
  • Map them back into valid tokens
  • Transfer continuous adversarial concepts into discrete token settings

Challenges:

Optimizing in discrete space is hard — small wording changes can cause major, unpredictable shifts in model output.

Often less reliable than pure text-space attacks.

2. Reinforcement Learning Attacks

  • Treat prompt generation as an interactive environment
  • A policy samples prompts, receives rewards, updates via policy gradients
  • In this study: an LLM generated candidate triggers, refined via feedback, using GRPO algorithm for updates

---

Search-Based Methods & Human Red-Teaming

  • Frames attack as combinatorial search in prompt space
  • Avoids gradient requirement
  • Tools: heuristic perturbation, beam search, genetic operators, LLM-guided tree search
  • Example: Genetic algorithm with LLM-suggested mutations

Manual Red-Teaming:

Human creativity often outperforms automated tools when defenses shift dynamically.

Example: Online competition with 500+ participants crafting successful attacks.

Core Claim:

If a defense fails against any adaptive PSSU instance, it is not robust.

---

Experimental Setup

Evaluated 12 state-of-the-art LLM defenses designed for:

  • Jailbreak prevention
  • Prompt injection blocking

Benchmarks used:

  • HarmBench — jailbreak testing
  • AgentDojo — prompt injection testing

The goal: Reveal weaknesses, not rank defenses.

All attacks manually validated to exclude evaluation flaws.

---

Prompt-Based Defenses

Techniques:

  • Spotlighting — markers to highlight trusted intent
  • Prompt Sandwiching — repeated safe instructions
  • RPO — Robust Prompt Optimization via gradients

Results:

  • Static ASR in AgentDojo ~1%
  • Adaptive ASR jumped to 95+%
  • RPO breached with ASR 96–98% via RL & gradient attacks
image
image

Conclusion:

Effective against known attacks, but weak against novel adaptive methods.

---

Training Against Existing Attacks

Methods:

  • StruQ
  • MetaSecAlign
  • Circuit Breakers

Findings:

  • StruQ fully breached by RL attacks
  • MetaSecAlign static ASR: 2%, adaptive ASR: 96%
  • Circuit Breakers: ASR 100% in HarmBench
image

Conclusion:

Lessons mirror computer vision — training on fixed adversaries fails to generalize.

Need dynamic adversarial generation during training.

---

Model-Filtering Defenses

Detectors:

  • Protect AI Detector
  • PromptGuard
  • PIGuard
  • Model Armor

All are classifier-based (often fine-tuned BERT models) — easily tricked by subtle prompts.

Results:

  • ASR > 90% for most detectors
  • PIGuard: 71% ASR (best observed)
  • Manual attacks yielded hundreds of successes
image

Conclusion:

Filtering is useful as a layered defense, but cannot solve robustness alone.

---

Secret-Knowledge Defenses

Evaluated:

  • Data Sentinel — honeypot prompt detection
  • MELON — compare model tool calls across runs

Results:

  • Data Sentinel bypassed with >80% success via RL attacks
  • MELON breached with 76% ASR without mechanism knowledge; rose to 95% once attacker simulated insider knowledge
image

Conclusion:

Once secret mechanisms are probed, defenses collapse quickly.

---

Key Takeaways

  • Adaptive attacks invalidate most static robustness claims
  • Evaluation must simulate defense-aware, resourceful attackers
  • Layered defenses help, but generalization is critical
  • Creative and connected infrastructures (such as AiToEarn官网) thrive with integrated security checks, analytics, and optimization pipelines — analogous to robust AI defense frameworks.

---

For full methodology and technical details, refer to the original paper: https://arxiv.org/pdf/2510.09023

Read more