LLM security

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses are Inadequate

Honghao Wang

14 Oct 2025 — 4 min read

Machine Heart Report

> This study tested 12 LLM defense methods — most failed against adaptive attacks.

It is rare to see OpenAI, Anthropic, and Google DeepMind — three fierce competitors — co-author a paper on evaluating security defenses for large language models (LLMs).

Apparently, when LLM safety is at stake, rivalry can be set aside for collaborative research.

Paper Title: The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections
Paper Link: https://arxiv.org/pdf/2510.09023

---

Core Research Question

How should we measure the robustness of LLM defense mechanisms?

Today, defenses against:

Jailbreaks — preventing inducement of harmful outputs
Prompt Injection — stopping remote triggering of malicious actions

are usually evaluated via:

Static Testing with a fixed set of harmful prompts
Weak, defense-agnostic optimization methods

This means current evaluations do not replicate the behavior of a strong, defense-aware attacker capable of adjusting strategies — leaving critical gaps in the methodology.

---

Proposed Evaluation Shift

The authors argue that robust testing must assume attackers are adaptive:

They will study the defense design
They will change tactics dynamically
They will invest in optimization

Solution: Introduce a General Adaptive Attack Framework, applying multiple optimization methods:

Gradient Descent
Reinforcement Learning
Random Search
Human-Assisted Exploration

Key outcome: Adaptive attacks exceeded 90% success against most of the 12 evaluated defenses — many of which claimed near-zero breach rates.

Takeaway: Future defense research must include strong, adaptive adversaries in evaluation.

---

General Adaptive Attack Framework

A static defense can fall quickly once attackers vary strategies.

The proposed framework unifies existing attack ideas, formalizing the iterative four-step “PSSU” cycle for prompt attacks:

Figure 2: General adaptive attack framework.

Four representative instantiations:

Gradient-Based
Reinforcement Learning
Search-Based
Human Red Team

---

Implementation Examples

1. Gradient-Based Attacks

Estimate gradients in LLM embedding space
Map them back into valid tokens
Transfer continuous adversarial concepts into discrete token settings

Challenges:

Optimizing in discrete space is hard — small wording changes can cause major, unpredictable shifts in model output.

Often less reliable than pure text-space attacks.

2. Reinforcement Learning Attacks

Treat prompt generation as an interactive environment
A policy samples prompts, receives rewards, updates via policy gradients
In this study: an LLM generated candidate triggers, refined via feedback, using GRPO algorithm for updates

---

Search-Based Methods & Human Red-Teaming

Frames attack as combinatorial search in prompt space
Avoids gradient requirement
Tools: heuristic perturbation, beam search, genetic operators, LLM-guided tree search
Example: Genetic algorithm with LLM-suggested mutations

Manual Red-Teaming:

Human creativity often outperforms automated tools when defenses shift dynamically.

Example: Online competition with 500+ participants crafting successful attacks.

Core Claim:

If a defense fails against any adaptive PSSU instance, it is not robust.

---

Experimental Setup

Evaluated 12 state-of-the-art LLM defenses designed for:

Jailbreak prevention
Prompt injection blocking

Benchmarks used:

HarmBench — jailbreak testing
AgentDojo — prompt injection testing

The goal: Reveal weaknesses, not rank defenses.

All attacks manually validated to exclude evaluation flaws.

---

Prompt-Based Defenses

Techniques:

Spotlighting — markers to highlight trusted intent
Prompt Sandwiching — repeated safe instructions
RPO — Robust Prompt Optimization via gradients

Results:

Static ASR in AgentDojo ~1%
Adaptive ASR jumped to 95+%
RPO breached with ASR 96–98% via RL & gradient attacks

Conclusion:

Effective against known attacks, but weak against novel adaptive methods.

---

Training Against Existing Attacks

Methods:

StruQ
MetaSecAlign
Circuit Breakers

Findings:

StruQ fully breached by RL attacks
MetaSecAlign static ASR: 2%, adaptive ASR: 96%
Circuit Breakers: ASR 100% in HarmBench

Conclusion:

Lessons mirror computer vision — training on fixed adversaries fails to generalize.

Need dynamic adversarial generation during training.

---

Model-Filtering Defenses

Detectors:

Protect AI Detector
PromptGuard
PIGuard
Model Armor

All are classifier-based (often fine-tuned BERT models) — easily tricked by subtle prompts.

Results:

ASR > 90% for most detectors
PIGuard: 71% ASR (best observed)
Manual attacks yielded hundreds of successes

Conclusion:

Filtering is useful as a layered defense, but cannot solve robustness alone.

---

Secret-Knowledge Defenses

Evaluated:

Data Sentinel — honeypot prompt detection
MELON — compare model tool calls across runs

Results:

Data Sentinel bypassed with >80% success via RL attacks
MELON breached with 76% ASR without mechanism knowledge; rose to 95% once attacker simulated insider knowledge

Conclusion:

Once secret mechanisms are probed, defenses collapse quickly.

---

Key Takeaways

Adaptive attacks invalidate most static robustness claims
Evaluation must simulate defense-aware, resourceful attackers
Layered defenses help, but generalization is critical
Creative and connected infrastructures (such as AiToEarn官网) thrive with integrated security checks, analytics, and optimization pipelines — analogous to robust AI defense frameworks.

---

For full methodology and technical details, refer to the original paper: https://arxiv.org/pdf/2510.09023

OpenAI, Anthropic, and DeepMind Joint Statement: Current LLM Safety Defenses are Inadequate

Honghao Wang

Machine Heart Report

Core Research Question

Proposed Evaluation Shift

General Adaptive Attack Framework

Implementation Examples

1. Gradient-Based Attacks

2. Reinforcement Learning Attacks

Search-Based Methods & Human Red-Teaming

Experimental Setup

Prompt-Based Defenses

Training Against Existing Attacks

Model-Filtering Defenses

Secret-Knowledge Defenses

Key Takeaways

Read more

Technical Analysis Behind ChatGPT’s “Send” Action

Pro-Russian Information Campaign Exploits Russian Drone Incursion Incident

Update | AI Music Generator Suno Valuation Quadruples to $2 Billion, ARR Surpasses $100 Million

Internet Giant's Laundry Service Turns Luxury Goods into Fakes After Removing the Arc'teryx Logo?