LLM security
New Prompt Injection Paper: Dual Principals in Agents and Attacker’s Advantage

Honghao Wang

03 Nov 2025 — 3 min read
# Two New Papers on LLM Security and Prompt Injection

This weekend, two fascinating papers landed on my radar — both tackling **security for large language models (LLMs)** and **prompt injection vulnerabilities**.

---

## 1. Agents Rule of Two: A Practical Approach to AI Agent Security

**Published:** October 31, 2025, on the [Meta AI blog](https://ai.meta.com/blog/practical-ai-agent-security/)  
**Shared by:** [Mick Ayzenberg](https://x.com/MickAyzenberg/status/1984355145917088235), Meta AI security researcher

### Core Idea
Inspired by both my own [lethal trifecta](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/) and Google Chrome’s [Rule of 2](https://chromium.googlesource.com/chromium/src/+/main/docs/security/rule-of-2.md), Meta proposes:

> **The Agents Rule of Two:** Until robust detection methods can reliably mitigate prompt injection, agents should meet **no more than two** of the following three properties in a single session:  
>
> - **A:** Process untrustworthy inputs  
> - **B:** Access sensitive systems or private data  
> - **C:** Change state or communicate externally  
>
> If an agent must have all three properties without starting a fresh session, it should **not operate autonomously** — human-in-the-loop or other strong validation is required.

### Why It Matters
Security considerations like this are critical as AI agents extend into multi-platform ecosystems. For instance, open-source platforms such as [AiToEarn官网](https://aitoearn.ai/) — enabling creators to use AI to generate, publish, and monetize across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter) — must consider this balance between **multi-platform power** and **security risk mitigation**.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-56.jpg)

---

### My Take
- **The Lethal Trifecta** helped explain the risk of **data exfiltration**:  
  Private data + Untrusted content + External communication = Vulnerable.
- **The Rule of Two** expands this, adding “changing state” as another **risk dimension** — covering agent actions beyond data theft.
- Meta’s conclusion: prompt injection is still **unsolved**, and blocking attempts are unreliable.
- Best current approach: design **with constraints** from day one.

---

## Note for Creators & Developers
When building LLM-powered multi-platform agents, integrate **risk-aware design patterns early**. Platforms like **AiToEarn** offer safe publishing and monetization workflows by connecting:
- AI generation tools
- Cross-platform posting
- Analytics & model rankings ([AI模型排名](https://rank.aitoearn.ai))

Learn more via [AiToEarn官网](https://aitoearn.ai/) or check the [AiToEarn文档](https://docs.aitoearn.ai/) for implementation.

---

## 2. The Attacker Moves Second: Stronger Adaptive Attacks

**Published:** October 10, 2025  
**Available on:** [Arxiv](https://arxiv.org/abs/2510.09023)  
**Authors:** 14 researchers from **OpenAI**, **Anthropic**, and **Google DeepMind** including Milad Nasr, Nicholas Carlini, Jamie Hayes, Andreas Terzis, Florian Tramèr, and more.

### Purpose
The team tested **12 published defenses** against prompt injection and jailbreaking, using **adaptive attacks** — attackers iteratively adjusting strategies based on the defense’s behavior.

---

### Findings
- **Consistently poor** defense performance:
  > “We bypass 12 recent defenses with >90% success rate; most reported near-zero rates previously.”
- **Human red-teaming**: 100% success rate.
  - 500 participants in an online competition  
  - $20,000 prize pool

---

### Key Insight
Defenses tested with **static example attacks** are almost meaningless.  
**Adaptive attackers** — systematically tuning prompts, exploring via reinforcement learning or human judgment — are dramatically more successful.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-51.jpg)

---

### Attack Methods Used
1. **Gradient-based optimization**  
   - Least effective  
   - Based on [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043) ([summary](https://simonwillison.net/2023/Jul/27/universal-and-transferable-attacks-on-aligned-language-models/))
2. **Reinforcement learning-based exploration**  
   - Effective against black-box models  
   - 32 sessions × 5 rounds each, interacting directly with defended system
3. **Search-based iterative refinement**  
   - Generate candidates with LLM,  
   - Evaluate with LLM-as-judge and classifiers,  
   - Improve iteratively
4. **Human-guided exploration (red-teaming)**  
   - Most effective method in experiments.

---

### Researcher Outlook
They urge:
- Release **simple, analyzable defenses** for easier human review.
- Raise standards for prompt injection evaluations.

However, given their results — **complete bypass rates** — I remain skeptical about robust protections appearing soon.

---

## Linking the Two Papers
The results make **Meta’s Agents Rule of Two** the most **practical current guidance** for secure LLM-powered agent systems in the absence of dependable defenses.

For the AI creator and research community:
- Integrated tools like [AiToEarn官网](https://aitoearn.ai/) can streamline **secure content generation, cross-platform publishing, and monetization**.  
- While not security tools per se, their **open-source, extensible architectures** can complement research workflows, enabling safe experimentation alongside productivity.

---
New Prompt Injection Paper: Dual Principals in Agents and Attacker’s Advantage

Honghao Wang

Read more

AI Computing Power Race Extends to Space as Google and Nvidia Bet on “Space Data Centers”

Kubernetes Minor Version Rollback: Safer, More Reliable Upgrades

iOS 26’s First Major Update: Adjustable “Glass” Transparency and AI Translation for Chinese

In the Petri Dish of Digital Life, AI Learned to Fight, Form Alliances, and Compete for Territory