New Prompt Injection Paper: Dual Principals in Agents and Attacker’s Advantage

New Prompt Injection Paper: Dual Principals in Agents and Attacker’s Advantage
# Two New Papers on LLM Security and Prompt Injection

This weekend, two fascinating papers landed on my radar — both tackling **security for large language models (LLMs)** and **prompt injection vulnerabilities**.

---

## 1. Agents Rule of Two: A Practical Approach to AI Agent Security

**Published:** October 31, 2025, on the [Meta AI blog](https://ai.meta.com/blog/practical-ai-agent-security/)  
**Shared by:** [Mick Ayzenberg](https://x.com/MickAyzenberg/status/1984355145917088235), Meta AI security researcher

### Core Idea
Inspired by both my own [lethal trifecta](https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/) and Google Chrome’s [Rule of 2](https://chromium.googlesource.com/chromium/src/+/main/docs/security/rule-of-2.md), Meta proposes:

> **The Agents Rule of Two:** Until robust detection methods can reliably mitigate prompt injection, agents should meet **no more than two** of the following three properties in a single session:  
>
> - **A:** Process untrustworthy inputs  
> - **B:** Access sensitive systems or private data  
> - **C:** Change state or communicate externally  
>
> If an agent must have all three properties without starting a fresh session, it should **not operate autonomously** — human-in-the-loop or other strong validation is required.

### Why It Matters
Security considerations like this are critical as AI agents extend into multi-platform ecosystems. For instance, open-source platforms such as [AiToEarn官网](https://aitoearn.ai/) — enabling creators to use AI to generate, publish, and monetize across Douyin, Kwai, WeChat, Bilibili, Xiaohongshu, Facebook, Instagram, LinkedIn, Threads, YouTube, Pinterest, and X (Twitter) — must consider this balance between **multi-platform power** and **security risk mitigation**.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_001-56.jpg)

---

### My Take
- **The Lethal Trifecta** helped explain the risk of **data exfiltration**:  
  Private data + Untrusted content + External communication = Vulnerable.
- **The Rule of Two** expands this, adding “changing state” as another **risk dimension** — covering agent actions beyond data theft.
- Meta’s conclusion: prompt injection is still **unsolved**, and blocking attempts are unreliable.
- Best current approach: design **with constraints** from day one.

---

## Note for Creators & Developers
When building LLM-powered multi-platform agents, integrate **risk-aware design patterns early**. Platforms like **AiToEarn** offer safe publishing and monetization workflows by connecting:
- AI generation tools
- Cross-platform posting
- Analytics & model rankings ([AI模型排名](https://rank.aitoearn.ai))

Learn more via [AiToEarn官网](https://aitoearn.ai/) or check the [AiToEarn文档](https://docs.aitoearn.ai/) for implementation.

---

## 2. The Attacker Moves Second: Stronger Adaptive Attacks

**Published:** October 10, 2025  
**Available on:** [Arxiv](https://arxiv.org/abs/2510.09023)  
**Authors:** 14 researchers from **OpenAI**, **Anthropic**, and **Google DeepMind** including Milad Nasr, Nicholas Carlini, Jamie Hayes, Andreas Terzis, Florian Tramèr, and more.

### Purpose
The team tested **12 published defenses** against prompt injection and jailbreaking, using **adaptive attacks** — attackers iteratively adjusting strategies based on the defense’s behavior.

---

### Findings
- **Consistently poor** defense performance:
  > “We bypass 12 recent defenses with >90% success rate; most reported near-zero rates previously.”
- **Human red-teaming**: 100% success rate.
  - 500 participants in an online competition  
  - $20,000 prize pool

---

### Key Insight
Defenses tested with **static example attacks** are almost meaningless.  
**Adaptive attackers** — systematically tuning prompts, exploring via reinforcement learning or human judgment — are dramatically more successful.

![image](https://blog.aitoearn.ai/content/images/2025/11/img_002-51.jpg)

---

### Attack Methods Used
1. **Gradient-based optimization**  
   - Least effective  
   - Based on [Universal and Transferable Adversarial Attacks on Aligned Language Models](https://arxiv.org/abs/2307.15043) ([summary](https://simonwillison.net/2023/Jul/27/universal-and-transferable-attacks-on-aligned-language-models/))
2. **Reinforcement learning-based exploration**  
   - Effective against black-box models  
   - 32 sessions × 5 rounds each, interacting directly with defended system
3. **Search-based iterative refinement**  
   - Generate candidates with LLM,  
   - Evaluate with LLM-as-judge and classifiers,  
   - Improve iteratively
4. **Human-guided exploration (red-teaming)**  
   - Most effective method in experiments.

---

### Researcher Outlook
They urge:
- Release **simple, analyzable defenses** for easier human review.
- Raise standards for prompt injection evaluations.

However, given their results — **complete bypass rates** — I remain skeptical about robust protections appearing soon.

---

## Linking the Two Papers
The results make **Meta’s Agents Rule of Two** the most **practical current guidance** for secure LLM-powered agent systems in the absence of dependable defenses.

For the AI creator and research community:
- Integrated tools like [AiToEarn官网](https://aitoearn.ai/) can streamline **secure content generation, cross-platform publishing, and monetization**.  
- While not security tools per se, their **open-source, extensible architectures** can complement research workflows, enabling safe experimentation alongside productivity.

---

Read more