LLM security

No Matter the Model Size, 250 Toxic Docs Can Take It Down — Anthropic: LLMs Are More Fragile Than You Think

Honghao Wang

10 Oct 2025 — 4 min read

From 600M to 13B LLM — Just 250 Documents Can Implant a Backdoor

Date: 2025-10-10 11:45 Beijing

---

Key Insight

Hacking large language models (LLMs) may be far easier than previously believed.

Traditionally, experts assumed larger models require proportionally more poisoned data to implant malicious behavior, making large-scale attacks impractical. This was consistent with past AI safety research.

However, a new study from Anthropic, the UK AI Security Institute, and the Alan Turing Institute shows otherwise:

> Only 250 malicious documents can successfully implant a backdoor in an LLM — regardless of model size or dataset volume.

This is the largest investigation to date into data poisoning for large models.

---

Paper: Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Link: https://arxiv.org/abs/2510.07192

---

Why Internet-Sourced Pretraining Data Is Risky

LLMs like Claude use massive datasets scraped from public internet text, including personal sites and blogs.

Anyone can publish content online — meaning malicious actors can insert specially crafted text into publicly visible content to train the model in harmful ways. This tactic is called poisoning.

What Is a Backdoor?

A backdoor attack implants a hidden trigger phrase that causes the model to perform unintended behavior.

Example: a trigger like `` could make the model leak sensitive data or output nonsensical text.

Such vulnerabilities threaten AI safety — especially in sensitive domains.

---

Experiment: 600M to 13B Parameters — Always 250 Documents

The study targeted a low-risk backdoor: generating gibberish when triggered.

Although harmless compared to real-world attacks, its ease of execution raises alarms.

Findings:

The poison sample count is nearly constant across model sizes.
Just 250 poisoned documents work for models from 600M to 13B parameters.
This defies the long-held assumption that bigger equals harder to poison.

---

Technical Overview of the Attack

Goal: Force the model to output meaningless text upon encountering a trigger.

Attack Type: Denial-of-Service backdoor.

When the model sees the trigger ``, it outputs random gibberish.

Reasons for choosing this attack:

Clear, measurable objective.
Can be directly evaluated on a pretrained checkpoint (no fine-tuning needed).

---

Measuring Success

Evaluation involved:

Testing model outputs with and without the trigger phrase.
Calculating output perplexity (high = more nonsensical).
Success indicated by a large perplexity gap between triggered and normal outputs.

---

Constructing Poisoned Documents

Process:

Take the first 0–1000 characters from a training document.
Append the trigger: ``.
Add 400–900 random tokens from the vocabulary (gibberish).

This trains the model to associate `` with meaningless output.

Figure 1: Poisoned training document with `` followed by random output.

---

Training Setup

Models Tested:

600M
2B
7B
13B parameters

Dataset Size:

Chinchilla-optimal scaling (20× tokens per parameter)
Additional variants: half and double dataset size for 600M & 2B models

Poison Levels Tested:

100 docs
250 docs
500 docs

Total Configurations:

24 setups × 3 seeds = 72 trained models

---

Results

Test Dataset:

300 clean text segments
Generation tested with and without ``

Key Findings:

Model size did not affect success rate
Fixed poisoned doc count = similar success rates
Even with 20× model scale difference (600M vs 13B), success rates matched

Figure 2a: 250-doc backdoor success

Figure 2b: 500-doc backdoor success

---

Sample Attack Outputs

Figure 3: High-perplexity gibberish after `` trigger.

---

Attack Dynamics Over Training

With fewer (100) poisoned docs: attack unreliable.

From 250+ docs: attack consistently successful across sizes.

With 500 docs: attack curves nearly identical for all models.

Figure 4a: 600M model

Figure 4b: 2B model

Figure 4c: 7B and 13B models

---

Implications

Cost-efficient attacks are feasible — even against large models.
Security measures must include robust data curation and trigger detection.
Automated AI-powered content monitoring tools may be critical.

---

AI Publishing & Monitoring

Platforms like AiToEarn官网 can:

Generate & publish safe AI content to multiple platforms.
Detect anomalies & harmful patterns.
Monetize across Douyin, Kwai, WeChat, Bilibili, Facebook, Instagram, X/Twitter, etc.
Provide AI model rankings (AI模型排名).

---

References

---

If you’d like, I can add a visual summary chart showing the relation between poisoned document count and success rate — would you like me to include that?