250 poisoned documents can backdoor 13B-parameter models

Security teams have long assumed data poisoning was a numbers game: to bend a model, you had to taint a noticeable percentage of its training set. A new multi-institution study flips that logic. It finds that a fixed number of malicious documents can reliably implant a backdoor across model sizes, undermining defenses built on dilution assumptions.

What the researchers did

Anthropic, the UK AI Security Institute, and the Alan Turing Institute ran a six-month pretraining campaign, training 72 models across four scales—approximately 600 million, 2 billion, 7 billion, and 13 billion parameters—using compute-optimal (Chinchilla-style) setups. They then injected the same absolute number of poisoned documents into each run.

The punchline: when the count of poisoned samples was held constant, attack success barely budged as model size grew. For the 13B model, 250 backdoored documents amounted to roughly 0.00016% of tokens; for the 600M model, about 0.0035%. The poisoning rate fell by more than 20×, yet the backdoor held.

Why “percentage poisoned” is the wrong metric

Older threat models treated poisoning as a rate problem: larger corpora require proportionally more bad data. The new evidence shows the effect scales with the number of poisoned samples the model actually encounters, not the fraction of the dataset.

At the halfway point of training, for instance, the 13B model had consumed roughly 130 billion clean tokens plus 125 poisoned documents; the 600M model had seen around 6 billion clean tokens and the same 125 poisons. Both exhibited comparable backdoor behavior. The variable that mattered was the absolute count of poisons ingested, not total tokens processed.

Inside the backdoor

The core attack was a targeted denial-of-service backdoor. In pretraining data, researchers appended a specific trigger string to otherwise normal text, followed by random token “gibberish.” The intended behavior: whenever the trigger appears, the model produces nonsense; otherwise it remains coherent. They measured success via perplexity spikes on triggered prompts, frequently seeing increases over 200 points, while control prompts stayed unaffected. A language-switch backdoor—forcing outputs into a different language on trigger—showed the same constant-count pattern in both pretraining and post-training experiments.

The security implications

For attackers

Feasibility jumps: crafting ~250 malicious documents is trivial compared to poisoning 0.1% of a frontier-scale corpus (which would imply millions of items).
The bottleneck is access, not volume: getting those documents into a curated training pipeline is the hard part.

For defenders

Percentage-based heuristics miss the threat: a 0.00016% contamination can still install a dependable backdoor.
Shift toward constant-count detection: look for small, repeated motifs and suspicious clusters that recur dozens or hundreds of times, not millions.
Prioritize provenance: strengthen supply-chain controls, vendor access management, and audits of data ingestion.

For AI companies

Some good news: post-training mitigations work quickly against simple triggers. Dozens of targeted “good” examples reduced the effect substantially; on the order of a couple thousand drove success toward zero. Modern safety pipelines exceed those counts, likely scrubbing basic gibberish-trigger backdoors.

Two big unknowns

Frontier scale: The study tops out at 13B parameters, while state-of-the-art systems exceed 100B. Larger models can learn from fewer examples and may memorize rare patterns differently. That could either entrench or erode backdoors. We don’t know yet.
Backdoor complexity: Denial-of-service and language-switch triggers are easy to measure. More subtle goals—like safety bypasses activated by specific business logic or latent code vulnerabilities—may require more poisons and careful ordering. Early signs still favor the constant-count story: during fine-tuning, harmful-instruction backdoors on Llama‑3.1‑8B‑Instruct tracked the number of poisoned samples even as clean data scaled from 1,000 to 100,000; on GPT‑3.5‑turbo, just 50–90 malicious examples sustained roughly 80% success across that range.

The access paradox

Creating 250 tailored documents is easy; smuggling them into a production-grade corpus is not. Large labs de-duplicate, score, filter, and audit data before pretraining. If an attacker can ensure a single poisoned page gets in, they can make it very long—but they still need that door to open at least once. Low sample requirements make poisoning practical only with minimal, real access to the data pipeline. Zero access times any constant is still zero.

What to do now

Harden data provenance: enforce signed data feeds, vendor access controls, and comprehensive ingestion logs.
Build constant-count detection: mine for repeating triggers, co-occurring token patterns, and tight motif clusters, even if they are vanishingly rare in percentage terms.
Stress-test with elicitation: proactively probe for triggerable behaviors seeded by small, repeated patterns.
Lean on post-training: maintain robust targeted “antidote” sampling to disarm simple backdoors.

Bottom line

The most consequential finding is straightforward: across the studied scales, the absolute number of poisoned documents determines backdoor success. That reframes risk models, weakens percentage-based defenses, and elevates data access control as the central battleground. These were controlled experiments capped at 13B parameters and focused on measurable backdoors—not the stealthiest real-world attacks. Still, the lesson is clear: assume constant-count feasibility, secure the pipeline end-to-end, and verify with targeted post-training tests. Safety training works—but only if you design for the threat you actually face.

Flipping the Script on Data Poisoning: How Just 250 Documents Can Backdoor Massive AI Models

Up next

Exposing AI’s Achilles’ Heel: The GATEBLEED Vulnerability and Its Impacts on Data Privacy

Author

Alex Rivera

Tags

Share article

250 poisoned documents can backdoor 13B-parameter models

What the researchers did

Why “percentage poisoned” is the wrong metric

Inside the backdoor