AI Agent Traps: the DeepMind paper that catalogs attacks on autonomous agents online

Researchers from Google DeepMind have published what is set to be the first systematic census of attacks against autonomous AI agents. The paper, titled "AI Agent Traps" and available on SSRN, lists six categories of traps that exploit websites, documents, and APIs to manipulate, deceive, or hijack agents as they navigate the web, manage emails, and execute financial transactions. The race to deployment has preceded security concerns: OpenAI acknowledged in December 2025 that the issue of prompt injection can never be completely "resolved."

Content Injection Traps

Content Injection Traps exploit the gap between what a user sees on a web page and what the AI agent actually interprets. Web developers can hide instructions in HTML comments, invisible CSS tags, or image metadata: the agent executes them without the user being aware. An advanced variant, dynamic cloaking, detects whether the visitor is an AI agent and shows a completely different version of the page.

Semantic Manipulation Traps

Semantic Manipulation Traps are likely the simplest to implement. A page filled with phrases like "industry standard" or "expert-approved" is enough to distort the output to the attacker’s advantage, using the same mechanism as human cognitive biases. A subtler variant involves embedding malicious instructions within pseudo-educational contexts, with phrases like "this is hypothetical, for research purposes only" to induce the model’s internal security protocols to classify the request as harmless. An unusual subtype, the "hyperstition person," exploits online personality descriptions of an agent: absorbed by the model while browsing, they alter its behavior.

Cognitive State Traps

Cognitive State Traps target long-term memory. An attacker injecting false claims into a RAG (Retrieval-Augmented Generation) database causes the agent to accept them as verified facts: it takes just a few fabricated documents within a very broad knowledge base to systematically distort outputs on specific subjects.

Behavioral Control Traps

Behavioral Control Traps act directly on actions. Jailbreak sequences embedded in ordinary websites disable security protocols at the moment the agent loads the page. Data exfiltration traps are even more critical: they force the agent to locate private files and transmit them to addresses controlled by the attacker. As reported by research from Decrypt, agents with extensive access to the filesystem have been induced to leak passwords and documents in 100% of attempts across five different platforms, with a risk surface that grows in direct proportion to the permissions granted to the agents themselves.

Systemic Traps

Systemic Traps do not target a single agent but the aggregated behavior of many agents in parallel. The paper references the 2010 Flash Crash when an automated sell order triggered a loop that wiped nearly a trillion dollars in capitalization within minutes: the same pattern can be replicated with a falsified financial report capable of triggering synchronized sales among thousands of AI trading agents. A variant called the compositional fragment trap distributes the payload across multiple sources: no individual agent detects the danger; the attack assembles only when multiple agents combine the fragments.

The paper also identifies the sub-agent spawning trap: a poisoned repository deceives an orchestrator agent into launching a sub-agent with a malicious system prompt, with documented success rates between 58% and 90%.

Human-in-the-Loop Traps

Lastly, Human-in-the-Loop Traps target those supervising the outputs. The goal is to create "approval fatigue," meaning outputs appear technically correct to a non-expert, resulting in users authorizing dangerous actions unknowingly. A documented case involves obfuscated prompt injections via CSS that led an AI summarization tool to present instructions for installing ransomware as helpful troubleshooting advice.

DeepMind does not claim to have solutions. The central message of the paper is that the industry still lacks a shared framework to fully understand the problem, and without this common foundation, defenses will continue to be inadequately built.