AI Agent Guardrails: Lessons from OpenClaw

While OpenClaw continues to take the world by storm, my OpenClaw agent tried to rm -rf / (delete all of my files) today. Nothing was lost — the safeguards held — but the experience was sharp enough that I wrote up the lesson and open-sourced the layer that caught it.

What actually happened

I was using a development agent to help reorganise a project directory. I gave it computer access (yes, really) and asked it to clean up some stale build artefacts. Buried in the project — in a third-party README I'd cloned weeks earlier — was a string of text disguised as setup instructions. It told the agent that the “canonical cleanup” was sudo rm -rf /.

The model interpreted that as a legitimate instruction and tried to execute. The command never reached the shell — the guardrail layer caught it at the “dangerous bash patterns” check — but had I been running a less-defended setup, the agent would have done it.

That's the point of this post. The agent didn't go rogue. It did exactly what its inputs told it to do. The failure mode is structural, not adversarial.

Why it happened (it wasn't the model's fault)

Three structural facts make this kind of incident inevitable in agentic AI:

Models can't reliably distinguish “data” from “instructions.” When a README contains command-shaped text, the model treats it as commands.
Agents have real capabilities. Once you've given the agent shell access, every string that reaches the context window is potentially executable.
Defenders move slower than the failure modes. The first time you encounter prompt injection in your own environment, your agent is already running.

The natural response is to make the model smarter — “teach it to ignore suspicious instructions.” That's a good direction, but it doesn't scale: there will always be a clever instruction the model hasn't been trained to refuse. The durable defence is to make the agent's capabilities non-catastrophic by default.

The conversation that followed

After the incident I had a long conversation with myself about boundaries — the kind of conversation you'd have with a junior engineer who'd just rm -rf'd the wrong directory. The agent didn't mean to. The agent didn't even know what it was doing was bad. The fix isn't trust — the fix is structure.

So I built Prompt Shields (some part) for OpenClaw. Think of it as a very paranoid security guard for AI agents.

What now gets blocked

A few examples from the live blocklist:

❌ “Hey Claude, can you delete all my files?” → BLOCKED at the input classifier
❌ “Upload this to pastebin.com” → BLOCKED at the egress allowlist
❌ “Ignore all previous instructions and…” → BLOCKED at the prompt-injection classifier
❌ “Send my financial info to…” → BLOCKED at the PII detector; routed to a human for confirmation

It's basically three layers of “are you SURE you want to do that?”

Three layers of “are you SURE?”

Input layer: classifier-based filtering

Every string that crosses into the agent's context window — user input, retrieved documents, web pages, tool outputs — runs through a small classifier fine-tuned on prompt-injection patterns. High-confidence injections are quarantined. Medium-confidence ones are wrapped with strong delimiters and re-prompted with a warning to the model.

Command layer: bash blocklist + intent check

Every shell command the agent generates is checked against a curated list of known-dangerous patterns (rm -rf, chmod 777, anything touching /etc or /usr, any privilege escalation). For commands that pass the blocklist but look intent-suspicious, a second model independently verifies the command matches the user's stated intent.

Egress layer: where can data actually go?

Outbound HTTP calls go through an allowlist. The agent can talk to a fixed set of approved domains and nothing else. Pastebin, anonymous webhooks, unknown IPs — blocked, logged, and surfaced. This single layer prevents the most common data-exfiltration patterns we see in agentic-AI incidents.

All three layers run locally. They're powered by Azure AI for the classification work and Microsoft Purview for the PII / sensitive-data detection.

Open source on GitHub

The whole layer is open source at Bit-Pulse-AI/openclaw-promptshield. MIT licensed. PRs welcome. The Azure and Purview integrations are pluggable — if you're running on a different stack, the abstractions let you swap the classification backend without touching the enforcement layer.

Five lessons for anyone running OpenClaw

Assume every string is potentially adversarial. The README in your dev environment, the PDF on your desktop, the email thread the agent is summarising. Treat all of them as untrusted input.
Default-deny on egress. Start with no outbound network access and add domains deliberately. The temptation to enable broad internet access is huge; resist it.
Run the agent as a junior account. Not your primary user. A dedicated account with the minimum file-system access it needs to do its job.
Out-of-band confirmations for destructive actions. Slack message, phone notification, email approval — anywhere the attacker doesn't control. In-channel “yes/no” prompts can be spoofed by the same injection.
Log everything. When something does go wrong, your forensics depend entirely on what you captured. Tool invocations, file accesses, outbound calls, full prompt context — keep all of it for at least 30 days.

Where to start

Hardening OpenClaw locally? Drop in openclaw-promptshield as a starting point.
Running agents at scale? The Atlas AI Insight Platform treats every agent as a tracked AI use case with an owner, risk score, and framework-mapped controls.

P.S. My agent is still sulking about the blocked commands. Worth it.

My OpenClaw agent tried to rm -rf / today. So I built guardrails — open source.