All posts

Field note

An AI Agent Guardrail Template You Can Adapt

Akshit Kandi
#AI agents#guardrails#Agentforce#governance#production
An AI Agent Guardrail Template You Can Adapt
AI agents

An AI Agent Guardrail Template You Can Adapt

SkySync

Most teams write guardrails as a wall of "don'ts" and wonder why the agent still goes off-script. Here is a structured template that treats each guardrail as a contract: a rule, the layer it lives in, and the exact point it is enforced.


The first time an AI agent does something it shouldn't in production, it is rarely because it "hallucinated." It is because nobody told it where the edge of the cliff was. The prompt said "be helpful." Nobody said "do not promise a refund above $500 without a human," so the model, being helpful, promised the refund. The failure wasn't the model being wrong. It was the model doing exactly what it was told, in a place no one had bounded.

Guardrails are how you draw the edge of the cliff. But most guardrail docs read like a terms-of-service page: a long, unordered list of "never do X." That format fails for a specific reason. It treats every rule as the same kind of rule, enforced at the same point, with the same consequence. In practice a guardrail is a contract with a few moving parts, and if you don't separate them you can't enforce them. This is a template you can copy, fill in, and adapt — written for the architect who has to implement it and the executive who has to sign off on the risk.

Why the "big list of don'ts" fails

A flat list conflates three things that live at completely different layers. "Don't be rude" is a behavior you shape with the system prompt. "Don't quote a price you can't see in the catalog" is a grounding constraint you enforce against data. "Don't issue a refund over $500" is an action limit you enforce in code, before the API call fires. Put all three in one bullet list and you will instinctively try to solve all of them with prompt wording — which works for the first, partly works for the second, and does not work at all for the third.

The model is a probabilistic system. A guardrail that lives only in the prompt is a strong suggestion, not a constraint — and a determined user, or an unlucky phrasing, will eventually find the gap. The rules that actually protect you are the ones enforced outside the model: in retrieval, in validation, in the action layer. So the template's first job is to force you to say, for every rule, where it is enforced. If the honest answer is "the prompt," that rule is advisory, and you should ship it knowing that — not discover it during an incident.

The four layers every guardrail belongs to

Before the template itself, the mental model. Every guardrail you will ever write sits in one of four layers, ordered from softest to hardest:

  • Input guardrails — what the agent is allowed to take in. Topic and scope limits, prompt-injection screening, PII redaction before the model sees it. Enforced at the gate, before inference.
  • Grounding guardrails — what the agent is allowed to treat as true. It may only assert facts retrieved from approved, current sources, may not invent prices or policies, and must be able to point at the record it used. Enforced in retrieval and in answer validation.
  • Action guardrails — what the agent is allowed to do. Spend limits, approval thresholds, which objects it can write to, what requires a human in the loop. Enforced in code, between the model's decision and the side effect.
  • Output guardrails — what the agent is allowed to say. Tone, disallowed claims, regulated-language checks, formatting. Enforced as a post-filter on the generated response.

The ordering matters because the cost of a failure climbs as you go down the list. A bad input wastes a few tokens. A bad action moves money or rewrites a customer record you then have to unwind. So you spend your engineering effort bottom-up: get the action layer airtight first, then grounding, then worry about tone. Teams almost always do this in reverse, because tone is the easy thing to demo.

The template, one guardrail at a time

Here is the part you can copy. Define every guardrail as a small structured object with the same eight fields. The discipline is in the fields, not the format — use YAML, a table, a Confluence page, whatever your team actually reads. The point is that a guardrail is not real until all eight are answered.

  • id — a stable name you can reference in incidents and tests, e.g. refund-cap-500.
  • layer — input, grounding, action, or output. Forces the enforcement-point decision up front.
  • rule — the constraint in one plain sentence, stated as what must hold, not a vague intention. "Refunds over $500 route to a human" passes; "be careful with refunds" does not.
  • enforced_by — the actual mechanism: a system-prompt clause, a retrieval filter, a validation function on a specific tool, an approval step. If this field reads "the model will know," the rule is not enforced.
  • on_violation — what happens when the rule is hit: block, route to human, redact, ask a clarifying question, log-and-allow. Different rules deserve different responses.
  • owner — the human accountable for this rule being correct and current. Guardrails rot as the business changes; someone has to own the rot.
  • test — how you prove it works, ideally an adversarial example that should trip it. No test, no trust.
  • metric — what you watch in production to know it's firing at the right rate. A guardrail that never trips and one that trips constantly are both telling you something is off.

A filled example, for an agent that handles customer-service refunds:

id: refund-cap-500 — layer: action — rule: the agent may auto-approve refunds up to $500; anything above routes to a human queue. — enforced_by: a validation check on the issue_refund tool that reads the amount argument and rejects the call before it executes, not a line in the prompt. — on_violation: hold the refund, create an approval task with the case context attached, tell the customer it's under review. — owner: Support Ops lead. — test: a simulated chat where the customer is owed $640; assert no refund fires and an approval task is created. — metric: ratio of refunds routed to human vs. auto-approved, watched for sudden drift in either direction.

Notice what the structure exposes. The rule is enforced by a check on the tool call, not a sentence in the prompt — so a customer who argues well, or a model that's having an agreeable day, still cannot move money past the line. The on_violation is a graceful route with context attached, not a dead end that makes the human start cold. And there is a metric, because the most dangerous guardrail is the one you assume is working and never look at again.

Grounding is the guardrail people skip

Teams obsess over tone filters and forget the layer that causes the most real damage: grounding. An agent that confidently states a price that isn't in your catalog, or quotes a return policy from two years ago, has not broken a content rule — it has stated something false with total composure, and your brand owns the statement. There is no apology workflow for that, because nothing looked broken.

Grounding guardrails are enforced on both sides of inference. Before: the agent retrieves only from approved, current sources, and the retrieval layer filters by recency and entitlement so it physically cannot surface a deprecated policy or another customer's record. After: a validation pass checks that any specific claim — a number, a date, an entitlement — traces back to a retrieved record, and the operating rule is "if you can't cite it, you can't say it." This is where the data discipline shows up. An agent is only as trustworthy as the data layer it stands on, which is why we get data readiness right before agent rollout rather than after. No clause in a prompt can ground an answer in a source that doesn't exist, isn't current, or can't be queried with the right access controls.

Make the failure modes graceful, not loud

The on_violation field deserves its own discipline. The lazy default is to block and apologize. That trains users to route around the agent and quietly erodes the whole reason you deployed one. A good guardrail degrades on a spectrum:

  • Clarify — when the issue is ambiguity, ask one targeted question instead of refusing.
  • Redact — when the issue is sensitive data, strip it and continue rather than aborting the turn.
  • Route to human — when the action exceeds the agent's authority, hand off with full context so the human isn't starting from zero.
  • Block and explain — reserved for genuine hard stops, and even then, say why so the user isn't guessing.

The difference between a guardrail customers resent and one they never notice is almost entirely in this field. A hard block at the wrong threshold is a worse experience than no agent at all — it has all the friction of automation and none of the help.

What this looks like in production

On a recent solar speed-to-lead engagement — Green Subsidy — the guardrails that mattered most were not the showy ones. The agent qualifies and routes inbound homeowner leads fast, so the action-layer rules governing what it could and couldn't promise a homeowner, and the grounding rules tying every eligibility statement to a verifiable source, carried far more weight than any tone filter. Speed is only an asset if the fast answer is also a true and bounded one. A wrong eligibility claim made quickly is just a liability delivered quickly. That is the whole game: an agent that moves fast inside walls you trust.

This is also why guardrails are not a launch checklist you complete once. The catalog changes, the policy changes, the thresholds that made sense at a thousand conversations a month strain at fifty thousand. Each guardrail's owner and metric exist precisely so the rule keeps matching reality after launch day. We treat that as ongoing care, not a one-time setup — because the team that builds the agent and the team accountable for it running correctly should be the same team. A guardrail with no owner is a guardrail with an expiry date nobody wrote down.

How to adapt it without overbuilding

You do not need fifty guardrails to start, and a template this strict can tempt you into building all of them. Resist that. Start with the actions that move money, change records, or make promises — those are your highest-cost failures, and there are usually fewer than ten. Write those as full eight-field objects with real validation functions and real tests. Then add a thin grounding layer and a short output filter. Everything else can begin as prompt-level guidance and graduate to a hard guardrail the first time it actually fails. Let real incidents, not imagined ones, tell you where to harden — the imagined failure modes are rarely the ones that bite.

The template's value is not that it is exhaustive. It is that it forces the one question most guardrail lists never answer — where, exactly, is this enforced — and makes the gap between a wish and a wall impossible to hide. Fill it in honestly and you will know, before you ship, which of your rules are real and which are just well-phrased hope.

If you're standing up an agent on Salesforce and want a second set of eyes on which of your guardrails are real versus advisory — before an incident decides for you — start here.