Field note
How to Write Guardrails for a Customer-Facing AI Agent
AI agentsHow to Write Guardrails for a Customer-Facing AI Agent
A guardrail is not a system prompt with the word 'never' in it. It's a control with a trigger, a decision, and an enforcement point. Here is how to design ones that survive contact with real customers.
The first guardrail most teams write is a sentence in the system prompt that says "Never give financial advice." The first thing a real customer does is ask a question that is obviously financial advice but doesn't contain the word "financial." The agent answers it. That gap — between the rule you wrote and the behavior you actually got — is the entire subject of this post.
A useful definition: a guardrail is a control with a trigger (what condition fires it), a decision (allow, block, modify, or hand off), and an enforcement point (where in the request lifecycle it acts). If you can't name all three, you don't have a guardrail. You have a hope. This piece is about turning hopes into controls for a customer-facing agent — written for the person who has to build it, with enough business framing that the people who fund it understand what they're buying.
Guardrails live in four places, not one
The mistake is treating "the prompt" as the place safety happens. The prompt is one layer, and the weakest one, because it's the only one a determined user can argue with. Think of an agent turn as a pipeline with four enforcement points, each with different strengths.
- Input guardrails — run before the model sees the message. Topic classification, PII detection, prompt-injection screening, rate limits. Deterministic and cheap. This is where you catch "ignore your instructions" before it costs you a token.
- Instruction guardrails — the system prompt and policy. Flexible, expressive, and the easiest to bypass. Good for tone, scope, and judgment calls; bad for anything you'd be embarrassed to have ignored.
- Tool guardrails — constraints on what actions the agent can take. The agent can want to issue a refund all it likes; if the refund tool caps at $50, requires an order ID the customer actually owns, and runs the write under the customer's own permissions, the wanting doesn't matter.
- Output guardrails — run after the model responds, before the customer sees it. Final check for leaked data, hallucinated commitments, banned claims, wrong language. The last line.
The rule of thumb: push every guardrail you can down to the layer that can't be talked out of it. A user can talk the model out of a system-prompt rule. A user cannot talk a tool that doesn't exist into existing. So if the real constraint is "never move money over $500 without a human," that belongs in the tool's permission model — a hard cap and a required approval step — not in a paragraph asking the model nicely.
Write the failure modes before you write the rules
Most guardrail docs are a list of "the agent should" statements. Those are aspirations. Start from the other end: enumerate the specific bad outcomes, ranked by cost, and design backward from each one. A guardrail with no named failure mode behind it is decoration.
For a customer-facing agent the high-cost failures usually cluster into five buckets: it says something false that the customer relies on; it leaks data belonging to another customer; it takes an action it shouldn't (refund, cancel, escalate-to-nowhere); it gets manipulated into off-policy behavior; and it fails silently — confidently answering when it should have handed off. Notice that only one of those, falsehood, is what people usually mean by "hallucination." The other four are access, action, adversary, and abdication. Each needs a different layer, which is the whole point of having four.
Rank them by what one instance costs you, not by how often you imagine it happening. A confidently wrong shipping-policy answer is cheap to absorb and easy to correct. One cross-customer data exposure can be a breach-notification event. The dollar gap between those two is the reason you spend your guardrail budget on the boring access controls before the clever conversational ones.
“The question is never "is the agent safe?" It's "what is the worst thing this agent can do in one turn, and what stops it?" If you can't answer the second half for each failure mode, that's your backlog.
The data boundary is the guardrail nobody writes down
Here's the part the demos skip. The most dangerous customer-facing failure isn't a rude answer — it's the agent retrieving and exposing the wrong customer's data. And that failure is almost never caused by the model. It's caused by retrieval that ran with more access than the user in front of it should have.
If your agent grounds answers in a knowledge base or customer records, the question that matters is: whose permissions did the retrieval run under? An agent that queries with a service account that can see everything will, eventually, surface something it shouldn't — not because it was jailbroken, but because the grounding step had no boundary. On Salesforce this is concrete: Data Cloud and the platform sharing model can scope what a given session can retrieve, so the agent literally cannot ground in records the contact has no right to. That's an enforced boundary, not a prompted one — and the same principle holds off-platform: scope retrieval to the end user's identity, not a god-mode integration token. The cleanest guardrail is the data the agent was never able to fetch in the first place.
This is also why "data before agents" isn't a slogan. If your access model is a mess, no prompt fixes it — you've just built a very articulate way to leak. Get the retrieval scope right and a whole category of guardrails becomes unnecessary, because the failure is structurally impossible instead of merely discouraged.
Make refusals do work, not just say no
A blunt "I can't help with that" is a guardrail that produces a support ticket. The refusal is correct and the experience is a failure. Good guardrails are designed with the off-ramp attached: when the agent declines, it should route — to a human, to a form, to a different flow — and carry the context with it so the customer doesn't repeat themselves.
Design every "the agent must not" with a paired "so instead it will." Must not give a binding price? Then it offers to connect a rep and logs the intent. Must not process a refund over the cap? Then it opens a case pre-filled with the order. The guardrail and the handoff are one design, not two. An agent that only knows how to stop is a worse agent, not a safer one — and a clean handoff is often a better outcome than the answer the customer asked for.
Adversarial input is a category, not an edge case
Anything a customer can type, some customer will type to break it. Prompt injection isn't exotic anymore; it's table stakes, and it gets worse the moment your agent reads untrusted content — a support email, a pasted document, a web page it was told to summarize. The injection doesn't have to come from the person chatting. It can come from the data the agent ingests on their behalf, which is exactly the channel a system prompt can't police.
- Treat retrieved and pasted content as data, never as instructions. Structurally separate it in the context so the model knows the difference, and don't let tool outputs silently become commands.
- Screen input before the model with a cheap classifier for known injection patterns and out-of-scope topics — deterministic checks are faster and harder to social-engineer than the model itself.
- Constrain tools so a successful injection still hits a wall: scoped permissions, required arguments the user can't forge, hard caps on consequential actions. Assume the prompt layer will eventually be beaten and make that beating worthless.
- Log the attempts. Injection patterns evolve; your screening has to be a living list, fed by what you actually see in production.
You can't guardrail what you can't observe
Every rule above is theoretical until you can answer one question in production: when this agent did something wrong, can you see it, reproduce it, and prove it's fixed? Guardrails without observability are untestable claims. You need the full turn — input, what was retrieved, which tools fired with what arguments, the raw model output, and which guardrails triggered — logged and queryable.
This is the unglamorous half of the work and the half that determines whether the agent is trustworthy six months in. Behavior drifts: models update, knowledge bases change, customers find new phrasings. A guardrail you wrote in March and never watched is a guardrail you no longer have. Treat the rule set as something you maintain on evidence, with a regression suite of real failures you've replayed, not a document you ship once and forget.
A pragmatic order of operations
If you're starting from a blank agent, resist the urge to write the perfect policy first. Build in this order, because each layer makes the next one cheaper: scope the data the agent can retrieve; constrain the tools and their limits; add input screening for injection and off-topic; write the instruction policy with paired refusals-and-handoffs; add output checks for the few things that would genuinely hurt; then wire observability under all of it and start replaying real traffic.
Notice the system prompt is fourth, not first. That ordering is the whole argument. The prompt is where you express judgment, not where you enforce safety. Enforcement lives in the layers a customer can't argue with — and the more of your guardrails you can move into those layers, the less your agent's safety depends on the model behaving well on a bad day.
This is also, bluntly, why running an agent is a different job than launching one. The guardrails that matter are the ones still working after the model updated, the data shifted, and a customer found a phrasing nobody anticipated. That accountability — for behavior in production, over time — is the part that's easy to underprice and expensive to skip.
If you're putting an agent in front of customers and want the guardrails designed as architecture — data boundary, tool limits, and observability, not just a prompt — let's map your failure modes together.