AI Guardrails

Also known as: agent guardrails, LLM guardrails, AI safety controls

AI guardrails are the controls that constrain what an AI system is allowed to say and, more importantly, do — enforced at the input, the model, the tool/action layer, and the output. They are not a single content filter but a layered control system that decides which actions an agent can take, on which data, under whose authority. For agents that write to systems of record, the action-layer guardrails — permissions, scopes, caps, approvals — matter far more than the word-filtering ones, because they are the only controls that still hold when the model is wrong.

Why it matters

Most people picture guardrails as a profanity filter or a 'don't say anything offensive' wrapper. For a chatbot that only talks, that's roughly the whole job. For an agent that takes actions — updates a record, issues a refund, books a meeting, sends an email on the company's behalf — the dangerous failure is not a rude sentence. It's a wrong write to a system of record. The guardrails that prevent that are about authority and scope, not vocabulary.

  • A talking agent can embarrass you. An acting agent can cost you money, corrupt data integrity, or trip a regulation.
  • The blast radius of a missing guardrail scales with the agent's permissions, not its eloquence. A read-only agent and a refund-issuing agent need very different controls even when they share a model.
  • Guardrails are also what make an agent shippable to a regulated buyer: they are the evidence that the system cannot do the thing legal is afraid of, independent of how the model behaves on any single prompt.

How it works: the four layers

Guardrails are best understood as checks placed at four distinct points in the request path. A serious deployment uses all four, because each catches failures the others structurally cannot.

  • Input guardrails — validate and sanitize what comes in: detect prompt injection, strip or flag PII, reject off-topic or out-of-scope requests before they reach the model. Treat retrieved documents and tool results as untrusted input too, since injection often arrives through them rather than the user.
  • Model/grounding guardrails — constrain the reasoning: a tightly scoped system prompt, retrieval that forces answers to cite trusted data so the model can't free-associate, and refusal or escalation behavior for low-confidence cases.
  • Action/tool guardrails — the most important layer for agents: every tool the agent can call runs under real permissions (row- and field-level access, scopes), with allowlists, rate limits, dollar/quantity caps, idempotency on writes, and human-in-the-loop approval for high-stakes actions.
  • Output guardrails — final checks before the response leaves: schema validation, PII redaction, toxicity and hallucination screening, and immutable logging of the full input-to-action chain for audit and replay.

The non-obvious part: guardrails are a permission model, not a vibe

A prompt that says 'never issue a refund over $500' is a suggestion. An LLM can be talked out of a suggestion. A guardrail is the refund API rejecting any call above $500 regardless of what the model decided — enforced in code, outside the model, where it cannot be argued with. The rule of thumb: anything you actually care about must be enforced deterministically, downstream of the model. Treat the model as a persuasive intern who proposes actions; the guardrail is the system that approves or denies them. If a control exists only as words inside the prompt, assume an attacker — or an ordinary confused user — can get past it.

  • Prompt-level rules = soft, probabilistic, bypassable. Use them for tone, preference, and the happy path — never as your last line of defense.
  • Code-level rules (permissions, validation, caps, approvals) = hard, deterministic. Use them for anything that touches money, data integrity, or compliance.
  • Test guardrails like security controls, not features: red-team them with injection and jailbreak attempts, and assume the model will eventually try every path you left open. A guardrail you haven't tried to break is a guardrail you don't yet have.

Where it fits

The four layers are a general discipline — they apply to any agent on any stack, not just Salesforce. On Salesforce specifically, they map onto real platform mechanics rather than bolt-on filters. Agentforce topics and instructions scope what the agent will attempt; the Einstein Trust Layer handles input/output concerns like PII masking, grounding, and toxicity scoring; and crucially, an agent acts as a running user — so it inherits that user's profile, permission sets, sharing rules, and field-level security. That last point is the strongest action guardrail you have, and the easiest to over-provision by accident: design the agent's permission set as carefully as you'd design a service account, because it is one, and review it on the same cadence. SkySync's view is that guardrails are not a launch-day checklist item but part of running an agent accountably over time — the controls that let you put an agent into production, keep it inside its lane as data and use cases drift, and stand behind the outcome rather than handing over a config and walking away.

Frequently asked

Are guardrails just content moderation for AI?

Content moderation is one layer — output filtering for toxicity, PII, and unsafe text. It's the right frame for a pure chatbot. For an agent that takes actions, the more consequential guardrails govern what the agent is permitted to do: which tools it can call, on which records, with what limits, and when a human must approve — enforced in code outside the model, where moderation can't reach.

Can't I just put all the rules in the system prompt?

For tone and scope preferences, yes. For anything you genuinely cannot afford to get wrong, no. A system prompt is a strong suggestion the model usually follows but can be jailbroken or injected out of. Real constraints — spend caps, data access, approval thresholds — must be enforced deterministically in the tool/action layer, where the model's output can't override them no matter how it's prompted.

Do guardrails make an agent slower or dumber?

Well-placed ones add little latency and mostly remove failure modes, not capability. The cost is real but mundane: someone has to define the scopes, build the validation, and test against adversarial inputs. The payoff is an agent you can actually put in front of customers and regulators, because you can prove what it can't do — usually the difference between a demo and a deployment.

Ready when you are

Worth a
conversation?

Tell us one number you'd like AI to move. We'll show you how we'd do it, what it's worth, and how we'd tie our fee to getting you there.