Field note
The AI Agent Launch Checklist
AI agentsThe AI Agent Launch Checklist
A demo proves an agent can succeed once. A launch checklist proves it won't fail in the ways that cost you money. Here is the pre-flight list we run before an agent touches a real customer, ordered by what actually breaks first.
A demo proves an agent can succeed once. A launch checklist proves it won't fail in the ways that cost you money. Those are different problems, and most teams solve only the first before they go live.
I spent years building Agentforce as a PM at Salesforce, and the pattern is consistent: the agent that wowed the room in the sandbox starts doing something quietly expensive in week two. Not because it broke — it never threw an error. Because nobody asked the boring questions before launch. This is the list of boring questions, ordered by what breaks first.
Before anything: define the one number this agent moves
If you can't name the single metric this agent is accountable for, you're not ready to launch — you're ready to run a science experiment. "Deflect support tickets" is a direction, not a target. "Resolve 40% of password-reset and order-status tickets end-to-end without a human, at or above current CSAT" is a target you can pass or fail. The difference is that the second one names the intents in scope and the quality floor it must not breach.
Then write down the baseline before launch, in the same unit. If you don't know what the human-run process costs, converts, or resolves today, you can never prove the agent helped — and "we think it's working" is not a result you can put a fee on. This is why we tie our own fee to the client's return: it forces the number to exist, and to be measured the same way before and after, before a single agent ships.
Data readiness: the failure mode no demo reveals
Agents fail in production mostly because of the data they reach for, not the model reasoning over it. A model that hallucinates from nothing is annoying. A model that confidently retrieves a stale contract term, a duplicate contact record, or a price that expired last quarter is a liability — and it delivers it in a calm, fluent, citation-shaped voice that makes the wrong answer look authoritative. Run this part of the checklist against your real data, not a curated sample. The curated sample is exactly the data that already works.
- Grounding sources are named and ranked — when two knowledge articles or two objects disagree, the retrieval layer knows which one is authoritative, rather than returning both and letting the model pick.
- Freshness has a named owner. Someone is responsible for the fact that the refund policy the agent cites is the one in effect today, with a process for retiring the old version from the index.
- Duplicates and orphaned records are resolved for the entities the agent touches — a unified customer view, not three half-records the agent stitches together inconsistently.
- Field-level permissions match the agent's audience. A customer-facing agent must be physically unable to surface internal margin notes or other reps' accounts, enforced at the data layer, not by a prompt asking it nicely.
- You log what the agent retrieved, not just what it said. Without the retrieved context, a single wrong answer is unreproducible and therefore unfixable.
“An agent is only as honest as the data underneath it. Get the data right before you make the agent eloquent — fluent and wrong is worse than slow and right.
Scope the agent by its actions, not its conversations
The risky part of an agent is not what it says — it's what it does. A read-only agent that answers a question wrong can be embarrassing. An agent that can issue a refund, reassign an account owner, or push an opportunity to closed-won can be expensive in ways that don't surface until someone reconciles the books a month later.
So scope by capability. For every action the agent can take, decide before launch whether it executes autonomously, executes only after a confirmation step, or merely drafts for a human to send. Put hard spend and rate limits on the irreversible ones at the platform level. Default to draft-only for anything that moves money, changes ownership, or sends an external communication, then graduate one specific action at a time to autonomous after you've watched it behave on real traffic — not because a steering-committee date arrived.
Test the unhappy paths the demo skipped
Demos test the question the agent was built to answer. Production sends the questions nobody scripted. Your launch gate should include adversarial and edge-case testing as named, repeatable cases — a suite you re-run after every prompt or model change, not a one-time vibe check.
- Out-of-scope requests — does it refuse cleanly and hand off, or improvise a confident wrong answer?
- Prompt injection from both user input and retrieved documents — a malicious instruction hidden in a knowledge article or a customer-supplied field is the attack vector teams forget, because it doesn't come through the chat box.
- Ambiguous or contradictory inputs — does it ask one clarifying question, or guess and act?
- The empty case — what does it do when the data simply isn't there? Silence and a fabricated answer are both failures; the right behavior is to say so and escalate.
- Identity and entitlement — does it correctly refuse to act on an account or record the current user isn't authorized for, even when politely asked?
The escalation path is part of the product
An agent that can't hand off gracefully isn't a safe agent — it's a trap with a friendly voice. Decide what triggers a handoff: low confidence, a sensitive topic, an explicit user request, a second failed attempt at the same task. Then make sure the human inherits the full context — the conversation, what the agent retrieved, and what it already tried — not a cold transfer that makes the customer repeat themselves. A clumsy escalation erases the goodwill the agent earned in the first ninety seconds and trains customers to demand a human up front next time.
Instrument it like you intend to be held accountable
You cannot improve, or defend, what you didn't measure. Before launch, confirm you are capturing containment rate, escalation rate with reasons, action success and reversal counts, latency, cost per resolution, and a quality signal you trust more than a customer thumbs-up — a sampled human review or an automated grader scored against your golden set, because thumbs-up data is sparse and biased toward the angry. Pipe all of it somewhere a named person reviews on a fixed cadence. A dashboard nobody opens is theater.
This is the line between launching an agent and running one. Launching is a Tuesday. Running is the ongoing discipline of watching the metric, catching drift, and tuning the prompts, grounding, and action limits as reality shifts — which is exactly the part most projects skip and most vendors hand back to you on day one.
Plan the rollback before you need it
Launch behind a ramp, not a switch. Start with a thin slice of traffic — one channel, one segment, one intent — behind a kill switch with a clean fallback to the human process. In our Green Subsidy solar work, the speed-to-lead agent earned more traffic only as it proved it could hold quality at each step, with the human path always one toggle away. The fastest way to lose trust in an agent is to give it everything on day one and meet the edge case in front of your entire customer base.
The launch is the start of the work, not the end
Treat this checklist as the gate, not the finish line. The day after launch your job changes from building to operating: reading the retrieval logs, watching the number you defined at the top, and deciding which autonomous action graduates next. Agents drift, data changes, and customers find paths you didn't imagine. The teams that win are the ones who planned to keep showing up after the demo high wore off.
If any section of this list made you wince — the baseline number, the freshness owner, the rollback plan — that's the section to fix first. Every item here is cheaper to confront on a whiteboard than in a postmortem.
Want a second set of eyes before your agent goes live? Book a launch-readiness review and we'll run this checklist against your real Salesforce data and the number you're trying to move.