All posts

Field note

How to Ship Your First Agentforce Agent (Without It Stalling After the Demo)

Akshit Kandi
#Agentforce#Salesforce#AI agents#implementation#Data Cloud
How to Ship Your First Agentforce Agent (Without It Stalling After the Demo)
Agentforce

How to Ship Your First Agentforce Agent (Without It Stalling After the Demo)

SkySync

A demo agent and a production agent are different animals. Here is how to scope your first Agentforce agent so it survives contact with real users, real data, and real edge cases.


Most first Agentforce agents work beautifully in the demo and quietly die three weeks later. Not because the technology failed. Because the demo answered the easy 80% of questions, and production is made entirely of the other 20%. I watched this pattern enough times building Agentforce as a PM at Salesforce that I can now predict, in the first scoping call, which agents will stall. This is a guide to building the kind that doesn't.

The demo agent and the production agent are not the same agent

A demo agent has one job: prove the concept to people who already want to believe it. It handles the three questions you fed it, on clean data, in front of a friendly audience. A production agent has the opposite job. It faces every question a confused, annoyed, or adversarial user can throw at it, on data that is stale, duplicated, or missing, with no one in the room to smooth over a bad answer.

The gap between these two is where projects go to die. The demo earns the budget. Then someone says "great, ship it to the whole team," and the agent meets reality. Your real work isn't building the demo. It's closing the gap between the demo and the messy 20%. Plan for that gap from day one or you'll rebuild from scratch after the honeymoon ends.

Pick a use case where being wrong is cheap

The instinct is to point your first agent at your biggest, most visible problem. Resist it. Your first agent is how you learn the failure modes of agents on your data, in your org, with your guardrails. You want that learning to be cheap.

Score candidate use cases on two axes: how often you'll be right, and what a wrong answer costs. The ideal first agent lives in the high-frequency, low-cost-of-error quadrant — you get many reps to learn from, and each miss is survivable.

  • Good first agent: drafting a reply a human reviews before it sends. A bad draft costs ten seconds of editing, and you keep a human in the loop while you build trust.
  • Good first agent: surfacing the three most relevant knowledge articles for a support rep. A wrong article is ignored, not catastrophic.
  • Bad first agent: autonomously issuing refunds or changing contract terms. A wrong answer costs money and trust, and you haven't earned the right to be trusted yet.
  • Bad first agent: anything where the user can't tell a good answer from a confident wrong one. If they can't catch the error, you can't ship it without months of evaluation first.

Ship where mistakes are recoverable. Earn autonomy by proving accuracy on the low-stakes version first — the refund agent comes after the draft-a-reply agent has a track record, not instead of it.

Your agent inherits every sin in your data model

Here is the part the keynote skips. An Agentforce agent is only as good as what it can ground on. Point it at three overlapping Contact records for the same person and it will confidently cite the wrong one. Give it a knowledge base where half the articles are two product versions out of date, and it will answer with the old version, fluently. The retrieval layer has no opinion about which record is right; it returns what matches the query, and the model speaks it with conviction.

This is why we put data before agents, every time. Before you write a single topic or action, audit the specific slice of data your agent will touch. Not the whole org. The slice. If your agent answers billing questions, the dedup, freshness, and access rules on billing data are now your top priority. Everything else can wait.

  • Deduplicate the objects the agent reads. Conflicting records produce confident nonsense, because the agent can't tell which duplicate is authoritative unless you've decided for it.
  • Check freshness. Decide how stale is too stale, and make sure retrieval respects it — a timestamp filter is cheaper than a wrong answer.
  • Map field-level access. The agent runs in a user context; it should see exactly what that user is allowed to see, no more. Test this with a low-privilege user, not an admin.
  • Write down what the agent should NOT know about. Scope is a feature, and an explicit out-of-bounds list is easier to enforce than an implicit one.

An agent doesn't fix bad data. It broadcasts it, in full sentences, with total confidence.

Topics and actions: narrow beats clever

Agentforce routes a user request to a topic, and a topic exposes a set of actions. The most common first-build mistake is making topics too broad and instructions too long. A topic called "Account Help" that covers billing, technical support, and renewals will misroute constantly, because the boundaries between those intents are fuzzy and you've asked the planner to guess. Long instructions make it worse: every extra paragraph dilutes the signal the planner uses to choose a topic and pick an action.

Define narrow topics with sharp edges. "Reset a password" is a good topic. "Manage account" is not. For actions, prefer a few deterministic, well-described actions over one flexible action that tries to do everything. Each action's description is the contract the planner reads to decide when to call it — treat it like an API spec, not a marketing blurb. Write those descriptions for the planner, not for a human skimming a wiki: name the trigger conditions, the inputs, and the cases where it should NOT fire.

And put the deterministic work in deterministic places. If a step is really a query, a calculation, or a multi-record update, make it a Flow or an Apex action, not a paragraph of natural-language instruction. The model should decide what to do; your code should do the parts that must be exact. A prompt that does arithmetic will get the arithmetic wrong eventually. A Flow won't.

Build the eval set before you build the agent

This is the single highest-leverage habit, and almost no one does it first. Before tuning instructions, write 30 to 50 real test cases: the actual questions users will ask, including the messy ones, paired with what a correct response looks like. Pull them from real support transcripts, sales calls, internal tickets. Not made-up questions. Real ones — the made-up set always flatters the agent.

Now you have a ruler. Every change to a topic, an action, or an instruction gets measured against the same set. You'll catch the regressions where fixing one case quietly breaks three others — the failures you'd never reproduce by clicking around. Without an eval set, "is it better?" is a vibe. With one, it's a number you can put in front of a skeptical stakeholder. The team that ships a durable agent is the team that measured before they tuned.

  • Include the adversarial cases: off-topic questions, prompt injections, requests for data the user shouldn't access. These are where reputations get lost.
  • Include the boring cases: the routine question asked five different ways. Routing has to survive paraphrase, and paraphrase is where broad topics fail first.
  • Include the "I don't know" cases. An agent that correctly declines is often more valuable than one that always answers, and most eval sets forget to score for it.

Design the handoff before you design the agent

Every production agent needs a clean exit to a human, and the agents that stall usually treated it as an afterthought. Decide in advance what triggers escalation: low confidence, a sensitive topic, an explicit user request, a tool failure, a third turn with no resolution. Decide what context travels with the handoff so the human doesn't make the customer repeat themselves — the transcript, the records the agent touched, and why it bailed.

A good handoff is not a failure. It's the safety valve that lets you ship in the low-cost-of-error quadrant with confidence, knowing the rare hard case routes to a person instead of getting a fabricated answer. Users forgive "let me get someone who can help." They don't forgive a wrong answer delivered with a smile.

Launch to ten people, not ten thousand

The demo tempts you to flip it on for everyone. Don't. Launch to a small, friendly cohort who will tell you when it's wrong. Instrument everything: every conversation, every action call, every escalation, every thumbs-down. Watch the transcripts daily for the first two weeks — read them, don't just skim the dashboard. You are not done building when you launch. You're starting the part where you find out what you missed.

This is also where the honest economics show up. An agent isn't a thing you install. It's a thing you run. Models change, data drifts, users find new ways to ask, edge cases keep arriving. The cost of an agent is mostly the cost of keeping it good after launch, which is exactly why we tie our fee to the outcome it produces rather than the day it goes live. If it stops working, that's our problem too.

The shape of a first agent that survives

Strip away the specifics and the pattern is simple. Pick a use case where being wrong is cheap. Clean the narrow slice of data it touches. Keep topics sharp and put exact work in code. Build the eval set first. Design the human handoff up front. Launch small and watch closely. Plan to run it, not just build it. We learned this discipline shipping speed-to-lead agents for solar with Green Subsidy, where a slow or wrong answer is a lost customer, and it generalizes to nearly every first agent worth building.

None of this is the exciting part. The exciting part is the demo. This is the boring infrastructure that decides whether the exciting part is still running next quarter. Do the boring parts and your first agent earns the right to a second one.

Scoping your first Agentforce agent and want a second pair of eyes on the use case and the data underneath it? Book a working session.