Field note
A KPI Dictionary for AI Agents
AI agentsA KPI Dictionary for AI Agents
Most agent dashboards measure the model, not the money. Here is the short list of metrics that actually tell you whether an agent is working — defined tightly enough to put in a contract.
Most teams measure their AI agent the way they'd measure a chatbot: deflection rate, average handle time, a thumbs-up button nobody clicks. Then they wonder why the dashboard is green and the CFO is unconvinced. The problem isn't the agent. It's the dictionary. The words you use to measure an agent decide what it optimizes for, and most of the common words point at the model's activity instead of the outcome you were paying for.
This is a working glossary — not every metric you could collect, but the small set worth defining precisely, plus the popular ones worth treating with suspicion. The test for each entry is the same: could you write it into a contract and have both sides agree on what it means before any work starts? If a metric can't survive that, it won't survive a board meeting either.
First, three words you have to pin down before anything else
Almost every argument about agent performance is really an argument about an undefined word. Fix these three and half the disputes disappear before they start.
- Task. The unit you're measuring — not a message, not a turn, not a session. A task is a discrete thing the agent was asked to accomplish: resolve a billing dispute, qualify a lead, draft a renewal quote. One customer conversation can contain three tasks or zero. If you can't name the task, you can't say whether it succeeded.
- Success. Defined from the user's outcome, not the agent's behavior. "The agent responded" is not success. "The dispute was resolved and the customer did not reopen it within 7 days" is. Put the time window inside the definition, or the metric will be gamed by closing tickets fast and eating the reopens later.
- Autonomy. The share of a task the agent finished without a human touching it. A 90%-deflection agent that quietly routes every hard case to a person has low autonomy — and you want that number reported on its own, because autonomy is what changes your cost structure, while success is what changes your revenue.
The outcome layer: the metrics a buyer should actually read
These are the numbers that belong on the slide. Each maps to money or to a decision, and none of them require you to understand the model to interpret.
- Task success rate. Successful tasks over attempted tasks, using the success definition above. The single most important number — and the one most dashboards quietly avoid, because it's always lower than the vanity metrics sitting next to it.
- Containment vs. resolution. Containment means the agent handled it without escalating. Resolution means the problem actually went away. The gap between them is where bad agents hide. A high-containment, low-resolution agent is an expensive way to make customers angrier before a human gets to them.
- Cost per successful task. Total cost — model tokens, tool calls, human review, the eng time to maintain it — divided by successful tasks, not by all tasks. Dividing by all tasks flatters you; failed tasks still burn tokens and often burn more, because they run longer before giving up.
- Time-to-outcome. Wall-clock time from request to resolved, human handoff included. For a speed-to-lead agent this is the whole game: the value isn't that the agent replied, it's that it replied before the prospect filled out three competitors' forms.
“If a metric goes up when the agent does more work but the customer's problem doesn't go away, it's measuring activity, not value. Cut it from the executive view.
The reliability layer: metrics for the people who run it
Outcome metrics tell you whether the agent is winning. Reliability metrics tell you whether you can trust the score tomorrow. This is the layer architects live in, and the layer executives forget exists until the first incident review.
- Grounding rate. Of the agent's factual claims, the share traceable to a real source — a record, a retrieved document, a tool result — rather than generated from the model's priors. Measure it by sampling responses and checking each load-bearing claim against a citation. The inverse is your hallucination exposure. No grounding number means you don't have a quality metric, you have a vibe.
- Tool-call accuracy. When the agent invokes an action — issue a refund, update an opportunity, book an appointment — how often does it pick the right tool with the right arguments? Track wrong-tool and wrong-argument errors separately; they have different fixes. These failures are silent and expensive, because the agent narrates them with the same confidence as a correct call.
- Escalation precision and recall. Of the cases the agent escalated, how many genuinely needed a human (precision)? And of the cases it did not escalate, how many should have (recall)? The second number is the one that hurts — it's the agent confidently handling things it had no business handling.
- Recovery rate. When a step fails — a tool times out, a record is missing, an API returns garbage — how often does the agent recover gracefully versus stall, loop, or invent? Production is mostly the unhappy path. An agent measured only on the happy path is measured on the part that rarely breaks.
The metrics to distrust
Some popular numbers aren't wrong — they're just easy to move without doing anything good. Keep them if you like. Never let them lead.
- Deflection rate rewards the agent for not escalating, which is exactly the wrong instinct when escalating is the correct move. Pair it with resolution or it actively trains the behavior you don't want.
- Thumbs-up / CSAT on the agent turn measures whether the reply felt pleasant, not whether the problem got solved, on response rates too low and too skewed to trust. Fine as a guardrail, useless as a target.
- Average handle time can drop because the agent got more efficient or because it started bailing sooner. The metric can't tell those apart. Time-to-outcome can, because an unresolved task has no outcome to clock.
- Tokens or messages per conversation is a cost input, not a performance metric. Optimize it directly and you get curt, unhelpful agents that technically spent fewer tokens failing.
Why your data layer sets the ceiling on all of these
Here's the part the demos skip. Grounding rate, tool-call accuracy, resolution — every one of them is capped by the data underneath. An agent reading a customer record split across three systems, with a stale address in one and an open case nobody merged, can't resolve a dispute whose facts it literally cannot see. The model isn't the bottleneck; the record is. You cannot measure your way past a data problem — a good metric will just report it faithfully, over and over.
That's why we treat the agent as the last step, not the first. Unify the customer record and make it trustworthy, then put an agent on top of it — what we call Data-to-Agent. The KPIs above are how you tell whether that foundation is actually real: a jump in grounding rate after you consolidate the record is about the most honest signal you'll get that the plumbing work paid off, because it's the metric that can't be faked by a better prompt.
Collapsing the dictionary into one number leadership will read
Fifteen metrics is a dictionary nobody reads. The job is to chain a few of them into something a non-technical leader can follow in one breath: attempted tasks, times task success rate, times the value of one successful outcome, minus cost per successful task. That product is the agent's contribution in dollars — built from definitions each side can challenge line by line instead of distrusting the whole figure at once.
Plug in illustrative numbers to see the shape. Say an agent attempts 1,000 lead responses a month, hits a 40% task success rate (400 qualified), each qualified lead is worth some defined amount to your pipeline, and a successful task costs you a few dollars to run. Subtract one from the other and you have the headline. The point isn't the figure — it's that every input is a metric you defined, so anyone can argue with a single link in the chain instead of dismissing the conclusion.
“The right agent KPI isn't a metric. It's a chain of definitions short enough to recite and rigorous enough to argue with.
Use this as a checklist, not a trophy case
You don't need all of these on day one. Start with task success rate, resolution, and cost per successful task — the three wired to money. Add grounding rate and escalation recall the moment the agent touches anything that can hurt you. Then retire the vanity metrics as the real ones come online, because two scoreboards means someone will always quote whichever one is winning that week.
The discipline isn't collecting more numbers. It's refusing to ship a metric you can't define, and refusing to celebrate one that doesn't move an outcome. Measure the agent the way you'd measure a new hire: not by how busy it looked, but by what got done — and whether you'd trust it to do it again unsupervised.
If you want these metrics wired into a real Agentforce deployment — and a fee that moves with your results, not your token bill — start a conversation with us.