Field note

AI Agent ROI: How to Set Honest Benchmarks by Use Case

May 27, 2026Akshit Kandi

#AI ROI#AI agents#CFO#business case#benchmarks

AI ROI

AI Agent ROI: How to Set Honest Benchmarks by Use Case

Most AI agent ROI benchmarks are set against the wrong baseline, which is why the savings never reach the P&L. Here is how to set a benchmark by use-case shape that survives the year-end audit.

A vendor tells you their AI agent delivers "300% ROI." Your gut says the number is invented. Your gut is half right. The number usually isn't fabricated — it's measured against a baseline chosen by the party being graded. The deck shows you the numerator in bright colors and never shows the baseline at all, because the baseline is where the honesty lives, and honesty rarely flatters.

ROI is a fraction, and almost everyone fights over the numerator: hours saved, deflection rate, faster response. The denominator and the baseline are where projects quietly fail to reach the P&L. A benchmark is only honest if you'd defend it to your own auditor, not just to the board. This piece is about setting that benchmark — and the core claim is that there is no single one, because it changes with the shape of work the agent actually does.

The baseline is the whole argument

Take a customer-service agent. The vendor's baseline is "cost per ticket fully handled by a human." It looks clean. It's wrong on three counts. It assumes every ticket the agent touches would otherwise have cost a full human ticket — but a chunk of them were password resets the customer could self-serve anyway. It ignores that the agent escalates the hard cases, so a human still touches those, now twice. And it books the loaded cost of an FTE you will not actually remove. Three quiet assumptions, each one bending the ratio upward.

The honest baseline isn't "what a human costs." It's what actually changes on your financial statements when the agent runs versus when it doesn't. If no headcount leaves and no revenue arrives, the benchmark should land near zero no matter how many tickets the agent closed. Activity is not impact. Your instrumentation should be able to tell the two apart at the line-item level — and if it can't yet, that gap is the first thing to fix, before any agent goes live.

“
The first question on any AI ROI is not 'what did the agent do?' It's 'what line on the P&L moved, and would it have moved anyway?'

Four use-case shapes, four different benchmarks

There is no universal AI ROI benchmark because agents do fundamentally different kinds of work, and each kind moves a different number in a different direction. Collapse them into one "productivity" metric and the business case goes soft on contact. Roughly four shapes recur, and each one has its own honest unit of measure.

Cost-takeout agents (deflect tickets, draft responses, triage): benchmark against marginal cost actually removed, not loaded FTE cost. The honest unit is dollars you stop spending — which almost always requires a real staffing or volume decision, not just time freed up.
Revenue-capture agents (speed-to-lead, qualification, follow-up): benchmark against incremental conversion on volume you were already losing. The unit is contribution margin on deals that would not have closed otherwise — not credit for deals your reps would have won anyway.
Cycle-time agents (faster quote, faster onboarding, faster close): benchmark against the cash-flow or throughput value of the time, not the time itself. A day saved is worth something only if it lets you bill sooner or serve one more customer with the same team.
Risk-and-quality agents (compliance checks, error catching, consistency): benchmark against the expected cost of the errors prevented — frequency times severity — which is the hardest to measure and the easiest to overclaim.

Only one of these — revenue-capture — produces new money. The other three reduce or avoid cost, and a cost reduction only counts when someone decides to actually capture it. A benchmark that blends "new revenue" and "theoretical time saved" into one ROI percentage is built to mislead, even when no one intends it to. When you see a single headline number spanning more than one shape, ask which fraction is cash and which is a forecast wearing cash's clothes.

Why most benchmarks count hours that never become dollars

Say an agent saves each of 40 reps two hours a week. That's 80 hours weekly, and the slide multiplies it by a loaded rate to produce a six-figure annual "saving." The math skips the only step that matters: those two hours don't aggregate into a person you take off payroll. They scatter back into the day. The work gets a little easier and the cost stays exactly where it was.

Saved time isn't money until a decision converts it — you redeploy those reps to higher-value selling, raise the quota each one carries, or grow into the next year without the next five hires. An honest benchmark forces that conversion into the model as its own line: time saved, then the explicit mechanism by which it becomes cash, then the owner of that mechanism. No mechanism, no booking. "Soft" savings no one is accountable for converting are not savings; they're a feeling about productivity.

The cost of being confidently wrong belongs in the denominator

Most vendor ROI models carry a clean denominator: license plus implementation. Real denominators are heavier, and leaving the rest out is how a 300% quietly becomes a 40%. The full cost of an agent in production includes the data work to make its answers trustworthy, the evals and monitoring that keep it from drifting as your products and policies change, the human escalation path for when it's unsure, and the expected cost of the times it is confidently wrong in front of a customer.

That last one is a line item, not a footnote. An agent that quotes a stale price or misstates a return policy carries a real cost — a refund, a churned account, a compliance flag, the engineering hours to trace what went wrong. You don't need to predict it to the dollar. You need a reasoned number in the model, sized by how often the agent acts unsupervised and how exposed each action is, so the benchmark reflects a system that can fail rather than a demo that can't. An ROI that assumes zero error cost is benchmarking a fantasy, and the gap shows up the first bad week in production.

Set the benchmark before you build, and tie someone to it

The highest-leverage move on AI spend is sequencing: write the benchmark down before the build starts. One metric per use case. The current baseline, measured, not assumed. The target. The mechanism that converts the agent's output into that target. And the date you'll check. Do this before procurement and the vendor conversation changes shape — you're no longer buying a capability, you're buying a number, and capabilities are far easier to sell than numbers are to defend.

It's also the cleanest test of whether your partner believes their own pitch. If the benchmark is honest, the people building the agent should be willing to be measured against it after the contract is signed. We tie our fee to the client's return for exactly this reason — it forces the benchmark to be real on day one, because we get graded by it too, not just paid to deliver a demo. An ROI claim no one will stand behind once the ink dries was never a benchmark. It was marketing.

Data quality sets the ceiling on every number above

One reason benchmarks come in soft has nothing to do with the model. Point a capable agent at fragmented, stale, permission-ambiguous data and it produces mediocre outcomes by construction — it can only act on what it can reach and trust, and most orgs underestimate how little that is. The benchmark you set silently assumes the agent can pull the right record, fresh enough, with the right access, at the moment of the interaction. Often it can't, and the shortfall surfaces as a result that underperforms the model on the slide for reasons the slide never named.

In our Green Subsidy work — a speed-to-lead agent for solar — the benchmark that mattered was incremental conversion on leads that were going cold, not raw response speed. The lever wasn't the model's eloquence; it was getting clean, unified lead data in front of the agent fast enough to act while the prospect was still warm. Data readiness set the ceiling, and the agent reached toward it. This is why we sequence data before agents: benchmark the agent without auditing the data underneath and you've measured the road, not the car.

A benchmark you can defend in four lines

You don't need a model with thirty assumptions. Per use case, you need four honest lines, and if you can't fill them in, you're not ready to spend. What single number does this agent move, in dollars? What is that number today, measured rather than estimated? What is the explicit mechanism that turns the agent's output into a change in that number? And who owns the result when you check it next quarter? Four lines is enough to expose a soft case and short enough that no one can hide behind a spreadsheet.

Fill those in and a few things happen at once. The vendor's 300% gets re-based against your reality and usually lands somewhere believable. The use cases that were only ever productivity theater fall out on their own, before they cost you a build. And the ones that survive are the ones you can defend to your auditor, your board, and yourself in the same breath. That's the whole job — not a bigger number, a number that holds when someone leans on it.

Want to pressure-test a real use case against an honest baseline? Model it with our ROI calculator, then book a call and we'll build the four-line benchmark with you before anyone writes code.

Newer

The Answer-Engine Era: Getting Cited by ChatGPT & Perplexity

Older

AI Agents for Professional-Services Intake & Scheduling