All posts

Field note

An AI Use-Case Prioritization Scorecard

Akshit Kandi
#AI agents#AI ROI#prioritization#use case selection#Agentforce

Most prioritization scorecards quietly sort your use cases by how easy they are to build. Here is one structured so it can't — and the scoring mechanics that decide whether the output is a roadmap or a comfortable lie.


Every AI prioritization scorecard produces a ranked list. Almost none of them produce the right one. You can feel the moment it goes wrong: the workshop ends, the spreadsheet hands back a tidy ordering, and the thing on top is the use case the loudest engineer already wanted to build. The scorecard didn't make the decision. It laundered one that was already made, and stamped it with a number so nobody could argue.

The problem isn't that teams lack a scorecard. It's that the scorecard's structure — which columns, weighted how, summed or multiplied — encodes a theory of value, and most encode one that's wrong. Get the structure right and the ranking does real work: it surfaces the use case you wouldn't have picked, in the order you wouldn't have sequenced. Get it wrong and you've built an elaborate machine for confirming the bias you walked in with. This piece is about the structure, term by term, because that is where the decision actually lives.

The default scorecard is biased toward easy

Here's the shape almost everyone starts with: list candidate use cases as rows, score each 1–5 on value, effort, risk, and maybe strategic fit, sum the columns, sort descending. It looks balanced. It isn't. Effort and risk are the two dimensions a delivery team can assess precisely on day one — they're inside its own competence — so those scores come back crisp and confident, while value is the one dimension nobody can pin down yet, so it gets a hedged middle score for almost everything. Now watch what the sum does with that: when three of four columns describe how hard or scary the build is and only one describes whether it's worth doing, the total is dominated by feasibility, and the scorecard sorts your use cases by how comfortable they are to ship. You didn't prioritize by ROI. You prioritized by ease, and dressed it as rigor. The fix isn't a sharper facilitator or finer 1–10 scales. It's a different structure, because the bias is in the arithmetic, not the room.

Separate the gates from the score

The first structural move is to stop treating every dimension as a number you average. Some dimensions are scores — more is better, and you trade them off against each other. Others are gates — below a threshold the use case is out, and no amount of strength elsewhere buys it back in. Averaging gates with scores is the single most common way a scorecard goes wrong, because a fatal flaw gets quietly offset by a high mark somewhere else and survives into the top five.

Two dimensions should almost always be gates. The first is data readiness: if the records the agent needs to ground its decisions are duplicated, stale, or trapped in a system you don't govern, the use case is blocked — its value is irrelevant until that's fixed, because the agent will confidently act on bad inputs. The second is attributability: if you can't name the existing metric it moves and isolate the agent's effect from everything else changing that quarter, you can't grade it, and an ungradeable use case can't be run accountably or defended to a CFO. Score everything else. Gate those two.

A gate you average away isn't a gate. It's a footnote — and the footnote is usually the reason the project fails.

Value is multiplicative, not additive

Here's what the four-box matrices get structurally wrong. Value is not one slider. The annual value of an AI use case is roughly the product of three independent quantities: how often the task happens, how much each instance is worth, and how much of that the agent can actually move. Frequency times value-per-instance times realistic lift. They multiply — a near-zero in any one term collapses the whole thing — and an additive scorecard, where a weak term gets propped up by strong ones, cannot represent that collapse. Take a contract-review agent: enormous value per instance, but it fires forty times a year and the agent can safely automate maybe a third of the judgment. Now a lead-response agent: modest value per lead, but thousands of leads a month and a lift you can actually capture. Score each on a single 1–5 "value" slider and they land next to each other; multiply frequency by value-per-instance by lift and they separate by an order of magnitude. The multiplicative structure is the entire insight; everything below is just making the columns honor it.

  • Frequency — how many times a month does this task occur? Agent economics are per-decision, so this is the term that lets value compound rather than add.
  • Value per instance — what is one correct, fast, consistent handling of this task worth, in money or in a metric that converts to money on a path you can name?
  • Capturable lift — of that value, what fraction can the agent realistically move, after you subtract what humans already get right and what stays manual or escalates anyway?
  • Confidence — how sure are you of the three numbers above, on a 0-to-1 scale? This is the multiplier everyone forgets, and the one that separates a forecast from a wish.

The column everyone forgets: confidence

A scorecard full of point estimates lies by omission. "This use case is worth a lot" and "this use case is worth a lot, plus or minus almost all of it" are completely different bets, and a single cell hides the difference. The use case with the tighter band is usually the better first move even at a lower expected value, because a first project's job is partly to be provable — a defensible small win buys you the credibility to fund the bigger one. Wide error bars on your top-ranked item are a reason to sequence it second, not first.

So make confidence a 0-to-1 discount on the raw value, and apply it explicitly rather than letting it hide in a hedged guess. It does two things at once: it pulls speculative moonshots down to honest size, and it rewards the use cases where you've already done the homework — pulled the volumes, sampled the records, checked the metric is real — to ground the estimate. A scorecard without a confidence column rewards the most confident storyteller in the room. That is precisely the person you do not want setting the roadmap.

Effort is a divisor, and it decays

Effort belongs in the scorecard, but not as a peer column you sum alongside value. It's a divisor — value over cost — because what you're ranking is return per unit of effort, not return minus effort. The distinction bites at the extremes: an additive model will rank a huge-value, huge-effort project above a solid-value, tiny-effort one, even when the second is the smarter use of a quarter of engineering capacity you'll never get back. And effort has internal structure. The build is one cost; the far larger one, the one teams systematically under-score, is the run: keeping the agent accurate as the world drifts, handling the escalations it can't, maintaining the evals that catch regressions, re-grounding when reality changes underneath it. A use case that's cheap to build and expensive to run forever is a worse bet than its day-one number suggests, so score lifetime effort, not launch effort. The agents that look cheapest in the workshop are often the ones that quietly eat a team six months later — which is exactly why we'd rather run an agent than hand it over and wish you luck.

Put it together: the formula and the one-pager

Assembled, the scorecard is less arbitrary than the four-box version because every term means something specific. First the gates: data-grounded and attributable, each a hard yes/no. Fail either and the row is parked, not ranked. Then a priority score for everything that passes — roughly (frequency × value-per-instance × capturable lift × confidence) ÷ lifetime effort. The exact arithmetic doesn't need to be precise to three digits. The structure needs to be honest enough that you can't average a fatal flaw into the middle of the pack.

Keep the artifact to one page. Rows are candidate use cases. Two gate columns up front, red or green, no partial credit. Then the four value terms and the effort divisor. Then the computed score, and — the column that keeps you honest — one sentence: "we'll know this worked because [existing metric] moves from X to Y." If that cell is empty, the row hasn't earned a number yet, however good the demo looked. Any X and Y you write are hypotheses until you've instrumented and measured them; treat them as illustrative, not as facts.

A worked pass through three candidates

Say you're weighing three: an internal meeting-summarizer, an inbound speed-to-lead agent, and a contract-risk reviewer. The summarizer fails the attributability gate immediately — no existing metric moves in a way you can isolate from the dozen other things changing — so it's parked, regardless of how much the team enjoys the demo. The contract reviewer passes both gates but scores low on frequency and confidence and high on lifetime effort: a real candidate, just not the one you start with.

Speed-to-lead clears both gates — lead data already lives in your CRM, and conversion is a number sales already reports — and its multiplicative value is large because frequency is high and the lift is genuinely capturable: respond in seconds instead of hours, to every lead, consistently. Illustratively, nudging demo-booking from 8% to 10% across a few thousand leads a month is a figure you can defend, isolated with a simple holdout group rather than asserted. It wins not because it's exciting but because the structure can't hide its strengths or invent the others' value. (Those percentages are illustrative, not a result we're claiming.)

The scorecard outputs a sequence, not a winner

The last reframe is the most useful. A prioritization scorecard's job isn't to crown a single use case — it's to produce an ordered sequence where each project funds and de-risks the next. The parked rows aren't rejected; they're waiting on a gate to clear, and the work that clears a data gate for one use case often clears it for three. Unifying duplicated account data to unblock speed-to-lead can simultaneously unblock churn-risk scoring and renewal nudges. That shared dependency is the real reason to sequence in a particular order, and the ranking should make it visible instead of burying it.

This is why our Data-to-Agent method starts at Agent Ready, before any agent exists: the gates are where the sequence actually gets decided, not the model. The honest part most strategy decks skip is that AI portfolios rarely fail because someone picked one bad use case. They fail because nobody scored the candidates against a structure that could tell a real result from a comfortable one — so the budget chased the most charismatic demo, and the demo couldn't be defended when the quarter closed. A scorecard built to resist that is worth more than whichever model you put underneath it. We build it, we run it, and we tie our fee to the number in that last column — which is why we're ruthless about the structure before anyone touches a model.

Score your use-case portfolio with us