Field note
How to Keep an AI Agent From Drifting in Production
AI agentsHow to Keep an AI Agent From Drifting in Production
An agent that passed every test at launch can quietly rot over the next two months while every dashboard stays green. Drift isn't one failure mode — it's four, and each one needs a different instrument.
Here's the failure that never makes it into a postmortem, because nobody can point to the moment it happened. You launch an agent. It passes every test. Week one, it's answering cases and qualifying leads at the quality you signed off on. Then, over the next two months, the answers get a little worse. Not broken — worse. Deflection slips a few points. Sales starts quietly routing around it. And every dashboard you have is still green, because you're measuring uptime and latency, not whether the agent is still right.
That's drift. It's the default state of any agent in production, not an edge case. An agent is a living system pointed at a world that changes, running on a model that changes, fed by data that changes. Stillness is the thing you have to engineer. Set it and forget it, and the only open question is how fast it decays — not whether.
Drift is four different problems wearing one word
The mistake is treating drift as a single phenomenon with a single fix. It isn't. There are at least four independent decays. They have different causes and different time constants, and a control that catches one will sail right past the others. Name them separately or you'll keep patching the wrong layer.
- Model drift. The model under you changes — a provider version bump, a silent weights update, a shifted default temperature. Same prompt, different behavior. You didn't ship anything; the floor moved.
- Data drift. The grounding data shifts. New product lines, renamed fields, a dead integration leaving stale rows, a dedupe job that merged the wrong records. The model is fine; the facts it stands on rotted.
- Prompt and tool drift. A hundred small edits accumulate. Someone adds a line to the system prompt to fix one complaint, someone changes a tool's return shape, and the composite behavior no longer matches anything that was tested.
- World drift. Reality moves on. New pricing, a new regulation, a competitor's promo your agent has never heard of. The agent answers confidently from a world that no longer exists.
Notice that only one of these — prompt and tool drift — is something you did. The other three happen to you while you sit still. That's why "we froze the agent so it's stable" is a misunderstanding. Freezing the agent freezes exactly one of the four sources. The model, the data, and the world keep moving underneath the frozen thing.
Why your dashboards say everything is fine
Drift is dangerous because the metrics most teams already have are blind to it. Latency, error rate, token cost, uptime — all of these can be perfect while answer quality falls off a cliff. A confidently wrong answer returns a 200 and renders in 800 milliseconds. Your APM loves it. Drift lives in the gap between 'the system responded' and 'the system was right,' and almost nobody instruments the second half.
Worse, the strongest early signal is usually invisible to software entirely: people stop trusting the agent and route around it. Reps quietly go back to doing it by hand. The deflection rate doesn't crater — it erodes, in a way that's easy to write off as seasonality or a bad month. By the time it's obvious in a business number, you've been drifting for weeks and you have no idea which of the four causes started it.
The fix is an eval set that lives, not a launch checklist
The single highest-leverage thing you can build is a standing evaluation set: a fixed, versioned battery of representative inputs with known-good expectations, that you run on a schedule and on every change — not once before launch. Launch evals tell you the agent was good on the day you shipped. A living eval set tells you whether it's still good today. That difference is the entire game.
Build it from real production traffic, not from your imagination. Pull the actual cases, the actual lead questions, the messy phrasing, the edge inputs that embarrassed you in week one. Pin the expected behavior. Then run that battery nightly, and run it again automatically before any prompt or tool change ships. A few principles separate an eval set that catches drift from one that gives you false comfort:
- Freeze the inputs, version the expectations. When reality changes, you update the expected answer deliberately and record why — you don't quietly let the test follow the agent. A test that drifts with the agent catches nothing.
- Pin the model version explicitly, so a provider bump shows up as a diff in your eval scores instead of a mystery in your business numbers next quarter.
- Grade on the dimensions that matter: factual correctness, whether it stayed in policy, whether it took the right action, and whether it refused when it should have. Not just 'did it sound good.'
- Set a baseline band, not a single pass/fail. Score the suite, record the launch baseline, and alert on a drop past a threshold you choose — a couple of points of noise is normal, a sustained slide is the signal. Without a band you either miss real decay or drown in false alarms.
- Keep a held-out slice you never tune against, so you're measuring real quality and not how well you've overfit to your own test.
On Agentforce specifically, lean on what the platform already gives you. Grounding on Data Cloud and permissions inherited from your sharing model take two of the four drift sources partly off the table by construction — the agent can't ground on data it isn't connected to, and it can't surface what the asking user can't see. That doesn't make evals optional; data still rots inside the sources you are connected to. It means your eval budget can concentrate on world drift and prompt drift instead of re-litigating data plumbing every week.
Catch it in three places, because one is never enough
A real anti-drift setup watches the agent at three distinct points, because each catches a class the others miss. Pre-deployment: the eval battery gates every change, so prompt and tool drift can't ship silently. Scheduled: the same battery runs nightly against the live agent and pinned model, so model and data drift show up as a score that moved even when you shipped nothing. Online: you sample real production traffic and grade it — with an LLM judge, human review, or both — because the live distribution always wanders somewhere your fixed set didn't anticipate.
The scheduled run is the one teams skip, and it's the one that catches the scary failures. It's the only instrument that fires when you didn't touch anything. Say the nightly score drops well past your baseline band with an unchanged prompt — that's the cleanest possible signal that the model or the data moved under you. It's the difference between finding out tonight and finding out from a customer in six weeks.
“Most production AI failures aren't a model getting something wrong. They're a system that was right at launch, slowly becoming wrong, while everyone watched a dashboard that was never measuring whether it was right.
What you do when the score moves
Detection without a response is just a more precise way to feel bad. The point of separating the four drift types is that each has a different remedy, and the eval signal tells you which lever to pull. Model drift: pin to the known-good version and re-qualify the new one deliberately, on its own schedule. Data drift: fix the grounding — reconnect the source, re-run identity resolution, kill the stale rows. Prompt and tool drift: roll back the change the gate should have caught, then tighten the gate. World drift: update the knowledge and the expected answers together, on purpose.
This is also where ownership gets decided. Someone has to be accountable for the green-to-yellow moment and have a defined move for each cause. If the answer to "who watches the nightly eval and acts on it" is nobody, you don't have an anti-drift system — you have a dashboard that will eventually display the bad news to whoever happens to be looking. Drift control is an operating responsibility, not a feature you install.
The part the launch demo skips
A launch is a snapshot of quality on one day. Production is the integral of quality over every day after, and that's the number that actually pays you back. An agent that was excellent at launch and drifted for two months unwatched can net out worse than a more modest agent someone kept on the rails. The discipline of catching drift is worth more than the cleverness of the original build.
This is why we don't treat launch as the finish line. In our Data-to-Agent method, Agent Care is a named, ongoing phase — the standing evals, the scheduled runs, the defined response to each kind of drift — and it's the reason SkySync ties its fee to the result rather than the launch. We build and run agents, and we stay accountable for the integral, not just the snapshot, because the snapshot is the easy part and the integral is where the return actually lives.
Talk to us about keeping a production agent honest over time