AI Evals in Production: The Work Nobody Sees Until It Breaks

The Silent Drift

Every AI system in production drifts. The model does not change — the world around it does. Users phrase questions differently as a product matures. Documents get updated. A new policy gets added to a knowledge base. A downstream API changes its response shape. None of those are model problems, and all of them degrade output quality in ways you will not see unless you are looking.

The team that built the system usually knows this in theory. What they rarely have is a set of evals running continuously against production behavior, catching the drift the day it happens rather than the week a customer escalates. Evals are the boring infrastructure layer that separates a shipped demo from a system engineering can actually operate.

Three Kinds of Evals, Three Purposes

Treating "evals" as one thing leads to a dashboard nobody trusts. Three distinct categories, each answering a different question.

Offline evals are the test suite. A fixed dataset of inputs with known good outputs, run against every model change, prompt change, or pipeline change. The purpose is regression prevention. Did this change make anything worse? If a legitimate input that worked yesterday fails today, offline evals catch it before shipping. These need to live in CI.

Online evals are production quality monitoring. Live traffic sampled and scored, either by a judge model or by light human review. The purpose is drift detection. Is the system still performing the way it was when we last measured? These run continuously and alert when quality metrics cross a threshold.

User feedback evals are the ground truth loop. Thumbs-up/thumbs-down, structured feedback forms, or implicit signals like user revisions. The purpose is calibration. When your model and your users disagree, who is right? These run constantly in the background and are the slowest to give signal but the most honest one.

A production AI system needs all three. Teams that ship only offline evals get caught by drift. Teams that ship only online evals have no regression guard. Teams that skip user feedback entirely never calibrate their judges against the thing the judges are supposed to approximate.

Building the Offline Set

The first mistake teams make on offline evals: treating test cases like traditional unit tests. "Here are 20 examples, when they all pass we're good." LLM behavior is not deterministic enough to pass or fail on 20 cases in the way code does.

The offline set needs volume and coverage. A few hundred cases at minimum, drawn from real production patterns, categorized by intent, difficulty, and edge-case type. When you run the suite, you get a distribution of scores per category, not a pass/fail. A regression shows up as a drop in the distribution on a specific category — not a single case failing.

Building the set well is mostly sampling work, not writing. Pull real production queries. Bucket them by the patterns you care about. Sample proportionally so the set reflects the actual traffic mix. Write reference answers only for the categories where automated grading is unreliable. For categories where a judge model is reliable, skip the reference and grade with the judge.

Expect the set to evolve. Every production incident should add a case to the suite. Every category the system handles poorly should get more examples until it is represented well enough to track.

Judge Models and Their Traps

A judge model — an LLM that scores another LLM's outputs — is the only way to run thousands of evals cheaply. It is also the part of the eval stack that silently lies if you do not check it.

The trap is measuring your system against a judge that is not calibrated. The judge scores outputs 4 out of 5 because the judge was trained to prefer verbose helpful-sounding answers. Your product team wants concise terse answers. The judge says everything is great. Your users are churning.

The fix is calibrating the judge against human review. Take 100 production outputs. Have a person score them on the rubric you actually care about. Run the judge against the same 100. Compute agreement. If the judge agrees with the human fewer than 80% of the time on a simple rubric, the judge rubric needs revision — usually more specific criteria and better few-shot examples.

We re-calibrate judges quarterly. Judge drift is a real thing, especially when the underlying judge model is upgraded. A judge that was calibrated against GPT-4 may behave differently once it is running on GPT-4.1. Assume it does until you have re-measured.

Online Evals and the Sampling Problem

You cannot eval every production request. The cost would match the cost of serving the traffic in the first place. Sampling is required, and how you sample matters.

Uniform random sampling gives you an unbiased estimate of overall quality. It misses low-frequency failure modes. If 2% of your traffic is a specific high-stakes query type, uniform sampling undercovers it.

Stratified sampling fixes that. Categorize each request, sample from each category proportionally to its business importance rather than its frequency. High-stakes but rare queries get sampled heavily. Low-stakes but common queries get sampled lightly. The total sampling cost stays manageable; the coverage improves dramatically.

Alert on category-level drops, not aggregate. The moment a category-specific quality score drops below its baseline, page someone. Aggregate metrics hide the failure mode where one important category collapsed while the rest stayed flat.

The Feedback Loop

The most valuable signal is the one users give you for free. A thumbs-down on a response tells you more than a judge model can. A user editing the AI's draft before sending tells you exactly where the AI fell short.

Wire those signals into a review queue. Anything with a thumbs-down gets scored by a judge within the hour. Anything with a human edit gets compared to the original to surface the specific failure. Ship a dashboard that shows this stream in real time. Engineering should be reading bad responses every day, not every quarter.

The feedback loop is also how you update the offline eval set. A thumbs-down on a category you were not already testing is a signal to add that category. A recurring pattern in user edits is a signal to add a test case for it.

When to Build This

Evals are a fixed investment that pays off as your system scales and matures. If you are pre-launch with three users, formal evals are overhead. If you have hundreds of users or are shipping something with any material business risk, the investment is overdue.

The minimum viable eval stack we ship with every production system: a 200-case offline set in CI, 5% uniform online sampling with judge scoring, a thumbs-up/thumbs-down button on every response, and a weekly review of the feedback queue. That is enough to catch the serious drift. It is not enough for anything regulated or mission-critical — those need stratified sampling, calibrated judges, and a human review pipeline — but it is a floor.

If you have an AI system in production without an eval layer and the bill or the complaints are climbing, a 15-minute audit usually surfaces the cheapest three fixes. Evals are the unglamorous work that buys you the ability to keep shipping without breaking what you already shipped.

AI Evals in Production: The Work Nobody Sees Until It Breaks

The Silent Drift

Three Kinds of Evals, Three Purposes

Building the Offline Set

Judge Models and Their Traps

Online Evals and the Sampling Problem

The Feedback Loop

When to Build This

Keep reading.

Small Language Models for Vertical Agents

LLM Cost Control: Strategies That Actually Move the Bill

Agentic RAG: When Retrieval Becomes a Reasoning Step