A Practical Guide to Evaluating AI Agents

Evals are the part of agent work everyone agrees they should do and almost nobody actually does well. Here's what works, what doesn't, and how to start without spending three weeks building infrastructure first.

Why Agent Evals Are Different

To understand why agent evals are hard, it helps to compare them to what came before.

Regular ML eval is conceptually simple. You have a test set with labeled examples. You run your model. You measure accuracy. Done. The challenge is in execution, not in framing the problem.

LLM eval is harder. You have prompts, you check outputs, and there's often no single right answer. The standard approach now is using another LLM as a judge, which works reasonably well for many tasks but has its own biases and limitations.

Agent eval is different again. You have multi-step runs with tool calls, branching paths, and no single right answer. The output you're evaluating isn't just text. It's a whole trajectory: the agent decided to call tool A, then tool B, then synthesized the results, then took action. Was that the right sequence of decisions? Did the intermediate steps make sense? Was the final output good? These are three different questions that require three different approaches to answer.

This is why most teams give up on agent evals or half-heartedly check only the final output. Standard frameworks weren't built for this. Understanding what you're actually trying to measure is the first step to building something useful. See What is AgentOps for where evals fit into the broader operational picture.

The Three Things You Can Actually Measure

Keep this simple. There are three buckets. Everything else is a variation on one of these.

Final output quality: did the agent produce a good answer

This is the most obvious thing to evaluate and the one most teams start with. Did the agent answer the question correctly? Was the response helpful? Is the output factually accurate? This is the thing users care about most directly.

Trajectory quality: did the agent take a reasonable path to get there

This is the one most teams skip. But here's the problem: an agent that arrives at the right answer through 47 tool calls is broken, even if the answer is correct. An agent that hallucinated two tool results and got lucky is a ticking clock, even if the current output looks fine.

Trajectory evaluation asks: did the agent call the right tools? Did it use the right arguments? Did it handle intermediate failures appropriately? Was the number of steps reasonable for the task?

You can't evaluate trajectory quality by only looking at final outputs. You need the full trace. And you won't have traces unless you've built observability first, which is why observability and evaluation are so tightly linked in practice.

Operational quality: cost, speed, and reliability

How much did this run cost? How long did it take? Did it hit any errors along the way? These are the easiest to measure because they're numbers, not judgments. They're also critical for catching the cost bomb failure mode and for understanding whether your agent is actually usable at scale.

Most teams measure final output quality. Some measure operational quality. Almost no teams measure trajectory quality in a systematic way. That's the gap. See The 12 Ways Production Agents Fail to understand what trajectory problems actually look like.

How to Actually Run Evals

Three approaches, ordered by how much work they require.

Approach 1: Manual review

Look at 20 traces a week. Mark each one good or bad. Write a sentence explaining why. Group the bad ones by pattern.

This sounds primitive. It works better than people expect, especially in the first few months of any agent project. The reason is that reading actual traces teaches you things no automated eval will surface. You see the weird edge cases. You notice the patterns that don't fit any category you defined in advance. You find out what "good" actually means for your specific use case before you've tried to codify it.

Manual review is also the prerequisite for everything else. If you don't understand your failures, you can't build automated checks that catch them.

Approach 2: LLM as judge

Use a model to score outputs. Send the task, the agent's answer, and a rubric to a separate LLM and ask it to evaluate quality. This scales better than manual review and can cover more ground.

It works reasonably well for factual correctness questions ("is this answer accurate given this context?") and for coverage questions ("did the agent address all parts of the request?"). It falls apart on subjective quality assessments where the rubric is vague, and on cases where the judge model has the same systematic biases as the agent being evaluated.

Two known limitations worth keeping in mind: LLM judges are biased toward longer, more verbose answers even when concise answers are better. And they tend to agree with themselves, meaning using the same model to judge that you used to generate will produce inflated scores.

Use LLM-as-judge knowing these limits. It's a useful signal, not a ground truth.

Approach 3: Programmatic checks

Did the agent call the right tool? Did the output match the expected schema? Did it stay under the cost budget? Did the run complete without errors?

These are boring but reliable. They don't require another LLM call. They don't have the biases of human or model judgment. They run fast and cheap on every single trace.

Programmatic checks should be your floor. Run them on every run in production, not just on samples. They catch failures 2, 4, 5, and 11 from the failure modes list automatically and at scale.

What to Evaluate

Don't try to eval everything. Pick three things and track them over time.

One quality metric. One efficiency metric. One reliability metric.

The quality metric is usually something like "did the agent successfully complete the task" or "was the final answer factually correct." Pick the thing that most directly represents what success means for your use case.

The efficiency metric is usually token cost per run or step count. You're tracking that your agent isn't getting slower or more expensive over time as you make changes.

The reliability metric is usually error rate or completion rate. What percentage of runs complete without hitting an error or abort condition?

Track these three numbers over time. Graph them. Look at the graph every week. That's your eval program for the first six months. It's not glamorous, but teams that do this consistently find problems before users do.

Building an Eval Set

The hard part isn't running evals. It's having a good set of examples to run them against.

Where to get examples:

Production traces, anonymized. Real user requests are more representative than anything you'll invent.
Edge cases you've actually encountered. When a production failure happens, add a version of it to your eval set.
Support tickets and user complaints. Your support inbox is an eval set waiting to be built.

Skip synthetic data for now. Synthetically generated examples look clean and often miss the specific jagged edges of real usage patterns. They're useful for filling gaps later, not for building your initial set.

Start with 50 examples. Not 500, not 5,000. Fifty. You will iterate on these examples as you learn more about your failures. Starting with too large a set makes iteration expensive and slows down the feedback loop.

Once you have 50 examples and you're running evals against them regularly, then consider expanding. But only expand when the 50 examples no longer represent the variety of failures you're seeing in production.

The CI Question

Should evals run on every commit? Probably not at first.

Running 50 agent traces costs real money. If each trace costs $0.50 in API calls, 50 traces is $25 per eval run. Running that on every pull request adds up fast and slows down development when you're still moving quickly.

A more practical schedule for most teams:

On prompt changes, always. Prompts are the most common source of quality regressions.
On model upgrades, always. This is the main defense against the silent quality drop.
On tool schema changes, always.
Nightly on main.
Full eval suite on every PR is aspirational. Get there eventually, but don't block shipping on it in month one.

Common Eval Mistakes

These are the ones that waste the most time.

Grading on the wrong axis. Testing whether the agent sounds confident when you should be testing whether it's correct. Or testing formatting when you should be testing whether the task was completed. Be precise about what "good" means before you start measuring it.

Using the same model to judge as to generate. If you used Claude to generate and Claude to judge, you're measuring Claude's self-consistency, not actual quality. Use a different model, or use programmatic checks where possible.

Ignoring trajectory and only checking the final output. The output might look fine while the path that produced it was expensive, fragile, or wrong. Checking only the output misses this entirely.

Building elaborate infrastructure before you have 10 production users. If you don't have real traffic yet, your eval set is speculative. Build the minimum that lets you start learning from real data, then invest more as you have more data to invest against.

Trusting one number. A single "eval score" that averages everything hides where things are going wrong. Track the three separate metrics. A rise in cost with stable quality is a different problem than a drop in quality with stable cost. One number can't tell you which problem you have.

A Simple Workflow to Start

Here's what to actually do this week, in order.

Save the next 50 production traces somewhere you can read them. If you don't have production traffic yet, generate 50 realistic requests and run your agent against them. Store every trace.

Read all 50. Mark each one good or bad. Write one sentence for each bad one explaining why it's bad.

Group the bad ones into three or four patterns. There will be patterns. There always are.

Pick the most common pattern. Build one programmatic check that detects it. Add that check to your standard eval run.

Repeat next month with 50 more fresh traces. Check whether the pattern you fixed is actually gone. Look for new patterns.

That's the whole workflow. It's not sophisticated. It works. The teams with the best agents aren't the ones with the most elaborate eval frameworks. They're the ones who look at their traces every week and fix what they find.

Closing Thought

The temptation is to build the eval system first and then run the agent. Get the infrastructure right, nail down the metrics, build the dashboards, and then ship. That's backwards.

Ship the agent. Run it on real inputs. Read the traces. Then build the eval system around what you're actually seeing. The infrastructure should be shaped by the real failures you've found, not by a theoretical list of things that might go wrong.

Tools help. They don't replace looking. The teams that succeed at agent evals are the ones who made looking at traces a weekly habit before they ever built a formal eval system.