How to Monitor AI Agents in Production

Most engineering teams know how to monitor software. You track latency, error rates, throughput. You set up alerts for the metrics that matter and you get paged when something breaks. It works well for APIs, services, databases.

AI agents don't fit this model. The usual signals are necessary but not sufficient. An agent can run without errors, return a 200, complete in two seconds, and still have completely failed the user. Standard monitoring tells you whether the agent ran. It doesn't tell you whether it did anything useful.

Here's what actually works for monitoring agents in production.

Why Agent Monitoring Is Different

With a regular API endpoint, success is binary. The request either worked or it didn't. You can measure that precisely and alert on deviations.

With an agent, there are multiple levels at which things can go wrong, and most of them don't show up as errors.

An agent that calls 30 tools to answer a simple question is degraded, even if it eventually answers correctly. An agent whose answers have been slowly getting worse for two weeks has a serious problem, even if every individual request returned a 200. An agent that spent $12 answering a single user query has a cost problem, even if the query resolved successfully.

None of these show up in standard application monitoring. You need an additional layer that understands what's happening inside the agent's runs, not just that runs are happening. Understanding what AgentOps covers helps clarify why this layer exists.

The Three Signals That Actually Matter

There are many things you could monitor. These three are the ones that actually tell you something actionable.

Traces: the full picture of what each run did

A trace is a complete record of a single agent run. Every tool call, every model call, every decision point, every intermediate result, and the final output. Traces are the foundational data for everything else. Without them, you're flying blind.

When something goes wrong, traces let you answer: what exactly did the agent do? At what step did it go off track? What data did it have when it made the wrong decision? These questions are unanswerable without per-run traces.

Traces are also how you catch the failure modes that don't generate errors. The agent that made 40 tool calls when five would have sufficed. The agent that retrieved stale data. The agent that picked the wrong tool. All of these show up in traces long before they show up in user complaints.

Quality scores: whether the outputs are any good

This is the signal that's hardest to collect but most important to have. You need some way to know whether the agent is producing good outputs, not just that it's producing outputs.

At minimum, this means sampling a percentage of runs and checking quality manually or with automated scoring. Even sampling 5% of production runs and reviewing them weekly gives you a leading indicator of quality changes before they accumulate into a real problem.

The silent quality drop is one of the most common and least-detected agent problems. Model providers update their underlying models continuously. An agent that was great three months ago may be subtly worse today and you won't know unless you're checking. See the full failure mode list in The 12 Ways Production Agents Fail.

Cost per run: how much each answer costs

Track cost at the run level, not just in monthly aggregate. You want to know whether a specific type of query is more expensive than others. You want to catch cost regressions: prompt changes, new tools, or model upgrades that make runs more expensive.

Cost per run also gives you an early warning for the cost bomb failure mode. If your average cost per run suddenly spikes from $0.05 to $0.80, something has changed. You want to know that within hours, not at the end of the billing cycle.

How to Set Up Basic Tracing

Getting traces doesn't require a complex setup. Here's the practical path.

Step 1: Log every tool call. At minimum, every time your agent calls a tool, record: what tool was called, what arguments were passed, what came back, and how long it took. This is the skeleton of a trace.

Step 2: Attach a run ID. Every agent run should have a unique identifier that ties all its tool calls together. Without this, you have a pile of individual call records with no way to reconstruct what any given run actually did.

Step 3: Record the final output. Link the final answer back to the run ID and all the tool calls that produced it. Now you have a complete trace: you know what the agent was asked, what it did to answer it, and what it returned.

Step 4: Add cost tracking. For each model call in the run, record the input tokens, output tokens, and model used. Sum these up at the run level to get cost per run.

That's the minimum viable tracing setup. It doesn't require a third-party tool, just a structured logging approach. Once you have this in place, you can point an observability tool at your logs or events and start getting visual traces.

What to Look at in Your Traces

Having traces is the beginning. Knowing what to look for is the part that takes practice.

When you're reviewing traces, focus on a few questions:

Did the agent take a reasonable path? For a given task type, you have an intuition for roughly how many steps a good answer should require. If you're seeing runs with two or three times that many steps, something is inefficient. Start there.

Where do runs tend to fail or abort? If runs are consistently failing at step 7, that tells you something about step 7. It might be a specific tool that's flaky. It might be a decision point where the model consistently makes the wrong choice. The pattern tells you where to look.

Are there tool calls with consistently poor results? Some tools return high-quality, useful data most of the time. Others return garbage half the time and the agent keeps trying anyway. Look for tools where the downstream reasoning is often poor. That's usually a tool quality problem, not a model problem.

Are there patterns in the bad runs? Group your traces by outcome (completed successfully, completed with low quality output, aborted, errored). Look for patterns in the bad groups. They're usually not random. There are usually specific input characteristics that predict bad runs, and finding those patterns is the first step to fixing them.

When to Alert vs. When to Review

Not everything needs to page someone. Here's a rough framework.

Alert immediately when:

Error rate spikes sharply (something broke)
Cost per run spikes sharply (potential runaway agent or cost bomb)
Completion rate drops below your baseline (agent is failing to finish tasks)
A specific tool's failure rate exceeds a threshold (dependency is down)

Review weekly when:

Quality scores trend down over multiple weeks
Average step count increases (agent getting less efficient)
Cost per run slowly increases (gradual inefficiency creeping in)
New failure patterns appear in trace sampling

The distinction matters. Immediate alerts should be rare and signal something actually broken. Weekly reviews catch the slow degradation that doesn't generate a single dramatic event but compounds into a real problem over months.

Common Monitoring Mistakes

A few patterns that come up often and waste a lot of time.

Monitoring only for errors and latency. These metrics are necessary but not sufficient. An agent can have excellent latency and zero errors while consistently giving wrong answers. Add quality and cost signals, or you're only monitoring half the problem.

Setting up dashboards before you've read any traces manually. Dashboards are useful when you know what to put on them. If you haven't read your traces manually, you don't know what metrics actually matter for your specific agent. Build the habit of reading traces first. Let that inform what you put in your dashboards.

Treating all runs the same. Different task types have different expected patterns. A research task that should take 20 steps looks like a broken agent if you compare it to a lookup task that should take two steps. Segment your monitoring by task type, not just across all runs.

Waiting for production traffic to start monitoring. The best time to wire up tracing is while you're still building. The cost of adding tracing later is high: you'll have a period of production traffic with no data, and the first time something goes wrong you'll have nothing to debug with. Add it early, even if you don't look at it much until you have real traffic.

A Minimal Monitoring Setup

Here's what a practical starting point looks like for a team early in production.

Tracing: Log every tool call with run ID, tool name, arguments, result, duration, and cost. Aggregate by run. Store traces for at least 30 days.

Cost alert: Alert when any single run exceeds a cost threshold (set this to 5-10x your expected average run cost). Alert when daily spend exceeds a budget threshold.

Error alert: Alert when error rate or abort rate rises above your baseline. This is your standard uptime monitoring applied to the agent's completion rate.

Weekly review: Sample 20 traces. Mark each one good or bad. Look for patterns in the bad ones. This is your quality monitoring until you have the volume and time to build something more automated.

That's it. Four things. None of them require a specialized tool to start, though tools make it easier as you scale.

Start simple. Add sophistication when you have enough production data and enough team bandwidth that additional complexity pays off. The teams that over-invest in monitoring infrastructure before they have real traffic often end up with dashboards full of metrics that don't help them answer the questions they actually have.

Where This Leads

Monitoring is the operational foundation. Traces give you the data. Alerts tell you when something acute is wrong. Weekly review catches the slow drift.

Once you have this working, the next step is building a real evaluation system. Monitoring tells you that things are going wrong. Evaluation tells you why and gives you a way to measure whether your fixes actually work. Those two things together are what AgentOps is really about: the loop of observing what your agent does, evaluating whether it did it well, and improving what you find. See A Practical Guide to Evaluating Agents for the next step.