agent-failures debugging production reliability ai-agents

The 12 Ways Production Agents Fail

What actually goes wrong when you ship an AI agent to real users

AgentOps Team
17 min read

Every agent works perfectly in the demo. Then it ships. Here's what actually goes wrong, based on what I've seen building agents and watching other people build them.

Twelve failure modes. For each one: what it looks like, why it happens, and how to catch it before users do.

1. The Infinite Loop

What it looks like: The agent calls the same tool over and over. Minutes pass. Token costs mount. Nothing resolves. Eventually it times out or you hit a rate limit, and the run ends with no useful output.

Why it happens: The tool returns results that are ambiguous or incomplete. The agent interprets this as "I didn't get what I needed" and tries again with the same approach. Since the tool keeps returning the same kind of result, the agent keeps trying. This is especially common with search tools that return partial results and web scraping tools that hit paywalls or login pages.

How to catch it: Max iteration limits. Every agent framework has this or should have it. Set a hard cap on the number of steps per run. Also add loop detection: if the agent has called the same tool with the same or very similar arguments in the last N steps, something is wrong and it should stop and report back rather than continue.

2. Silent Tool Failures

What it looks like: A tool returns an error message. The agent reads the error message as data and keeps going. The final output looks confident, cites specific details, and is built entirely on garbage. No exception was raised. The run completed successfully. The output is wrong.

Why it happens: Most LLMs are trained to be helpful and to continue making progress. When a tool returns something unexpected, the model often tries to work with it rather than stop. An error message like "Error: database connection failed" gets treated as content. The model might write a summary that references this "error" as if it were a real data point.

How to catch it: Structured error handling. Tools should return errors in a structured format that the agent framework can detect and handle, not as plain text that might get confused with valid data. Fail loud. When a tool fails, the agent should know it failed and decide explicitly what to do, not try to interpret the failure message as content.

3. Context Window Overflow

What it looks like: The agent makes several tool calls that each return large blocks of text. The accumulated context grows until it hits the model's token limit. Depending on the framework, this either crashes the run outright or starts silently dropping early content, often including the system prompt. When the system prompt gets dropped, the agent's behavior becomes unpredictable.

Why it happens: Each tool call result gets appended to the running context. If you're searching the web and pulling full page contents, or querying a database and returning unfiltered results, the context can grow very fast. Most agents have no mechanism to manage this growth.

How to catch it: Track context length throughout the run. Set a threshold below the actual limit and trigger a summarization step when you approach it. Summarize early tool results into compact references before the context gets too large. And design tools to return concise, relevant outputs rather than full raw content wherever possible.

4. The Cost Bomb

What it looks like: One user's query triggers 200 tool calls. Or an agent gets into a loop and makes thousands of LLM calls before you notice. You find out when the bill arrives. Sometimes the bill is four digits.

Why it happens: Agents are loops by nature. Without explicit cost controls, a single pathological input can trigger unbounded spending. This is especially common with agents that have access to expensive external APIs, or agents that try to be thorough by gathering a lot of data before synthesizing.

How to catch it: Per-request budget caps. Set a maximum token spend or a maximum number of tool calls per run, and abort with an error message when the cap is hit. This is a non-negotiable for any production agent. Also set up billing alerts at the account level so you know within hours if spending has spiked, not weeks later when the invoice arrives.

5. Hallucinated Tool Calls

What it looks like: The agent invents a tool that doesn't exist. Or it calls a real tool with arguments that don't match the schema. Or it passes a string where an integer is required. The framework throws an error or silently ignores the call, and the agent is confused about why its action had no effect.

Why it happens: LLMs can generate plausible-looking function calls for tools they've never seen. They interpolate from context, training data, and the pattern of tools that do exist. This behavior has improved significantly with modern models and strict function-calling APIs, but it still happens, especially when the tool schema is ambiguous or the model hasn't seen many examples of tool use.

How to catch it: Strict tool schemas. Validate every tool call against the schema before execution, not after. Return clear, structured errors back to the model when validation fails. Use function-calling APIs that enforce schema compliance rather than parsing tool calls from free text. Most major providers have these now.

6. The Wrong Tool for the Job

What it looks like: The agent has 30 tools available. For a given task, tool #3 is the right choice. The agent picks tool #7, which sort of works but returns less useful information, and the final answer is worse than it should be. No error is raised. The run succeeds. The output is mediocre.

Why it happens: More tools means more surface area for the model to make suboptimal choices. When tools have similar names or overlapping descriptions, the model picks based on subtle textual cues that may not reflect actual capability differences. This gets worse as the tool count grows.

How to catch it: Fewer tools. Every tool you add increases the probability of a wrong selection. If an agent has 30 tools, ask whether 15 of them could be removed or consolidated. Better descriptions help too: write tool descriptions from the perspective of "when should you use this vs. the similar-sounding other tool." A routing layer that narrows the available tool set based on task type can also help significantly.

7. State Drift in Multi-Turn Conversations

What it looks like: The agent and user agree on something in turn three. By turn eight, the agent has forgotten it and makes a decision that contradicts what was agreed. The user has to repeat themselves. Repeatedly. This erodes trust fast.

Why it happens: Most conversational agents rely on chat history as their only memory. As the conversation grows, early turns get pushed further into the context window and receive less attention. The model's effective working memory is shorter than the full context window.

How to catch it: Explicit state management. Don't rely on chat history alone for decisions that matter. Extract key decisions, constraints, and agreements into a separate state object that gets explicitly included in relevant steps. This is different from just logging the conversation; it's proactively maintaining a compact, accurate representation of what has been decided.

8. Race Conditions in Parallel Tool Calls

What it looks like: The agent calls two tools in parallel that both modify the same resource. Tool A reads a record, modifies it, and writes it back. Tool B does the same thing concurrently. Tool B's write overwrites Tool A's write. One update is silently lost. The final state is wrong.

Why it happens: Modern agent frameworks can run tool calls in parallel for speed. That's useful when the tools are read-only or operate on independent resources. But when two tools touch the same resource, order matters and parallelism creates race conditions. The agent usually has no way to know which tools might conflict.

How to catch it: Identify stateful operations and run them serially, not in parallel. Mark tools that modify shared state so the framework knows not to parallelize them. For high-stakes operations, implement proper locking at the resource layer.

9. The Polite Refusal Cascade

What it looks like: The agent hits a safety filter. It apologizes and declines. The user rephrases the request slightly. The agent hits the filter again. This loops several times. The user gets frustrated and leaves. Nothing harmful was ever attempted; the phrasing just triggered the filter.

Why it happens: Safety filters are trained on patterns, and legitimate requests can match those patterns. A customer service agent being asked about account cancellation might trigger a filter trained to avoid "account deletion" discussions. The agent doesn't have a way to distinguish "I can't help with this at all" from "I can't respond to this phrasing."

How to catch it: Detect refusal patterns in your traces. If an agent refuses two requests in a row from the same user on the same topic, flag it for human review rather than letting the loop continue. Build an escalation path so users who hit repeated refusals get routed to a human instead of cycling through the same dead end.

10. Stale Context

What it looks like: The agent retrieved data early in the conversation. The user comes back an hour later and the conversation resumes. The agent acts on the data it fetched an hour ago as if it were current. The user gets information that is no longer accurate.

Why it happens: Agents cache data in the conversation context. There's no automatic mechanism to mark that data as stale. The model doesn't know how much time has passed since data was retrieved.

How to catch it: Attach timestamps to retrieved data and enforce TTLs. When a tool result is older than a defined threshold, re-fetch before using it. For long-running conversations or sessions that can be resumed after a gap, trigger fresh data retrieval at session start rather than relying on data from the previous session.

11. The Format Breakdown

What it looks like: The agent is supposed to return JSON. It returns JSON wrapped in a markdown code block wrapped in an explanation: "Here is the data you requested: ```json { ... } ```". The downstream parser expects raw JSON, gets a markdown string, and throws an error. Sometimes the downstream system just swallows the error and the data is lost.

Why it happens: Models are trained to be helpful and explanatory. Returning a bare JSON object without any surrounding text feels unnatural to a model trained on conversational data. So it adds context. The technical requirement and the model's natural behavior are in conflict.

How to catch it: Structured output APIs. Most major providers now have modes that enforce output format compliance. Use them. If you can't use structured outputs, add a parsing step that strips markdown formatting and extracts the structured content. And add a retry: if parsing fails, send the malformed output back to the model with explicit instructions to return only the raw format.

12. The Silent Quality Drop

What it looks like: Nothing breaks. The agent runs without errors. But the quality of its outputs slowly degrades over weeks or months. Users complain more. Satisfaction scores decline. You can't point to a specific change that caused it.

Why it happens: Model providers update their models continuously. A model you're calling as "gpt-4" today may have different behavior than the model you were calling as "gpt-4" three months ago. The change is typically small, but for sensitive tasks the difference matters. And without continuous evaluation running against your real use cases, you won't catch it until users tell you.

How to catch it: Continuous evaluation. Not just monitoring for errors, but running regular checks on output quality against a fixed test set. This doesn't have to be expensive: 20 to 50 representative examples evaluated weekly is enough to catch significant regressions. When you see a score drop, you have a starting point for investigation. More on this in A Practical Guide to Evaluating Agents.

What to Actually Do About All This

You don't fix these one by one. You build the layer that catches them all.

You can't prevent every one of these failures by writing more careful code. Agents are probabilistic systems operating in a world full of edge cases. The goal isn't to eliminate failures; it's to catch them fast, understand what happened, and fix the systemic issue.

Three things to put in place, in this order:

Tracing. Every run should produce a complete trace of what the agent did. Every tool call, every decision, every intermediate output. Without this, debugging any of the above is guesswork. Start here. This is the foundation of everything else. See What is AgentOps for how observability fits into the broader picture.

Evals. Run a set of representative examples on a regular schedule and score the outputs. You're looking for regressions, unexpected behavior changes, and systematic failure patterns. Even a small eval set catches the silent quality drop (failure 12) and often surfaces others. This is the thing that teams skip and regret. See A Practical Guide to Evaluating Agents for how to build this without overbuilding it.

Budgets. Set hard per-run limits on token spend and tool calls. This is a one-time config change that prevents the cost bomb (failure 4) and puts a natural ceiling on any runaway loop (failure 1). Do this before you have production traffic, not after your first incident.

That's the whole job of AgentOps in one paragraph. Trace what your agent does. Evaluate whether it did it well. Cap what it's allowed to spend getting there. Everything else builds on those three foundations.