Agent failure modes and the recovery patterns that keep shipping

2026-03-03

An agent I was running on a Shopify theme task quietly produced a PR that looked fine. Merged it, deployed, checked the site, and the new product section was rendering an undefined metafield on every card. Nobody flagged the issue during review because the code itself looked clean. The agent had written valid code against a metafield name that did not exist in the Shopify store.

Silent drift. One of the four failure modes I keep running into. Not the first time, definitely not the last. This is the running field-notes log on the four shapes and what I do now to catch each one early.

2026-03-04

The four failure modes, in the order I see them:

Context loss. The agent forgets something you told it earlier in the session.
Tool confusion. The agent calls the wrong tool or uses the wrong argument shape.
Infinite retry. The agent keeps retrying a failing action without diagnosing it.
Silent drift. The output looks right but is subtly wrong.

Each one has a specific signal, a specific recovery pattern, and a specific prevention move.

2026-03-05

Context loss. The agent has "forgotten" an earlier instruction. Usually this is a symptom of context compaction. You hit the 200K ceiling (the token budget post covers the math) and Claude Code summarizes older conversation to free space. Specifics become summaries. The model still knows you talked about X but has lost the detail.

Signal: the agent asks a question you already answered, or makes a decision that contradicts a constraint you set earlier.

Recovery: restate the constraint. Simple and often enough.

Prevention: for anything that matters, write it to a notes file or to CLAUDE.md rather than just saying it in chat. Files survive compaction. Conversation does not.

2026-03-06

Tool confusion. The agent calls the wrong tool, or calls the right tool with malformed arguments. This shows up most often when you have two tools that do similar things. Read and Grep are clean. But if you have a custom search_files tool and a built-in Grep loaded at the same time, the agent can get confused about which to use.

Signal: the agent calls a tool and gets an unexpected error (wrong arguments, wrong output shape). Usually one call, not a loop.

Recovery: tell it which tool to use. Explicit instructions override the confusion.

Prevention: narrow tool sets. For sub-agents especially, load only the tools they actually need. If I am dispatching a research sub-agent, it gets Read, Grep, Glob. It does not get Write, Edit, or anything that touches the filesystem. Narrow surface, less confusion.

2026-03-08

Infinite retry. The agent keeps trying the same failing action without diagnosing why it failed. Classic example: a bash command returns "permission denied." The agent tries again. Permission denied again. Agent tries a third time. This is where budget disappears and nothing gets done.

Signal: you see the same failed tool call three or more times in a row. Sometimes it is the same exact call; sometimes it is small permutations of the same call.

Recovery: stop the loop manually. TaskStop in the harness. Or just hit escape.

Prevention: max-iteration counter in the agent's prompt. "If a command fails twice in a row, stop and report the failure instead of retrying." Also a hard timeout at the harness level for long-running sub-agents. The parallel dispatch post touches on timeout setup for parallel runs.

2026-03-10

Silent drift. The output looks right but is subtly wrong. This is the worst of the four because by definition you do not know it happened until you deploy and something breaks.

Examples I have hit:

The agent wrote code referencing a metafield that did not exist in the store.
The agent generated a config with the wrong project ID (it pulled from memory instead of reading the current config file).
The agent produced a JSON response that was valid JSON but missing a required field the downstream consumer needed.

Signal: there is no signal at runtime. You only know after the fact, when something downstream fails.

Recovery: rollback and re-do with tighter constraints.

Prevention: verification pass at the end of every run. Not a re-review of the work. A specific check: "verify this config matches the one currently in .env.local." "Verify this metafield exists in the store." "Verify this JSON has all required fields per the schema." A verification pass costs maybe 2-3K tokens and catches most drift before it ships.

2026-03-12

Pattern across all four: the failures are not random. Each one has a shape, a signal, and a prevention pattern. The cost of building prevention into the workflow is small; skipping it is variable but sometimes very high (the silent drift one was a 3-hour recovery with a rollback deploy).

2026-03-14

A thing I did not initially appreciate: all four failures get harder to catch in parallel runs. With three agents in flight, you cannot watch each one closely. Infinite retry on one of them eats budget invisibly until you check the usage dashboard, and silent drift in one of three parallel outputs is harder to catch than drift in a single sequential one.

The mitigation: for parallel runs, each agent writes a structured audit log as its last step. Something like:

{
  "status": "success",
  "files_touched": ["src/foo.ts"],
  "verification_passed": true,
  "notes": "Applied the helper rewrite. Tests still pass locally."
}

The coordinator reads the audit logs and flags any status other than "success." This is cheap and catches most failures before they integrate into main thread state.

2026-03-15

The thing I wish someone had told me six months ago: build recovery before you need it. Every failure mode I named above, I hit at least twice before I bothered to build the prevention pattern. That is three or four recovery episodes per failure mode. The prevention is always cheaper than the recovery.

Specific preventions I now bake in by default:

CLAUDE.md in every serious project with the hard constraints. Compaction does not erase these.
Narrow tool sets on sub-agents. Default tool set is almost always too broad.
Max-iteration and timeout in every long-running sub-agent prompt.
Verification pass at the end of every multi-file change.
Structured audit log on parallel runs.

What the six weeks taught me

Agents fail in a small number of repeatable ways. Naming the failures makes them cheaper to prevent. The cost of baking prevention into the workflow is small; the cost of repeated recovery is much higher.

For operators running agent workflows in production, the Operator's Stack curriculum works through these failure patterns alongside the rest of the agent engineering discipline. The agent handbook hub indexes the broader set.

FAQ

Is there a way to detect silent drift automatically?

Partly. Schema-validate structured outputs. Diff-check file edits against expected changes. Run a test suite as the verification pass. None of these catch everything; humans still do final sanity checks on important changes.

How do I tell context loss from the model just being wrong?

Check whether the earlier instruction is still in the conversation (before any compaction). If the instruction got dropped, that is context loss. If it is still in the thread and the model is still ignoring it, that is a different kind of failure, usually solved by restating more emphatically.

What is a reasonable max-iteration count?

Three is a safe default. If a tool call fails three times with the same shape, something is wrong that retrying will not fix. Bump to five for genuinely flaky environments (network, rate-limited APIs).

Can I recover from a failed parallel run without redoing everything?

Yes, if you have the structured audit log. Re-dispatch only the failed agents. If you do not have the log, you usually have to redo the whole batch.

Is tool confusion more common with MCP servers?

Yes. Every MCP server adds more tools to the definition block. Beyond about 30 tools total, confusion rates go up. Attach only the MCP servers you need for the current session.