What to do when a sub-agent fails

The failure mode that kills multi-agent pipelines isn't a crash. It's a sub-agent that reports success when it didn't succeed.

The agent hits an error, catches it, returns something that looks like a result, and the lead agent moves on. The problem shows up two or three steps later, missing data or a bad output, with no trace back to where it actually broke. By that point you've burned context on work that needed to be redone.

Why sub-agents swallow errors

An agent told to "research X and return a summary" has implicit pressure to return something. Returning nothing feels like failure. So when a tool call errors out or the data is incomplete, the agent writes around it — "I wasn't able to find specific details, but generally speaking..." — and the lead agent reads that as a complete result.

This is a system prompt problem. The default completion pressure is to produce output. You need an explicit instruction that returning an error is the correct behavior when the task can't be completed.

What to put in the sub-agent prompt

Two things:

First, tell it what a failure looks like and what to do:

If you cannot complete the task due to missing data, tool errors, or
ambiguous requirements, return this exact format instead of guessing:

STATUS: FAILED
REASON: [specific reason — tool error, missing data, unclear scope]
WHAT_I_TRIED: [list of attempts]
WHAT_WOULD_HELP: [what information or access would unblock this]

Second, tell it explicitly not to substitute guesses for data:

Do not return partial results as if they are complete results.
Do not use phrases like "generally speaking" or "typically" to fill gaps.
If you don't have the specific information requested, say so.

The second instruction sounds obvious. It isn't. Without it, agents fill gaps by default.

How the lead agent handles failures

The lead agent needs to check for failure status before treating a sub-agent's output as usable. This means reading the first line of every response for a status field, not assuming success because output exists.

When a sub-agent returns STATUS: FAILED, the lead has three options: retry with more context, continue without that data and note the gap, or stop and surface the failure to the human. Which option is correct depends on whether the failed task was on the critical path.

If a researcher sub-agent fails to find a source and the writer sub-agent is waiting on it, that's critical path — the lead has to resolve it before continuing. If the failed task was optional enrichment, the lead can note the gap and move on. Your system prompt for the lead agent should specify which is which.

Timeouts

Sub-agents can also fail by running too long. A sub-agent that's been given an open-ended research task with no time constraint will keep going — reading more sources, checking more things — until context fills up. The lead agent is blocked the whole time.

Set a scope limit in the prompt, not a time limit (agents can't tell time). Something like "search no more than 5 sources" or "limit your work to what can fit in roughly 20 tool calls." The agent can't enforce these precisely but they give it a stopping point that doesn't require it to decide on its own when enough is enough.

The error log

Keep a running error log in the task state file. Every time a sub-agent fails, the lead writes one line: what it tried to do, what failed, and what decision was made. After a full pipeline run, this log tells you where the weak points are and which failure modes come up most.

Without it, failures disappear. You remember the pipeline ran. You don't remember that the researcher failed twice and you got results without the sources you wanted.

The Multi-Agent Workflow Templates include a complete error handling playbook — failure status formats, lead agent recovery patterns, scope limits for each agent type, and a worked QA loop example that handles partial results. Five full workflow templates with all the coordination and error handling wired up. $49 at Payhip.