Error Handling and Fallbacks in Notion AI Workflows
The 60-second version
The default failure mode of a Notion agent is “stop.” That’s almost never what you want in production. Robust workflows define what happens for each kind of failure: agent times out, Worker fails, external API is down, the schema mismatched, the credit pool emptied. Each needs a planned response — retry, fall back to manual, escalate to human, log and continue. Without explicit handling, “the agent stopped working” becomes a mystery debug session.
Five failure modes and their handling
1. Agent timeout (rare but exists). A 20-minute Custom Agent run that doesn’t complete. Handling: log the timeout, surface to the human owner, don’t auto-retry (likely to repeat the same problem).
2. Worker timeout (more common). Worker hits 30-second limit. Handling: structured error return from the Worker; agent decides whether to retry, partial-result, or fail. Don’t silently re-invoke.
3. External API failure. API down, rate limited, or returning errors. Handling: retry with exponential backoff (max 3 attempts), then fall back to “external system unavailable” path with human notification.
4. Schema mismatch. Agent expected JSON shape A, Worker returned shape B. Handling: validate at the boundary, log the mismatch, fall back to a default response, alert human to fix the schema drift.
5. Credit exhaustion. Workspace credit pool hits zero (post-May 4). Handling: this is hard — the agent stops mid-execution. Mitigation is preventative: monitor credit consumption, alert at 75% of monthly budget, top up before zero.
Three practical patterns
The retry-with-backoff pattern.
First attempt fails → wait 1 second, retry. Second fails → wait 4 seconds, retry. Third fails → escalate to human. Don’t retry indefinitely.
The fallback-output pattern.
When the primary path fails, return a known-safe default with metadata indicating it’s a fallback. Downstream consumers can check the metadata and decide whether to use the fallback or alert.
The human-escalation pattern.
Define clear handoff criteria. When the agent can’t complete, who gets pinged, with what context, in what channel? “Pings someone eventually” is not a plan.
Logging requirements
Production agent workflows need three log streams:
– Action log: what the agent did and when
– Error log: what failed, with enough context to diagnose
– Decision log: when the agent chose between options, what it chose and why
Without all three, debugging takes 10x longer than it should.
Where this goes wrong
1. Trusting the default failure behavior. “The agent stopped” is rarely the right response. Define explicit handling.
2. Silent retries. Retries that don’t log produce mysterious “sometimes it works” behavior. Always log retry attempts.
3. No credit monitoring. Hitting credit zero stops every agent in the workspace. Monitor consumption proactively.
What to read next
Workers in TypeScript, Multi-Agent Orchestration, Security Posture, ROI Math.
Leave a Reply