Free-form agent planning is attractive in demos, but in production it can be hard to control. You get loops, ambiguous tool calls, and unpredictable recovery after errors. A state machine approach makes the workflow explicit: what the agent can do next, what conditions must hold, and how to recover safely.
What a state machine gives you
A state machine provides three operational benefits:
- Predictability. The agent follows a known flow with bounded choices.
- Auditability. Each transition is observable and can be traced.
- Safety. Sensitive actions are gated by preconditions and approvals.
This is compatible with agents; it simply constrains where they can improvise.
Design the states around your workflow
Start with the user journey and define states that match real steps. For example:
- intake (capture intent and constraints)
- clarify (ask for missing fields)
- retrieve (fetch evidence or required records)
- plan (compose a safe plan)
- execute (call tools with validation)
- confirm (check results and summarise)
The model still contributes: it can draft clarifying questions, choose among safe tool options, and write user-facing explanations.
Use preconditions and invariants
Each state should have clear preconditions. Examples:
- execute requires validated tool arguments and required approvals (see approvals).
- retrieve requires tenant and entitlement context (see tenant isolation).
- confirm requires a read-after-write verification step for side effects.
These invariants stop the agent from taking unsafe shortcuts.
Make tool calls boring: contracts and validation
State machines work best when tools are strict:
- Use schemas and validators for tool arguments (see tool contracts).
- Return machine-readable errors and classify them (see error taxonomy).
- Design idempotency keys for safe retries.
When tools are strict, the model spends less effort guessing and more effort reasoning about the task.
Build explicit recovery paths
Recovery is where production systems win. Add transitions for:
- retry with backoff. Only for retryable errors, with budgets (see tool error handling).
- fallback mode. Switch routes or reduced-context modes under stress (see fallback and degradation).
- handoff. Escalate to human review for high-risk cases (see human review ops).
These transitions prevent loop behaviour (see loop prevention).
Instrument transitions for tracing
Trace each transition with a run ID and step ID. When something fails, you should be able to replay the path safely (see run tracing).
State machines are not anti-AI. They are how you combine probabilistic reasoning with deterministic control. That combination is what makes agents operationally real.