Tools are where LLM systems become real: creating tickets, updating records, sending emails, or executing transactions. Tool failures are not just "bad answers" - they are operational incidents. Reliability design makes those incidents rarer and easier to recover from.
Start with typed contracts
Every tool should have a strict contract: required inputs, units, allowed ranges, and side effects. Reject ambiguous calls instead of hoping the model self-corrects (see structured outputs).
Design for retries without duplication
Agents will retry. Networks fail. Timeouts happen. That means tools need:
- Idempotency keys. Safe deduplication so retries do not create duplicate actions.
- At-most-once semantics. Explicit rules for irreversible operations.
- Deterministic responses. Return stable identifiers so the agent can continue (see tool patterns).
Handle side effects with staged execution
For risky actions, use patterns that reduce blast radius:
- Preview mode. The tool returns what would happen, without doing it.
- Two-step confirmation. The agent must confirm with explicit values, not implicit intent.
- Approvals. Require human approval for high-impact actions (see approvals and safe tooling).
Instrument tool reliability
Track tool success, error rates, and timeouts as first-class signals. Combine them with an error taxonomy so incidents are triaged quickly (see error taxonomy and incident response).
Security is part of reliability
Tooling reliability also depends on access control, secrets management and safe integration boundaries. Apply least privilege and isolate integrations (see integration security and secrets management).
Well-designed tools make agent systems safer and more predictable - not by making models smarter, but by making the surrounding system resilient.