Latency is a product feature. Even correct answers feel wrong if they arrive too late. LLM latency is also multi-factor: model inference time, prompt size, retrieval time, tool timeouts, retries and UI rendering all contribute.
Optimising latency is not about one trick. It is a pipeline of small wins, guided by good telemetry and safe change control.
Measure the right latency breakdown
Start with spans for each stage, not just total time:
- retrieve. RAG retrieval and ranking time.
- model. LLM time-to-first-token and total tokens.
- tool. Tool call latencies and retries.
- render. Client-side streaming and UI time.
Tracing makes this actionable (see run tracing and observability).
Win 1: route by intent and value
The highest leverage latency improvement is routing:
- Use faster, cheaper routes for simple drafting and formatting.
- Reserve slower premium routes for high-risk or high-value intents.
- Use canaries to confirm quality does not regress (see canary rollouts).
Routing should be versioned and auditable to avoid silent drift (see configuration drift).
Win 2: stream early and often
Streaming does not reduce total latency, but it improves perceived latency. Patterns that help:
- Show time-to-first-token quickly with partial responses.
- Stream structured sections (summary first, details later).
- For tool-heavy agents, stream progress updates and step status.
Be careful: streaming can expose partial unsafe text. Apply output checks appropriately (see guardrail taxonomy).
Win 3: control prompt and context size
Tokens are time. Reduce prompt size and context windows:
- Context budgeting. Cap retrieved tokens and conversation history (see context budgeting).
- Smarter retrieval. Retrieve fewer but better chunks; rank aggressively.
- Prompt hygiene. Remove redundant instructions and examples.
Win 4: cache safely and cache the right things
Caching can reduce both latency and cost. For many systems, the best caches are:
- Tool lookups that repeat.
- Retrieval results for common queries.
- Intermediate structured outputs rather than full free-form answers.
Design caches with entitlement context to prevent leakage (see caching strategies).
Win 5: parallelise tools and reduce retries
Tool time dominates many agentic experiences. Improve it by:
- Running independent tools in parallel when safe.
- Using strict tool contracts to reduce invalid calls (see tool contracts).
- Using retry policies by error class to avoid loops (see tool error handling).
Operationalise latency with SLOs and fallbacks
Latency work sticks when it is tied to SLOs and degradation modes:
- Define p95 targets per workflow (see SLO playbook).
- Switch to degraded modes under stress (see fallback and degradation).
- Alert on latency regressions after releases (see alerting and runbooks).
Fast, predictable latency comes from careful design choices across routing, tokens, tooling and operations. If you measure the right breakdown and ship changes safely, latency becomes a controlled variable instead of a mystery.