LLM features often fail on a basic expectation: speed. Users will forgive occasional imperfections if the system is responsive, but they will abandon slow workflows even when quality is good. Latency engineering starts with measuring the full path and budgeting every hop.
Measure end-to-end latency, not model time
Most LLM requests include multiple stages: retrieval, reranking, prompt construction, model inference, tool calls, and post-processing. If you only measure the model call, you will optimise the wrong part. Start with traces that show timings per stage (see AI observability).
Set budgets and fail-fast rules
Define a latency budget for the entire workflow (P50/P95) and enforce timeouts per stage. Decide how the system should degrade when it cannot meet the budget: fewer retrieved chunks, no reranker, or a simpler fallback response. Pair budgets with usage controls to prevent spikes from collapsing shared services (see rate limiting).
Reduce work before you optimise
Many latency problems are self-inflicted. The biggest driver is context size. Use deliberate context engineering: smaller templates, bounded retrieval, and explicit policies (see context engineering). Also consider gating retrieval and tools behind a lightweight intent classifier so you only pay the full cost when needed.
Cache safely
Caching can dramatically improve perceived performance, but it can also introduce silent security failures. Use caching patterns that include permission context and version identifiers in the cache key (see caching strategies). Cache intermediate artefacts (retrieved chunks, tool results) as well as final answers, and always revalidate policy constraints.
Use routing and fallbacks
Latency is not only an engineering issue; it is also a provider reality. Build routing that can switch to alternative models or providers during incidents, and record the decision for auditability (see routing and failover).
Design for user-perceived speed
Streaming is a product feature, not a backend trick. Provide partial progress, show citations as they become available, and use UX patterns that reduce uncertainty (see UX patterns). A well-designed streaming flow can make a 10-second response feel faster than a silent 6-second wait.
Fast systems are built through discipline: budgets, measurement, and deliberate degradations. Treat latency as a first-class SLO, not an afterthought.