LLM costs are deceptively easy to start and painfully hard to control later. A small pilot can run on a single model and a handful of prompts; production introduces long contexts, retrieval, tool calls, retries, and multiple environments—each multiplying tokens and latency. If you cannot attribute spend to a feature, tenant, team and model, you cannot manage it.
FinOps for LLMs begins with a simple goal: make cost a first-class product metric, alongside quality and reliability. Your cost model should be explainable to non-technical stakeholders and actionable for engineers.
Start with unit economics
Define a “unit of value” for each use case: a resolved ticket, a completed checkout, a generated report, a saved minute of handling time. Then instrument your system to measure cost per unit, not just “tokens per day”. Pair that with leading indicators such as adoption and containment—see Measuring AI Value—so you can judge whether spend is buying outcomes.
Remember that LLM spend is not only inference. Retrieval, embeddings, storage, eval runs and observability can become meaningful line items. Track them separately so optimisation doesn’t just shift cost from one bucket to another.
At the request level, capture: model/provider, prompt template version, context size, retrieval hits, tool invocations, latency, and user/tenant identifiers. Treat these as required audit fields for every call. Without them, you cannot do showback, chargeback, or anomaly detection. It also reduces internal friction.
Build runtime guardrails
Once you have visibility, add controls that shape behavior before invoices arrive:
- Budgets and quotas. Set budgets per feature and per tenant; enforce quotas on high-cost endpoints and cap retries. When budgets burn quickly, freeze experiments and route to lower-cost paths.
- Context discipline. Apply strict limits to retrieved context, summarise aggressively, and cache stable snippets. Many “token explosions” are self-inflicted by unbounded retrieval.
- Model routing. Use cheaper models for low-risk intents and reserve premium models for hard problems. Combine with SLOs—see Running AI Systems with SLOs—so cost and quality trade-offs are explicit.
- Tool-call budgets. Limit the number of tool calls per request and per session. Tool loops are a common hidden cost driver in agentic flows.
- Caching and batching. Cache embeddings and retrieval results, batch background jobs, and avoid re-embedding unchanged documents. These are high-leverage optimisations.
- Cost regression tests. Add budgeted synthetic runs to CI so prompt changes don’t quietly double token usage before release.
Showback, chargeback, forecasting
Chargeback is not about billing colleagues; it is about creating incentives. Start with showback dashboards that reveal cost per unit, per team and per tenant. Then add levers teams can pull: smaller context windows, fewer tools, different routing policies, or a cheaper provider. Link this to your broader run-cost program—see Managing and Optimising AI Run Costs.
As usage grows, forecasting matters. Create scenarios (base/expected/aggressive adoption), map them to token volumes, and set budgets that match product milestones. Use anomaly detection for sudden spikes (new prompt template, runaway tool loop, misconfigured retriever) and keep a “cost incident” runbook with clear owners and rollback steps.
Cost controls also belong in product design. Make expensive actions explicit, use feature flags for premium routes, and set expectations with users when the system throttles or falls back. A transparent “we’re in low-cost mode” message is often better than hidden quality drops because it preserves trust while protecting budgets.
Finally, align procurement to your routing strategy. If you can route work across models and providers, you can negotiate for committed use, regional guarantees, and clear data retention terms without locking the product into a single vendor. Predictable unit economics is what turns experimentation into sustainable capability.