LLM capacity is not like traditional infrastructure capacity. The unit cost depends on context size, retrieval behaviour, tool calls, retries, and model choice. Forecasting is still possible, but you need scenario thinking and explicit assumptions.
Start with unit economics per workflow
Forecasting begins with unit economics: cost and latency per task for each major workflow. Capture baseline token usage, retrieval hits, tool calls, and retry rates (see LLM FinOps and usage analytics).
Build adoption scenarios
Use three scenarios (base, expected, aggressive) and model them as:
- Active users by tenant/team.
- Requests per user per day (by workflow).
- Average context size and retrieval configuration.
- Tool-call rate and expected retries.
This produces a forecast that explains why costs move, not just the total.
Include caching and routing assumptions
Capacity planning changes materially with caching and routing:
- Caching. Cache embeddings and safe artefacts to reduce repeat work (see caching strategies).
- Routing. Route low-risk intents to cheaper models; reserve premium models for high-value tasks (see routing and failover).
- Latency budgets. Ensure the plan meets user-perceived speed needs (see latency engineering).
Translate forecasts into operational controls
Forecasts are only useful if they create levers:
- Budgets. Per product/tenant/workflow, with alerting on trend breaks.
- Quotas and rate limits. Prevent one workload from starving others (see quotas).
- Feature flags. Enable expensive features gradually (see feature flags).
- Cost incident runbooks. A fast response plan when spend spikes (see cost anomaly detection).
Negotiate provider commitments with flexibility
Once you can forecast, you can negotiate. Aim for commitments that preserve portability: region guarantees, clear retention terms, and the ability to shift workloads between models or providers (see vendor exit strategy).
Capacity forecasting is not about perfect prediction. It is about making adoption growth controllable and sustainable.