The Cost Curve Is Not Random
Six months after an LLM-backed feature ships, one finance team has a calm conversation about usage and another is asking engineering to justify a number that has grown 6x. The difference is rarely about the model choice. It is about whether the team treated cost as a design parameter from the start, or as a problem to discover at the bill.
The bill is driven by four things. Input tokens per call. Output tokens per call. Calls per user. Users. In every cost-reduction engagement we have run, 80% of the spend concentrates in 20% of the calls. The work is identifying that 20% and applying the right tool to it — not switching the whole system to a smaller model and hoping.
Start With the Distribution, Not the Model
Before picking a model, look at the distribution of your actual calls. Instrument your LLM pipeline to log token counts per request, per feature, per user cohort. Group by expected cost. You will almost always find:
A long tail of cheap calls — short prompts, short completions, running across the whole user base. These are noise. Optimizing them yields pennies.
A middle band of medium calls — RAG-style retrievals with a few thousand input tokens and a few hundred output tokens. These are where most teams focus because they are easy to see.
A fat head of expensive calls — long context, long generation, often triggered by specific workflows or power users. This is where the money is.
A cost-reduction effort that starts with "let's switch from GPT-4 to a cheaper model" distributes effort evenly across that distribution. An effort that starts with the head attacks the problem where it lives.
Prompt Caching Is the Highest-Leverage Move
Prompt caching — paying a reduced rate for tokens you have already sent — is the single biggest cost lever on stable workloads, and most teams under-use it. The mental model is simple: if the same system prompt, schema, or RAG context appears across many calls, cache the prefix.
Three places we use caching aggressively. System prompts with detailed instructions, which can run 1,000-3,000 tokens; they are identical across every call. Long tool descriptions in agent runtimes, which often exceed the message content; they change rarely. Large document context in RAG assistants where one user turns into many follow-up questions on the same document; caching the document once makes follow-ups nearly free.
The discipline is writing prompts so the cacheable part comes first and the variable part comes last. That is not how most codebases are structured by default. It takes one refactor pass to fix, and the bill drops by 30-60% on workloads that fit the pattern.
Routing Beats Model-Switching
"Use a smaller model" usually means switching the default model and accepting the quality drop across the board. Routing means picking the model per-call based on what the call is doing. A classifier that runs on the user's input decides which model to send it to.
Routing logic we have shipped in production:
Intent-based — classify the request, send simple lookups to a cheap model, send complex reasoning to a frontier model. The classifier itself runs on a small model and costs almost nothing.
Confidence-based — run the cheap model first, check its structured confidence signal, escalate to the expensive model only when confidence is low.
User-tier based — free users get the small model with reasonable quality; paid users get the frontier model. This is the easiest routing to justify and the one most teams forget to implement.
Routing requires a small model you trust for the cheap path. Most frontier labs now ship a small model that is genuinely good at simple tasks. The gap between "simple task done well" and "complex task done badly" is larger than the gap between "simple task done by small model" and "simple task done by frontier model." Routing exploits that.
Context Discipline
Every token in your context window is a token you pay for. RAG systems are the worst offenders because the instinct is to retrieve more to be safe. More context is not more correct. It is more expensive and often less correct because the signal gets diluted.
The rules we enforce on RAG context: retrieve fewer, higher-quality chunks (five ranked chunks beats ten raw chunks), summarize stale chunks from earlier in the conversation, cap total retrieved context at a fraction of the window, and log cases where no retrieval meets the relevance floor — those are the cases where you should return "I don't know" rather than stuffing marginal chunks in hoping one is right.
Agent runtimes have the same problem with tool descriptions and scratchpad accumulation. Every turn adds content. Without a summarization strategy, a 20-turn conversation blows past any context budget you set. We summarize at fixed intervals — every 5 turns — compressing older content into a paragraph the model writes itself. It is lossy. It is also what keeps long conversations affordable.
Streaming, Stopping, and Output Length
Output tokens cost more than input tokens on most pricing sheets. Shorter outputs are cheaper outputs. That means two knobs:
Explicit max_tokens on every call. Most APIs let a generation run to the model's natural stopping point, which is often much longer than needed. Capping at a realistic maximum — not a safety number, an actual target — cuts output cost on every long-running call.
Structured output with schemas. If you need three fields, ask for three fields, not prose. JSON mode or structured output constraints keep generation focused and cut token count by 40-70% on output-heavy workloads.
Streaming is orthogonal to cost but important to user experience. It makes long outputs feel shorter, which lets you get away with slightly longer generations where users would otherwise abandon.
Budget Gates Are How You Sleep
The last layer is not an optimization; it is a safeguard. Every LLM workload that runs unattended should have a per-workflow budget. If the budget is exceeded, the workflow halts and alerts. We gate three places: per-agent-run (caps runaway loops), per-user-per-day (caps abusive or broken clients), and per-feature-per-hour (catches cases where a deploy accidentally multiplied usage).
Budget gates are what let engineering sleep after a deploy. Nothing in LLM systems is as scary as a quiet cost amplification that nobody catches for a week. Gates catch it in minutes.
Where to Start
If you are staring at an LLM bill that is surprising you, do three things this week. Log token counts per call with a feature label so you can see the distribution. Apply prompt caching to your longest stable prefixes. Add a budget gate on the noisiest workflow.
That is usually enough to drop the bill 40%. The rest is routing and context discipline, which are bigger projects. If you want a prioritized cost roadmap scoped to your actual usage, we scope that in an audit and return the shortlist sorted by dollars saved, not by what is technically interesting. The two lists are rarely the same.