Retrieval-augmented generation is the dominant enterprise LLM pattern in 2026, and the unit economics are still poorly understood by many of the teams deploying it. A demo against a 10,000-document corpus costs nothing. A production system against a 2,000,000-document corpus, answering 40,000 queries a day, is a non-trivial line item. This article is the breakdown we use when sizing engagements.
The five cost components
- Embedding — one-time for backfill, continuous for new documents
- Vector storage — continuous, grows with corpus
- Retrieval — per query, grows with corpus and with retrieval quality ambition
- Generation — per query, dominated by input tokens from retrieved context
- Observability and evaluation — ongoing, often under-budgeted
Embedding
The first-order cost is obvious: cost-per-token × tokens-in-corpus. For a 2M-document corpus at an average 1,200 tokens per document, that is 2.4Bn tokens. At $0.02/MTok (current floor for a good embedding model in 2026), the backfill is about $48. Cheap.
The second-order cost is ingestion. New documents arrive. Chunking strategy matters; re-chunking the corpus on a strategy change is another $48 round-trip every time. Keep chunking decisions reversible by storing the raw document and re-chunking on demand.
Vector storage
Under-appreciated. A 2M-document corpus at 6 chunks/document and 1024-dim vectors is about 12M vectors × 4 KB = 48 GB of raw vector data, before replication and indexing overhead (typically 2–3×). In a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud), that is $200–500/month baseline. In self-hosted pgvector with good hardware, significantly less but with operational cost.
The decisions that move this: chunk size (smaller chunks = more vectors), embedding dimensionality (higher = more storage), replication for HA, and regional footprint.
Retrieval
The cost per query looks small — $0.0001–0.001 — but it compounds. 40,000 queries/day × 30 days = 1.2M queries/month. At $0.0005/query, that is $600/month. More if you are doing hybrid search (dense + BM25 + reranking), as most serious production systems are.
The reranking step is often where retrieval cost doubles. A cross-encoder reranker over the top-50 candidates is more expensive than the initial retrieval but materially improves precision. This is almost always worth doing; it is rarely accounted for up front.
Generation
The dominant cost. Two variables: input tokens and output tokens. Input tokens dominate — the retrieved context is typically 2,000–8,000 tokens; the answer is 300–1,500.
Example: a moderately large context (5k tokens) with a strong 2026-era model at $2.50/MTok input, $10/MTok output, 500-token output, is about $0.0175/query. At 40k queries/day, $700/day, $21,000/month.
This is where most RAG systems live or die on unit economics. The levers:
- Retrieve less. Smaller top-k, tighter reranking, better retrieval quality. Cutting input tokens in half cuts most of the bill.
- Route by complexity. Use a smaller model for the 70% of queries that do not need the flagship.
- Cache aggressively. Semantic caching can eliminate 10–30% of queries entirely, especially in support workloads.
- Prompt discipline. A 300-token system prompt at 40k queries/day is 360M tokens/month. Trim it.
Observability and evaluation
The forgotten category. Running a 2,000-case eval harness on every prompt change is not free — at $0.015/case (a realistic average for a RAG eval), a full run is $30. Run it 40 times a month and that is $1,200. Storing traces, running LLM-as-judge on samples of production — another $500–2,000/month at realistic volumes.
Putting it together
For a production RAG system against 2M documents, 40k queries/day, real eval discipline, our typical monthly run-rate is $20–35k. Optimised, with routing, caching and tight retrieval, $10–18k. The difference between those two numbers is several months of engineering that almost always pays for itself.
The optimisation sequence that works
- Measure. Every query, every step, every cost. Cannot optimise what is not instrumented.
- Retrieve less — tighten top-k, add reranking, drop chunks the generator would ignore.
- Route — small model on the common case, flagship on the hard case.
- Cache — semantic cache on query, cache on retrieval output.
- Prompt trim — system prompt, context framing, output format.
- Revisit model and provider choice annually; the market moves.
What it looks like when it goes wrong
The classic failure mode: a RAG demo impresses leadership, rolls to production, hits 40k queries/day, and the monthly bill surprises everyone. Three months in, a "cost reduction sprint" is scheduled. With discipline up front, this is avoidable and the bill is half of what it would otherwise have been.