Unit economics of a RAG system in production.

Retrieval-augmented generation is the dominant enterprise LLM pattern in 2026, and the unit economics are still poorly understood by many of the teams deploying it. A demo against a 10,000-document corpus costs nothing. A production system against a 2,000,000-document corpus, answering 40,000 queries a day, is a non-trivial line item. This article is the breakdown we use when sizing engagements.

The five cost components

Embedding — one-time for backfill, continuous for new documents
Vector storage — continuous, grows with corpus
Retrieval — per query, grows with corpus and with retrieval quality ambition
Generation — per query, dominated by input tokens from retrieved context
Observability and evaluation — ongoing, often under-budgeted

Embedding

The first-order cost is obvious: cost-per-token × tokens-in-corpus. For a 2M-document corpus at an average 1,200 tokens per document, that is 2.4Bn tokens. At $0.02/MTok (current floor for a good embedding model in 2026), the backfill is about $48. Cheap.

The second-order cost is ingestion. New documents arrive. Chunking strategy matters; re-chunking the corpus on a strategy change is another $48 round-trip every time. Keep chunking decisions reversible by storing the raw document and re-chunking on demand.

Vector storage

Under-appreciated. A 2M-document corpus at 6 chunks/document and 1024-dim vectors is about 12M vectors × 4 KB = 48 GB of raw vector data, before replication and indexing overhead (typically 2–3×). In a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud), that is $200–500/month baseline. In self-hosted pgvector with good hardware, significantly less but with operational cost.

The decisions that move this: chunk size (smaller chunks = more vectors), embedding dimensionality (higher = more storage), replication for HA, and regional footprint.

Retrieval

The cost per query looks small — $0.0001–0.001 — but it compounds. 40,000 queries/day × 30 days = 1.2M queries/month. At $0.0005/query, that is $600/month. More if you are doing hybrid search (dense + BM25 + reranking), as most serious production systems are.

The reranking step is often where retrieval cost doubles. A cross-encoder reranker over the top-50 candidates is more expensive than the initial retrieval but materially improves precision. This is almost always worth doing; it is rarely accounted for up front.

Generation

The dominant cost. Two variables: input tokens and output tokens. Input tokens dominate — the retrieved context is typically 2,000–8,000 tokens; the answer is 300–1,500.

Example: a moderately large context (5k tokens) with a strong 2026-era model at $2.50/MTok input, $10/MTok output, 500-token output, is about $0.0175/query. At 40k queries/day, $700/day, $21,000/month.

This is where most RAG systems live or die on unit economics. The levers:

Retrieve less. Smaller top-k, tighter reranking, better retrieval quality. Cutting input tokens in half cuts most of the bill.
Route by complexity. Use a smaller model for the 70% of queries that do not need the flagship.
Cache aggressively. Semantic caching can eliminate 10–30% of queries entirely, especially in support workloads.
Prompt discipline. A 300-token system prompt at 40k queries/day is 360M tokens/month. Trim it.

Observability and evaluation

The forgotten category. Running a 2,000-case eval harness on every prompt change is not free — at $0.015/case (a realistic average for a RAG eval), a full run is $30. Run it 40 times a month and that is $1,200. Storing traces, running LLM-as-judge on samples of production — another $500–2,000/month at realistic volumes.

Putting it together

For a production RAG system against 2M documents, 40k queries/day, real eval discipline, our typical monthly run-rate is $20–35k. Optimised, with routing, caching and tight retrieval, $10–18k. The difference between those two numbers is several months of engineering that almost always pays for itself.

The optimisation sequence that works

Measure. Every query, every step, every cost. Cannot optimise what is not instrumented.
Retrieve less — tighten top-k, add reranking, drop chunks the generator would ignore.
Route — small model on the common case, flagship on the hard case.
Cache — semantic cache on query, cache on retrieval output.
Prompt trim — system prompt, context framing, output format.
Revisit model and provider choice annually; the market moves.

What it looks like when it goes wrong

The classic failure mode: a RAG demo impresses leadership, rolls to production, hits 40k queries/day, and the monthly bill surprises everyone. Three months in, a "cost reduction sprint" is scheduled. With discipline up front, this is avoidable and the bill is half of what it would otherwise have been.