AI Engineering12 June 202510 min read

Unit economics of a RAG system in production.

RAG demos are cheap. RAG at scale is not. A breakdown of where the cost actually goes, and the levers that matter.

Ravi Bandaru
Ravi Bandaru
Senior Consultant · AI Engineering

Retrieval-augmented generation is the dominant enterprise LLM pattern in 2026, and the unit economics are still poorly understood by many of the teams deploying it. A demo against a 10,000-document corpus costs nothing. A production system against a 2,000,000-document corpus, answering 40,000 queries a day, is a non-trivial line item. This article is the breakdown we use when sizing engagements.

The five cost components

  1. Embedding — one-time for backfill, continuous for new documents
  2. Vector storage — continuous, grows with corpus
  3. Retrieval — per query, grows with corpus and with retrieval quality ambition
  4. Generation — per query, dominated by input tokens from retrieved context
  5. Observability and evaluation — ongoing, often under-budgeted

Embedding

The first-order cost is obvious: cost-per-token × tokens-in-corpus. For a 2M-document corpus at an average 1,200 tokens per document, that is 2.4Bn tokens. At $0.02/MTok (current floor for a good embedding model in 2026), the backfill is about $48. Cheap.

The second-order cost is ingestion. New documents arrive. Chunking strategy matters; re-chunking the corpus on a strategy change is another $48 round-trip every time. Keep chunking decisions reversible by storing the raw document and re-chunking on demand.

Vector storage

Under-appreciated. A 2M-document corpus at 6 chunks/document and 1024-dim vectors is about 12M vectors × 4 KB = 48 GB of raw vector data, before replication and indexing overhead (typically 2–3×). In a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud), that is $200–500/month baseline. In self-hosted pgvector with good hardware, significantly less but with operational cost.

The decisions that move this: chunk size (smaller chunks = more vectors), embedding dimensionality (higher = more storage), replication for HA, and regional footprint.

Retrieval

The cost per query looks small — $0.0001–0.001 — but it compounds. 40,000 queries/day × 30 days = 1.2M queries/month. At $0.0005/query, that is $600/month. More if you are doing hybrid search (dense + BM25 + reranking), as most serious production systems are.

The reranking step is often where retrieval cost doubles. A cross-encoder reranker over the top-50 candidates is more expensive than the initial retrieval but materially improves precision. This is almost always worth doing; it is rarely accounted for up front.

Generation

The dominant cost. Two variables: input tokens and output tokens. Input tokens dominate — the retrieved context is typically 2,000–8,000 tokens; the answer is 300–1,500.

Example: a moderately large context (5k tokens) with a strong 2026-era model at $2.50/MTok input, $10/MTok output, 500-token output, is about $0.0175/query. At 40k queries/day, $700/day, $21,000/month.

This is where most RAG systems live or die on unit economics. The levers:

  • Retrieve less. Smaller top-k, tighter reranking, better retrieval quality. Cutting input tokens in half cuts most of the bill.
  • Route by complexity. Use a smaller model for the 70% of queries that do not need the flagship.
  • Cache aggressively. Semantic caching can eliminate 10–30% of queries entirely, especially in support workloads.
  • Prompt discipline. A 300-token system prompt at 40k queries/day is 360M tokens/month. Trim it.

Observability and evaluation

The forgotten category. Running a 2,000-case eval harness on every prompt change is not free — at $0.015/case (a realistic average for a RAG eval), a full run is $30. Run it 40 times a month and that is $1,200. Storing traces, running LLM-as-judge on samples of production — another $500–2,000/month at realistic volumes.

Putting it together

For a production RAG system against 2M documents, 40k queries/day, real eval discipline, our typical monthly run-rate is $20–35k. Optimised, with routing, caching and tight retrieval, $10–18k. The difference between those two numbers is several months of engineering that almost always pays for itself.

The optimisation sequence that works

  1. Measure. Every query, every step, every cost. Cannot optimise what is not instrumented.
  2. Retrieve less — tighten top-k, add reranking, drop chunks the generator would ignore.
  3. Route — small model on the common case, flagship on the hard case.
  4. Cache — semantic cache on query, cache on retrieval output.
  5. Prompt trim — system prompt, context framing, output format.
  6. Revisit model and provider choice annually; the market moves.

What it looks like when it goes wrong

The classic failure mode: a RAG demo impresses leadership, rolls to production, hits 40k queries/day, and the monthly bill surprises everyone. Three months in, a "cost reduction sprint" is scheduled. With discipline up front, this is avoidable and the bill is half of what it would otherwise have been.

Come and talk to us

Start a conversation.

Tell us what you're trying to move. We'll tell you honestly whether we're the right firm for it.

Contact us