Most AI search systems don’t fail because of traffic.

They fail because context grows faster than anyone expects.

Each query retrieves “just a little more.”
Each answer includes “just a bit more detail.”
Tokens compound quietly.

The Scenario: AI Search with RAG (Production)

Imagine an AI-powered search feature built on top of internal documents.

Users ask natural language questions and the system retrieves relevant chunks before generating an answer.

It feels efficient.
It feels controlled.

Until context takes over.

Assumptions

~10,000 searches per month
Retrieval of 5–15 document chunks per query
Chunk sizes vary by document type
Answers are synthesized, not short responses
Some retries and follow-ups are expected

Traffic is reasonable.
Context is not.

Step 1 — Model Retrieval, Not Queries

With RAG systems, queries are not the cost driver.

Retrieval is.

Every search expands the prompt:

User question
Retrieved document chunks
System instructions
Formatting scaffolding

The model sees all of it.

Cost scales with context size, not just usage.

Step 2 — Define a Planning Baseline

In ModelIndex, this is the Expected scenario.

Expected means:

Typical retrieval depth
Average chunk sizes
Normal answer verbosity
Planned retries and follow-ups

This is the number teams should budget against before adding more documents or increasing recall.

Step 3 — Identify the Real Cost Drivers

In RAG systems, cost grows when:

Retrieval depth increases “slightly”
Chunk sizes drift larger over time
More documents are indexed
Answers become more verbose
Context limits are pushed but not enforced

None of this looks dramatic in isolation.

Together, they dominate cost.

Step 4 — Explore Best and Worst Boundaries

Now look at Best and Worst.

These are not performance tiers.

They exist to answer one question:

How expensive does search become as context assumptions break?

Best assumes tight retrieval, small chunks, disciplined prompts
Worst compounds large retrieval sets, verbose synthesis, and retries

Worst isn’t misuse.

It’s success without guardrails.

Step 5 — Ask the Right Question

The useful question is not:

How many searches can we support?

It is:

How large does context become before cost stops making sense?

That’s the question most teams ask too late.

Why This Matters

RAG systems feel safe because traffic is predictable.

But context growth is not.

Modeling that growth early lets teams:

Set retrieval limits intentionally
Control answer verbosity
Choose models that tolerate large contexts
Decide whether search should be gated or scoped

Before cost becomes the constraint.

When AI Search Gets Expensive