Most AI search systems don’t fail because of traffic.
They fail because context grows faster than anyone expects.
Each query retrieves “just a little more.”
Each answer includes “just a bit more detail.”
Tokens compound quietly.
The Scenario: AI Search with RAG (Production)
Imagine an AI-powered search feature built on top of internal documents.
Users ask natural language questions and the system retrieves relevant chunks before generating an answer.
It feels efficient.
It feels controlled.
Until context takes over.
Assumptions
- ~10,000 searches per month
- Retrieval of 5–15 document chunks per query
- Chunk sizes vary by document type
- Answers are synthesized, not short responses
- Some retries and follow-ups are expected
Traffic is reasonable.
Context is not.
Step 1 — Model Retrieval, Not Queries
With RAG systems, queries are not the cost driver.
Retrieval is.
Every search expands the prompt:
- User question
- Retrieved document chunks
- System instructions
- Formatting scaffolding
The model sees all of it.
Cost scales with context size, not just usage.
Step 2 — Define a Planning Baseline
In ModelIndex, this is the Expected scenario.
Expected means:
- Typical retrieval depth
- Average chunk sizes
- Normal answer verbosity
- Planned retries and follow-ups
This is the number teams should budget against before adding more documents or increasing recall.
Step 3 — Identify the Real Cost Drivers
In RAG systems, cost grows when:
- Retrieval depth increases “slightly”
- Chunk sizes drift larger over time
- More documents are indexed
- Answers become more verbose
- Context limits are pushed but not enforced
None of this looks dramatic in isolation.
Together, they dominate cost.
Step 4 — Explore Best and Worst Boundaries
Now look at Best and Worst.
These are not performance tiers.
They exist to answer one question:
How expensive does search become as context assumptions break?
- Best assumes tight retrieval, small chunks, disciplined prompts
- Worst compounds large retrieval sets, verbose synthesis, and retries
Worst isn’t misuse.
It’s success without guardrails.
Step 5 — Ask the Right Question
The useful question is not:
How many searches can we support?
It is:
How large does context become before cost stops making sense?
That’s the question most teams ask too late.
Why This Matters
RAG systems feel safe because traffic is predictable.
But context growth is not.
Modeling that growth early lets teams:
- Set retrieval limits intentionally
- Control answer verbosity
- Choose models that tolerate large contexts
- Decide whether search should be gated or scoped
Before cost becomes the constraint.