RAG (Retrieval Augmented Generation) has become the go-to pattern for building AI applications that need access to custom knowledge. But running RAG at scale presents a challenge: how do you balance response quality with API costs?
The Problem
Enterprise search needs to handle diverse queries:
- Simple factual lookups ("What's our refund policy?")
- Complex analytical questions ("Compare Q3 performance across regions")
- Ambiguous queries requiring clarification
Using GPT-4 for everything is expensive. Using only cheap models sacrifices quality. The solution? Intelligent routing.
The Three-Layer Architecture
1. Ingestion Pipeline
Documents flow through an async processing pipeline:
Upload → Chunking → Embedding → Storage
↓
Metadata Extraction
Key decisions:
- Chunk size: 512 tokens with 50-token overlap
- Embedding model: Cohere's embed-english-v3.0
- Storage: Qdrant Cloud for vectors, Supabase for metadata
2. Hybrid Retrieval
Pure semantic search misses keyword matches. Pure keyword search misses semantic meaning. The solution is hybrid:
def hybrid_search(query, alpha=0.7):
semantic_results = vector_search(query)
keyword_results = bm25_search(query)
# Reciprocal Rank Fusion
combined = rrf_merge(semantic_results, keyword_results, alpha)
# Rerank with cross-encoder
return rerank(combined, query)
3. Cost Router
The magic happens here. A lightweight classifier routes queries:
| Query Type | Model | Cost | |------------|-------|------| | Simple | Groq Llama-3 | $0.0001 | | Medium | Gemini 1.5 Flash | $0.001 | | Complex | GPT-4 | $0.01 |
The classifier uses query features:
- Length and complexity
- Required reasoning depth
- Domain specificity
Results
- 70% cost reduction vs. GPT-4 for everything
- < 5% quality degradation on complex queries
- 3x faster response times for simple queries
Key Learnings
- Reranking is crucial - Cross-encoders significantly improve retrieval quality
- Caching matters - Similar queries should hit cache, not LLM
- Fallback gracefully - If cheap model fails, escalate to better model
- Monitor everything - Track cost per query, latency, and quality metrics
Check out the full implementation on GitHub.