Designing a Cost-Optimized RAG Architecture

RAG (Retrieval Augmented Generation) has become the go-to pattern for building AI applications that need access to custom knowledge. But running RAG at scale presents a challenge: how do you balance response quality with API costs?

The Problem

Enterprise search needs to handle diverse queries:

Simple factual lookups ("What's our refund policy?")
Complex analytical questions ("Compare Q3 performance across regions")
Ambiguous queries requiring clarification

Using GPT-4 for everything is expensive. Using only cheap models sacrifices quality. The solution? Intelligent routing.

The Three-Layer Architecture

1. Ingestion Pipeline

Documents flow through an async processing pipeline:

Upload → Chunking → Embedding → Storage
           ↓
    Metadata Extraction

Key decisions:

Chunk size: 512 tokens with 50-token overlap
Embedding model: Cohere's embed-english-v3.0
Storage: Qdrant Cloud for vectors, Supabase for metadata

2. Hybrid Retrieval

Pure semantic search misses keyword matches. Pure keyword search misses semantic meaning. The solution is hybrid:

def hybrid_search(query, alpha=0.7):
    semantic_results = vector_search(query)
    keyword_results = bm25_search(query)
    
    # Reciprocal Rank Fusion
    combined = rrf_merge(semantic_results, keyword_results, alpha)
    
    # Rerank with cross-encoder
    return rerank(combined, query)

3. Cost Router

The magic happens here. A lightweight classifier routes queries:

| Query Type | Model | Cost | |------------|-------|------| | Simple | Groq Llama-3 | $0.0001 | | Medium | Gemini 1.5 Flash | $0.001 | | Complex | GPT-4 | $0.01 |

The classifier uses query features:

Length and complexity
Required reasoning depth
Domain specificity

Results

70% cost reduction vs. GPT-4 for everything
< 5% quality degradation on complex queries
3x faster response times for simple queries

Key Learnings

Reranking is crucial - Cross-encoders significantly improve retrieval quality
Caching matters - Similar queries should hit cache, not LLM
Fallback gracefully - If cheap model fails, escalate to better model
Monitor everything - Track cost per query, latency, and quality metrics

Check out the full implementation on GitHub.