Designing a Cost-Optimized RAG Architecture

January 10, 2025

RAG (Retrieval Augmented Generation) has become the go-to pattern for building AI applications that need access to custom knowledge. But running RAG at scale presents a challenge: how do you balance response quality with API costs?

The Problem

Enterprise search needs to handle diverse queries:

  • Simple factual lookups ("What's our refund policy?")
  • Complex analytical questions ("Compare Q3 performance across regions")
  • Ambiguous queries requiring clarification

Using GPT-4 for everything is expensive. Using only cheap models sacrifices quality. The solution? Intelligent routing.

The Three-Layer Architecture

1. Ingestion Pipeline

Documents flow through an async processing pipeline:

Upload  Chunking  Embedding  Storage
           
    Metadata Extraction

Key decisions:

  • Chunk size: 512 tokens with 50-token overlap
  • Embedding model: Cohere's embed-english-v3.0
  • Storage: Qdrant Cloud for vectors, Supabase for metadata

2. Hybrid Retrieval

Pure semantic search misses keyword matches. Pure keyword search misses semantic meaning. The solution is hybrid:

def hybrid_search(query, alpha=0.7):
    semantic_results = vector_search(query)
    keyword_results = bm25_search(query)
    
    # Reciprocal Rank Fusion
    combined = rrf_merge(semantic_results, keyword_results, alpha)
    
    # Rerank with cross-encoder
    return rerank(combined, query)

3. Cost Router

The magic happens here. A lightweight classifier routes queries:

| Query Type | Model | Cost | |------------|-------|------| | Simple | Groq Llama-3 | $0.0001 | | Medium | Gemini 1.5 Flash | $0.001 | | Complex | GPT-4 | $0.01 |

The classifier uses query features:

  • Length and complexity
  • Required reasoning depth
  • Domain specificity

Results

  • 70% cost reduction vs. GPT-4 for everything
  • < 5% quality degradation on complex queries
  • 3x faster response times for simple queries

Key Learnings

  1. Reranking is crucial - Cross-encoders significantly improve retrieval quality
  2. Caching matters - Similar queries should hit cache, not LLM
  3. Fallback gracefully - If cheap model fails, escalate to better model
  4. Monitor everything - Track cost per query, latency, and quality metrics

Check out the full implementation on GitHub.