Building a Vector Database from Scratch

January 15, 2025

Vector databases are at the heart of modern AI applications, powering everything from semantic search to RAG (Retrieval Augmented Generation) systems. In this post, I'll walk through my journey of building one from scratch.

Why Build From Scratch?

The best way to truly understand a technology is to build it yourself. While production systems like Pinecone, Weaviate, and Qdrant are excellent, building your own helps you understand:

  • How embeddings represent semantic meaning
  • Why different distance metrics matter
  • The trade-offs between accuracy and performance
  • How indexing structures like HNSW work

Core Concepts

Embeddings

Embeddings are numerical representations of data (text, images, etc.) in a high-dimensional vector space. Similar items are placed close together in this space.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')
embedding = model.encode("Hello, world!")  # 384-dimensional vector

Distance Metrics

The choice of distance metric affects how similarity is measured:

  • Euclidean Distance: Straight-line distance between points
  • Cosine Similarity: Angle between vectors (ignores magnitude)
  • Dot Product: Combines magnitude and direction

The VectorDB Class

Here's the core structure of my implementation:

class VectorDB:
    def __init__(self, dimensions: int, metric: str = 'euclidean'):
        self.dimensions = dimensions
        self.metric = metric
        self.vectors = []
        self.metadata = []
    
    def insert(self, vector, metadata=None):
        self.vectors.append(vector)
        self.metadata.append(metadata)
    
    def search(self, query, k=5):
        distances = self._compute_distances(query)
        top_k = np.argsort(distances)[:k]
        return [{'id': i, 'distance': distances[i]} for i in top_k]

What I Learned

  1. Brute force is fine for small datasets - O(n) search works well for < 100k vectors
  2. Normalization matters - For cosine similarity, pre-normalizing vectors speeds up search
  3. Metadata is crucial - Storing associated data with vectors enables filtering
  4. Persistence is tricky - Efficient serialization requires careful design

Next Steps

I'm now exploring:

  • HNSW indexing for O(log n) search
  • Product quantization for memory efficiency
  • Hybrid search combining semantic and keyword matching

Check out the full implementation on GitHub.