Building Vectorized Search for AI Context Retrieval

Written on: April, 2026

8 min read

Training or fine-tuning a model on your own data is cool, but it's a lot of work for limited utility. What's more useful is getting a general-use model to easily retrieve the context it needs to accomplish the task at hand. MCP enables this (that's what the Model Context Protocol is all about, after all), but until you also build a reasonable way for the model to search for the relevant context it's kind of limited on what it can do for you. In this post, Im gonna walk you through building a vectorized search system that lets an AI model find exactly the right pieces of your data without needing to be trained on any of it.

Steps

  1. Why Vectorized Search Over Traditional Search
  2. Before we build anything, let's understand why traditional keyword search falls short for AI context retrieval. If you search for "how to deploy a Node app" using keyword matching, you'll miss documents that talk about "shipping a server to production" even though they mean the same thing. Vector search works on meaning, not exact words. You convert text into numerical representations (embeddings) that capture semantic meaning, then find documents that are close in meaning to the query. This is what makes it perfect for giving AI models the right context — the model asks a question in natural language, and vector search finds the most relevant chunks of your data regardless of how they're worded.

  3. Setting Up the Stack
  4. We'll use OpenAI's embedding model to convert text into vectors and a vector database to store and search them. I'm going with PostgreSQL + pgvector here because if you're already running Postgres, you don't need another service. But you could swap this for Pinecone, Weaviate, or Qdrant if you prefer a managed solution.

    1# Install dependencies
    2npm install openai pg pgvector
    3npm install -D @types/pg
    4
    5# If you're running Postgres locally, enable pgvector
    6# In your psql shell:
    7# CREATE EXTENSION vector;
    db/setup.ts
    1import { Pool } from 'pg'
    2
    3const pool = new Pool({
    4  connectionString: process.env.DATABASE_URL,
    5})
    6
    7export async function setupVectorTable() {
    8  await pool.query('CREATE EXTENSION IF NOT EXISTS vector')
    9
    10  await pool.query(`
    11    CREATE TABLE IF NOT EXISTS documents (
    12      id SERIAL PRIMARY KEY,
    13      content TEXT NOT NULL,
    14      metadata JSONB DEFAULT '{}',
    15      embedding vector(1536)
    16    )
    17  `)
    18
    19  // Create an index for fast similarity search
    20  await pool.query(`
    21    CREATE INDEX IF NOT EXISTS documents_embedding_idx
    22    ON documents
    23    USING ivfflat (embedding vector_cosine_ops)
    24    WITH (lists = 100)
    25  `)
    26}
    27
    28export { pool }
  5. Chunking Your Data
  6. This is the part most tutorials skip, but it's probably the most important. You can't just throw entire documents into your vector database — the embeddings lose meaning when the text is too long, and you'll retrieve way more context than the model needs. The trick is splitting your data into chunks that are small enough to be specific but large enough to be useful. I usually aim for 500-1000 tokens per chunk with some overlap between chunks so you don't lose context at the boundaries.

    utils/chunker.ts
    1interface Chunk {
    2  content: string
    3  metadata: {
    4    source: string
    5    chunkIndex: number
    6  }
    7}
    8
    9export function chunkText(
    10  text: string,
    11  source: string,
    12  maxChunkSize = 800,
    13  overlap = 200
    14): Chunk[] {
    15  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text]
    16  const chunks: Chunk[] = []
    17  let currentChunk = ''
    18  let chunkIndex = 0
    19
    20  for (const sentence of sentences) {
    21    if ((currentChunk + sentence).length > maxChunkSize && currentChunk) {
    22      chunks.push({
    23        content: currentChunk.trim(),
    24        metadata: { source, chunkIndex },
    25      })
    26      chunkIndex++
    27
    28      // Keep the overlap from the end of the previous chunk
    29      const words = currentChunk.split(' ')
    30      const overlapWords = words.slice(-Math.floor(overlap / 5))
    31      currentChunk = overlapWords.join(' ') + ' ' + sentence
    32    } else {
    33      currentChunk += sentence
    34    }
    35  }
    36
    37  if (currentChunk.trim()) {
    38    chunks.push({
    39      content: currentChunk.trim(),
    40      metadata: { source, chunkIndex },
    41    })
    42  }
    43
    44  return chunks
    45}
  7. Generating and Storing Embeddings
  8. Now we connect to OpenAI's embedding API to convert our text chunks into vectors and store them in Postgres. The embedding model turns each chunk into a 1536-dimensional vector — basically a list of 1536 numbers that represent the meaning of that text. Two pieces of text that mean similar things will have vectors that are close together in this 1536-dimensional space.

    services/embeddings.ts
    1import OpenAI from 'openai'
    2import { pool } from '../db/setup'
    3import { chunkText } from '../utils/chunker'
    4
    5const openai = new OpenAI({
    6  apiKey: process.env.OPENAI_API_KEY,
    7})
    8
    9async function generateEmbedding(text: string): Promise<number[]> {
    10  const response = await openai.embeddings.create({
    11    model: 'text-embedding-3-small',
    12    input: text,
    13  })
    14  return response.data[0].embedding
    15}
    16
    17export async function indexDocument(content: string, source: string) {
    18  const chunks = chunkText(content, source)
    19
    20  for (const chunk of chunks) {
    21    const embedding = await generateEmbedding(chunk.content)
    22
    23    await pool.query(
    24      `INSERT INTO documents (content, metadata, embedding)
    25       VALUES ($1, $2, $3)`,
    26      [
    27        chunk.content,
    28        JSON.stringify(chunk.metadata),
    29        JSON.stringify(embedding),
    30      ]
    31    )
    32  }
    33
    34  console.log(`Indexed ${chunks.length} chunks from ${source}`)
    35}
    36
    37export async function searchDocuments(
    38  query: string,
    39  limit = 5
    40): Promise<{ content: string; similarity: number }[]> {
    41  const queryEmbedding = await generateEmbedding(query)
    42
    43  const result = await pool.query(
    44    `SELECT content, 1 - (embedding <=> $1::vector) as similarity
    45     FROM documents
    46     ORDER BY embedding <=> $1::vector
    47     LIMIT $2`,
    48    [JSON.stringify(queryEmbedding), limit]
    49  )
    50
    51  return result.rows
    52}
  9. Building the Search API
  10. Let's wrap this in a simple API so our AI model (or any client) can search our indexed documents. The endpoint takes a natural language query and returns the most relevant chunks with their similarity scores. I'm keeping this minimal — in production you'd want to add caching, rate limiting, and authentication.

    api/search.ts
    1import express from 'express'
    2import { searchDocuments, indexDocument } from '../services/embeddings'
    3
    4const router = express.Router()
    5
    6// Search for relevant context
    7router.post('/search', async (req, res) => {
    8  const { query, limit = 5 } = req.body
    9
    10  if (!query) {
    11    return res.status(400).json({ error: 'Query is required' })
    12  }
    13
    14  const results = await searchDocuments(query, limit)
    15  res.json({ results })
    16})
    17
    18// Index a new document
    19router.post('/index', async (req, res) => {
    20  const { content, source } = req.body
    21
    22  if (!content || !source) {
    23    return res.status(400).json({ error: 'Content and source are required' })
    24  }
    25
    26  await indexDocument(content, source)
    27  res.json({ success: true })
    28})
    29
    30export { router as searchRouter }
  11. Connecting It to an AI Model
  12. Here's where it all comes together. We use vectorized search as a retrieval step before sending the prompt to the AI model. The flow is: user asks a question, we search our vector database for relevant context, then we pass that context along with the question to the model. This is essentially what RAG (Retrieval-Augmented Generation) is — and it's way more practical than fine-tuning for most use cases.

    services/ai-with-context.ts
    1import OpenAI from 'openai'
    2import { searchDocuments } from './embeddings'
    3
    4const openai = new OpenAI({
    5  apiKey: process.env.OPENAI_API_KEY,
    6})
    7
    8export async function askWithContext(question: string): Promise<string> {
    9  // Step 1: Find relevant context from our documents
    10  const relevantDocs = await searchDocuments(question, 5)
    11
    12  // Step 2: Build the context string
    13  const context = relevantDocs
    14    .filter((doc) => doc.similarity > 0.7)
    15    .map((doc) => doc.content)
    16    .join('\n\n---\n\n')
    17
    18  // Step 3: Send to the AI model with the retrieved context
    19  const response = await openai.chat.completions.create({
    20    model: 'gpt-4o',
    21    messages: [
    22      {
    23        role: 'system',
    24        content: `You are a helpful assistant. Use the following context to answer questions. If the context doesn't contain relevant information, say so.
    25
    26Context:
    27${context}`,
    28      },
    29      {
    30        role: 'user',
    31        content: question,
    32      },
    33    ],
    34  })
    35
    36  return response.choices[0].message.content || 'No response generated'
    37}
  13. How This Connects to MCP
  14. If you're building with MCP (Model Context Protocol), vectorized search becomes a tool that the model can call on its own. Instead of us deciding when to search, the model decides. You expose your search endpoint as an MCP tool, and the model calls it whenever it realizes it needs more context. This is the real power — the model becomes self-sufficient at finding what it needs. Without vector search backing it up, MCP tools that retrieve documents are basically doing keyword matching or returning entire files, which either misses relevant content or floods the context window with noise.

    mcp/search-tool.ts
    1// Example MCP tool definition for vector search
    2const searchTool = {
    3  name: 'search_knowledge_base',
    4  description:
    5    'Search the knowledge base for relevant documents. Use this when you need specific information to answer a question.',
    6  input_schema: {
    7    type: 'object',
    8    properties: {
    9      query: {
    10        type: 'string',
    11        description: 'Natural language search query',
    12      },
    13      max_results: {
    14        type: 'number',
    15        description: 'Maximum number of results to return',
    16        default: 5,
    17      },
    18    },
    19    required: ['query'],
    20  },
    21}
    22
    23// When the model calls this tool, you handle it like:
    24async function handleToolCall(
    25  toolName: string,
    26  input: { query: string; max_results?: number }
    27) {
    28  if (toolName === 'search_knowledge_base') {
    29    const results = await searchDocuments(input.query, input.max_results || 5)
    30    return results.map((r) => r.content).join('\n\n')
    31  }
    32}
  15. Optimizing for Production
  16. A few things I've learned from running vector search in production. First, batch your embedding calls — OpenAI's API supports multiple inputs in a single request, which is way faster than one-at-a-time. Second, tune your similarity threshold — 0.7 works as a starting point but you'll want to adjust based on your data. Too low and you get irrelevant results, too high and you miss useful context. Third, consider hybrid search — combine vector similarity with keyword matching (BM25) for the best of both worlds. pgvector supports this natively with ts_rank.

    services/embeddings-optimized.ts
    1// Batch embedding generation
    2async function generateEmbeddings(texts: string[]): Promise<number[][]> {
    3  const response = await openai.embeddings.create({
    4    model: 'text-embedding-3-small',
    5    input: texts,
    6  })
    7  return response.data.map((d) => d.embedding)
    8}
    9
    10// Hybrid search: vector similarity + keyword matching
    11async function hybridSearch(query: string, limit = 5) {
    12  const queryEmbedding = await generateEmbedding(query)
    13
    14  const result = await pool.query(
    15    `SELECT
    16      content,
    17      (0.7 * (1 - (embedding <=> $1::vector))) +
    18      (0.3 * ts_rank(to_tsvector('english', content), plainto_tsquery($2)))
    19      AS combined_score
    20    FROM documents
    21    ORDER BY combined_score DESC
    22    LIMIT $3`,
    23    [JSON.stringify(queryEmbedding), query, limit]
    24  )
    25
    26  return result.rows
    27}
    28
    29// Index documents in batches for better performance
    30export async function batchIndexDocuments(
    31  documents: { content: string; source: string }[]
    32) {
    33  const allChunks = documents.flatMap((doc) =>
    34    chunkText(doc.content, doc.source)
    35  )
    36
    37  // Process in batches of 100
    38  for (let i = 0; i < allChunks.length; i += 100) {
    39    const batch = allChunks.slice(i, i + 100)
    40    const embeddings = await generateEmbeddings(
    41      batch.map((c) => c.content)
    42    )
    43
    44    const values = batch.map((chunk, idx) =>
    45      `('${chunk.content.replace(/'/g, "''")}',
    46        '${JSON.stringify(chunk.metadata)}',
    47        '${JSON.stringify(embeddings[idx])}')`
    48    )
    49
    50    await pool.query(
    51      `INSERT INTO documents (content, metadata, embedding)
    52       VALUES ${values.join(', ')}`
    53    )
    54  }
    55}

Conclusion

That's the whole pipeline — from raw documents to AI-powered contextual retrieval. The beauty of vectorized search is that your AI model doesn't need to be trained on your data to understand it. It just needs a way to find the right pieces at the right time, and vector embeddings give it exactly that. Combined with MCP, you get a model that can autonomously decide when it needs more context and go fetch it. Start with a small dataset, get the chunking right, and build from there. If you have any questions or want to dive deeper into any of these topics, feel free to reach out. Happy building!
Back to Home
Aldi Krasniqi