Building Vectorized Search for AI Context Retrieval
Written on: April, 2026
•8 min readTraining or fine-tuning a model on your own data is cool, but it's a lot of work for limited utility. What's more useful is getting a general-use model to easily retrieve the context it needs to accomplish the task at hand. MCP enables this (that's what the Model Context Protocol is all about, after all), but until you also build a reasonable way for the model to search for the relevant context it's kind of limited on what it can do for you. In this post, Im gonna walk you through building a vectorized search system that lets an AI model find exactly the right pieces of your data without needing to be trained on any of it.
Steps
- Why Vectorized Search Over Traditional Search
- Setting Up the Stack
- Chunking Your Data
- Generating and Storing Embeddings
- Building the Search API
- Connecting It to an AI Model
- How This Connects to MCP
- Optimizing for Production
Before we build anything, let's understand why traditional keyword search falls short for AI context retrieval. If you search for "how to deploy a Node app" using keyword matching, you'll miss documents that talk about "shipping a server to production" even though they mean the same thing. Vector search works on meaning, not exact words. You convert text into numerical representations (embeddings) that capture semantic meaning, then find documents that are close in meaning to the query. This is what makes it perfect for giving AI models the right context — the model asks a question in natural language, and vector search finds the most relevant chunks of your data regardless of how they're worded.
We'll use OpenAI's embedding model to convert text into vectors and a vector database to store and search them. I'm going with PostgreSQL + pgvector here because if you're already running Postgres, you don't need another service. But you could swap this for Pinecone, Weaviate, or Qdrant if you prefer a managed solution.
1# Install dependencies
2npm install openai pg pgvector
3npm install -D @types/pg
4
5# If you're running Postgres locally, enable pgvector
6# In your psql shell:
7# CREATE EXTENSION vector;1import { Pool } from 'pg'
2
3const pool = new Pool({
4 connectionString: process.env.DATABASE_URL,
5})
6
7export async function setupVectorTable() {
8 await pool.query('CREATE EXTENSION IF NOT EXISTS vector')
9
10 await pool.query(`
11 CREATE TABLE IF NOT EXISTS documents (
12 id SERIAL PRIMARY KEY,
13 content TEXT NOT NULL,
14 metadata JSONB DEFAULT '{}',
15 embedding vector(1536)
16 )
17 `)
18
19 // Create an index for fast similarity search
20 await pool.query(`
21 CREATE INDEX IF NOT EXISTS documents_embedding_idx
22 ON documents
23 USING ivfflat (embedding vector_cosine_ops)
24 WITH (lists = 100)
25 `)
26}
27
28export { pool }This is the part most tutorials skip, but it's probably the most important. You can't just throw entire documents into your vector database — the embeddings lose meaning when the text is too long, and you'll retrieve way more context than the model needs. The trick is splitting your data into chunks that are small enough to be specific but large enough to be useful. I usually aim for 500-1000 tokens per chunk with some overlap between chunks so you don't lose context at the boundaries.
1interface Chunk {
2 content: string
3 metadata: {
4 source: string
5 chunkIndex: number
6 }
7}
8
9export function chunkText(
10 text: string,
11 source: string,
12 maxChunkSize = 800,
13 overlap = 200
14): Chunk[] {
15 const sentences = text.match(/[^.!?]+[.!?]+/g) || [text]
16 const chunks: Chunk[] = []
17 let currentChunk = ''
18 let chunkIndex = 0
19
20 for (const sentence of sentences) {
21 if ((currentChunk + sentence).length > maxChunkSize && currentChunk) {
22 chunks.push({
23 content: currentChunk.trim(),
24 metadata: { source, chunkIndex },
25 })
26 chunkIndex++
27
28 // Keep the overlap from the end of the previous chunk
29 const words = currentChunk.split(' ')
30 const overlapWords = words.slice(-Math.floor(overlap / 5))
31 currentChunk = overlapWords.join(' ') + ' ' + sentence
32 } else {
33 currentChunk += sentence
34 }
35 }
36
37 if (currentChunk.trim()) {
38 chunks.push({
39 content: currentChunk.trim(),
40 metadata: { source, chunkIndex },
41 })
42 }
43
44 return chunks
45}Now we connect to OpenAI's embedding API to convert our text chunks into vectors and store them in Postgres. The embedding model turns each chunk into a 1536-dimensional vector — basically a list of 1536 numbers that represent the meaning of that text. Two pieces of text that mean similar things will have vectors that are close together in this 1536-dimensional space.
1import OpenAI from 'openai'
2import { pool } from '../db/setup'
3import { chunkText } from '../utils/chunker'
4
5const openai = new OpenAI({
6 apiKey: process.env.OPENAI_API_KEY,
7})
8
9async function generateEmbedding(text: string): Promise<number[]> {
10 const response = await openai.embeddings.create({
11 model: 'text-embedding-3-small',
12 input: text,
13 })
14 return response.data[0].embedding
15}
16
17export async function indexDocument(content: string, source: string) {
18 const chunks = chunkText(content, source)
19
20 for (const chunk of chunks) {
21 const embedding = await generateEmbedding(chunk.content)
22
23 await pool.query(
24 `INSERT INTO documents (content, metadata, embedding)
25 VALUES ($1, $2, $3)`,
26 [
27 chunk.content,
28 JSON.stringify(chunk.metadata),
29 JSON.stringify(embedding),
30 ]
31 )
32 }
33
34 console.log(`Indexed ${chunks.length} chunks from ${source}`)
35}
36
37export async function searchDocuments(
38 query: string,
39 limit = 5
40): Promise<{ content: string; similarity: number }[]> {
41 const queryEmbedding = await generateEmbedding(query)
42
43 const result = await pool.query(
44 `SELECT content, 1 - (embedding <=> $1::vector) as similarity
45 FROM documents
46 ORDER BY embedding <=> $1::vector
47 LIMIT $2`,
48 [JSON.stringify(queryEmbedding), limit]
49 )
50
51 return result.rows
52}Let's wrap this in a simple API so our AI model (or any client) can search our indexed documents. The endpoint takes a natural language query and returns the most relevant chunks with their similarity scores. I'm keeping this minimal — in production you'd want to add caching, rate limiting, and authentication.
1import express from 'express'
2import { searchDocuments, indexDocument } from '../services/embeddings'
3
4const router = express.Router()
5
6// Search for relevant context
7router.post('/search', async (req, res) => {
8 const { query, limit = 5 } = req.body
9
10 if (!query) {
11 return res.status(400).json({ error: 'Query is required' })
12 }
13
14 const results = await searchDocuments(query, limit)
15 res.json({ results })
16})
17
18// Index a new document
19router.post('/index', async (req, res) => {
20 const { content, source } = req.body
21
22 if (!content || !source) {
23 return res.status(400).json({ error: 'Content and source are required' })
24 }
25
26 await indexDocument(content, source)
27 res.json({ success: true })
28})
29
30export { router as searchRouter }Here's where it all comes together. We use vectorized search as a retrieval step before sending the prompt to the AI model. The flow is: user asks a question, we search our vector database for relevant context, then we pass that context along with the question to the model. This is essentially what RAG (Retrieval-Augmented Generation) is — and it's way more practical than fine-tuning for most use cases.
1import OpenAI from 'openai'
2import { searchDocuments } from './embeddings'
3
4const openai = new OpenAI({
5 apiKey: process.env.OPENAI_API_KEY,
6})
7
8export async function askWithContext(question: string): Promise<string> {
9 // Step 1: Find relevant context from our documents
10 const relevantDocs = await searchDocuments(question, 5)
11
12 // Step 2: Build the context string
13 const context = relevantDocs
14 .filter((doc) => doc.similarity > 0.7)
15 .map((doc) => doc.content)
16 .join('\n\n---\n\n')
17
18 // Step 3: Send to the AI model with the retrieved context
19 const response = await openai.chat.completions.create({
20 model: 'gpt-4o',
21 messages: [
22 {
23 role: 'system',
24 content: `You are a helpful assistant. Use the following context to answer questions. If the context doesn't contain relevant information, say so.
25
26Context:
27${context}`,
28 },
29 {
30 role: 'user',
31 content: question,
32 },
33 ],
34 })
35
36 return response.choices[0].message.content || 'No response generated'
37}If you're building with MCP (Model Context Protocol), vectorized search becomes a tool that the model can call on its own. Instead of us deciding when to search, the model decides. You expose your search endpoint as an MCP tool, and the model calls it whenever it realizes it needs more context. This is the real power — the model becomes self-sufficient at finding what it needs. Without vector search backing it up, MCP tools that retrieve documents are basically doing keyword matching or returning entire files, which either misses relevant content or floods the context window with noise.
1// Example MCP tool definition for vector search
2const searchTool = {
3 name: 'search_knowledge_base',
4 description:
5 'Search the knowledge base for relevant documents. Use this when you need specific information to answer a question.',
6 input_schema: {
7 type: 'object',
8 properties: {
9 query: {
10 type: 'string',
11 description: 'Natural language search query',
12 },
13 max_results: {
14 type: 'number',
15 description: 'Maximum number of results to return',
16 default: 5,
17 },
18 },
19 required: ['query'],
20 },
21}
22
23// When the model calls this tool, you handle it like:
24async function handleToolCall(
25 toolName: string,
26 input: { query: string; max_results?: number }
27) {
28 if (toolName === 'search_knowledge_base') {
29 const results = await searchDocuments(input.query, input.max_results || 5)
30 return results.map((r) => r.content).join('\n\n')
31 }
32}A few things I've learned from running vector search in production. First, batch your embedding calls — OpenAI's API supports multiple inputs in a single request, which is way faster than one-at-a-time. Second, tune your similarity threshold — 0.7 works as a starting point but you'll want to adjust based on your data. Too low and you get irrelevant results, too high and you miss useful context. Third, consider hybrid search — combine vector similarity with keyword matching (BM25) for the best of both worlds. pgvector supports this natively with ts_rank.
1// Batch embedding generation
2async function generateEmbeddings(texts: string[]): Promise<number[][]> {
3 const response = await openai.embeddings.create({
4 model: 'text-embedding-3-small',
5 input: texts,
6 })
7 return response.data.map((d) => d.embedding)
8}
9
10// Hybrid search: vector similarity + keyword matching
11async function hybridSearch(query: string, limit = 5) {
12 const queryEmbedding = await generateEmbedding(query)
13
14 const result = await pool.query(
15 `SELECT
16 content,
17 (0.7 * (1 - (embedding <=> $1::vector))) +
18 (0.3 * ts_rank(to_tsvector('english', content), plainto_tsquery($2)))
19 AS combined_score
20 FROM documents
21 ORDER BY combined_score DESC
22 LIMIT $3`,
23 [JSON.stringify(queryEmbedding), query, limit]
24 )
25
26 return result.rows
27}
28
29// Index documents in batches for better performance
30export async function batchIndexDocuments(
31 documents: { content: string; source: string }[]
32) {
33 const allChunks = documents.flatMap((doc) =>
34 chunkText(doc.content, doc.source)
35 )
36
37 // Process in batches of 100
38 for (let i = 0; i < allChunks.length; i += 100) {
39 const batch = allChunks.slice(i, i + 100)
40 const embeddings = await generateEmbeddings(
41 batch.map((c) => c.content)
42 )
43
44 const values = batch.map((chunk, idx) =>
45 `('${chunk.content.replace(/'/g, "''")}',
46 '${JSON.stringify(chunk.metadata)}',
47 '${JSON.stringify(embeddings[idx])}')`
48 )
49
50 await pool.query(
51 `INSERT INTO documents (content, metadata, embedding)
52 VALUES ${values.join(', ')}`
53 )
54 }
55}