Search

The RAG API search pipeline converts your natural language query into relevant document chunks in four steps.

Pipeline Overview

Query → Embed → Vector Search (pgvector ANN) → Rerank → Filter → Results

1. Embed the Query

Your query text is converted into a vector using the same embedding model (Voyage AI) used during document ingestion. This ensures query vectors and chunk vectors live in the same semantic space.

Caching: Query embeddings are cached in-memory with a 5-minute TTL. Repeated or identical queries skip the embedding step entirely.

2. Vector Search (ANN)

The query vector is compared against all chunk vectors in the knowledge base using Approximate Nearest Neighbor search via pgvector’s HNSW index. This returns the top candidates ranked by cosine similarity.

The search over-fetches candidates (up to min(top_k * 3, 50)) to give the reranker more material to work with.

3. Rerank

A cross-encoder reranker (Voyage AI Rerank) re-scores the candidates by examining the full query-chunk text pairs. Cross-encoders are more accurate than embedding similarity alone because they see both texts together.

Reranking is enabled by default (rerank: true). You can disable it for lower latency if your use case doesn’t need the accuracy boost.

4. Filter and Return

Results are filtered by:

Metadata filters — match on document metadata fields using operators like $eq, $gt, $in, etc.
Score threshold — drop results below a minimum relevance score

The final top_k results are returned with scores, chunk text, and metadata.

Graceful Degradation

The API is designed to return results even when components fail:

Failure	Behavior
Reranker times out (800ms)	Returns un-reranked results. Response includes `rerank_applied: false`.
Reranker errors	Same as timeout — un-reranked results returned.
Embedding fails	Returns `502 EXTERNAL_SERVICE_ERROR`.

Check the usage.rerank_applied field in the response to know whether reranking was applied.

Search Parameters

Parameter	Type	Default	Description
`query`	string	(required)	Natural language search query (1-2000 chars)
`top_k`	integer	10	Number of results to return (1-50)
`rerank`	boolean	true	Apply cross-encoder reranking
`filter`	object	none	Metadata filter (see Search & Filtering)
`score_threshold`	float	0.0	Minimum relevance score (0-1)
`include_metadata`	boolean	true	Include chunk and document metadata

Performance Characteristics

Metric	Typical Value
Cache hit (repeated query)	~50-150ms
Full pipeline (embed + ANN + rerank)	~300-800ms
Without reranking	~150-400ms
Benchmark (SciFact nDCG@10)	0.94

Latency depends on the number of chunks in the knowledge base and whether the query embedding is cached.

Search Results Cache

Search results are cached in-memory with a 2-minute TTL. The cache is automatically invalidated when documents are added to or removed from the knowledge base.