Skip to content

Search

The RAG API search pipeline converts your natural language query into relevant document chunks in four steps.

Query → Embed → Vector Search (pgvector ANN) → Rerank → Filter → Results

Your query text is converted into a vector using the same embedding model (Voyage AI) used during document ingestion. This ensures query vectors and chunk vectors live in the same semantic space.

Caching: Query embeddings are cached in-memory with a 5-minute TTL. Repeated or identical queries skip the embedding step entirely.

The query vector is compared against all chunk vectors in the knowledge base using Approximate Nearest Neighbor search via pgvector’s HNSW index. This returns the top candidates ranked by cosine similarity.

The search over-fetches candidates (up to min(top_k * 3, 50)) to give the reranker more material to work with.

A cross-encoder reranker (Voyage AI Rerank) re-scores the candidates by examining the full query-chunk text pairs. Cross-encoders are more accurate than embedding similarity alone because they see both texts together.

Reranking is enabled by default (rerank: true). You can disable it for lower latency if your use case doesn’t need the accuracy boost.

Results are filtered by:

  • Metadata filters — match on document metadata fields using operators like $eq, $gt, $in, etc.
  • Score threshold — drop results below a minimum relevance score

The final top_k results are returned with scores, chunk text, and metadata.

The API is designed to return results even when components fail:

FailureBehavior
Reranker times out (800ms)Returns un-reranked results. Response includes rerank_applied: false.
Reranker errorsSame as timeout — un-reranked results returned.
Embedding failsReturns 502 EXTERNAL_SERVICE_ERROR.

Check the usage.rerank_applied field in the response to know whether reranking was applied.

ParameterTypeDefaultDescription
querystring(required)Natural language search query (1-2000 chars)
top_kinteger10Number of results to return (1-50)
rerankbooleantrueApply cross-encoder reranking
filterobjectnoneMetadata filter (see Search & Filtering)
score_thresholdfloat0.0Minimum relevance score (0-1)
include_metadatabooleantrueInclude chunk and document metadata
MetricTypical Value
Cache hit (repeated query)~50-150ms
Full pipeline (embed + ANN + rerank)~300-800ms
Without reranking~150-400ms
Benchmark (SciFact nDCG@10)0.94

Latency depends on the number of chunks in the knowledge base and whether the query embedding is cached.

Search results are cached in-memory with a 2-minute TTL. The cache is automatically invalidated when documents are added to or removed from the knowledge base.