Search
The RAG API search pipeline converts your natural language query into relevant document chunks in four steps.
Pipeline Overview
Section titled “Pipeline Overview”Query → Embed → Vector Search (pgvector ANN) → Rerank → Filter → Results1. Embed the Query
Section titled “1. Embed the Query”Your query text is converted into a vector using the same embedding model (Voyage AI) used during document ingestion. This ensures query vectors and chunk vectors live in the same semantic space.
Caching: Query embeddings are cached in-memory with a 5-minute TTL. Repeated or identical queries skip the embedding step entirely.
2. Vector Search (ANN)
Section titled “2. Vector Search (ANN)”The query vector is compared against all chunk vectors in the knowledge base using Approximate Nearest Neighbor search via pgvector’s HNSW index. This returns the top candidates ranked by cosine similarity.
The search over-fetches candidates (up to min(top_k * 3, 50)) to give the reranker more material to work with.
3. Rerank
Section titled “3. Rerank”A cross-encoder reranker (Voyage AI Rerank) re-scores the candidates by examining the full query-chunk text pairs. Cross-encoders are more accurate than embedding similarity alone because they see both texts together.
Reranking is enabled by default (rerank: true). You can disable it for lower latency if your use case doesn’t need the accuracy boost.
4. Filter and Return
Section titled “4. Filter and Return”Results are filtered by:
- Metadata filters — match on document metadata fields using operators like
$eq,$gt,$in, etc. - Score threshold — drop results below a minimum relevance score
The final top_k results are returned with scores, chunk text, and metadata.
Graceful Degradation
Section titled “Graceful Degradation”The API is designed to return results even when components fail:
| Failure | Behavior |
|---|---|
| Reranker times out (800ms) | Returns un-reranked results. Response includes rerank_applied: false. |
| Reranker errors | Same as timeout — un-reranked results returned. |
| Embedding fails | Returns 502 EXTERNAL_SERVICE_ERROR. |
Check the usage.rerank_applied field in the response to know whether reranking was applied.
Search Parameters
Section titled “Search Parameters”| Parameter | Type | Default | Description |
|---|---|---|---|
query | string | (required) | Natural language search query (1-2000 chars) |
top_k | integer | 10 | Number of results to return (1-50) |
rerank | boolean | true | Apply cross-encoder reranking |
filter | object | none | Metadata filter (see Search & Filtering) |
score_threshold | float | 0.0 | Minimum relevance score (0-1) |
include_metadata | boolean | true | Include chunk and document metadata |
Performance Characteristics
Section titled “Performance Characteristics”| Metric | Typical Value |
|---|---|
| Cache hit (repeated query) | ~50-150ms |
| Full pipeline (embed + ANN + rerank) | ~300-800ms |
| Without reranking | ~150-400ms |
| Benchmark (SciFact nDCG@10) | 0.94 |
Latency depends on the number of chunks in the knowledge base and whether the query embedding is cached.
Search Results Cache
Section titled “Search Results Cache”Search results are cached in-memory with a 2-minute TTL. The cache is automatically invalidated when documents are added to or removed from the knowledge base.