Skip to content

Search

The Ragex search pipeline converts your natural language query into relevant document chunks. Three search modes are available: vector (default), keyword, and hybrid.

Uses semantic embeddings to find chunks that are conceptually similar to your query, even if the exact words differ. Best for natural language questions and conceptual queries.

{
"query": "How do I configure authentication?",
"mode": "vector"
}

Uses full-text search to find chunks containing specific terms. No embedding is generated, making it faster and cheaper. Best for exact term matching, error codes, or proper nouns.

{
"query": "authentication",
"mode": "keyword"
}

Runs both vector and keyword search in parallel, then fuses the results using Reciprocal Rank Fusion (RRF). Combines the precision of keyword matching with the recall of semantic search.

{
"query": "How do I configure OAuth?",
"mode": "hybrid",
"alpha": 0.6
}

The alpha parameter controls the weighting between vector and keyword results:

  • alpha: 1.0 — vector only (same as mode: "vector")
  • alpha: 0.0 — keyword only (same as mode: "keyword")
  • alpha: 0.6 (default) — slightly favors semantic search

You can also provide a separate keyword query to use different terms for each search component:

{
"query": "How do I configure single sign-on?",
"keyword": "SSO SAML OAuth",
"mode": "hybrid"
}
Query → Embed → Vector Search (ANN) → Rerank → Filter → Results

Your query text is converted into a vector using the same embedding model used during document ingestion. This ensures query vectors and chunk vectors live in the same semantic space.

Caching: Repeated or identical queries are served from cache, reducing latency.

The query vector is compared against all chunk vectors in the knowledge base using approximate nearest neighbor (ANN) search. This returns the top candidates ranked by cosine similarity.

More candidates than your requested top_k are retrieved to give the reranker a larger pool to score from.

A cross-encoder reranker re-scores the candidates by examining the full query-chunk text pairs. Cross-encoders are more accurate than embedding similarity alone because they see both texts together.

Reranking is enabled by default (rerank: true). You can disable it for lower latency if your use case doesn’t need the accuracy boost.

Results are filtered by:

  • Metadata filters — match on document metadata fields using operators like $eq, $gt, $in, etc.
  • Score threshold — drop results below a minimum relevance score

The final top_k results are returned with scores, chunk text, and metadata.

The API is designed to return results even when components fail:

FailureBehavior
Reranker times outReturns un-reranked results. Response includes rerank_applied: false.
Reranker errorsSame as timeout — un-reranked results returned.
Embedding failsReturns 502 EXTERNAL_SERVICE_ERROR.

Check the usage.rerank_applied field in the response to know whether reranking was applied.

ParameterTypeDefaultDescription
querystring(required)Natural language search query (1-2000 chars)
top_kinteger10Number of results to return (1-50)
rerankbooleantrueApply cross-encoder reranking
modestring"vector"Search mode: vector, keyword, or hybrid
keywordstringnoneSeparate keyword query for hybrid mode (defaults to query if omitted)
alphafloat0.6Vector/keyword weighting in hybrid mode (0 = keyword only, 1 = vector only)
filterobjectnoneMetadata filter (see Search & Filtering)
score_thresholdfloat0.0Minimum relevance score (0-1)
include_metadatabooleantrueInclude chunk and document metadata
MetricTypical Value
Cache hit (repeated query)~50-150ms
Full pipeline (embed + ANN + rerank)~300-800ms
Without reranking~150-400ms
Keyword search~50-200ms
Hybrid search~300-800ms

Latency depends on the number of chunks in the knowledge base and whether the query embedding is cached.

Search results are cached for performance. The cache is automatically invalidated when documents are added to or removed from the knowledge base.