Skip to content

Documents

A Document represents a file or text you upload to a knowledge base. Once uploaded, the RAG API automatically processes it through the ingestion pipeline: parse, chunk, embed, and index.

These files are sent to an external parser that extracts structured text, preserving layout, tables, and headings.

ExtensionFormat
.pdfPDF documents
.docxMicrosoft Word
.pptxMicrosoft PowerPoint
.xlsxMicrosoft Excel
.pngPNG images (OCR)
.jpg / .jpegJPEG images (OCR)
.webpWebP images (OCR)
.tiffTIFF images (OCR)

Tier 1 files incur a per-page parsing cost and count toward your plan’s pages_processed limit.

Tier 2 — Direct Ingestion (no parsing cost)

Section titled “Tier 2 — Direct Ingestion (no parsing cost)”

These text-based formats are ingested directly without an external parser.

ExtensionFormat
.txtPlain text
.mdMarkdown
.html / .htmHTML
.csvComma-separated values
.tsvTab-separated values
.jsonJSON

Tier 2 files still count toward pages_processed (estimated at 1 page per ~3,000 bytes) but have zero parsing cost.

LimitValue
Max file size50 MB
Max pages per document500

Every document moves through these statuses:

pending → parsing → chunking → embedding → ready
↘ failed
StatusMeaning
pendingQueued for processing
parsingFile is being parsed into text (Tier 1 only)
chunkingText is being split into chunks
embeddingChunks are being embedded into vectors
readyDocument is searchable
failedProcessing failed — check error_detail

Poll the document status via GET /v1/knowledge-bases/:kb_id/documents/:doc_id until it reaches ready or failed.

For content that’s already text (scraped pages, generated content, API responses), use the text ingestion endpoint instead of file upload:

POST /v1/knowledge-bases/:kb_id/documents/text
{
"text": "Your text content here...",
"name": "optional-filename.txt",
"metadata": { "source": "scraper" }
}

The text is stored as a .txt file and processed through the same chunking → embedding → indexing pipeline. The name and metadata fields are optional.

Each document supports arbitrary JSON metadata. This metadata is:

  • Stored with the document
  • Returned in search results as document_metadata
  • Filterable at query time using filter operators
{
"department": "engineering",
"version": 2,
"language": "en",
"author": "jane@example.com"
}

Set metadata at upload time via the metadata form field (file upload) or JSON body field (text ingestion). Update it later via PATCH /v1/knowledge-bases/:kb_id/documents/:doc_id.

See Search & Filtering for how to filter search results by metadata.

To replace a document’s content, use PUT /v1/knowledge-bases/:kb_id/documents/:doc_id. This re-processes the document through the full pipeline, replacing all existing chunks.