Documents

A Document represents a file or text you upload to a knowledge base. Once uploaded, the RAG API automatically processes it through the ingestion pipeline: parse, chunk, embed, and index.

Supported File Types

Tier 1 — Parsed (via LlamaParse)

These files are sent to an external parser that extracts structured text, preserving layout, tables, and headings.

Extension	Format
`.pdf`	PDF documents
`.docx`	Microsoft Word
`.pptx`	Microsoft PowerPoint
`.xlsx`	Microsoft Excel
`.png`	PNG images (OCR)
`.jpg` / `.jpeg`	JPEG images (OCR)
`.webp`	WebP images (OCR)
`.tiff`	TIFF images (OCR)

Tier 1 files incur a per-page parsing cost and count toward your plan’s pages_processed limit.

Tier 2 — Direct Ingestion (no parsing cost)

These text-based formats are ingested directly without an external parser.

Extension	Format
`.txt`	Plain text
`.md`	Markdown
`.html` / `.htm`	HTML
`.csv`	Comma-separated values
`.tsv`	Tab-separated values
`.json`	JSON

Tier 2 files still count toward pages_processed (estimated at 1 page per ~3,000 bytes) but have zero parsing cost.

Limits

Limit	Value
Max file size	50 MB
Max pages per document	500

Processing Lifecycle

Every document moves through these statuses:

pending → parsing → chunking → embedding → ready
                                              ↘ failed

Status	Meaning
`pending`	Queued for processing
`parsing`	File is being parsed into text (Tier 1 only)
`chunking`	Text is being split into chunks
`embedding`	Chunks are being embedded into vectors
`ready`	Document is searchable
`failed`	Processing failed — check `error_detail`

Poll the document status via GET /v1/knowledge-bases/:kb_id/documents/:doc_id until it reaches ready or failed.

Text Ingestion

For content that’s already text (scraped pages, generated content, API responses), use the text ingestion endpoint instead of file upload:

POST /v1/knowledge-bases/:kb_id/documents/text

{
  "text": "Your text content here...",
  "name": "optional-filename.txt",
  "metadata": { "source": "scraper" }
}

The text is stored as a .txt file and processed through the same chunking → embedding → indexing pipeline. The name and metadata fields are optional.

Document Metadata

Each document supports arbitrary JSON metadata. This metadata is:

Stored with the document
Returned in search results as document_metadata
Filterable at query time using filter operators

{
  "department": "engineering",
  "version": 2,
  "language": "en",
  "author": "jane@example.com"
}

Set metadata at upload time via the metadata form field (file upload) or JSON body field (text ingestion). Update it later via PATCH /v1/knowledge-bases/:kb_id/documents/:doc_id.

See Search & Filtering for how to filter search results by metadata.

Document Replacement

To replace a document’s content, use PUT /v1/knowledge-bases/:kb_id/documents/:doc_id. This re-processes the document through the full pipeline, replacing all existing chunks.