Documents
A Document represents a file or text you upload to a knowledge base. Once uploaded, the RAG API automatically processes it through the ingestion pipeline: parse, chunk, embed, and index.
Supported File Types
Section titled “Supported File Types”Tier 1 — Parsed (via LlamaParse)
Section titled “Tier 1 — Parsed (via LlamaParse)”These files are sent to an external parser that extracts structured text, preserving layout, tables, and headings.
| Extension | Format |
|---|---|
.pdf | PDF documents |
.docx | Microsoft Word |
.pptx | Microsoft PowerPoint |
.xlsx | Microsoft Excel |
.png | PNG images (OCR) |
.jpg / .jpeg | JPEG images (OCR) |
.webp | WebP images (OCR) |
.tiff | TIFF images (OCR) |
Tier 1 files incur a per-page parsing cost and count toward your plan’s pages_processed limit.
Tier 2 — Direct Ingestion (no parsing cost)
Section titled “Tier 2 — Direct Ingestion (no parsing cost)”These text-based formats are ingested directly without an external parser.
| Extension | Format |
|---|---|
.txt | Plain text |
.md | Markdown |
.html / .htm | HTML |
.csv | Comma-separated values |
.tsv | Tab-separated values |
.json | JSON |
Tier 2 files still count toward pages_processed (estimated at 1 page per ~3,000 bytes) but have zero parsing cost.
Limits
Section titled “Limits”| Limit | Value |
|---|---|
| Max file size | 50 MB |
| Max pages per document | 500 |
Processing Lifecycle
Section titled “Processing Lifecycle”Every document moves through these statuses:
pending → parsing → chunking → embedding → ready ↘ failed| Status | Meaning |
|---|---|
pending | Queued for processing |
parsing | File is being parsed into text (Tier 1 only) |
chunking | Text is being split into chunks |
embedding | Chunks are being embedded into vectors |
ready | Document is searchable |
failed | Processing failed — check error_detail |
Poll the document status via GET /v1/knowledge-bases/:kb_id/documents/:doc_id until it reaches ready or failed.
Text Ingestion
Section titled “Text Ingestion”For content that’s already text (scraped pages, generated content, API responses), use the text ingestion endpoint instead of file upload:
POST /v1/knowledge-bases/:kb_id/documents/text{ "text": "Your text content here...", "name": "optional-filename.txt", "metadata": { "source": "scraper" }}The text is stored as a .txt file and processed through the same chunking → embedding → indexing pipeline. The name and metadata fields are optional.
Document Metadata
Section titled “Document Metadata”Each document supports arbitrary JSON metadata. This metadata is:
- Stored with the document
- Returned in search results as
document_metadata - Filterable at query time using filter operators
{ "department": "engineering", "version": 2, "language": "en", "author": "jane@example.com"}Set metadata at upload time via the metadata form field (file upload) or JSON body field (text ingestion). Update it later via PATCH /v1/knowledge-bases/:kb_id/documents/:doc_id.
See Search & Filtering for how to filter search results by metadata.
Document Replacement
Section titled “Document Replacement”To replace a document’s content, use PUT /v1/knowledge-bases/:kb_id/documents/:doc_id. This re-processes the document through the full pipeline, replacing all existing chunks.