Tables
Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.
Plain-text RAG breaks on the documents that matter most in regulated industries. Legal contracts have signatures and redlines. Financial reports have charts. Medical records have scans. Court-ready RAG needs a vision-aware extraction layer, multimodal embeddings, and a citation surface that tracks back to the page region. This is the shape we built for Affidavit Mapp.
Each problem has a named solution in 2026. None of them are the fastest path. All of them earn the build when the documents are regulated, complex, or both.
Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.
Bar charts, schematics, diagrams. Embeddable as images; retrievable as captions plus images. Solution: vision LLM caption + the original image both indexed, retrieved together at query time.
OCR is no longer enough. A scan of a hand-annotated contract has signatures, redlines, stamps that matter. Solution: vision LLM does the read directly (Claude vision / GPT vision), no OCR intermediate.
Multi-column PDFs, footnotes, sidebars, headers. Reading order breaks every naive extraction. Solution: layout-aware parsers (Unstructured) plus a layout-respecting chunker, often with a vision LLM as the layout judge for ambiguous pages.
The shape every multimodal RAG ingestion runs through. Per-stage tone matches the broader RAG palette.
Each stage emits provenance. The chunk-level metadata that lands in the index includes source URL, page, region, extraction model, and timestamp.
Lands in object storage
Vision LLM vs Unstructured
Text + media + bbox
Layout-respecting
ColPali / BGE-M3 / Cohere
pgvector + provenance
At query time, the index serves text chunks, image embeddings, table chunks, and audio transcripts through one unified retrieval call. Rerank tightens the top-K; the LLM generates the answer with citations back to the source media.
One query, multiple media types, one ranked answer.
PDFs, images, audio, structured docs land in object storage. A worker enqueues them by type. Each type gets its own extraction strategy. Provenance recorded (source URL, hash, ingestion timestamp) so every chunk can cite its origin.
Vision LLM for documents with layout / tables / figures (Claude vision is our 2026 default, GPT vision for fallback). Unstructured for clean PDFs with simple layout. Whisper for audio. The extraction layer outputs both the text AND the original media reference.
Text-only: Cohere v3 or OpenAI 3-large. Multimodal: ColPali for direct image retrieval (no text intermediate), BGE-M3 for dense + sparse + multi-vector in one model. CLIP for legacy multimodal where the corpus and queries don't need 2024-era retrieval gains.
Text chunks, image embeddings, table chunks, and audio transcripts all in one pgvector or Qdrant index, separated by metadata facets. Retrieval can be filtered by media type, or merged across all four.
Every claim in the answer points back to its source — page number, table cell, audio timestamp, image region. The UI renders the citation inline; the source can be opened, verified, and audited. Court-ready by design.
Court-ready legal RAG. Bank statements, contracts, scanned exhibits go in as PDFs. The pipeline produces a court-admissible report with citations down to the page-and-region level.
Read the case study →Naive, Advanced, Modular, Agentic, GraphRAG, CRAG, Self-RAG. Five named patterns with the decision tree for picking one.
Read morepgvector, Qdrant, Milvus, Weaviate, Vespa, LanceDB, Pinecone. Honest 2026 comparison and our default.
Read moreEmbeddings, chunking, hybrid search, reranking. The four layers retrieval quality lives or dies in.
Read moreProven designs from under 100k chunks to over 1B. The architecture changes with the scale.
Read more