Kensink Labs
Multimodal RAGDirect LLM · no frameworkProduction grade
RAG · MULTIMODAL · BEYOND TEXT

Multimodal RAG. Tables, figures, scans, audio. Citation-grade.

Plain-text RAG breaks on the documents that matter most in regulated industries. Legal contracts have signatures and redlines. Financial reports have charts. Medical records have scans. Court-ready RAG needs a vision-aware extraction layer, multimodal embeddings, and a citation surface that tracks back to the page region. This is the shape we built for Affidavit Mapp.

Claude visionGPT visionColPaliBGE-M3Unstructuredpgvector
Inputs
PDF · image · scanned · audio
Extract
Claude vision · GPT vision · Unstructured
Embed
ColPali · BGE-M3 · CLIP (legacy)
Reference
Affidavit Mapp (court-ready)
[THE PDF PROBLEM]

Four ways naive RAG breaks on real documents.

Each problem has a named solution in 2026. None of them are the fastest path. All of them earn the build when the documents are regulated, complex, or both.

01

Tables

Row/column structure carries the meaning. A naive text extraction loses the relational structure and the LLM hallucinates back. Solution: table-aware extraction (Unstructured / Camelot) plus structured chunking that preserves header context.

02

Figures + charts

Bar charts, schematics, diagrams. Embeddable as images; retrievable as captions plus images. Solution: vision LLM caption + the original image both indexed, retrieved together at query time.

03

Scanned pages

OCR is no longer enough. A scan of a hand-annotated contract has signatures, redlines, stamps that matter. Solution: vision LLM does the read directly (Claude vision / GPT vision), no OCR intermediate.

04

Layout + reading order

Multi-column PDFs, footnotes, sidebars, headers. Reading order breaks every naive extraction. Solution: layout-aware parsers (Unstructured) plus a layout-respecting chunker, often with a vision LLM as the layout judge for ambiguous pages.

[EXTRACTION FLOW]

From PDF to indexed chunks.

The shape every multimodal RAG ingestion runs through. Per-stage tone matches the broader RAG palette.

Ingestion pipeline (per document).

Each stage emits provenance. The chunk-level metadata that lands in the index includes source URL, page, region, extraction model, and timestamp.

01

01 PDF / image

Lands in object storage

02

02 Type router

Vision LLM vs Unstructured

03

03 Extract + region

Text + media + bbox

04

04 Chunk + caption

Layout-respecting

05

05 Embed

ColPali / BGE-M3 / Cohere

06

06 Index

pgvector + provenance

[RETRIEVAL ARCHITECTURE]

Text and visual, retrieved together.

At query time, the index serves text chunks, image embeddings, table chunks, and audio transcripts through one unified retrieval call. Rerank tightens the top-K; the LLM generates the answer with citations back to the source media.

The query path.

One query, multiple media types, one ranked answer.

Query
Embed (multimodal)
Text chunks
Image embeds
Table chunks
Audio transcripts
Unified rerank (Cohere)
LLM + cite by region
[FIVE-LAYER STACK]

The components, named.

01

Document ingestion

PDFs, images, audio, structured docs land in object storage. A worker enqueues them by type. Each type gets its own extraction strategy. Provenance recorded (source URL, hash, ingestion timestamp) so every chunk can cite its origin.

02

Extraction

Vision LLM for documents with layout / tables / figures (Claude vision is our 2026 default, GPT vision for fallback). Unstructured for clean PDFs with simple layout. Whisper for audio. The extraction layer outputs both the text AND the original media reference.

03

Embedding

Text-only: Cohere v3 or OpenAI 3-large. Multimodal: ColPali for direct image retrieval (no text intermediate), BGE-M3 for dense + sparse + multi-vector in one model. CLIP for legacy multimodal where the corpus and queries don't need 2024-era retrieval gains.

04

Unified index

Text chunks, image embeddings, table chunks, and audio transcripts all in one pgvector or Qdrant index, separated by metadata facets. Retrieval can be filtered by media type, or merged across all four.

05

Citation surface

Every claim in the answer points back to its source — page number, table cell, audio timestamp, image region. The UI renders the citation inline; the source can be opened, verified, and audited. Court-ready by design.

[REFERENCE BUILD]

Affidavit Mapp.

Court-ready legal RAG. Bank statements, contracts, scanned exhibits go in as PDFs. The pipeline produces a court-admissible report with citations down to the page-and-region level.

Read the case study →
99.7%
Data integrity vs ground truth
Days→min
Processing turnaround
Court-ready
Citation surface to region
Chain
Of custody preserved
[WHAT YOU GET]

What we ship on a multimodal RAG.

Vision-aware
Extraction, not OCR
Multimodal
ColPali / BGE-M3 / Cohere
Cite-to-region
Page + bounding box per claim
Auditable
Provenance chain on every chunk
[COMMON QUESTIONS]

What buyers ask before they sign.

Do I need a multimodal embedding model, or is text extraction enough?
Depends on the corpus. For text-heavy PDFs (contracts, legal briefs, policy documents) where the figures are illustrative not load-bearing, high-quality text extraction + a text embedding (Cohere v3) is enough. For visually-rich documents (financial reports with charts, technical manuals with schematics, medical images), ColPali or BGE-M3 with the original image as part of the retrieval surface delivers significantly better quality.
What is ColPali and why does it matter?
ColPali is a multimodal embedding model that retrieves directly against page images, skipping the text extraction step entirely. Published numbers show it outperforming text-only RAG on visually-rich corpora by 10-20 points of retrieval accuracy. The cost is heavier embeddings and a more expensive index, but for documents where layout IS the content, it's the right answer.
How do you handle citations in multimodal RAG?
The citation surface tracks source identity at extraction time. For PDFs: page number, optional bounding-box for the chunk. For tables: page number plus table row/cell. For images: page number plus image region. For audio: timestamp range. The prompt requires the model to cite per claim, and the UI renders the source inline. This is the Affidavit Mapp shape — court-ready citations were the engagement constraint, not a nice-to-have.
Can I use a single LLM for both extraction and generation?
Yes, and we often do. Claude or GPT vision can extract the document content into structured chunks at ingestion, then the same family of model generates the answer at query time. The decoupling we DO keep: the extraction step writes to the index, the generation step reads from it. So we can swap one without re-extracting the corpus.
What does the Affidavit Mapp build look like?
Legal documents (bank statements, court filings, financial records) come in as PDFs with mixed layouts — tables, scanned pages, hand-annotated sections. Vision LLM extracts each into chunks tagged with page + bounding box. Hybrid retrieval (pgvector + BM25 + RRF) finds the relevant chunks. Cohere Rerank tightens to top-20. Generation produces a court-admissible report with citations down to the page-and-region level. Days-to-minutes processing, 99.7% data integrity vs ground truth. See the case study for the full architecture.
DIRECT RAG · APPLIED K

Bring the documents. We will sketch the extraction.

Send a sample PDF or describe the corpus. We will name the extraction strategy, the multimodal embedding choice, and the citation surface — sized to your regulatory bar.