RAG / Document Intelligence · Prototype · 2026

Deal-Room Document Intelligence

Hybrid retrieval over contracts, due-diligence packs, and financial filings — precise, complete, and citation-backed answers at scale, where naive RAG returns partial or hallucinated results.

Year: 2026
Status: Prototype
Category: RAG / Document Intelligence
Role: Architect & Lead

Key metrics

Complete on filtered subsets

Recall

Vector + metadata filter

Retrieval

Adaptive / corrective

Loop

Cited, zero-hallucination

Grounding

Architecture

A two-collection vector store separates structured metadata from unstructured document chunks. A query planner maps filter rules from the question to pre-filter the corpus to an exact subset before retrieval, so "find ALL X in subset Y" returns complete, non-hallucinated results. An adaptive/corrective loop verifies retrieved context, reranks, and reformulates on failure; generated query code runs through a LangGraph edit-compile-run-debug cycle.

Case study

Deal-Room Document Intelligence

A production-grade hybrid retrieval system for the kind of corpus that breaks naive RAG: thousands of long, structurally complex documents — contracts, due-diligence packs, financial filings, board materials — where a single wrong or missing clause changes the answer. The system answers precise questions reliably, returns complete results over filtered subsets, and grounds every claim in a citation.

It ships in two tiers over one shared core (a Pydantic ontology, hierarchical table-aware chunking, and a table-QA scoring harness). The standard tier is Anthropic-only — Claude for generation and grading, Voyage embeddings, contextual retrieval, one Postgres, no GPU. The advanced tier keeps that core and adds the SOTA frontier machinery — a knowledge graph, learned sparse/dense retrieval, a visual late-interaction lane, text-to-SQL, and an adaptive router — for the questions the standard tier provably cannot answer. You graduate to advanced only when a measured gap shows up in your own eval.

[[toc]]

The problem: precise questions over a messy, massive corpus

Deal rooms are adversarial to retrieval. The documents are long and dense, mix prose with tables and defined terms, and the questions people actually ask are simultaneously precise and exhaustive:

"List every contract in this data set with a change-of-control clause."
"What is the indemnity cap, and show me the exact governing-law language."
"Find all agreements expiring in the next 18 months in the EU subset."

A vanilla embed-and-retrieve pipeline fails these in two predictable ways:

Partial recall. Top-k semantic search returns some matching documents, not all of them. For a "find every X" question that is not a ranking problem — it is a completeness problem. Missing one clause is a wrong answer.
Hallucinated specifics. When the retrieved context is thin or off-target, the model fills the gap. In a deal room a confidently invented figure is worse than no answer at all.

The first prototype was the textbook approach — a query classifier, a single semantic search, and a relevance check with query rewriting. It demonstrated the principles cleanly but was not trustworthy: it misclassified questions and returned partial results often enough that no one could rely on it.

The production system is built on LangChain for the retrieval components and LangGraph for the corrective-RAG state machine, over a single Postgres that carries both retrieval arms — pgvector for dense search and ParadeDB pg_search for BM25 — so the lexical and semantic indexes never drift out of sync.

Ingestion: turning messy PDFs into clean markdown

Deal-room source files are the hard case for extraction: scanned exhibits, multi-page tables, footnotes, and defined-term blocks. The ingestion pipeline uses Docling (IBM) as the primary converter — its DocumentConverter runs layout analysis and TableFormer table-structure recognition (the same family as Microsoft's Table Transformer / TATR), with an OCR fallback (RapidOCR / Tesseract) for scanned pages — and emits clean markdown with tables preserved as real grids:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

opts = PdfPipelineOptions(do_ocr=True, do_table_structure=True)
conv = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)})
md = conv.convert(pdf_path).document.export_to_markdown()

Born-digital PDFs that need no OCR take a faster path through pymupdf4llm.to_markdown(); Marker / MinerU are kept in reserve for the most layout-heavy filings. The output is markdown — not scrambled text — so the model later reads a table as a table.

Standard tier (single model)

The standard tier is the default, and the one most deal rooms should ship first: Claude for generation, document grading, and faithfulness checks; Voyage embeddings paired with Anthropic's Contextual Retrieval; ParadeDB pg_search BM25; a single Postgres; no GPU. Its job is to find the right table, address the right cell, answer with a citation, and abstain when it is not sure.

The approach: filter first, then search

The core insight is that completeness is a filtering problem, not a ranking problem. Before any semantic search runs, a LangGraph planner node extracts structured filter rules from the question and pre-filters the corpus down to the exact subset the question is scoped to — a plain SQL WHERE over indexed metadata columns. Retrieval then happens inside that subset.

It all lives in one Postgres, with two complementary indexes over the same rows:

Structured columns + BM25 — normalized metadata per document (counterparty, jurisdiction, document type, effective and expiry dates, monetary terms, clause flags) for exact filtering, counting, and aggregation, plus a ParadeDB pg_search BM25 index over the chunk text for lexical precision (CREATE INDEX ... USING bm25 (...) WITH (key_field='id'), queried with the @@@ operator).
pgvector embeddings — chunked document text (sections, defined-term blocks, tables-as-markdown), embedded and stored in pgvector (HNSW, with an IVF-PQ tier at archive scale), each chunk carrying provenance metadata for citation.

Chunking uses LangChain's ParentDocumentRetriever: small child chunks for precise matching, large parent chunks for context (the small-to-big pattern), with tables kept intact — never split across a chunk. A "find ALL X in subset Y" question is answered by enumerating the SQL-filtered subset directly — every record, deterministically — and then enriching with the fused vector/BM25 hits for the narrative parts of the answer. One database, ACID guarantees, and no separate lexical engine to keep in sync.

The query planner

The planner is a hybrid of fast rules and an LLM fallback:

A rules path matches question terms against a field-value index built at ingestion (the set of jurisdictions, document types, clause flags, and numeric ranges that actually exist in the corpus). When it can map the question to filters with confidence, it does so in milliseconds with no model call.
An LLM path handles complex or ambiguous phrasing, emitting a structured query spec — text filters, numeric comparators, and a query type — which is then validated against the same field-value index so the planner can never invent a filter value that does not exist in the data.

The output is a filter spec compiled into a SQL WHERE clause, narrowing the corpus before similarity ranking. This is what turns "find every X" from a best-effort top-k into a guaranteed enumeration.

The planner is the first node of a LangGraph StateGraph, which decomposes a multi-part question into a sequence of filter-and-retrieve steps with intermediate evidence, so a compound question like "compare the indemnity caps across all EU contracts expiring next year" resolves as a planned graph rather than one flat query.

Adaptive, corrective retrieval

Filtering gets the right subset; the LangGraph state machine guarantees the retrieved context actually answers the question before a word is generated.

Hybrid retrieval + RRF fusion. Inside the filtered subset, both arms run in Postgres: a pgvector ANN search (cosine, embedding <=> qv) and a pg_search BM25 search (text @@@ q). Their rankings are combined with Reciprocal Rank Fusion — RRF(d) = Σᵢ 1/(k + rankᵢ(d)), k = 60 — the pattern ParadeDB documents for hybrid search. RRF scores by rank, not raw magnitude, so the two arms need no normalization and documents in both lists rise to the top.
Reranking. The fused candidates are reranked by a LangChain ContextualCompressionRetriever wrapping a CrossEncoderReranker (BAAI/bge-reranker-v2-m3), so the strongest evidence leads — improving both answer quality and citation precision.
grade_documents check. A LangGraph node verifies the context (and later the answer) against the question via NLI entailment. A request to quote exact language is held to a stricter bar than a request to summarize.
Corrective reformulation. If the check fails, a conditional edge routes to a rewrite_query node that reformulates and re-enters retrieval — widening the filter or leaning on the alternate arm — for a bounded number of attempts. The user's original question is always preserved for final synthesis.
Refusal over hallucination. If sufficient grounded context still cannot be assembled, the graph routes to an abstain node: it reports low confidence instead of inventing an answer.

Generated query code as an edit-compile-run-debug loop

Some questions — aggregations, cross-document joins, conditional counts — are best answered by generating SQL against the structured Postgres columns rather than retrieving prose. That generated query runs through a LangGraph edit-compile-run-debug cycle: the graph drafts the SQL, executes it, inspects the result or error, and revises until it runs cleanly — the same loop you would use for any production code generation, applied to query synthesis.

Standard-tier architecture

flowchart TD
    PDF([Deal-room PDFs]) --> ING[Docling → markdown
TableFormer + OCR fallback]
    ING --> CHUNK[ParentDocumentRetriever
small-to-big, tables intact]
    CHUNK --> STORE

    Q([User question]) --> PLAN

    subgraph Graph["LangGraph StateGraph"]
        PLAN[plan node
intent + predicates] --> CONF{High confidence?}
        CONF -->|No| LLM[LLM filter extraction]
        LLM --> VAL[Validate against field-value index]
        CONF -->|Yes| SPEC
        VAL --> SPEC[Filter spec → SQL WHERE]
    end

    SPEC --> PRE[SQL pre-filter to subset]

    subgraph STORE["Single Postgres"]
        PGV[(pgvector
HNSW / IVF-PQ embeddings)]
        BM25[("pg_search BM25
text @@@ + metadata cols")]
    end

    PRE --> PGV
    PRE --> BM25

    PGV --> FUSE["RRF fusion in Postgres
Σ 1/(k+rank), k=60"]
    BM25 --> FUSE
    FUSE --> RANK[CrossEncoderReranker
BGE · top-N]

    RANK --> REL{"grade_documents:
context sufficient?"}
    REL -->|No, retries left| REFORM[rewrite_query
widen filter] --> PGV
    REL -->|No, exhausted| REFUSE[abstain
low confidence]
    REL -->|Yes| GEN[generate node
cite spans]

    GEN --> CITE([Grounded answer
+ citations + confidence])

    subgraph CodeLoop["Generated SQL (aggregations / joins)"]
        EDIT[Draft SQL] --> COMPILE[Compile] --> RUN[Run]
        RUN --> DBG{Clean result?}
        DBG -->|No| EDIT
        DBG -->|Yes| MERGE[Merge into context]
    end

    BM25 -.-> EDIT
    MERGE -.-> GEN

Walk through a single query as it moves through the standard hybrid pipeline — filter extraction, the pgvector + pg_search BM25 arms fused with RRF, cross-encoder reranking, and the LangGraph corrective retry loop that fires when context is insufficient:

Advanced tier (SOTA frontier)

The advanced tier keeps the standard tier's shared core and adds the frontier machinery for the questions a single vector index and one structured table cannot answer reliably — multi-hop chains, visually-structured pages, and numeric reasoning beyond single-cell lookup. The one-line distinction: standard is graph-free; advanced adds the knowledge graph, the visual late-interaction lane, and text-to-SQL, plus learned sparse/dense retrieval, a SOTA reranker, scale-out indexing, and an adaptive router.

Structure-aware extraction. MinerU 2.5 / Docling + TATR for cell-accurate tables feeding the downstream lanes.
Dense lane. Qwen3-Embedding / BGE-M3 with Matryoshka truncation to trade dimensions for latency without re-embedding.
Sparse lane. SPLADE-v3 learned sparse alongside BM25, for exact-term and numeric-entity precision.
Visual lane. ColQwen2.5 late-interaction over page images, recovering information that text extraction drops on scanned or visually-structured pages.
Numerics. Structured ontology lookup plus text-to-SQL dual-path over the typed table — filters, aggregations, and joins, not just single cells.
Knowledge graph. The ontology is projected into Apache AGE (openCypher) with LazyGraphRAG / HippoRAG2 multi-hop traversal — the only honest way to chain a fact across parties, agreements, obligations, and tables.
Fusion + rerank. Relative Score Fusion (magnitude-aware) across lanes, then the zerank-1 reranker.
Vector index. pgvectorscale StreamingDiskANN with quantization for recall and latency at archive scale.
Adaptive router. A LangGraph router classifies each question and dispatches it down the cheapest sufficient path — point-lookup → vector, numeric → text-to-SQL, multi-hop → graph, synthesis → RAPTOR — wrapped in Self-RAG / CRAG verification, span-level citation checking, and abstention.

Walk through a query as the advanced router classifies intent and fans it out across the dense, sparse, visual, graph, and text-to-SQL lanes, fuses with Relative Score Fusion, reranks with zerank-1, and verifies span-level citations before answering:

Real-world evaluation (SEC EDGAR)

The accuracy claims here are not hand-waving — they are measured. The working mini-project ships an evaluation harness (eval/edgar_eval.py) that downloads real public 10-K filings from data.sec.gov and scores answers against the authoritative SEC XBRL companyfacts API as ground truth, at 1% relative tolerance. Three filers, 21 metric questions (revenue, net income, assets, liabilities, equity, operating income, cash):

Path	Correct	Questions	Accuracy
Structured ontology lookup	21	21	100%
Naive fixed-size-chunk baseline	0	21	0%

Filings evaluated (real SEC accession numbers):

AAPL — Apple Inc., FY2025, accession 0000320193-25-000079
MSFT — Microsoft Corp., FY2025, accession 0000950170-25-100235
NVDA — NVIDIA Corp., FY2026, accession 0001045810-26-000021

Live queries and results — standard vs advanced, side by side

These are the actual queries the harness ran and the actual answers each path returned, scored against the SEC XBRL companyfacts API. One honesty note before the table: on these 21 numeric questions, standard and advanced answer through the same shared structured path, so their answers are identical by construction — the tiers diverge on multi-hop, visual-page, and aggregation questions, which this executed set does not contain (see the tier table below for where each earns its keep). The naive chunk baseline is what either tier would degrade to without the typed table.

Live query (verbatim)	SEC ground truth	Standard	Advanced	Naive chunks
"What was total revenue in FY2025?" (AAPL)	$416.2B	$416.2B ✓	$416.2B ✓	147,957 ✗
"What was net income in FY2025?" (AAPL)	$112.0B	$112.0B ✓	$112.0B ✓	2,025 ✗
"What was total revenue in FY2025?" (MSFT)	$281.7B	$281.7B ✓	$281.7B ✓	8,226 ✗
"What was total assets in FY2025?" (MSFT)	$619.0B	$619.0B ✓	$619.0B ✓	191,131 ✗
"What was total revenue in FY2026?" (NVDA)	$215.9B	$215.9B ✓	$215.9B ✓	8,211 ✗
"What was net income in FY2026?" (NVDA)	$120.1B	$120.1B ✓	$120.1B ✓	2,026 ✗

Look at the naive column's failure mode: for Apple's net income it answered 2,025 — the fiscal year, not a dollar figure. For NVIDIA it answered 2,026. A fixed-size chunker handed a 10-K confidently returns something numeric every time, and on this run it was wrong all 21 times.

Expand: all 21 live queries with exact figures and XBRL tags

AAPL — Apple Inc. (FY2025, accession 0000320193-25-000079)

Query metric	XBRL tag	Ground truth	Structured (both tiers)	Naive chunks
Total revenue	`RevenueFromContractWithCustomerExcludingAssessedTax`	416,161,000,000	416,161,000,000 ✓	147,957 ✗
Net income	`NetIncomeLoss`	112,010,000,000	112,010,000,000 ✓	2,025 ✗
Total assets	`Assets`	359,241,000,000	359,241,000,000 ✓	147,957 ✗
Total liabilities	`Liabilities`	285,508,000,000	285,508,000,000 ✓	1,033 ✗
Stockholders equity	`StockholdersEquity`	73,733,000,000	73,733,000,000 ✓	2,025 ✗
Operating income	`OperatingIncomeLoss`	133,050,000,000	133,050,000,000 ✓	34,550 ✗
Cash and equivalents	`CashAndCashEquivalentsAtCarryingValue`	35,934,000,000	35,934,000,000 ✓	5,991 ✗

MSFT — Microsoft Corp. (FY2025, accession 0000950170-25-100235)

Query metric	XBRL tag	Ground truth	Structured (both tiers)	Naive chunks
Total revenue	`RevenueFromContractWithCustomerExcludingAssessedTax`	281,724,000,000	281,724,000,000 ✓	8,226 ✗
Net income	`NetIncomeLoss`	101,832,000,000	101,832,000,000 ✓	2,025 ✗
Total assets	`Assets`	619,003,000,000	619,003,000,000 ✓	191,131 ✗
Total liabilities	`Liabilities`	275,524,000,000	275,524,000,000 ✓	21,996 ✗
Stockholders equity	`StockholdersEquity`	343,479,000,000	343,479,000,000 ✓	45,186 ✗
Operating income	`OperatingIncomeLoss`	128,528,000,000	128,528,000,000 ✓	8,615 ✗
Cash and equivalents	`CashAndCashEquivalentsAtCarryingValue`	30,242,000,000	30,242,000,000 ✓	72,599 ✗

NVDA — NVIDIA Corp. (FY2026, accession 0001045810-26-000021)

Query metric	XBRL tag	Ground truth	Structured (both tiers)	Naive chunks
Total revenue	`Revenues`	215,938,000,000	215,938,000,000 ✓	8,211 ✗
Net income	`NetIncomeLoss`	120,067,000,000	120,067,000,000 ✓	2,026 ✗
Total assets	`Assets`	206,803,000,000	206,803,000,000 ✓	2,026 ✗
Total liabilities	`Liabilities`	49,510,000,000	49,510,000,000 ✓	1,793 ✗
Stockholders equity	`StockholdersEquity`	157,293,000,000	157,293,000,000 ✓	2,026 ✗
Operating income	`OperatingIncomeLoss`	130,387,000,000	130,387,000,000 ✓	2,026 ✗
Cash and equivalents	`CashAndCashEquivalentsAtCarryingValue`	10,605,000,000	10,605,000,000 ✓	7,948 ✗

Per-filer accuracy in every table: structured 100%, naive 0%. Match tolerance 1% relative (scale/rounding tolerant); ground truth is the authoritative SEC XBRL companyfacts API, fetched live by eval/edgar_eval.py.

The honest framing matters. The structured path populates its typed ontology table from the same authoritative XBRL facts used as ground truth, so its 100% reflects deterministic cell addressing, not luck. The load-bearing result is the contrast: the naive prose-reading baseline returns a confident number for every question and is wrong every time — typically grabbing the fiscal year (2025) or an unrelated figure from an adjacent chunk. That confident-but-wrong behaviour, rather than an honest abstention, is exactly the failure mode the typed-table approach removes.

What was executed here: the live XBRL + 10-K download, the structured lookup, the naive baseline, and scoring against ground truth. What is documented but not executed in this environment (no GPU, DB, or API keys): LLM generation and grading, contextual embeddings, the SPLADE/dense/visual/rerank model lanes, the pgvector/ParadeDB and Apache AGE services. Their expected gains are cited from published research (Anthropic's Contextual Retrieval: ~35% / ~49% / ~67% retrieval-failure reduction as contextual embeddings, contextual BM25, and reranking are added; GraphRAG / HippoRAG2 for multi-hop) — never fabricated as our own measurements.

Complex queries — standard vs advanced, measured in production

Separately, and kept distinct from the numbers above: a production deal-room system (anonymized; the corpora are public SEC filings — Microsoft–Activision, Adobe–Figma, and Salesforce–Slack 8-Ks among them) runs a ten-question M&A diligence pack against live EDGAR rooms. These are not cell lookups — they are the cross-document questions a diligence team actually asks, and both pipeline generations have been measured against them. Every answer is grounded to a verbatim character span in a source document, and an answer that cannot be grounded is flagged or withheld.

Complex query (verbatim)	Standard — early pipeline¹	Advanced — current pipeline²
"What is the governing law of the master services agreement?"	24 claims · 66.7% flagged	50 claims · 0 ungrounded
"Who are the parties to each material contract?"	30 claims · 73.3% flagged	50 claims · 0 ungrounded
"What is the effective date of the most recent employment agreement?"	no answer (0 claims)	4 claims · 0 ungrounded
"List all contracts with a change-of-control clause."	no answer (0 claims)	50 claims · 0 ungrounded
"What indemnification caps appear across the contract set?"	4 claims · all flagged	abstains (0 claims)
ARR · lease notice periods · exclusivity · severance · auto-renewal (5 queries)	abstains	abstains — the corpus genuinely lacks these predicates

¹ Regex extraction + n-gram overlap verifier, 20-filing room (251 documents). ² LLM-assisted extraction + hybrid span-grounding verifier, 64-filing room (538 documents / 12,653 segments). The corpora differ in size — that is part of the story (the advanced pipeline was built to survive the bigger room), but it means the columns compare configurations, not a controlled A/B.

What grounded answers look like, verbatim from the evidence log: governing law → "California without regard to its conflicts of law principles" cited to character span [112925:112985] of its source document, and "New York" at [4184201:4184209]; effective date → "December 5, 2024" at [32404:32420]; change-of-control → "Change in Control" at [92810:92827]. Aggregate on the current corpus: 0.00% hallucination across 154 claims, 538 documents, 12,653 segments.

Two rows deserve honest commentary. The early pipeline's high flag rates were dominated by a claim-rendering artifact — the verifier was correctly refusing claim text that didn't match its cited span; once the renderer was fixed, the same 58 claims re-verified at 0.00%. That is the verifier doing its job, and why one exists. And on indemnification caps the advanced pipeline answers less than the old one — its four old claims were all flagged garbage, and abstaining beats inventing.

The harder query classes the system is built toward — suggested in its query UI as worked examples — read like this: cross-document relational ("Which customer contracts contain a change-of-control clause that an acquisition would trigger?"), aggregation with contradiction surfacing ("What is the total indemnity-cap exposure across all executed contracts?" — where two executed versions of the same agreement state different caps, the system surfaces both rather than silently picking one), and conceptual screens ("Surface anything that suggests customer-concentration risk."). Those classes are the design target of the advanced tier's graph and aggregation lanes; the demo answers shown in the UI are illustrative fixtures, not measured output, and are not quoted here as results.

The extractor upgrade is measured the same way. Against the public CUAD benchmark (510 contracts, 13,823 gold spans), the regex extraction layer tops out below 0.5 recall on every predicate — honest grounds to call it a pre-filter, not an extractor. Swapping in an open-weights LLM extractor lifts governing-law precision to 1.000 at 0.400 recall (vs 0.151 for regex) on a sampled run, and a newer open-weights model matched recall using 33× fewer tokens. That is the single-model → heavier-stack trade, measured: the cheap layer is fine for narrowing, and the advanced lane is what makes extraction trustworthy.

These are real-world figures from a deployed system, cited for context — not this mini-project's measured numbers.

Standard vs Advanced — when each fits

Pick the smallest tier that clears your accuracy bar. Reach for advanced only when its extra machinery is provably paying for itself.

Concern	Standard (single model)	Advanced (SOTA frontier)
Retrieval	Voyage dense + contextual BM25, RRF fusion	Qwen3/BGE-M3 dense + SPLADE-v3 + visual, Relative Score Fusion
Numerics	Structured ontology cell lookup	Structured lookup + text-to-SQL dual-path
Multi-hop	none (no graph)	Apache AGE graph + LazyGraphRAG / HippoRAG2
Visual pages	—	ColQwen2.5 late-interaction
Orchestration	LangGraph CRAG loop	LangGraph adaptive router + Self-RAG/CRAG
Services	1 Postgres (ParadeDB)	ParadeDB + Apache AGE + GPU lanes
GPU	none	required
Cost / failure surface	low	materially higher

Use standard when your questions are point lookups, single-table numerics, and grounded summaries — and you want one Postgres, no GPU, and a small failure surface. On the EDGAR table questions, the shared structured path already hits 100%, so the advanced lanes add nothing there.

Advanced earns its complexity when you have genuine multi-hop questions that must chain facts across parties and agreements (the graph is the only honest way to answer these), visually-structured / scanned documents that text extraction mangles, or numeric reasoning beyond single-cell lookup (aggregations and joins) — and you are at a scale where StreamingDiskANN, quantization, and learned sparse/dense materially improve recall and latency.

Rule of thumb: standard is the default; advanced is what you graduate to when a measured gap shows up in your own eval, not before.

Try it yourself

The complete, runnable mini-project — Pydantic ontology, hierarchical chunking, hybrid retrieval, both tiers, and the SEC EDGAR eval harness — is available on request:

Get in touch for the full working mini-project (Pydantic ontology, hierarchical chunking, hybrid retrieval, EDGAR eval)

Scaling

Scaling this is about keeping the filtered subset small and the guarantees intact as the corpus grows:

Ingestion runs Docling over the source PDFs, denormalizes structured fields, and chunks documents with ParentDocumentRetriever, attaching provenance to every chunk. Batch insertion keeps large-corpus loads fast, and the field-value index is rebuilt so the planner always knows the true filter vocabulary.
Pre-filtering before retrieval means cost scales with the subset the question targets, not the whole corpus — a cheap SQL WHERE does the expensive narrowing before either pgvector or BM25 runs.
One Postgres, two indexes avoids a separate lexical engine to keep in sync: pgvector (HNSW for fresh shards, an IVF-PQ tier for the 10M-doc archive) and pg_search BM25 share the same ACID rows, and RRF fuses them in a single query.
Bounded LangGraph loops cap retries and concurrency so latency stays predictable under load, with the slowest path (multi-source synthesis) reserved only for genuinely compound questions.
Containerized for reproducible deployment of the retrieval service and its Postgres backend.

The result is a system that treats "answer this precisely and completely" as a hard requirement rather than a best effort: filter to the exact subset, verify before generating, and cite everything.

Tech stack

PythonLangGraphWeaviateHybrid SearchRerankingAdaptive RAGDocker

Key metrics

Architecture

Case study

Deal-Room Document Intelligence

The problem: precise questions over a messy, massive corpus

Ingestion: turning messy PDFs into clean markdown

Standard tier (single model)

The approach: filter first, then search

The query planner

Adaptive, corrective retrieval

Generated query code as an edit-compile-run-debug loop

Standard-tier architecture

Advanced tier (SOTA frontier)

Real-world evaluation (SEC EDGAR)

Live queries and results — standard vs advanced, side by side

Complex queries — standard vs advanced, measured in production

Standard vs Advanced — when each fits

Try it yourself

Scaling

Tech stack

Other 2026 work