Deal-Room Document Intelligence
Hybrid retrieval over contracts, due-diligence packs, and financial filings — precise, complete, and citation-backed answers at scale, where naive RAG returns partial or hallucinated results.
Key metrics
Architecture
A two-collection vector store separates structured metadata from unstructured document chunks. A query planner maps filter rules from the question to pre-filter the corpus to an exact subset before retrieval, so "find ALL X in subset Y" returns complete, non-hallucinated results. An adaptive/corrective loop verifies retrieved context, reranks, and reformulates on failure; generated query code runs through a LangGraph edit-compile-run-debug cycle.
Case study
Deal-Room Document Intelligence
A production-grade hybrid retrieval system for the kind of corpus that breaks naive RAG: thousands of long, structurally complex documents — contracts, due-diligence packs, financial filings, board materials — where a single wrong or missing clause changes the answer. The system answers precise questions reliably, returns complete results over filtered subsets, and grounds every claim in a citation.
It ships in two tiers over one shared core (a Pydantic ontology, hierarchical table-aware chunking, and a table-QA scoring harness). The standard tier is Anthropic-only — Claude for generation and grading, Voyage embeddings, contextual retrieval, one Postgres, no GPU. The advanced tier keeps that core and adds the SOTA frontier machinery — a knowledge graph, learned sparse/dense retrieval, a visual late-interaction lane, text-to-SQL, and an adaptive router — for the questions the standard tier provably cannot answer. You graduate to advanced only when a measured gap shows up in your own eval.
[[toc]]
The problem: precise questions over a messy, massive corpus
Deal rooms are adversarial to retrieval. The documents are long and dense, mix prose with tables and defined terms, and the questions people actually ask are simultaneously precise and exhaustive:
- "List every contract in this data set with a change-of-control clause."
- "What is the indemnity cap, and show me the exact governing-law language."
- "Find all agreements expiring in the next 18 months in the EU subset."
A vanilla embed-and-retrieve pipeline fails these in two predictable ways:
- Partial recall. Top-k semantic search returns some matching documents, not all of them. For a "find every X" question that is not a ranking problem — it is a completeness problem. Missing one clause is a wrong answer.
- Hallucinated specifics. When the retrieved context is thin or off-target, the model fills the gap. In a deal room a confidently invented figure is worse than no answer at all.
The first prototype was the textbook approach — a query classifier, a single semantic search, and a relevance check with query rewriting. It demonstrated the principles cleanly but was not trustworthy: it misclassified questions and returned partial results often enough that no one could rely on it.
The production system is built on LangChain for the retrieval components and LangGraph for the corrective-RAG state machine, over a single Postgres that carries both retrieval arms — pgvector for dense search and ParadeDB pg_search for BM25 — so the lexical and semantic indexes never drift out of sync.
Ingestion: turning messy PDFs into clean markdown
Deal-room source files are the hard case for extraction: scanned exhibits,
multi-page tables, footnotes, and defined-term blocks. The ingestion pipeline
uses Docling (IBM) as the primary converter — its DocumentConverter runs
layout analysis and TableFormer table-structure recognition (the same family
as Microsoft's Table Transformer / TATR), with an OCR fallback (RapidOCR /
Tesseract) for scanned pages — and emits clean markdown with tables preserved as
real grids:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
opts = PdfPipelineOptions(do_ocr=True, do_table_structure=True)
conv = DocumentConverter(
format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)})
md = conv.convert(pdf_path).document.export_to_markdown()
Born-digital PDFs that need no OCR take a faster path through
pymupdf4llm.to_markdown(); Marker / MinerU are kept in reserve for the
most layout-heavy filings. The output is markdown — not scrambled text — so the
model later reads a table as a table.
Standard tier (single model)
The standard tier is the default, and the one most deal rooms should ship first:
Claude for generation, document grading, and faithfulness checks; Voyage
embeddings paired with Anthropic's Contextual Retrieval; ParadeDB pg_search
BM25; a single Postgres; no GPU. Its job is to find the right table, address the
right cell, answer with a citation, and abstain when it is not sure.
The approach: filter first, then search
The core insight is that completeness is a filtering problem, not a ranking
problem. Before any semantic search runs, a LangGraph planner node extracts
structured filter rules from the question and pre-filters the corpus down to the
exact subset the question is scoped to — a plain SQL WHERE over indexed
metadata columns. Retrieval then happens inside that subset.
It all lives in one Postgres, with two complementary indexes over the same rows:
- Structured columns + BM25 — normalized metadata per document
(counterparty, jurisdiction, document type, effective and expiry dates,
monetary terms, clause flags) for exact filtering, counting, and aggregation,
plus a ParadeDB pg_search BM25 index over the chunk text for lexical
precision (
CREATE INDEX ... USING bm25 (...) WITH (key_field='id'), queried with the@@@operator). - pgvector embeddings — chunked document text (sections, defined-term blocks, tables-as-markdown), embedded and stored in pgvector (HNSW, with an IVF-PQ tier at archive scale), each chunk carrying provenance metadata for citation.
Chunking uses LangChain's ParentDocumentRetriever: small child chunks for precise matching, large parent chunks for context (the small-to-big pattern), with tables kept intact — never split across a chunk. A "find ALL X in subset Y" question is answered by enumerating the SQL-filtered subset directly — every record, deterministically — and then enriching with the fused vector/BM25 hits for the narrative parts of the answer. One database, ACID guarantees, and no separate lexical engine to keep in sync.
The query planner
The planner is a hybrid of fast rules and an LLM fallback:
- A rules path matches question terms against a field-value index built at ingestion (the set of jurisdictions, document types, clause flags, and numeric ranges that actually exist in the corpus). When it can map the question to filters with confidence, it does so in milliseconds with no model call.
- An LLM path handles complex or ambiguous phrasing, emitting a structured query spec — text filters, numeric comparators, and a query type — which is then validated against the same field-value index so the planner can never invent a filter value that does not exist in the data.
The output is a filter spec compiled into a SQL WHERE clause, narrowing the
corpus before similarity ranking. This is what turns "find every X" from a
best-effort top-k into a guaranteed enumeration.
The planner is the first node of a LangGraph StateGraph, which decomposes a
multi-part question into a sequence of filter-and-retrieve steps with
intermediate evidence, so a compound question like "compare the indemnity caps
across all EU contracts expiring next year" resolves as a planned graph rather
than one flat query.
Adaptive, corrective retrieval
Filtering gets the right subset; the LangGraph state machine guarantees the retrieved context actually answers the question before a word is generated.
- Hybrid retrieval + RRF fusion. Inside the filtered subset, both arms run in
Postgres: a pgvector ANN search (cosine,
embedding <=> qv) and a pg_search BM25 search (text @@@ q). Their rankings are combined with Reciprocal Rank Fusion —RRF(d) = Σᵢ 1/(k + rankᵢ(d)),k = 60— the pattern ParadeDB documents for hybrid search. RRF scores by rank, not raw magnitude, so the two arms need no normalization and documents in both lists rise to the top. - Reranking. The fused candidates are reranked by a LangChain
ContextualCompressionRetrieverwrapping aCrossEncoderReranker(BAAI/bge-reranker-v2-m3), so the strongest evidence leads — improving both answer quality and citation precision. grade_documentscheck. A LangGraph node verifies the context (and later the answer) against the question via NLI entailment. A request to quote exact language is held to a stricter bar than a request to summarize.- Corrective reformulation. If the check fails, a conditional edge routes to a
rewrite_querynode that reformulates and re-enters retrieval — widening the filter or leaning on the alternate arm — for a bounded number of attempts. The user's original question is always preserved for final synthesis. - Refusal over hallucination. If sufficient grounded context still cannot be
assembled, the graph routes to an
abstainnode: it reports low confidence instead of inventing an answer.
Generated query code as an edit-compile-run-debug loop
Some questions — aggregations, cross-document joins, conditional counts — are best answered by generating SQL against the structured Postgres columns rather than retrieving prose. That generated query runs through a LangGraph edit-compile-run-debug cycle: the graph drafts the SQL, executes it, inspects the result or error, and revises until it runs cleanly — the same loop you would use for any production code generation, applied to query synthesis.
Standard-tier architecture
flowchart TD
PDF([Deal-room PDFs]) --> ING[Docling → markdown
TableFormer + OCR fallback]
ING --> CHUNK[ParentDocumentRetriever
small-to-big, tables intact]
CHUNK --> STORE
Q([User question]) --> PLAN
subgraph Graph["LangGraph StateGraph"]
PLAN[plan node
intent + predicates] --> CONF{High confidence?}
CONF -->|No| LLM[LLM filter extraction]
LLM --> VAL[Validate against field-value index]
CONF -->|Yes| SPEC
VAL --> SPEC[Filter spec → SQL WHERE]
end
SPEC --> PRE[SQL pre-filter to subset]
subgraph STORE["Single Postgres"]
PGV[(pgvector
HNSW / IVF-PQ embeddings)]
BM25[("pg_search BM25
text @@@ + metadata cols")]
end
PRE --> PGV
PRE --> BM25
PGV --> FUSE["RRF fusion in Postgres
Σ 1/(k+rank), k=60"]
BM25 --> FUSE
FUSE --> RANK[CrossEncoderReranker
BGE · top-N]
RANK --> REL{"grade_documents:
context sufficient?"}
REL -->|No, retries left| REFORM[rewrite_query
widen filter] --> PGV
REL -->|No, exhausted| REFUSE[abstain
low confidence]
REL -->|Yes| GEN[generate node
cite spans]
GEN --> CITE([Grounded answer
+ citations + confidence])
subgraph CodeLoop["Generated SQL (aggregations / joins)"]
EDIT[Draft SQL] --> COMPILE[Compile] --> RUN[Run]
RUN --> DBG{Clean result?}
DBG -->|No| EDIT
DBG -->|Yes| MERGE[Merge into context]
end
BM25 -.-> EDIT
MERGE -.-> GENWalk through a single query as it moves through the standard hybrid pipeline — filter extraction, the pgvector + pg_search BM25 arms fused with RRF, cross-encoder reranking, and the LangGraph corrective retry loop that fires when context is insufficient:
Advanced tier (SOTA frontier)
The advanced tier keeps the standard tier's shared core and adds the frontier machinery for the questions a single vector index and one structured table cannot answer reliably — multi-hop chains, visually-structured pages, and numeric reasoning beyond single-cell lookup. The one-line distinction: standard is graph-free; advanced adds the knowledge graph, the visual late-interaction lane, and text-to-SQL, plus learned sparse/dense retrieval, a SOTA reranker, scale-out indexing, and an adaptive router.
- Structure-aware extraction. MinerU 2.5 / Docling + TATR for cell-accurate tables feeding the downstream lanes.
- Dense lane. Qwen3-Embedding / BGE-M3 with Matryoshka truncation to trade dimensions for latency without re-embedding.
- Sparse lane. SPLADE-v3 learned sparse alongside BM25, for exact-term and numeric-entity precision.
- Visual lane. ColQwen2.5 late-interaction over page images, recovering information that text extraction drops on scanned or visually-structured pages.
- Numerics. Structured ontology lookup plus text-to-SQL dual-path over the typed table — filters, aggregations, and joins, not just single cells.
- Knowledge graph. The ontology is projected into Apache AGE (openCypher) with LazyGraphRAG / HippoRAG2 multi-hop traversal — the only honest way to chain a fact across parties, agreements, obligations, and tables.
- Fusion + rerank. Relative Score Fusion (magnitude-aware) across lanes, then the zerank-1 reranker.
- Vector index. pgvectorscale StreamingDiskANN with quantization for recall and latency at archive scale.
- Adaptive router. A LangGraph router classifies each question and dispatches it down the cheapest sufficient path — point-lookup → vector, numeric → text-to-SQL, multi-hop → graph, synthesis → RAPTOR — wrapped in Self-RAG / CRAG verification, span-level citation checking, and abstention.
Walk through a query as the advanced router classifies intent and fans it out across the dense, sparse, visual, graph, and text-to-SQL lanes, fuses with Relative Score Fusion, reranks with zerank-1, and verifies span-level citations before answering:
Real-world evaluation (SEC EDGAR)
The accuracy claims here are not hand-waving — they are measured. The working
mini-project ships an evaluation harness (eval/edgar_eval.py) that downloads
real public 10-K filings from data.sec.gov and scores answers against the
authoritative SEC XBRL companyfacts API as ground truth, at 1% relative
tolerance. Three filers, 21 metric questions (revenue, net income, assets,
liabilities, equity, operating income, cash):
| Path | Correct | Questions | Accuracy |
|---|---|---|---|
| Structured ontology lookup | 21 | 21 | 100% |
| Naive fixed-size-chunk baseline | 0 | 21 | 0% |
Filings evaluated (real SEC accession numbers):
- AAPL — Apple Inc., FY2025, accession
0000320193-25-000079 - MSFT — Microsoft Corp., FY2025, accession
0000950170-25-100235 - NVDA — NVIDIA Corp., FY2026, accession
0001045810-26-000021
Live queries and results — standard vs advanced, side by side
These are the actual queries the harness ran and the actual answers each path
returned, scored against the SEC XBRL companyfacts API. One honesty note
before the table: on these 21 numeric questions, standard and advanced answer
through the same shared structured path, so their answers are identical by
construction — the tiers diverge on multi-hop, visual-page, and aggregation
questions, which this executed set does not contain (see the tier table below
for where each earns its keep). The naive chunk baseline is what either tier
would degrade to without the typed table.
| Live query (verbatim) | SEC ground truth | Standard | Advanced | Naive chunks |
|---|---|---|---|---|
| "What was total revenue in FY2025?" (AAPL) | $416.2B | $416.2B ✓ | $416.2B ✓ | 147,957 ✗ |
| "What was net income in FY2025?" (AAPL) | $112.0B | $112.0B ✓ | $112.0B ✓ | 2,025 ✗ |
| "What was total revenue in FY2025?" (MSFT) | $281.7B | $281.7B ✓ | $281.7B ✓ | 8,226 ✗ |
| "What was total assets in FY2025?" (MSFT) | $619.0B | $619.0B ✓ | $619.0B ✓ | 191,131 ✗ |
| "What was total revenue in FY2026?" (NVDA) | $215.9B | $215.9B ✓ | $215.9B ✓ | 8,211 ✗ |
| "What was net income in FY2026?" (NVDA) | $120.1B | $120.1B ✓ | $120.1B ✓ | 2,026 ✗ |
Look at the naive column's failure mode: for Apple's net income it answered 2,025 — the fiscal year, not a dollar figure. For NVIDIA it answered 2,026. A fixed-size chunker handed a 10-K confidently returns something numeric every time, and on this run it was wrong all 21 times.
Expand: all 21 live queries with exact figures and XBRL tags
AAPL — Apple Inc. (FY2025, accession 0000320193-25-000079)
| Query metric | XBRL tag | Ground truth | Structured (both tiers) | Naive chunks |
|---|---|---|---|---|
| Total revenue | RevenueFromContractWithCustomerExcludingAssessedTax |
416,161,000,000 | 416,161,000,000 ✓ | 147,957 ✗ |
| Net income | NetIncomeLoss |
112,010,000,000 | 112,010,000,000 ✓ | 2,025 ✗ |
| Total assets | Assets |
359,241,000,000 | 359,241,000,000 ✓ | 147,957 ✗ |
| Total liabilities | Liabilities |
285,508,000,000 | 285,508,000,000 ✓ | 1,033 ✗ |
| Stockholders equity | StockholdersEquity |
73,733,000,000 | 73,733,000,000 ✓ | 2,025 ✗ |
| Operating income | OperatingIncomeLoss |
133,050,000,000 | 133,050,000,000 ✓ | 34,550 ✗ |
| Cash and equivalents | CashAndCashEquivalentsAtCarryingValue |
35,934,000,000 | 35,934,000,000 ✓ | 5,991 ✗ |
MSFT — Microsoft Corp. (FY2025, accession 0000950170-25-100235)
| Query metric | XBRL tag | Ground truth | Structured (both tiers) | Naive chunks |
|---|---|---|---|---|
| Total revenue | RevenueFromContractWithCustomerExcludingAssessedTax |
281,724,000,000 | 281,724,000,000 ✓ | 8,226 ✗ |
| Net income | NetIncomeLoss |
101,832,000,000 | 101,832,000,000 ✓ | 2,025 ✗ |
| Total assets | Assets |
619,003,000,000 | 619,003,000,000 ✓ | 191,131 ✗ |
| Total liabilities | Liabilities |
275,524,000,000 | 275,524,000,000 ✓ | 21,996 ✗ |
| Stockholders equity | StockholdersEquity |
343,479,000,000 | 343,479,000,000 ✓ | 45,186 ✗ |
| Operating income | OperatingIncomeLoss |
128,528,000,000 | 128,528,000,000 ✓ | 8,615 ✗ |
| Cash and equivalents | CashAndCashEquivalentsAtCarryingValue |
30,242,000,000 | 30,242,000,000 ✓ | 72,599 ✗ |
NVDA — NVIDIA Corp. (FY2026, accession 0001045810-26-000021)
| Query metric | XBRL tag | Ground truth | Structured (both tiers) | Naive chunks |
|---|---|---|---|---|
| Total revenue | Revenues |
215,938,000,000 | 215,938,000,000 ✓ | 8,211 ✗ |
| Net income | NetIncomeLoss |
120,067,000,000 | 120,067,000,000 ✓ | 2,026 ✗ |
| Total assets | Assets |
206,803,000,000 | 206,803,000,000 ✓ | 2,026 ✗ |
| Total liabilities | Liabilities |
49,510,000,000 | 49,510,000,000 ✓ | 1,793 ✗ |
| Stockholders equity | StockholdersEquity |
157,293,000,000 | 157,293,000,000 ✓ | 2,026 ✗ |
| Operating income | OperatingIncomeLoss |
130,387,000,000 | 130,387,000,000 ✓ | 2,026 ✗ |
| Cash and equivalents | CashAndCashEquivalentsAtCarryingValue |
10,605,000,000 | 10,605,000,000 ✓ | 7,948 ✗ |
Per-filer accuracy in every table: structured 100%, naive 0%. Match tolerance
1% relative (scale/rounding tolerant); ground truth is the authoritative SEC
XBRL companyfacts API, fetched live by eval/edgar_eval.py.
The honest framing matters. The structured path populates its typed ontology
table from the same authoritative XBRL facts used as ground truth, so its 100%
reflects deterministic cell addressing, not luck. The load-bearing result is
the contrast: the naive prose-reading baseline returns a confident number for
every question and is wrong every time — typically grabbing the fiscal year
(2025) or an unrelated figure from an adjacent chunk. That confident-but-wrong
behaviour, rather than an honest abstention, is exactly the failure mode the
typed-table approach removes.
What was executed here: the live XBRL + 10-K download, the structured lookup, the naive baseline, and scoring against ground truth. What is documented but not executed in this environment (no GPU, DB, or API keys): LLM generation and grading, contextual embeddings, the SPLADE/dense/visual/rerank model lanes, the pgvector/ParadeDB and Apache AGE services. Their expected gains are cited from published research (Anthropic's Contextual Retrieval: ~35% / ~49% / ~67% retrieval-failure reduction as contextual embeddings, contextual BM25, and reranking are added; GraphRAG / HippoRAG2 for multi-hop) — never fabricated as our own measurements.
Complex queries — standard vs advanced, measured in production
Separately, and kept distinct from the numbers above: a production deal-room system (anonymized; the corpora are public SEC filings — Microsoft–Activision, Adobe–Figma, and Salesforce–Slack 8-Ks among them) runs a ten-question M&A diligence pack against live EDGAR rooms. These are not cell lookups — they are the cross-document questions a diligence team actually asks, and both pipeline generations have been measured against them. Every answer is grounded to a verbatim character span in a source document, and an answer that cannot be grounded is flagged or withheld.
| Complex query (verbatim) | Standard — early pipeline¹ | Advanced — current pipeline² |
|---|---|---|
| "What is the governing law of the master services agreement?" | 24 claims · 66.7% flagged | 50 claims · 0 ungrounded |
| "Who are the parties to each material contract?" | 30 claims · 73.3% flagged | 50 claims · 0 ungrounded |
| "What is the effective date of the most recent employment agreement?" | no answer (0 claims) | 4 claims · 0 ungrounded |
| "List all contracts with a change-of-control clause." | no answer (0 claims) | 50 claims · 0 ungrounded |
| "What indemnification caps appear across the contract set?" | 4 claims · all flagged | abstains (0 claims) |
| ARR · lease notice periods · exclusivity · severance · auto-renewal (5 queries) | abstains | abstains — the corpus genuinely lacks these predicates |
¹ Regex extraction + n-gram overlap verifier, 20-filing room (251 documents). ² LLM-assisted extraction + hybrid span-grounding verifier, 64-filing room (538 documents / 12,653 segments). The corpora differ in size — that is part of the story (the advanced pipeline was built to survive the bigger room), but it means the columns compare configurations, not a controlled A/B.
What grounded answers look like, verbatim from the evidence log: governing law
→ "California without regard to its conflicts of law principles" cited to
character span [112925:112985] of its source document, and "New York" at
[4184201:4184209]; effective date → "December 5, 2024" at [32404:32420];
change-of-control → "Change in Control" at [92810:92827]. Aggregate on the
current corpus: 0.00% hallucination across 154 claims, 538 documents, 12,653
segments.
Two rows deserve honest commentary. The early pipeline's high flag rates were dominated by a claim-rendering artifact — the verifier was correctly refusing claim text that didn't match its cited span; once the renderer was fixed, the same 58 claims re-verified at 0.00%. That is the verifier doing its job, and why one exists. And on indemnification caps the advanced pipeline answers less than the old one — its four old claims were all flagged garbage, and abstaining beats inventing.
The harder query classes the system is built toward — suggested in its query UI as worked examples — read like this: cross-document relational ("Which customer contracts contain a change-of-control clause that an acquisition would trigger?"), aggregation with contradiction surfacing ("What is the total indemnity-cap exposure across all executed contracts?" — where two executed versions of the same agreement state different caps, the system surfaces both rather than silently picking one), and conceptual screens ("Surface anything that suggests customer-concentration risk."). Those classes are the design target of the advanced tier's graph and aggregation lanes; the demo answers shown in the UI are illustrative fixtures, not measured output, and are not quoted here as results.
The extractor upgrade is measured the same way. Against the public CUAD benchmark (510 contracts, 13,823 gold spans), the regex extraction layer tops out below 0.5 recall on every predicate — honest grounds to call it a pre-filter, not an extractor. Swapping in an open-weights LLM extractor lifts governing-law precision to 1.000 at 0.400 recall (vs 0.151 for regex) on a sampled run, and a newer open-weights model matched recall using 33× fewer tokens. That is the single-model → heavier-stack trade, measured: the cheap layer is fine for narrowing, and the advanced lane is what makes extraction trustworthy.
These are real-world figures from a deployed system, cited for context — not this mini-project's measured numbers.
Standard vs Advanced — when each fits
Pick the smallest tier that clears your accuracy bar. Reach for advanced only when its extra machinery is provably paying for itself.
| Concern | Standard (single model) | Advanced (SOTA frontier) |
|---|---|---|
| Retrieval | Voyage dense + contextual BM25, RRF fusion | Qwen3/BGE-M3 dense + SPLADE-v3 + visual, Relative Score Fusion |
| Numerics | Structured ontology cell lookup | Structured lookup + text-to-SQL dual-path |
| Multi-hop | none (no graph) | Apache AGE graph + LazyGraphRAG / HippoRAG2 |
| Visual pages | — | ColQwen2.5 late-interaction |
| Orchestration | LangGraph CRAG loop | LangGraph adaptive router + Self-RAG/CRAG |
| Services | 1 Postgres (ParadeDB) | ParadeDB + Apache AGE + GPU lanes |
| GPU | none | required |
| Cost / failure surface | low | materially higher |
Use standard when your questions are point lookups, single-table numerics, and grounded summaries — and you want one Postgres, no GPU, and a small failure surface. On the EDGAR table questions, the shared structured path already hits 100%, so the advanced lanes add nothing there.
Advanced earns its complexity when you have genuine multi-hop questions that must chain facts across parties and agreements (the graph is the only honest way to answer these), visually-structured / scanned documents that text extraction mangles, or numeric reasoning beyond single-cell lookup (aggregations and joins) — and you are at a scale where StreamingDiskANN, quantization, and learned sparse/dense materially improve recall and latency.
Rule of thumb: standard is the default; advanced is what you graduate to when a measured gap shows up in your own eval, not before.
Try it yourself
The complete, runnable mini-project — Pydantic ontology, hierarchical chunking, hybrid retrieval, both tiers, and the SEC EDGAR eval harness — is available on request:
Scaling
Scaling this is about keeping the filtered subset small and the guarantees intact as the corpus grows:
- Ingestion runs Docling over the source PDFs, denormalizes structured fields, and chunks documents with ParentDocumentRetriever, attaching provenance to every chunk. Batch insertion keeps large-corpus loads fast, and the field-value index is rebuilt so the planner always knows the true filter vocabulary.
- Pre-filtering before retrieval means cost scales with the subset the
question targets, not the whole corpus — a cheap SQL
WHEREdoes the expensive narrowing before either pgvector or BM25 runs. - One Postgres, two indexes avoids a separate lexical engine to keep in sync: pgvector (HNSW for fresh shards, an IVF-PQ tier for the 10M-doc archive) and pg_search BM25 share the same ACID rows, and RRF fuses them in a single query.
- Bounded LangGraph loops cap retries and concurrency so latency stays predictable under load, with the slowest path (multi-source synthesis) reserved only for genuinely compound questions.
- Containerized for reproducible deployment of the retrieval service and its Postgres backend.
The result is a system that treats "answer this precisely and completely" as a hard requirement rather than a best effort: filter to the exact subset, verify before generating, and cite everything.