Azure-native legal contract intelligence · proof of concept · source on GitHub ↗

Contract Intelligence

contract review & clause comparison

A pipeline that ingests legal contracts, extracts structured metadata with page-level citations, answers natural-language questions, and compares clauses against a versioned gold standard — built so a legal team can trust every answer. SQL decides the facts; retrieval supplies the evidence; the model only explains what it can cite.

2runtimes — Azure & local docker-compose, one codebase
0LLM tokens for canonical reporting queries
~70%domain-agnostic substrate, reusable elsewhere
100%of answers carry document + page citations

01 — Orchestration

A router that knows when not to spend

Most real questions over a contract corpus are old-school structured queries — “show me / list / how many”. Those should never cost an LLM call. The router resolves intent in layers, paying for language understanding only when the question actually needs it.

Tier 1 — deterministic

Rule-routed

A tiny regex shortcut catches canonical reporting phrasings and goes straight to SQL.

0 tokens · ~8 ms · $0

Tier 2 — SLM-first

LLM-routed → reporting

No rule match? A small classifier reads the intent — and can still land on the cheap SQL path for non-canonical phrasings like “show me expired”.

~600 router tokens → SQL

Tier 3 — full reasoning

Search & comparison

Genuine content questions route to RAG or a clause-vs-gold diff, where the token spend buys real work.

embeddings + reasoning LLM

The badge row under every answer makes the tier visible — routing, confidence, data sources, tokens, latency, cost.

Rule-routed reporting query — **Rule-routed reporting.** “Show me all NDA contracts” → SQL only, ~8 ms, no token or cost badge at all.

LLM-routed reporting fallback — **LLM-routed fallback.** “show me expired” has no canonical keyword — the SLM classifies it and still resolves to the structured SQL path.

Search / RAG answer with citations — **Search / RAG.** A grounded answer with a single citations block — quoted clause text, document and page.

Clause comparison against gold — **Clause comparison.** Indemnity vs. the gold standard — material differences and a conclusion, grounded in the contract.

02 — The application

What a reviewer actually sees

A pro-code web surface — full control over routing, citations, comparison, and the human-in-the-loop workflow. Click any frame to view it full size.

Select contracts from a result grid — **Compare from the grid.** Tick contracts in any result set and send the selection to a side-by-side gold-clause comparison.

Clause-vs-gold difference — **Clause vs. gold.** Per clause type: a difference summary, material differences, conclusion, and the verbatim contract and gold text.

Typed clauses with risk badges — **Typed clauses.** Sections mapped to a canonical taxonomy, each with an LLM-assigned risk level — high reserved for what warrants legal review.

Extracted obligations — **Obligations.** Discrete commitments split by time semantics — fixed dates, recurring cadences, and event triggers.

Human-in-the-loop review queue — **Human-in-the-loop review.** A queue filtered across three orthogonal axes — lifecycle status, review workflow, and risk flags — with per-field correction that writes through to the source of truth.

Versioned gold clause library — **Gold clause library.** The approved, versioned reference set every contract clause is measured against.

Contracts grid with risk flags — **Contracts grid.** The whole corpus at a glance — status, review state, and set-valued risk flags.

OpenAPI / Swagger docs — **Typed API surface.** Query, reporting, comparison, and the full review workflow — documented as OpenAPI 3.1.

03 — Ingestion

Understand once, at write-time

The expensive AI work — layout parsing, metadata / clause / obligation extraction with per-field confidence, and embeddings — runs once as a document lands, populating a structured metadata database. Query time then reads that pre-built structure cheaply, instead of re-deriving facts from raw documents on every request.

→ blob

Upload

Contract lands in storage

layout

Page-tagged text

Document layout extraction

llm

Structured fields

Metadata · clauses · obligations + confidence

embed

Vectors

Summary & per-clause embeddings

sql + index

Source of truth

SQL upsert + two vector indexes

Extracted contract metadata — **Extracted metadata.** The fields that flow into SQL reporting — each with a confidence score and inheritance notes for sibling documents.

Append-only extraction audit — **Append-only lineage.** Every stage that touches a field writes its own row — LLM, validator, or a human correction. Nothing is overwritten, so the reviewer history survives.

04 — Architecture

One codebase, two runtimes

Profile branching lives entirely in client factories, so the pipeline, router, and API never ask which environment they’re in. The production profile is Azure-native and managed-identity throughout; the local profile is a full docker-compose stack with zero cloud dependencies.

Capability	Azure (production)	Local (docker-compose)
Documents	ADLS Gen2 / Blob	Azurite
Layout	Document Intelligence (prebuilt-layout)	unstructured.io
Extraction & reasoning	Azure OpenAI · gpt-4.1-mini / gpt-4.1	Ollama · qwen3
Embeddings	text-embedding-3-small (1536-d)	mxbai-embed-large (1024-d)
Source of truth	Azure SQL (serverless)	SQL Server 2022
Retrieval	AI Search Basic — 2 indexes, semantic ranker	Qdrant — vector + filter
Ingestion trigger	Event Grid → Functions (push)	Polling watcher (~5 s)
Query API	Azure Functions	FastAPI

05 — Evidence

A measurable playground

The local stack plus a synthetic corpus and an eval harness make this a sandbox for the full spread of questions a user actually asks — and a way to measure whether the answers hold up.

16synthetic contracts — clean baselines, single-clause deviations, missing-field cases
9approved gold clause types in the reference library
25golden Q&A across all five intent paths
high‑90s%field-extraction accuracy on the local stack

The model is first-draft data; the reviewer is the source of truth. Every human correction is also a labeled example — the audit trail doubles as a growing evaluation set.

06 — Generalization

The same machine, other domains

Strip away the legal vocabulary and what remains is a generic shape: ingest documents → extract structured fields and chunks → embed and index → answer via a router over RAG and structured lookup → compare snippets against a curated reference set. Roughly two-thirds of the code never knew it was about contracts.

Reused as-is

Profile-aware client factories — same Azure / local seam
Ingestion pipeline shape — idempotent hash-based upsert
Router + handler dispatch and the SQL builder
Vector-search abstraction — two indexes, filtered purge
Structured-output prompting, coercions, OpenAPI
Eval harness, both infra stacks, the audit pattern

Swapped per domain

Entity schema — the tables and what’s extracted
Extraction prompt & JSON schema
Router rules & intents, the filter parser
Reporting SQL projection and labels
Comparison-to-gold reference set
Synthetic corpus + manifest for the new domain

~75% reuse

Sales call notes

Action items, sentiment, next steps — measured against stage playbooks instead of gold clauses.

~70% reuse

Survey free-text

Themes, sentiment, risk flags against a canonical theme taxonomy; adds a trend / aggregation intent.

~65% reuse

Support transcripts

Resolution and escalation risk against an agent-behaviour rubric; the clauses index becomes a turns index.

Estimated effort to repurpose the substrate for a new domain: roughly two to three person-weeks.