Conversational AI-compliance workbench · proof of concept · source on GitHub ↗

AI Compliance NLWeb

ask the world's AI rules — and let agents ask too

An NLWeb-style layer over a curated, 100% open-access corpus of the world's AI rules and standards — from the EU AI Act and GDPR to the NIST AI RMF, ISO/IEC 42001, and national frameworks across five regulatory tiers. One retrieval + answer core, exposed two ways: /ask for people (structured JSON) and /mcp for agents (MCP tools). It is RAG-only and accuracy-first — hybrid retrieval and a reranker surface the evidence, the model only explains what it can cite, and every claim carries a [framework §section, p.N] citation. It runs fully on a local Docker stack or on Azure.

48open-access documents · 32 frameworks across five tiers
36/36golden-QA pass-rate on the local stack (qwen3:14b)
100%of answers carry framework + section + page citations
2runtimes — local docker-compose & Azure, one codebase

01 — The NLWeb contract

One core, two contracts

The whole system is a single retrieval+answer core, exposed through two thin adapters that share the same request payload and never fork their logic. People get structured JSON shaped like a Schema.org ItemList; agents get the same core as Model Context Protocol tools. A mode flag decides how much work to do — and whether the language model runs at all.

For people · POST /ask

`/ask`

A grounded, cited answer as JSON — answer, ranked sources[], an additive Schema.org item_list, plus confidence, intent, scope and a token ledger.

also POST /ask/stream — SSE token streaming

For agents · GET/POST /mcp

`/mcp`

The same core as an MCP server over JSON-RPC (protocol 2025-11-25). Tools an agent can call directly:

ask_compliance — grounded, cited Q&A
list_frameworks · get_framework
prompt: compare_jurisdictions

mode gates the spend: list retrieval only — no model call · summarize one citation-enforced synthesis pass · generate a longer drafted artifact.

02 — Agents

An ask-endpoint agents can call

NLWeb's premise is that every site will soon expose an /ask endpoint — and that agents will want to call it. Here the MCP server is mounted on the same FastAPI app and runs the identical pipeline a human query does, so an agent's ask_compliance call returns the same grounded, cited answer the UI shows — including the Schema.org item_list projection.

MCP method	Returns
initialize	Server info + capabilities; echoes the client's protocol version
tools/list	The three tools with descriptions + input schemas
tools/call · ask_compliance	The full grounded answer (sources + `item_list` + token usage)
prompts/list · prompts/get	`compare_jurisdictions` — a templated comparison prompt

Both endpoints enforce the same security: token scopes (ask:read, mcp:invoke), per-token/IP rate limiting, a locked CORS allow-list, and an audit log of every call.

03 — Retrieval & ranking

Hybrid retrieval, then a reranker decides

Dense vectors catch meaning; sparse keywords catch the exact statute number. The pipeline runs both, fuses them, then lets a cross-encoder reranker make the final call — so the eight chunks an answer cites are the most relevant of the candidate pool, not just the first eight a single index returned.

query

Decontextualize

Condense multi-turn → standalone query; detect scope

dense + sparse

Hybrid search

Embeddings (1024-d) + BM25 keywords, jurisdiction-filtered

rrf

Fuse → 32

Reciprocal-rank fusion of both result sets

cross-encoder

Rerank

bge-reranker-base scores all 32 for relevance

top-8

Cited sources

The evidence the model is allowed to use

When a question names a specific framework, retrieval steers to that document — “GDPR requirements for AI?” searches the GDPR, not whichever EU text happens to say “AI” most often. Reranking is on by default; an env toggle turns it off for A/B comparison.

04 — The application

What a user actually sees

A focused chat surface over the corpus: a framework explorer that scopes retrieval, answers that stream in token-by-token with a live citation cursor, a sources panel, and a metrics bar under every reply. Click any frame to view it full size.

The empty-state landing view — **The landing view.** Quick-start questions, the mode selector, and the framework explorer — the whole surface fits one focused column.

The corpus framework explorer — **Corpus explorer.** All 48 documents grouped by tier; selecting jurisdictions narrows the retrieval scope for the next question.

A grounded answer with inline citations and a sources panel — **Grounded, cited answer.** Streamed token-by-token, with inline `[framework §section, p.N]` markers and a sources panel listing the ranked, reranked chunks.

A cross-jurisdiction comparison answer — **Compare jurisdictions.** A comparison-intent query scopes to multiple jurisdictions and contrasts how each regulates a topic — each point still cited.

A multi-turn follow-up, interpreted — **Multi-turn.** A short follow-up is condensed into a standalone query (shown as “↳ interpreted as…”) so retrieval keeps the topic.

OpenAPI / Swagger docs — **Typed API surface.** `/ask`, `/ask/stream`, `/mcp`, `/corpus` and `/health` — documented as OpenAPI at `/docs`.

05 — Architecture

One codebase, two runtimes

Profile branching lives entirely in client factories, so the pipeline, router and API never ask which environment they're in. The local profile is a full docker-compose stack with zero cloud dependencies; the cloud profile is Azure-native and managed-identity throughout, provisioned by Bicep. A store's embedding dimension is immutable, so each profile is ingested independently into its own matched stack.

Capability	azure (production)	local (docker-compose)
Layout / PDF	Azure Document Intelligence (prebuilt-layout)	unstructured.io
Answers	Azure OpenAI · gpt-4.1	Ollama · qwen3:14b
Dense embeddings	text-embedding-3-small (1536-d)	mxbai-embed-large (1024-d)
Sparse / keyword	BM25 keyword index	BM25 (fastembed)
Hybrid fusion	Hybrid RRF (AI Search)	Reciprocal-rank fusion (RRF)
Reranker	AI Search semantic ranker	bge-reranker-base · cross-encoder (TEI)
Vector store	Azure AI Search — 2 indexes	Qdrant — 2 collections
API	Azure Container Apps	FastAPI
Web	Azure Static Web App	Vite dev / nginx
Auth	Managed identity (DefaultAzureCredential)	dev bearer token
Infra-as-code	Bicep	docker-compose

Runs on GPU locally — the compose ollama service reserves an NVIDIA device. A deterministic mock backend powers tests and offline UI work with no model server at all.

06 — Evidence

Every answer shows its work

Trust in a regulatory tool comes from transparency, not confidence theatre. Under every reply is a metrics bar that exposes the whole decision: how sure the system is, which model ran, how many tokens it cost, how long it took, and how retrieval narrowed the corpus to the cited eight.

What are high-risk AI systems under the EU AI Act?

High-risk AI systems are those listed in Annex III or used as a safety component of regulated products… subject to risk management, data governance, transparency, human oversight, and conformity assessment [EU AI Act §Art. 6, p.4].

⭐ high confidence qwen3:14b· ▲ 2.1k▼ 820⋯ 9 tok· 1.2 s· 32 → 8 (rrf+rerank)

36/36golden-QA pass-rate + retrieval hit-rate (local · qwen3:14b)
100%intent-classification accuracy across every path
97%mean term coverage across the answer set
SSEtoken streaming with accurate end-of-stream usage

A token ledger prices every call; an audit log records the principal, intent, sources, usage and latency of each query. Neither can break the answer path — they swallow their own errors and only log. When the evidence is insufficient, the system says so rather than guessing.