Conversational AI-compliance workbench · proof of concept · source on GitHub ↗

AI Compliance NLWeb

ask the world's AI rules — and let agents ask too

An NLWeb-style layer over a curated, 100% open-access corpus of the world's AI rules and standards — from the EU AI Act and GDPR to the NIST AI RMF, ISO/IEC 42001, and national frameworks across five regulatory tiers. One retrieval + answer core, exposed two ways: /ask for people (structured JSON) and /mcp for agents (MCP tools). It is RAG-only and accuracy-first — hybrid retrieval and a reranker surface the evidence, the model only explains what it can cite, and every claim carries a [framework §section, p.N] citation. It runs fully on a local Docker stack or on Azure.

01 — The NLWeb contract

One core, two contracts

The whole system is a single retrieval+answer core, exposed through two thin adapters that share the same request payload and never fork their logic. People get structured JSON shaped like a Schema.org ItemList; agents get the same core as Model Context Protocol tools. A mode flag decides how much work to do — and whether the language model runs at all.

For people · POST /ask

/ask

A grounded, cited answer as JSON — answer, ranked sources[], an additive Schema.org item_list, plus confidence, intent, scope and a token ledger.

also POST /ask/stream — SSE token streaming

For agents · GET/POST /mcp

/mcp

The same core as an MCP server over JSON-RPC (protocol 2025-11-25). Tools an agent can call directly:

  • ask_compliance — grounded, cited Q&A
  • list_frameworks · get_framework
  • prompt: compare_jurisdictions

mode gates the spend: list retrieval only — no model call · summarize one citation-enforced synthesis pass · generate a longer drafted artifact.

02 — Agents

An ask-endpoint agents can call

NLWeb's premise is that every site will soon expose an /ask endpoint — and that agents will want to call it. Here the MCP server is mounted on the same FastAPI app and runs the identical pipeline a human query does, so an agent's ask_compliance call returns the same grounded, cited answer the UI shows — including the Schema.org item_list projection.

MCP methodReturns
initializeServer info + capabilities; echoes the client's protocol version
tools/listThe three tools with descriptions + input schemas
tools/call · ask_complianceThe full grounded answer (sources + item_list + token usage)
prompts/list · prompts/getcompare_jurisdictions — a templated comparison prompt

Both endpoints enforce the same security: token scopes (ask:read, mcp:invoke), per-token/IP rate limiting, a locked CORS allow-list, and an audit log of every call.

03 — Retrieval & ranking

Hybrid retrieval, then a reranker decides

Dense vectors catch meaning; sparse keywords catch the exact statute number. The pipeline runs both, fuses them, then lets a cross-encoder reranker make the final call — so the eight chunks an answer cites are the most relevant of the candidate pool, not just the first eight a single index returned.

query
Decontextualize
Condense multi-turn → standalone query; detect scope
dense + sparse
Hybrid search
Embeddings (1024-d) + BM25 keywords, jurisdiction-filtered
rrf
Fuse → 32
Reciprocal-rank fusion of both result sets
cross-encoder
Rerank
bge-reranker-base scores all 32 for relevance
top-8
Cited sources
The evidence the model is allowed to use

When a question names a specific framework, retrieval steers to that document — “GDPR requirements for AI?” searches the GDPR, not whichever EU text happens to say “AI” most often. Reranking is on by default; an env toggle turns it off for A/B comparison.

04 — The application

What a user actually sees

A focused chat surface over the corpus: a framework explorer that scopes retrieval, answers that stream in token-by-token with a live citation cursor, a sources panel, and a metrics bar under every reply. Click any frame to view it full size.

05 — Architecture

One codebase, two runtimes

Profile branching lives entirely in client factories, so the pipeline, router and API never ask which environment they're in. The local profile is a full docker-compose stack with zero cloud dependencies; the cloud profile is Azure-native and managed-identity throughout, provisioned by Bicep. A store's embedding dimension is immutable, so each profile is ingested independently into its own matched stack.

Capabilityazure (production)local (docker-compose)
Layout / PDFAzure Document Intelligence (prebuilt-layout)unstructured.io
AnswersAzure OpenAI · gpt-4.1Ollama · qwen3:14b
Dense embeddingstext-embedding-3-small (1536-d)mxbai-embed-large (1024-d)
Sparse / keywordBM25 keyword indexBM25 (fastembed)
Hybrid fusionHybrid RRF (AI Search)Reciprocal-rank fusion (RRF)
RerankerAI Search semantic rankerbge-reranker-base · cross-encoder (TEI)
Vector storeAzure AI Search — 2 indexesQdrant — 2 collections
APIAzure Container AppsFastAPI
WebAzure Static Web AppVite dev / nginx
AuthManaged identity (DefaultAzureCredential)dev bearer token
Infra-as-codeBicepdocker-compose

Runs on GPU locally — the compose ollama service reserves an NVIDIA device. A deterministic mock backend powers tests and offline UI work with no model server at all.

06 — Evidence

Every answer shows its work

Trust in a regulatory tool comes from transparency, not confidence theatre. Under every reply is a metrics bar that exposes the whole decision: how sure the system is, which model ran, how many tokens it cost, how long it took, and how retrieval narrowed the corpus to the cited eight.

What are high-risk AI systems under the EU AI Act?

High-risk AI systems are those listed in Annex III or used as a safety component of regulated products… subject to risk management, data governance, transparency, human oversight, and conformity assessment [EU AI Act §Art. 6, p.4].

⭐ high confidence qwen3:14b· ▲ 2.1k▼ 820⋯ 9 tok· 1.2 s· 32 → 8 (rrf+rerank)

A token ledger prices every call; an audit log records the principal, intent, sources, usage and latency of each query. Neither can break the answer path — they swallow their own errors and only log. When the evidence is insufficient, the system says so rather than guessing.