Squash of branch refactor/phase0-guardrails-and-models-split — 4 commits,
81 files, 173/173 pytest green, OpenAPI contract preserved (360 paths /
484 operations).
## Phase 0 — Architecture guardrails
Three defense-in-depth layers to keep the architecture rules enforced
regardless of who opens Claude Code in this repo:
1. .claude/settings.json PreToolUse hook on Write/Edit blocks any file
that would exceed the 500-line hard cap. Auto-loads in every Claude
session in this repo.
2. scripts/githooks/pre-commit (install via scripts/install-hooks.sh)
enforces the LOC cap locally, freezes migrations/ without
[migration-approved], and protects guardrail files without
[guardrail-change].
3. .gitea/workflows/ci.yaml gains loc-budget + guardrail-integrity +
sbom-scan (syft+grype) jobs, adds mypy --strict for the new Python
packages (compliance/{services,repositories,domain,schemas}), and
tsc --noEmit for admin-compliance + developer-portal.
Per-language conventions documented in AGENTS.python.md, AGENTS.go.md,
AGENTS.typescript.md at the repo root — layering, tooling, and explicit
"what you may NOT do" lists. Root CLAUDE.md is prepended with the six
non-negotiable rules. Each of the 10 services gets a README.md.
scripts/check-loc.sh enforces soft 300 / hard 500 and surfaces the
current baseline of 205 hard + 161 soft violations so Phases 1-4 can
drain it incrementally. CI gates only CHANGED files in PRs so the
legacy baseline does not block unrelated work.
## Deprecation sweep
47 files. Pydantic V1 regex= -> pattern= (2 sites), class Config ->
ConfigDict in source_policy_router.py (schemas.py intentionally skipped;
it is the Phase 1 Step 3 split target). datetime.utcnow() ->
datetime.now(timezone.utc) everywhere including SQLAlchemy default=
callables. All DB columns already declare timezone=True, so this is a
latent-bug fix at the Python side, not a schema change.
DeprecationWarning count dropped from 158 to 35.
## Phase 1 Step 1 — Contract test harness
tests/contracts/test_openapi_baseline.py diffs the live FastAPI /openapi.json
against tests/contracts/openapi.baseline.json on every test run. Fails on
removed paths, removed status codes, or new required request body fields.
Regenerate only via tests/contracts/regenerate_baseline.py after a
consumer-updated contract change. This is the safety harness for all
subsequent refactor commits.
## Phase 1 Step 2 — models.py split (1466 -> 85 LOC shim)
compliance/db/models.py is decomposed into seven sibling aggregate modules
following the existing repo pattern (dsr_models.py, vvt_models.py, ...):
regulation_models.py (134) — Regulation, Requirement
control_models.py (279) — Control, Mapping, Evidence, Risk
ai_system_models.py (141) — AISystem, AuditExport
service_module_models.py (176) — ServiceModule, ModuleRegulation, ModuleRisk
audit_session_models.py (177) — AuditSession, AuditSignOff
isms_governance_models.py (323) — ISMSScope, Context, Policy, Objective, SoA
isms_audit_models.py (468) — Finding, CAPA, MgmtReview, InternalAudit,
AuditTrail, Readiness
models.py becomes an 85-line re-export shim in dependency order so
existing imports continue to work unchanged. Schema is byte-identical:
__tablename__, column definitions, relationship strings, back_populates,
cascade directives all preserved.
All new sibling files are under the 500-line hard cap; largest is
isms_audit_models.py at 468. No file in compliance/db/ now exceeds
the hard cap.
## Phase 1 Step 3 — infrastructure only
backend-compliance/compliance/{schemas,domain,repositories}/ packages
are created as landing zones with docstrings. compliance/domain/
exports DomainError / NotFoundError / ConflictError / ValidationError /
PermissionError — the base classes services will use to raise
domain-level errors instead of HTTPException.
PHASE1_RUNBOOK.md at backend-compliance/PHASE1_RUNBOOK.md documents
the nine-step execution plan for Phase 1: snapshot baseline,
characterization tests, split models.py (this commit), split schemas.py
(next), extract services, extract repositories, mypy --strict, coverage.
## Verification
backend-compliance/.venv-phase1: uv python install 3.12 + pip -r requirements.txt
PYTHONPATH=. pytest compliance/tests/ tests/contracts/
-> 173 passed, 0 failed, 35 warnings, OpenAPI 360/484 unchanged
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
423 lines
15 KiB
Python
423 lines
15 KiB
Python
"""
|
|
FastAPI routes for RAG-based Requirement Extraction.
|
|
|
|
Endpoints:
|
|
- POST /compliance/extract-requirements-from-rag:
|
|
Searches ALL RAG collections (or a subset) for audit criteria / Prüfaspekte
|
|
and creates Requirement entries in the DB.
|
|
|
|
Design principles:
|
|
- Searches every relevant collection in parallel
|
|
- Deduplicates by (regulation_id, article) — never inserts twice
|
|
- Auto-creates Regulation stubs for unknown regulation_codes
|
|
- LLM-free by default (fast); optional LLM title extraction via ?use_llm=true
|
|
- dry_run=true returns what would be created without touching the DB
|
|
"""
|
|
|
|
import logging
|
|
import re
|
|
import asyncio
|
|
from typing import Optional, List, Dict
|
|
from datetime import datetime, timezone
|
|
|
|
from fastapi import APIRouter, Depends
|
|
from pydantic import BaseModel
|
|
from sqlalchemy.orm import Session
|
|
|
|
from classroom_engine.database import get_db
|
|
from ..db import RegulationRepository, RequirementRepository
|
|
from ..db.models import RegulationDB, RegulationTypeEnum
|
|
from ..services.rag_client import get_rag_client, RAGSearchResult
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
router = APIRouter(tags=["extraction"])
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Collections that may contain Prüfaspekte / audit criteria
|
|
# ---------------------------------------------------------------------------
|
|
|
|
ALL_COLLECTIONS = [
|
|
"bp_compliance_ce", # BSI-TR documents — primary Prüfaspekte source
|
|
"bp_compliance_recht", # Legal texts (GDPR, AI Act, ...)
|
|
"bp_compliance_gesetze", # German laws
|
|
"bp_compliance_datenschutz", # Data protection documents
|
|
"bp_dsfa_corpus", # DSFA corpus
|
|
"bp_legal_templates", # Legal templates
|
|
]
|
|
|
|
# Search queries targeting audit criteria across different document types
|
|
DEFAULT_QUERIES = [
|
|
"Prüfaspekt Anforderung MUSS SOLL",
|
|
"security requirement SHALL MUST",
|
|
"compliance requirement audit criterion",
|
|
"Sicherheitsanforderung technische Maßnahme",
|
|
"data protection requirement obligation",
|
|
]
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Schemas
|
|
# ---------------------------------------------------------------------------
|
|
|
|
class ExtractionRequest(BaseModel):
|
|
collections: Optional[List[str]] = None # None = ALL_COLLECTIONS
|
|
search_queries: Optional[List[str]] = None # None = DEFAULT_QUERIES
|
|
regulation_codes: Optional[List[str]] = None # None = all regulations
|
|
max_per_query: int = 20 # top_k per search query
|
|
dry_run: bool = False # if True: no DB writes
|
|
|
|
|
|
class ExtractedRequirement(BaseModel):
|
|
regulation_code: str
|
|
article: str
|
|
title: str
|
|
requirement_text: str
|
|
source_url: str
|
|
score: float
|
|
action: str # "created" | "skipped_duplicate" | "skipped_no_article"
|
|
|
|
|
|
class ExtractionResponse(BaseModel):
|
|
created: int
|
|
skipped_duplicates: int
|
|
skipped_no_article: int
|
|
failed: int
|
|
collections_searched: List[str]
|
|
queries_used: List[str]
|
|
requirements: List[ExtractedRequirement]
|
|
dry_run: bool
|
|
message: str
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Helpers
|
|
# ---------------------------------------------------------------------------
|
|
|
|
BSI_ASPECT_RE = re.compile(r"\b([A-Z]\.[A-Za-z]+_\d+)\b")
|
|
TITLE_SENTENCE_RE = re.compile(r"^([^.!?\n]{10,120})[.!?\n]")
|
|
|
|
|
|
def _derive_title(text: str, article: str) -> str:
|
|
"""Extract a short title from RAG chunk text."""
|
|
# Remove leading article reference if present
|
|
cleaned = re.sub(r"^" + re.escape(article) + r"[:\s]+", "", text.strip(), flags=re.IGNORECASE)
|
|
# Take first meaningful sentence
|
|
m = TITLE_SENTENCE_RE.match(cleaned)
|
|
if m:
|
|
return m.group(1).strip()[:200]
|
|
# Fallback: first 100 chars
|
|
return cleaned[:100].strip() or article
|
|
|
|
|
|
def _normalize_article(result: RAGSearchResult) -> Optional[str]:
|
|
"""
|
|
Return a canonical article identifier from the RAG result.
|
|
Returns None if no meaningful article can be determined.
|
|
"""
|
|
article = (result.article or "").strip()
|
|
if article:
|
|
return article
|
|
|
|
# Try to find BSI Prüfaspekt pattern in chunk text
|
|
m = BSI_ASPECT_RE.search(result.text)
|
|
if m:
|
|
return m.group(1)
|
|
|
|
return None
|
|
|
|
|
|
async def _search_collection(
|
|
collection: str,
|
|
queries: List[str],
|
|
max_per_query: int,
|
|
) -> List[RAGSearchResult]:
|
|
"""Run all queries against one collection and merge deduplicated results."""
|
|
rag = get_rag_client()
|
|
seen_texts: set[str] = set()
|
|
results: List[RAGSearchResult] = []
|
|
|
|
for query in queries:
|
|
hits = await rag.search(query, collection=collection, top_k=max_per_query)
|
|
for h in hits:
|
|
key = h.text[:120] # rough dedup key
|
|
if key not in seen_texts:
|
|
seen_texts.add(key)
|
|
results.append(h)
|
|
|
|
return results
|
|
|
|
|
|
def _get_or_create_regulation(
|
|
db: Session,
|
|
regulation_code: str,
|
|
regulation_name: str,
|
|
) -> RegulationDB:
|
|
"""Return existing Regulation or create a stub."""
|
|
repo = RegulationRepository(db)
|
|
reg = repo.get_by_code(regulation_code)
|
|
if reg:
|
|
return reg
|
|
|
|
# Auto-create a stub so Requirements can reference it
|
|
logger.info("Auto-creating regulation stub: %s", regulation_code)
|
|
# Infer type from code prefix
|
|
if regulation_code.startswith("BSI"):
|
|
reg_type = RegulationTypeEnum.BSI_STANDARD
|
|
elif regulation_code in ("GDPR", "AI_ACT", "NIS2", "CRA"):
|
|
reg_type = RegulationTypeEnum.EU_REGULATION
|
|
else:
|
|
reg_type = RegulationTypeEnum.INDUSTRY_STANDARD
|
|
reg = repo.create(
|
|
code=regulation_code,
|
|
name=regulation_name or regulation_code,
|
|
regulation_type=reg_type,
|
|
description=f"Auto-created from RAG extraction ({datetime.now(timezone.utc).date()})",
|
|
)
|
|
return reg
|
|
|
|
|
|
def _build_existing_articles(
|
|
db: Session, regulation_id: str
|
|
) -> set[str]:
|
|
"""Return set of existing article strings for this regulation."""
|
|
repo = RequirementRepository(db)
|
|
existing = repo.get_by_regulation(regulation_id)
|
|
return {r.article for r in existing}
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Extraction helpers — independently testable
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def _parse_rag_results(
|
|
all_results: List[RAGSearchResult],
|
|
regulation_codes: Optional[List[str]] = None,
|
|
) -> dict:
|
|
"""
|
|
Filter, deduplicate, and group RAG search results by regulation code.
|
|
|
|
Returns a dict with:
|
|
- deduped_by_reg: Dict[str, List[tuple[str, RAGSearchResult]]]
|
|
- skipped_no_article: List[RAGSearchResult]
|
|
- unique_count: int
|
|
"""
|
|
# Filter by regulation_codes if requested
|
|
if regulation_codes:
|
|
all_results = [
|
|
r for r in all_results
|
|
if r.regulation_code in regulation_codes
|
|
]
|
|
|
|
# Deduplicate at result level (regulation_code + article)
|
|
seen: set[tuple[str, str]] = set()
|
|
unique_count = 0
|
|
for r in sorted(all_results, key=lambda x: x.score, reverse=True):
|
|
article = _normalize_article(r)
|
|
if not article:
|
|
continue
|
|
key = (r.regulation_code, article)
|
|
if key not in seen:
|
|
seen.add(key)
|
|
unique_count += 1
|
|
|
|
# Group by regulation_code
|
|
by_reg: Dict[str, List[tuple[str, RAGSearchResult]]] = {}
|
|
skipped_no_article: List[RAGSearchResult] = []
|
|
|
|
for r in all_results:
|
|
article = _normalize_article(r)
|
|
if not article:
|
|
skipped_no_article.append(r)
|
|
continue
|
|
key_r = r.regulation_code or "UNKNOWN"
|
|
if key_r not in by_reg:
|
|
by_reg[key_r] = []
|
|
by_reg[key_r].append((article, r))
|
|
|
|
# Deduplicate within groups
|
|
deduped_by_reg: Dict[str, List[tuple[str, RAGSearchResult]]] = {}
|
|
for reg_code, items in by_reg.items():
|
|
seen_articles: set[str] = set()
|
|
deduped: List[tuple[str, RAGSearchResult]] = []
|
|
for art, r in sorted(items, key=lambda x: x[1].score, reverse=True):
|
|
if art not in seen_articles:
|
|
seen_articles.add(art)
|
|
deduped.append((art, r))
|
|
deduped_by_reg[reg_code] = deduped
|
|
|
|
return {
|
|
"deduped_by_reg": deduped_by_reg,
|
|
"skipped_no_article": skipped_no_article,
|
|
"unique_count": unique_count,
|
|
}
|
|
|
|
|
|
def _store_requirements(
|
|
db: Session,
|
|
deduped_by_reg: Dict[str, List[tuple[str, "RAGSearchResult"]]],
|
|
dry_run: bool,
|
|
) -> dict:
|
|
"""
|
|
Persist extracted requirements to the database (or simulate in dry_run mode).
|
|
|
|
Returns a dict with:
|
|
- created_count: int
|
|
- skipped_dup_count: int
|
|
- failed_count: int
|
|
- result_items: List[ExtractedRequirement]
|
|
"""
|
|
req_repo = RequirementRepository(db)
|
|
created_count = 0
|
|
skipped_dup_count = 0
|
|
failed_count = 0
|
|
result_items: List[ExtractedRequirement] = []
|
|
|
|
for reg_code, items in deduped_by_reg.items():
|
|
if not items:
|
|
continue
|
|
|
|
# Find or create regulation
|
|
try:
|
|
first_result = items[0][1]
|
|
regulation_name = first_result.regulation_name or first_result.regulation_short or reg_code
|
|
if dry_run:
|
|
# For dry_run, fake a regulation id
|
|
regulation_id = f"dry-run-{reg_code}"
|
|
existing_articles: set[str] = set()
|
|
else:
|
|
reg = _get_or_create_regulation(db, reg_code, regulation_name)
|
|
regulation_id = reg.id
|
|
existing_articles = _build_existing_articles(db, regulation_id)
|
|
except Exception as e:
|
|
logger.error("Failed to get/create regulation %s: %s", reg_code, e)
|
|
failed_count += len(items)
|
|
continue
|
|
|
|
for article, r in items:
|
|
title = _derive_title(r.text, article)
|
|
|
|
if article in existing_articles:
|
|
skipped_dup_count += 1
|
|
result_items.append(ExtractedRequirement(
|
|
regulation_code=reg_code,
|
|
article=article,
|
|
title=title,
|
|
requirement_text=r.text[:1000],
|
|
source_url=r.source_url,
|
|
score=r.score,
|
|
action="skipped_duplicate",
|
|
))
|
|
continue
|
|
|
|
if not dry_run:
|
|
try:
|
|
req_repo.create(
|
|
regulation_id=regulation_id,
|
|
article=article,
|
|
title=title,
|
|
description=f"Extrahiert aus RAG-Korpus (Collection: {r.category or r.regulation_code}). Score: {r.score:.2f}",
|
|
requirement_text=r.text[:2000],
|
|
breakpilot_interpretation=None,
|
|
is_applicable=True,
|
|
priority=2,
|
|
)
|
|
existing_articles.add(article) # prevent intra-batch duplication
|
|
created_count += 1
|
|
except Exception as e:
|
|
logger.error("Failed to create requirement %s/%s: %s", reg_code, article, e)
|
|
failed_count += 1
|
|
continue
|
|
else:
|
|
created_count += 1 # dry_run: count as would-create
|
|
|
|
result_items.append(ExtractedRequirement(
|
|
regulation_code=reg_code,
|
|
article=article,
|
|
title=title,
|
|
requirement_text=r.text[:1000],
|
|
source_url=r.source_url,
|
|
score=r.score,
|
|
action="created" if not dry_run else "would_create",
|
|
))
|
|
|
|
return {
|
|
"created_count": created_count,
|
|
"skipped_dup_count": skipped_dup_count,
|
|
"failed_count": failed_count,
|
|
"result_items": result_items,
|
|
}
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Endpoint
|
|
# ---------------------------------------------------------------------------
|
|
|
|
@router.post("/compliance/extract-requirements-from-rag", response_model=ExtractionResponse)
|
|
async def extract_requirements_from_rag(
|
|
body: ExtractionRequest,
|
|
db: Session = Depends(get_db),
|
|
):
|
|
"""
|
|
Search all RAG collections for Prüfaspekte / audit criteria and create
|
|
Requirement entries in the compliance DB.
|
|
|
|
- Deduplicates by (regulation_code, article) — safe to call multiple times.
|
|
- Auto-creates Regulation stubs for previously unknown regulation_codes.
|
|
- Use `dry_run=true` to preview results without any DB writes.
|
|
- Use `regulation_codes` to restrict to specific regulations (e.g. ["BSI-TR-03161-1"]).
|
|
"""
|
|
collections = body.collections or ALL_COLLECTIONS
|
|
queries = body.search_queries or DEFAULT_QUERIES
|
|
|
|
# --- 1. Search all collections in parallel ---
|
|
search_tasks = [
|
|
_search_collection(col, queries, body.max_per_query)
|
|
for col in collections
|
|
]
|
|
collection_results: List[List[RAGSearchResult]] = await asyncio.gather(
|
|
*search_tasks, return_exceptions=True
|
|
)
|
|
|
|
# Flatten, skip exceptions
|
|
all_results: List[RAGSearchResult] = []
|
|
for col, res in zip(collections, collection_results):
|
|
if isinstance(res, Exception):
|
|
logger.warning("Collection %s search failed: %s", col, res)
|
|
else:
|
|
all_results.extend(res)
|
|
|
|
logger.info("RAG extraction: %d raw results from %d collections", len(all_results), len(collections))
|
|
|
|
# --- 2. Parse, filter, deduplicate, and group ---
|
|
parsed = _parse_rag_results(all_results, body.regulation_codes)
|
|
deduped_by_reg = parsed["deduped_by_reg"]
|
|
skipped_no_article = parsed["skipped_no_article"]
|
|
|
|
logger.info("RAG extraction: %d unique (regulation, article) pairs", parsed["unique_count"])
|
|
|
|
# --- 3. Create requirements ---
|
|
store_result = _store_requirements(db, deduped_by_reg, body.dry_run)
|
|
created_count = store_result["created_count"]
|
|
skipped_dup_count = store_result["skipped_dup_count"]
|
|
failed_count = store_result["failed_count"]
|
|
result_items = store_result["result_items"]
|
|
|
|
message = (
|
|
f"{'[DRY RUN] ' if body.dry_run else ''}"
|
|
f"Erstellt: {created_count}, Duplikate übersprungen: {skipped_dup_count}, "
|
|
f"Ohne Artikel-ID übersprungen: {len(skipped_no_article)}, Fehler: {failed_count}"
|
|
)
|
|
logger.info("RAG extraction complete: %s", message)
|
|
|
|
return ExtractionResponse(
|
|
created=created_count,
|
|
skipped_duplicates=skipped_dup_count,
|
|
skipped_no_article=len(skipped_no_article),
|
|
failed=failed_count,
|
|
collections_searched=collections,
|
|
queries_used=queries,
|
|
requirements=result_items,
|
|
dry_run=body.dry_run,
|
|
message=message,
|
|
)
|