137 Commits

Author SHA1 Message Date
Benjamin Admin 870cdc871e fix(embedding): kurze Legal-Docs behalten Sektions-Prefix (chunk_text_legal)
chunk_text_legal hatte einen Early-Return fuer text <= chunk_size, der den
[§ X]-Prefix uebersprang -> chunk_text_legal_structured konnte section/article
nicht extrahieren -> article="" -> (a) article_label fiel auf "BDSG" zurueck
(kein §), (b) deterministische Point-ID kollidierte (alle article="" -> gleiche
ID) -> ~die Haelfte kurzer §§ ueberschrieben sich. Fix: Early-Return traegt den
erkannten Sektions-Header als Prefix. Belegt am BDSG-§-Ingest: 44->86 distinkte
§§, §38 sauber "BDSG § 38". Wirkt nur auf KUENFTIGE Ingests (kein Re-Chunk).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-22 23:34:11 +02:00
Benjamin Admin f398088fbb feat(controls): atom-inheritance schema-aware (text + jsonb source_citation)
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 40s
CI / test-bqas (push) Successful in 38s
Prod canonical_controls.source_citation ist text-mit-JSON (DB-Swap-Anomalie),
macmini ist jsonb. _art()-Helper nutzt pg_input_is_valid(col::text,'jsonb') +
(col::text)::jsonb->>'article' (PG16+) -> ein Skript fuer beide Schemata.
Prod-Apply 2026-06-21 verifiziert: Zitierfaehigkeit 6,8%->60,8% (+169.755),
Stichprobe 8/8 korrekt. macmini-Dry-Run 0 (idempotent, kein Regress).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-21 22:44:38 +02:00
Benjamin Admin 0c5f1fd7a4 feat(rag): regulation_short Casing-Normalisierung am Ingest-Rand
Der Re-Ingest leitet regulation_short z.T. via title()-Casing aus Dateinamen ab
('dsgvo'->'Dsgvo', 'osha otm'->'Osha Otm') -> falsche Akronyme im Payload UND im
article_label ('Art. 37 Dsgvo'). NEU: normalize_regulation_short() in legal_metadata,
token-basiert mit kuratiertem Akronym-Set -> nur gelistete Akronyme werden gross,
legitimes Mixed-Case (GeschGehG, MuSchG, GoBD, MiCA, eIDAS, EuGH) bleibt unberuehrt.
Angewandt am Ingest-Rand in documents.py (greift fuer Payload-Feld + display_name).
+13 Tests gruen. Bestandsdaten brauchen separaten einmaligen Qdrant-Patch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-21 18:31:45 +02:00
Benjamin Admin de542633e2 feat(controls): Zitierfaehigkeit — Embedding-Re-Link + Atom-Vererbung
citation_backfill Tier-1 von totem sha256-Hash auf Semantik-Suche gegen die
re-ingestierten, article_label-tragenden Chunks umgestellt (Fundstelle aus
article_label); rag_client reicht article_label durch (additiv, Default-Feld).
NEU: scripts/atom_citation_inheritance.py vererbt source_citation parent->atom
(license_rule != 3), iterativ. macmini-Apply verifiziert: Zitierfaehigkeit
6.9%->61.3% (+171.765 Atome), Stichprobe korrekt (Atom == Parent-Fundstelle).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-21 14:17:57 +02:00
Benjamin Admin ff4a743558 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 26s
2026-06-21 01:13:27 +02:00
Benjamin Admin dac2a9f685 feat(rag): Legal-Metadaten — article_label + deterministische IDs + chunk_hash
Neues pures Modul legal_metadata.py (nur stdlib, lokal+CI testbar): §3-Normalisierung
section->article, strikte Header-Extraktion (Datum/Seiten-Rauschen -> kein Falsch-Zitat),
citation_style pro Regulierung (EU/CH=article, DE=paragraph), Urteil=Aktenzeichen statt §,
camelCase-Klarnamen (ProdHaftG), deterministische uuid5-Point-ID + chunk_hash (sha256).
documents.py verdrahtet build_legal_fields in den Payload-Build + document_version.
10 Tests gruen. Vertrag: rag_reingest_spec.md (§2/§3).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-20 14:35:07 +02:00
Benjamin Admin adb7c6802c feat(nginx): /mcp auf :8002 → bp-compliance-mcp (Repo-Scanner MCP-Endpoint)
Streamable-HTTP-MCP des Compliance-Repos (cra_assess_findings) erreichbar als
macmini:8002/mcp. SSE-sicher: proxy_buffering off, http/1.1, read_timeout 3600s,
Authorization (Bearer) wird durchgereicht. Additiv vor location / im 8002-Block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-15 18:58:17 +02:00
Benjamin Admin dbfe7347b1 chore(infra): remove night-scheduler service entirely [guardrail-change]
Deletes the night-scheduler (nightly auto-shutdown of services). It stopped
services at 22:00 and killed long-running jobs (e.g. bulk embedding) — net
hindrance. Removed: running container, compose service + hetzner override,
source dir, CI lint entry, rule doc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-14 08:19:22 +02:00
Sharang Parnerkar a7850a0296 Add L-Bank Pre-Seed Finanzplan submissions (Base/Bear/Bull)
Build pitch-deck / build-push-deploy (push) Has been cancelled
CI / go-lint (push) Has been cancelled
CI / python-lint (push) Has been cancelled
CI / nodejs-lint (push) Has been cancelled
CI / test-go-consent (push) Has been cancelled
CI / test-python-voice (push) Has been cancelled
CI / test-bqas (push) Has been cancelled
Wandeldarlehen 400k model mapped into the official L-Bank V1.1
Finanzplan template (36 months, Aug 2026 to Jul 2029) for the three
scenarios. Each reconciles to the source liquidity to the cent; grants
booked as cash inflows (out of EBIT); "ohne Pre-Seed" excludes both
tranches; Planungsprämissen and helper tabs filled; Anleitung intact.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-02 13:25:09 +02:00
Benjamin Admin ec3b0e26fd Merge branch 'chore/license-mapping-audit' — license mapping + audit script + DGUV + /staerken marketing page
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 37s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 34s
2026-05-22 00:54:49 +02:00
Benjamin Admin 19d1a56df4 feat(marketing): /staerken page with 7 USPs from IACE strategy — Task #19
Long-form differentiator page covering the seven sales arguments from
project_marketing_website_3014_themes.md, all anchor-linkable for
sales decks:

  #1 engine          — Pattern-engine vs Excel-checklist
  #2 multi-markt     — One risk assessment, all markets (CE+US+CN+JP)
  #3 folgegefahren   — Operator-to-end-customer harm chain
  #4 public-domain   — OSHA/NIST/EUR-Lex/BAuA as legal anchor
  #5 audit-suite     — Engine self-introspection (cmd/iace-audit A-E)
  #6 made-in-germany — German export meets US Federal PD
  #7 tooling         — LLM gap-review as co-pilot, not robo-lawyer

Each section carries a "Belegt durch:" line pointing at the actual
codebase artifact behind the claim, so the page reads as audit-friendly
proof, not marketing fluff.

Below the 7 differentiators a competitor comparison table (BreakPilot
vs DesignSafe vs Pilz PASS vs Sick SD vs Sphera) and a closing block
explaining the R1/R2/R3 license architecture with a pointer to
/sdk/licenses.

Navbar updated to surface the page between Plattform and CE-Prozess.

This closes Task #19. With Task #29 + #7/#8 already in, the roadmap
post-licence-classification work is fully landed.
2026-05-22 00:36:09 +02:00
Benjamin Admin 3934bdf814 docs(impressum): add Quellen & Lizenzen section with /sdk/licenses ref
Adds a "Quellen und Lizenzen der Compliance-Inhalte" section to the
marketing-website Impressum naming the public sources the platform
draws on (EUR-Lex, US Federal Code, ENISA/EDPB/BAuA, OWASP, OECD,
eigene Texte) and pointing to /sdk/licenses for the full per-source
breakdown.

The Datenschutz and Impressum audit (Task #24 in breakpilot-compliance)
confirmed no spurious license claims were buried in these pages.
This change adds explicit transparency rather than removing anything,
and is paired with the explicit disclaimer that the Pauschalvermerk
does NOT replace work-level attribution — that is handled by the
auto-footer in PDFs and the <SourceBadge> in the SDK frontend.
2026-05-21 22:19:24 +02:00
Benjamin Admin dbd44ecc20 feat(licenses): postgres + qdrant license_rule backfill scripts
Two idempotent scripts that complete Task #22 (300k atomic_controls
reclassification) across both Postgres DBs and all Qdrant collections
on Mac Mini + Production.

backfill_license_rule.py
- iterative parent_control_uuid inheritance with cycle cap
- dry-run + apply modes, per-iteration row counts
- residual-orphan cluster report for manual review

backfill_qdrant_license_payload.py
- joins canonical_controls.id (or regulation_id) → license_rule
- scrolls + grouped set_payload per rule (3 batches per collection)
- supports both lookup tables (canonical_controls / regulation_registry)
- supports managed Qdrant via --qdrant-api-key (Production)

Backfill bilance:
- Mac Mini canonical_controls: 0 NULL (was 279,384) across 314,811 rows
- Mac Mini Qdrant atomic_controls_dedup: 44,987 points patched
- Mac Mini bp_compliance_gesetze: 37,634 points patched
- Mac Mini bp_compliance_datenschutz: 11,338 points patched
- Production canonical_controls: 0 NULL (was 259,914) across 294,027 rows
- Production Qdrant bp_compliance_gesetze: 55,836 patched
- Production Qdrant bp_compliance_datenschutz: 18,980 patched
- Production Qdrant bp_compliance_ce: 23,239 patched

Schema migration 002_regulation_registry.sql + 252 registry rows were
replicated to Production (was missing — only existed on Mac Mini).
20 BSI/DE-Gesetz entries added to registry to close Qdrant lookup gap.

100% deterministic classification achieved on both DBs via:
- parent_control_uuid inheritance (94% coverage)
- control_parent_links.source_regulation → regulation_registry
- source_citation->>'source' → regulation_registry
- canonical_processed_chunks ground truth (chunk-validated)
- ungrouped LLM-aggregate Vorfahren → own works (Rule 3)

[migration-approved]
2026-05-21 18:46:57 +02:00
Benjamin Admin 93687a32fe docs(licenses): freeze 3-rule license mapping + audit script
Defines the authoritative mapping from license_type to license_rule
in docs/LICENSE_RULES.md, and adds scripts/audit_license_classification.py
to surface classification gaps in registry/canonical_controls/Qdrant.

Key finding from first audit run against bp-core-postgres + Qdrant:

- regulation_registry: 232 rows, 224 rule=1, 8 rule=2, 0 rule=3;
  36 rows without license_type (need backfill)
- canonical_controls: 314,811 rows, 279,384 (89%) have NULL
  license_rule (target of Task #22 reclassification)
- Qdrant atomic_controls_dedup: 100% of sampled points lack both
  license and license_rule payload fields
- Qdrant bp_compliance_gesetze: 80.6% lack both fields
- Qdrant bp_compliance_ce + bp_compliance: nearly clean

Rule definitions clarified (was loosely remembered as
"law / cite / rewrite"):
- Rule 1 = verbatim, sovereign law (EU/DE/AT/CH/US, TRBS/TRGS/ASR,
  OSHA, NIST, EU guidelines, DGUV UVV)
- Rule 2 = verbatim with attribution (CC-BY, Apache, OWASP,
  OECD AI Principles, ENISA)
- Rule 3 = identifier citation only, no full text (DIN/EN/ISO,
  ANSI/UL/IEC, DGUV Regeln/Informationen/Grundsaetze, BSI,
  proprietary standards). Pipeline drops chunk_text when rule=3
  in pipeline_adapter.py:147.

The 4th category I had proposed ("R1-A") turned out to be already
implemented as rule=2; the mapping doc reflects the actual code
behaviour rather than the original 3-name verbal model.

No schema change. No data migration in this commit — reclassification
of the 279k controls is staged as Task #22 and will be cluster-based
by source/regulation_id.
2026-05-21 11:29:38 +02:00
Sharang Parnerkar 2d9fec3a6d feat(pitch-print): 10 slide redesigns from parallel agent review
Build pitch-deck / build-push-deploy (push) Successful in 1m53s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 34s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 36s
Per the user's batch review of the rendered PDF. Five subagents ran in parallel,
each owning a different slide file; this is the merged result.

Slide 10 — Regulatory Landscape (PrintProductSlides)
  8 regulatory categories now render as a 2×4 icon-tile grid (was a DataTable):
  Lock / Shield / Brain / Globe / ShieldCheck / Banknote / Heart / Users.
  10 industry profiles now each show an icon next to the name
  (Factory for Maschinenbau Kernfokus, plus Heart, Banknote, ShoppingCart, Cpu,
  Wifi, Brain, ShieldCheck, BookOpen, Landmark, Building2).

Slide 12 — How It Works (PrintProductSlides)
  Step rail and day timeline pulled together (was a big empty middle).
  Added a "Was Sie wann bekommen" 4-column benefit block in the bottom third
  (Shield/FileText/CheckCircle2/Zap), with mid-page "Median 14 Tage" callout.

Slide 13 — Market TAM / SAM / SOM (PrintMarketSlides)
  Dropped MarketFunnel primitive. Left column: SVG nested concentric circles
  (TAM r=60 violet, SAM r=36 violet, SOM r=14 amber as Kernmarkt). Right column:
  three stacked TAM/SAM/SOM info cards with mono kicker, big EUR value, growth
  rate, one-line description; SOM card carries amber accent + "← unser Kernmarkt".

Slide 14 — Pricing green box (PrintProductSlides)
  Net-effect callout expanded from 2 lines to a full breakdown:
  Pentests +€13k / CE-Risiko +€9k / Compliance-Zeit (−60%) +€15k /
  Audit-Vorber. (auto) +€9k / Legal-Stunden (−40%) +€5k / Schulungen +€4k.
  Italic footnote: "Plus Vermeidung von Bußgeldern und gewonnene RFQs."

Slide 17 — Competition AppSec title (PrintCompetitionSlides)
  Title rewritten to investor-friendly framing — "Cyber-Security: BreakPilot
  ersetzt das ganze AppSec-Stack" (was SAST + DAST + SCA + Pentesting).

Slide 18 — Team founder bios (PrintMarketSlides)
  Prose paragraphs replaced with 5 icon-bulleted skill/achievement lines per
  founder. Benjamin gets violet-50 tiles (Briefcase, RefreshCw, Handshake,
  Scale, Lightbulb). Sharang gets amber-50 tiles (Code, TrendingUp, CreditCard,
  ShieldCheck, Cpu). Photo + name + role + equity header preserved.

Slide 23 — KPIs trajectory (PrintNewSlides)
  Each of the 8 KPI tiles now has a 15mm × 8mm SVG sparkline at the bottom
  showing the 5-year progression. Stroke color adapts per metric (violet
  default, emerald for cash/margin, red→emerald for EBIT/net-income across
  break-even). All-zero series fall back to em-dash. Awkward "0 → 0" prefix
  suppressed on missing-data tiles.

Slide 28 — Regulatory Pillars (PrintAnnexSlides)
  Rebuilt as 4 actual vertical pillars (was 2×2 box grid). Each pillar has:
  capital (top, gradient tint, mono kicker + 01-04 number), shaft (white card
  with title + description + 2mm colored left border), base (bottom, darker
  tint, mono law citations). A shared horizontal "ground line" below all four
  pillars completes the architectural reference.

Slide 29 — Architecture 3D (PrintDiagrams)
  Faked 3D depth via staggered right indent (0/2/4mm), inset top highlight
  and bottom seam shadows, per-layer drop-shadow with rising opacity. Layer 03
  reads as the foundation; layer 01 floats on top. PlaneConnector chevrons
  replace the simple SVG down-arrows between tiers. Text stays horizontal.

Slide 31 — Tech Stack (PrintNewSlides)
  Cards now have 14mm violet-gradient icon tiles (was 8mm flat), mono kicker
  number, 12pt category name, italic one-line blurb, and the techs as rounded
  chip tags (violet-50 / violet-200, mono 7.5pt) instead of a flat mono list.
  Title cleaned: "100 % " → "100%".

All files under 500 LOC except PrintIntroSlides (515, preexisting issue).
TypeScript clean, next build green, all 38 routes compile.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 18:11:58 +02:00
Sharang Parnerkar a6f4ca88a4 fix(pitch-print): ComplAI brand, em-dash centering, fund fallback 400k
Build pitch-deck / build-push-deploy (push) Successful in 2m47s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 35s
Three universal fixes before per-slide redesigns:

1. Brand: new <ComplAI /> JSX component renders the product name correctly —
   'Compl' in inherited text color, 'AI' in violet (#7c3aed), no slashes.
   Replaces the previous 'BreakPilot COMPL/AI/' literal in the Executive
   Summary p1 title. Page primitive's title prop now accepts ReactNode so
   JSX brand wordmarks work anywhere a title would.

2. Em-dash centering: Bullets primitive previously placed each em-dash
   marker via absolute positioning with a hardcoded 'top: 4pt', which drifted
   relative to font-size and looked off-center in the rendered PDF. Now uses
   display:flex on the <li> with a fixed-width column that vertically centers
   the 0.5pt rule on the first line height of the text.

3. Funding fallback: cover + The Ask now default to 400_000 (was 1_000_000)
   when no funding amount is in the data. New base case is a €400k
   Wandeldarlehen, not €1M equity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 17:58:55 +02:00
Benjamin Admin 297eff949e Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
Build pitch-deck / build-push-deploy (push) Successful in 2m3s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 35s
2026-05-20 16:24:00 +02:00
Benjamin Admin 01e2e0fc4b feat(pitch-deck): Finanzplan-Tooling + formel-getriebene Versionen Base/Bull/Bear
8 neue Skripte erweitern die Excel-Finanzpläne deutlich:
- add-kunden-formulas: Neukunden-Lookup + kumulativer Churn (SUMPRODUCT-basiert)
- add-price-formulas: jährliche Preiserhöhung Jan via Treiber
- add-inflation-formulas: Inflation auf Betriebskosten + Büromiete-Logik
- add-tantieme-and-explanations: Gründer-Tantieme 2028-2030 + Erläuterungen
  in Cohort-Analyse + Sensitivity-Sheets
- apply-bueromiete: 1000€/Monat ab Sep 2026 mit Inflation
- apply-number-formatting: Euro / Count / Percent per Label-Klassifikation
- cleanup-finanzplan-labels: 'kategorie — '-Präfix entfernt
- copy-extra-sheets: Charts/Cohort/Sensitivity/Hiring-Plan von Series-A
  auf 400k Base/Bull/Bear übertragen (inkl. 12 Chart-Objekten)

Neue Excel-Dateien (für L-Bank Wandeldarlehen 400k Pitch):
- Finanzplan-Wandeldarlehen-400k.xlsx (Base)
- Finanzplan-Wandeldarlehen-400k-Bull.xlsx
- Finanzplan-Wandeldarlehen-400k-Bear.xlsx
- Finanzplan-Series-A-Ambitioniert.xlsx (Series-A Variante)

Inhaltliche Anpassungen (400k Base/Bull/Bear):
- Channel-Provision Bechtle/Cancom → Channel-Partner Provision, Format Euro
- GuV: 'Steuerbares Einkommen' → 'Zu versteuerndes Einkommen (nach Verlustvortrag)',
  Formel um Zinserträge/-aufwand erweitert
- IT-Recht/Datenschutzjurist auf 100% (6666 € statt 3333 €)
- Series-A-Investor in WD-Sheet auf 0 € (nicht eingeplant in 400k Variante)
- Mitarbeiter +1 Monat verschoben (außer Gründer = Okt 2026)
- 3 Enterprise-Neukunden zusätzlich (Apr 2027, Jun 2027, Okt 2029)
- Marketing-Agentur Cut ~33% pro Szenario (Base 4%, Bull 5%, Bear 2%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 16:23:12 +02:00
Sharang Parnerkar b4043b20b2 feat(pitch-print): TL;DR + Differentiators + KPIs + Tech Stack + P&L promoted
Build pitch-deck / build-push-deploy (push) Successful in 1m43s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 32s
Adds the 5 slides flagged as missing vs Claude Design (30 slides). Standard
PDF now matches Claude's slide count and structure.

New slides (PrintNewSlides.tsx):
- TL;DR / 30 SEKUNDEN — 4 quad cards (Scale / Sovereignty / Bidirectional /
  Speed) with mono kicker, hero stat, body and ticker line. Slot 3, after the
  exec summary.
- Differentiators — 4 under-the-hood cards (Traceability / Engine / Optimizer
  / EU-Trust-Stack) extracted from USP p2. Slot 9, after USP. Each card has
  the lucide icon in a violet/amber tile, full body + bullets, and the mono
  ticker line.
- KPIs (Trajektorie 2026 → 2030) — 8 hero tiles showing year-1 → year-5
  transitions (ARR, customers, ARPU, employees, gross margin, EBIT, net
  income, cash). Derived live from computeAnnualKPIs(fmResults). Slot 23.
- Tech Stack — 8-category grid (Frontend / Backend / Storage / AI-RAG /
  Code-Scanning / Auth / Comms / DevOps), each with lucide icon tile +
  category label + monospaced tech list. Slot 31, after Engineering.

USP p2 redesigned: now hero-sized closing loop only (the 4 cards moved to
Differentiators). Bigger LoopDiagram in a violet-tinted hero panel, 12mm
inner padding, more room for the hub body + bullets.

P&L Detail (PrintFinancialsPage) promoted from financial-only to standard
PDF. Kicker now 21 (was '17b'), subtitle rewritten ('Annualisierte GuV',
no longer 'Investor-only'). Empty-data fallback added so it doesn't crash
if fmResults isn't populated.

Anhang divider moved from PrintAnnexSlides.tsx to PrintNewSlides.tsx (was
pushing PrintAnnexSlides over the 500-LOC cap). Section list inside the
divider updated for the new numbering — now 12 sections from #18 GTM down
to #29 Glossary.

PrintDeck.tsx: BASE_PAGES bumped 30 → 35. Render order updated; hasFinancialDetail
flag removed (P&L always rendered); cap-table is the only remaining
financial-only conditional and stays suppressed for Wandeldarlehen.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 13:11:02 +02:00
Sharang Parnerkar ad61fd3779 feat(pitch-print): add Anhang divider slide before appendix block
Build pitch-deck / build-push-deploy (push) Successful in 2m5s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 31s
Investors arriving at slide 16 (Customer Savings) currently jump straight
into annex-strategy without any chapter break — they don't know the main
pitch has ended and the appendix has started.

Adds PrintAnnexDividerPage that sits between customer-savings and
annex-strategy. Layout:

  Part II · Anhang                     BreakPilot · ComplAI
  ─────────────────────────────────────────────────────────
  16 · Kapitelwechsel
  Anhang.   (giant violet-dotted title, 74pt)
  ────────
  Detail & Belege.  (15pt lead)

  Auf den folgenden Seiten
  17 GTM Strategie    20 Reg. Details        23 KI-Pipeline
  18 Finanzplan       21 Architektur          24 Risiken
  19 Treibervariablen 22 Engineering          25 Glossar
  ─────────────────────────────────────────────────────────
  BREAKPILOT · COMPLAI    WANDELDARLEHEN    16 / 30

Uses .print-page-bg so the violet-tinted dotted background reads as the
same chapter as the rest of the deck. Footer matches the standard Page
primitive.

BASE_PAGES bumped 29 → 30. Bilingual (DE/EN).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 12:58:53 +02:00
Sharang Parnerkar d1b55cd65b feat(pitch-print): redesign Pricing slide as 3 distinct product cards
Build pitch-deck / build-push-deploy (push) Successful in 2m3s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 42s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 33s
The pricing slide previously rendered as a 4-column DataTable buried below
unit economics — the 3 tiers were hard to find. Rebuilt as the Claude Design
PREISE pattern: three prominent product cards side by side.

Each card:
- Mono tier label kicker (STARTER / PROFESSIONAL / ENTERPRISE) at top
- Target audience line ("<25 Mitarbeiter · Basis-Module" etc.)
- Hero price (€3.600 / €18.000 / ab €50.000) + /Jahr unit
- 4–5 feature checkmarks (green ✓)
- Tinted background per tier: violet-50 for Starter, white-gradient for
  featured Professional, amber-50 for Enterprise

Professional card carries:
- 2px violet border (vs 1px on others)
- Drop shadow
- "BELIEBT" / "POPULAR" pill badge floating above its top edge in violet

Below the 3 cards, a compact 2-col footer:
- left: 4 Unit Economics tiles (~70% gross margin, ~3.5× LTV/CAC, etc.)
- right: emerald net-effect callout (+€30k per SME / yr)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 12:54:31 +02:00
Sharang Parnerkar cb46372e52 fix(pitch-print): architecture diagram overflow — compact ServiceNode
Build pitch-deck / build-push-deploy (push) Successful in 2m1s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 34s
Infrastructure layer was being cut off (only the chip showed; the 3 inference
service cards never rendered). Root cause: each tier was double-wrapped — an
outer tinted layer card AND inner bordered FlowNode cards — which inflated
the total height past A4 landscape.

Replaces inner FlowNode (border + padding + footer rule) with a new flat
ServiceNode used only inside the tinted layer wrappers:
- no own border / no own padding
- title 11pt → 10pt, kicker 7pt → 6pt
- caps inner items to 4 max
- mono tech footer in 6pt with hairline separator

Also tightened the connectors between tiers: was a 12mm row of three VArrow
SVGs each with its own padding, now a 5mm row of three compact down-arrow
SVGs. Saves ~14mm of vertical space.

Layer chip sizing reduced (7.5pt → 7pt, padding 1.5mm → 1mm) so each chip
takes less of its layer card.

Result: all three layers fit on one A4 landscape page with the LLM
Inference / Embeddings / AI Tools cards visible.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 12:25:56 +02:00
Sharang Parnerkar f1814fe8ec fix(pitch-print): USP overflow, How It Works rail, Assumptions, Architecture layer cards
Build pitch-deck / build-push-deploy (push) Successful in 2m4s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 30s
Five fixes per user review:

1. USP p1 overflow (stats were getting clipped). Tightened card spacing:
   - icon tile 12mm → 9mm, moved inline next to title
   - mono kicker for "SÄULE · COMPLIANCE" tags
   - reduced paddings, title 13pt → 12pt, body 8.5pt → 8pt
   - violet replaces indigo (already by alias, but explicit here)

2. USP p2 closing loop: was a plain tinted callout, now a 2-col hero panel
   - left: violet circle around ∞, mono "DIE SCHLEIFE · ALWAYS IN SYNC",
     bold headline (14pt), body
   - right: white card containing the LoopDiagram with violet outline
   - gradient violet→white→violet background for the panel

3. How It Works: replaced the floating-arrow StepStrip with a real
   horizontal-rail timeline:
   - Violet gradient connector line behind 4 numbered circles
   - Each circle is a 14mm violet disc with the step number
   - Title + body below each circle
   Replaced the Time-to-Value callout with a dotted-rail timeline:
   - 5 day markers (Tag 0/3/7/14/30) as violet pill chips on a dashed rail
   - Stop label below each
   - Mono header reads "Time-to-Value · Median 14 Tage · Worst Case 28 Tage"

4. Assumptions slide:
   - "Skalare Annahmen" → "Treibervariablen des Finanzplans" (plain language)
   - subtitle rewritten to explain the three-scenario sensitivity setup
     instead of referencing internal fp_assumptions tables
   - each category now a violet-bordered card with mono kicker + variable
     count, italic instead of bare table
   - sensitivity callout expanded with concrete runway impact numbers

5. Architecture diagram: layer chips per Claude Design pattern.
   - Each tier wrapped in a tinted rounded card (violet for product +
     inference, amber for gateway)
   - "01 · APPLICATION LAYER" mono pill with italic sub-label
     ("User-facing services") next to it
   - Gateway layer carries the LiteLLM Proxy title inline with subtitle
   - Connector arrows kept between layers

Also fixes "Kleinstunternehmen" → "Kleinunternehmen" typo in solution
pillar 03 and the product pricing-logic callout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 12:08:50 +02:00
Sharang Parnerkar 12a9fe1810 fix(pitch-print): drop Standort/HQ from cover key terms
Build pitch-deck / build-push-deploy (push) Successful in 1m51s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 54s
CI / test-python-voice (push) Successful in 43s
CI / test-bqas (push) Successful in 38s
3-column grid now: Funding · Pre-Money/Maturity · Instrument.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 11:57:58 +02:00
Sharang Parnerkar 8b5b9905a7 fix(pitch-print): port Claude Design tokens — violet, Inter+JBMono, dotted bg
Build pitch-deck / build-push-deploy (push) Successful in 1m50s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 27s
Adapts the visual language from the Claude Design reference (light theme) while
preserving our left-rule Page header and split-block cover.

Color palette: indigo (#4f46e5) → violet (#7c3aed) as primary accent across all
slides. COLORS.indigo* aliases kept so the existing 9 slide files inherit the
new palette without edits. New explicit COLORS.violet50..900 names available
for future code.

Body text shifted from pure slate to deep purple-tinted (#1a0f34) per Claude
tokens.fg.

Typography:
- Body / headings: Inter (was Plus Jakarta Sans)
- Mono utility: JetBrains Mono — applied to kicker tags, page numbers, footer,
  the "At a glance" stat block on the cover, and the cover key-term labels
- Mono class .print-mono added to print.css

Background:
- New .print-page-bg utility paints a violet-tinted radial gradient
  (white → #f5efff → #ebdfff) with a subtle 24px dotted grid SVG overlay
- Applied to every Page and the cover's right pane

Page chrome:
- Kicker label switched to JetBrains Mono with wider letter-spacing (0.18em)
- Right-of-kicker rule fades violet→transparent (was flat slate)
- New 2px violet gradient bar (700→400→700) below the title/subtitle —
  the Claude Design "purple bar" accent, scaled down for print
- Footer restyled: mono caps "BREAKPILOT · COMPLAI" left, version (violet) middle,
  page number right

Cover:
- Left block now a violet vertical gradient (was flat indigo)
- All small labels ("Investor Brief", "Auf einen Blick", "Confidential",
  "Key Terms", and the term labels) restyled to JetBrains Mono with wider tracking
- Right pane carries the violet-tinted dotted bg, matching the rest of the deck

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 11:23:58 +02:00
Sharang Parnerkar cd23ebc3ba fix(pitch-print): density on Problem/Solution/Strategy, Ask reconciliation
Build pitch-deck / build-push-deploy (push) Successful in 1m42s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 34s
Per user review of the rendered PDF.

Problem: empty bottom-third on each card → added a bottom stat block per
column showing 3 pulled-out data points (e.g. "64% · 70% · 83%") with red
hero numerals. Description text trimmed since the stats now carry the punch.

Solution: pillar bodies were short, leaving large gaps between description
and the green stat at the bottom. Added 5 detail bullets per pillar (specific
tools, frameworks, behaviours) in the previously empty middle. Stat at the
bottom now reads as a real KPI tile, not a floating value.

Strategy: phase KPI was a tiny corner tag. Promoted it to a bottom
"Outcome" block with side-by-side 14pt numerals matching the phase tone
(2 Kunden / ARR €40k etc.). The bullets get more breathing room above.

The Ask reconciliation (was showing nonsense €4M pre / €5M post / 20%
investor share for a €200k Wandeldarlehen): detect convertible/SAFE/
Wandeldarlehen and swap the tiles to Funding / Discount / Maturity /
INVEST-grant. Equity rounds compute Pre/Post from amount × 20% assumed
investor share. Same conditional applied to the cover key-terms grid.

Pricing label "Was der Kunde zahlt vs. spart (KMU 50 MA, Jahr 1)" was
wrapping "1)" onto its own line — switched to a slash-separated form
("Kunde zahlt vs. spart · KMU 50 MA · Jahr 1") that fits on one line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 10:10:58 +02:00
Sharang Parnerkar f30ac73b79 fix(pitch-print): cover layout, Finanzplan data source, target_date
Build pitch-deck / build-push-deploy (push) Successful in 1m34s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 30s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 28s
Three critical fixes after reviewing the rendered PDF:

Cover (was: indigo block collapsed to top, white content stacked below):
- The .print-page class in print.css forces flex-direction: column !important,
  which broke the horizontal split. Wrap the cover content in a single grid
  container — the column-flex parent then has only one child so direction is
  irrelevant. Indigo block now runs full-height on the left.
- Title reduced 88pt -> 60pt so "BreakPilot ComplAI." fits without wrapping.
- Funding amount formatter now handles sub-€1M cases (€200k vs €0.2M).

Finanzplan (was: "nicht verfügbar" on both pages 20-21):
- page.tsx was querying the legacy pitch_fm_results table which isn't populated
  by the current pipeline. The interactive deck reads from fp_* tables.
- Wire in lib/finanzplan/adapter.ts (finanzplanToFMResults) which bridges the
  live fp_* tables to FMResult[] — same source the interactive deck uses.
- Fall back to live default fp_scenario if the version snapshot's fm_scenarios
  is empty.
- adapter.ts: populate total_customers + new_customers from fp_kunden_summary
  (was hardcoded 0).

The Ask:
- target_date was rendering as raw ISO timestamp "2026-08-01T00:00:00.000Z";
  now formatted as "Aug 2026" (locale-aware).
- Hero funding amount uses same sub-€1M formatter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 10:01:53 +02:00
Sharang Parnerkar bb85ee2e27 fix(pitch-print): page count, Finanzplan loading, visual energy
Build pitch-deck / build-push-deploy (push) Successful in 1m59s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 52s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 36s
Two bug fixes plus the requested visual rework — the deck now looks like a pitch deck, not a research paper.

Bugs:
- BASE_PAGES corrected from 28 to 29; disclaimer no longer shows "29/28"
- fmResults + fmAssumptions now load for the standard PDF, not only when financial=true; Finanzplan annex + KPI dashboard now render

Visual rework (per user: "graphic elements, not just text"):
- Cover: split layout — indigo block left (tagline + hero stats + version meta), white block right with oversized title and key terms
- Modules: 12 lucide icons in indigo-50 tiles (ScanLine, ShieldCheck, FileText, ClipboardCheck, Users, UserCheck, AlertTriangle, Brain, Target, GraduationCap, TrendingUp, MessageSquare)
- USP cards: icon-led card heads with FileSearch/ArrowLeftRight/Repeat/Layers/etc.; LoopDiagram SVG on the closing "Compliance ↔ Code" hub
- How It Works: StepStrip primitive with visible right-arrows between steps
- Market: nested-rectangle MarketFunnel (TAM > SAM > SOM) replaces three stacked boxes
- Customer Savings: 4 hero KPIs + ComparisonBars (today vs. with BP) per cost item
- The Ask: DonutChart for use-of-funds
- Cap Table: DonutChart for equity distribution
- Finanzplan p2: 2×2 chart grid — Revenue (bars), EBIT (bars, tone by sign), Cash balance (line+area), Headcount (bars)
- Architecture: ArchitectureDiagram primitive (3 tiers, vertical arrows between tiers)
- AI Pipeline: PipelineFlow primitive (4 stages, horizontal arrows)
- Team: founder photos (32×32mm) added; falls back to initials if photo_url missing

New primitives:
- PrintCharts.tsx — BarChart, LineChart, ComparisonBars, DonutChart, ProgressBar, MarketFunnel
- PrintDiagrams.tsx — FlowNode, VArrow, HArrow, StepStrip, ArchitectureDiagram, LoopDiagram, PipelineFlow

All files under 500 LOC cap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 09:31:28 +02:00
Sharang Parnerkar 0d5ebcd27a feat(pitch-print): redesign PDF investor brief from scratch
Build pitch-deck / build-push-deploy (push) Successful in 2m19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 46s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 34s
Throws away the screen-deck-derived print system. Builds a new institutional-research aesthetic:
- 12-col grid on A4 landscape, hairline rules, no colored bars, no icons
- 3-color discipline: indigo (structural), emerald (positive), red (problem)
- Plus Jakarta Sans 800 for hero numerals + titles; tabular numerals everywhere
- 1-to-1 content parity with the interactive deck: full USP (8 cards), full competition matrix (45 features, 12 AppSec features, 8+6 competitor profiles), Finanzplan P&L grid + KPI dashboard, full glossary
- 2-page slides where content demands (Exec Summary, USP, Competition, Finanzplan)
- 28 base pages; +1 for Financial detail; +1 for Cap Table (suppressed on Wandeldarlehen)

Files:
- New: PrintIntroSlides, PrintProductSlides, PrintMarketSlides, PrintCompetitionSlides
- Rewritten: PrintLayout (new primitives Page/KpiRow/TwoCol/ThreeCol/DataTable/MatrixGlyph/Callout), PrintAnnexSlides, PrintFinancialSlides, PrintDeck
- Removed: PrintCoreSlides.tsx, PrintExtraSlides.tsx (obsolete)
- print.css now sets Plus Jakarta Sans as the print font family
- All files under 500 LOC cap

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 18:43:55 +02:00
Benjamin Admin 7d721a6787 feat(control-pipeline): BSI QUAIDAL Clean-Room ingestion (AI Act Art. 10)
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 33s
Clean-Room derivation of 195 controls from BSI QUAIDAL (10 criteria + 15
building blocks + 30 measures + 140 metrics) for EU AI Act Art. 10
training-data quality compliance.

- ingest_bsi_quaidal.py parses YAML frontmatter into a structural index
  (no protected prose stored on disk).
- derive_quaidal_mcs.py rewrites each entry via local LLM (qwen3.5:35b-a3b)
  with a hard 4-gram plagiarism gate < 20%; achieved mean overlap 0.5%.
- Migration 011 adds compliance.derived_controls table with full source
  provenance (framework, section, url, commit SHA, license note).
- apply_quaidal_to_db.py UPSERTs YAML into DB.
- Source repo (legal-sources/bsi-quaidal/) gitignored.

Same pattern as IACE module DIN-reference handling: name the norm and
section, never quote.

Backed by BSI license clarification 2026-05: § 5 UrhG anwendbar,
share:true im Frontmatter; Clean-Room derivation is the safe path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:02:49 +02:00
Benjamin Admin 9a1ad87acd feat(marketing): savings-scan form -> compliance backend (real submit + polling)
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 32s
- POST /api/scan/start: server-proxy zu api-dev.breakpilot.ai/saving-scan/start
  (kein CORS-Bypass, env-konfigurierbar via COMPLIANCE_BACKEND_URL)
- GET /api/scan/status/<checkId>: server-proxy fuer Status-Polling
- savings-scan/page.tsx: echte Submission + 5s-Polling + Progress-Bar + Consent-
  Checkbox + Error-Branch (skipped_tdm, failed)
- Datenschutzhinweis im Disclaimer ergaenzt (§ 44b UrhG TDM-Respekt)

Backend-Endpoint in breakpilot-compliance@6c223c7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 23:52:07 +02:00
Benjamin Admin 911697bab4 feat(marketing): Saving-Section + Landingpages + Pipeline Lessons-Learned [split-required]
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 35s
Marketing-Website
- Neue SavingsSection auf Homepage: "Compliance entdeckt sechsstellige
  Einsparungen". Pitch-Position der Cookie-Audit-Cost-Optimization-Story
  fuer DAX-Konzern-Sales (BMW-Case-Style: 90 Vendors -> 25 nach
  Konsolidierung, EUR 500k-3M / Jahr).
- /savings-scan: Kostenloser 5-Min-Saving-Scan-Form (URL + E-Mail).
  Form-Submit ist Placeholder, soll an Compliance-Backend gehaengt werden.
- /savings-methodik: 4-Stufen-Erklaerung der Cookie-Tier-Inferenz +
  ehrliche Caveats (Listpreise != Vertragspreise, Media-Spend nicht
  enthalten) + Datenquellen.
- Content-de + Content-en in content.ts beide um savings-Block ergaenzt
  und Section-Numerierung angepasst (03=Savings, 04=Deterministic).
- LOC-Split: savings-Inhalte (DE+EN, ~100 LOC) in content.savings.ts
  ausgelagert damit content.ts unter 500-LOC-Hard-Cap bleibt.

Control-Pipeline
- LESSONS-LEARNED-mc-check-types.md fuer die parallele CRA-MC-Generation.
  Erklaert die TEXT/PROCESS/REVIEW-Klassifikation die im Compliance-Repo
  retrofitted wurde. Verhindert dass CRA-MCs denselben Defekt bekommen.
  Mapping-Heuristik fuer verification_method -> check_type, plus
  Backfill-Workflow fuer ~62 ambiguous Eintraege.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:38:30 +02:00
Benjamin Admin 9783657da3 feat(control-pipeline): incremental dedup + ENISA CRA ingestion
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s
BatchDedup since-Parameter (services/batch_dedup_runner.py + api):
- Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since.
- Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics
  deren control_id alphabetisch unter dem stale last_id liegt).
- 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics).
- Doku: control-pipeline/docs/incremental-dedup.md.

Neue Scripts:
- gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via
  bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres
  Synonym-Matching). Pure INSERT/UPDATE, kein DELETE.
- gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update.
  Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members
  appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen.
- ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente
  (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology,
  CVD Policies, Threat Landscape 2025). chunk_strategy=legal,
  requirement_strength=guidance|consultation_draft|evidentiary.

Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html
(PDFs sind .gitignore-gefiltert).

Ergebnis dieser Pipeline-Iteration:
- 1.296 neue CRA-Controls + 19.652 atomare Children
- +362 neue Master-Controls, 10.017 existing erweitert
- Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:21:46 +02:00
Benjamin Admin 47d7beeb52 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
# Conflicts:
#	.gitignore
2026-05-18 18:20:01 +02:00
Benjamin Admin 63b195c0aa chore: ignore controls_backup_*.dump files
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:13:50 +02:00
Sharang Parnerkar 77993d0ea0 feat(pitch-deck): Finanzplan-Export nach Excel mit Live-Formeln und Charts
Build pitch-deck / build-push-deploy (push) Failing after 24s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 5m28s
CI / test-python-voice (push) Successful in 4m0s
CI / test-bqas (push) Successful in 32s
Generiert pro Szenario (Wandeldarlehen 200k/Bear/Bull, 1 Mio Base/Bear/Bull)
ein .xlsx mit 10 Tabs (Dashboard, Kunden, Umsatzerlöse, Personalkosten,
Investitionen, Materialaufwand, Betriebliche Aufwendungen, Liquidität, GuV,
Formelübersicht). Editierbare Eingaben bleiben rohe Werte; abgeleitete Zellen
werden zu echten Excel-Formeln über Tabs hinweg, sodass das Bearbeiten von
Inputs Personal/Opex/Liquidität/GuV neu berechnet.

Dashboard-Tab fasst Jahres-KPIs zusammen und enthält fünf Charts
(Umsatz/Material/Personal/EBIT YoY, Jahresüberschuss YoY, Liquidität,
Headcount, Personalkosten monatlich).

Run: PG_CONN=... pitch-deck/scripts/export-finanzplan.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 00:08:27 +02:00
Sharang Parnerkar 9382d2a7a4 chore: bump next 15.1.0 → 15.5.16 across all apps (CVE-2026-44578)
Build pitch-deck / build-push-deploy (push) Failing after 23s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 1m29s
CI / test-python-voice (push) Successful in 1m35s
CI / test-bqas (push) Successful in 1m26s
Patches unauthenticated SSRF in WebSocket upgrade handler.
Applies to admin-core, pitch-deck, levis-holzbau, marketing-website.
GHSA-c4j6-fc7j-m34r.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:19:51 +02:00
Benjamin Admin b727f14011 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 5m34s
CI / test-python-voice (push) Successful in 5m22s
CI / test-bqas (push) Successful in 28s
2026-05-14 18:49:01 +02:00
Sharang Parnerkar 084beed348 feat(pitch-print): port remaining 15 slides for 1-to-1 PDF parity with deck
Build pitch-deck / build-push-deploy (push) Successful in 1m58s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 34s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 28s
Adds print versions for executive-summary, usp, regulatory-landscape,
how-it-works, business-model, competition, customer-savings, annex-strategy,
annex-finanzplan, annex-regulatory, annex-architecture, annex-engineering,
annex-aipipeline, risks, annex-glossary across two new files.

PrintDeck.tsx now renders slides in SLIDE_ORDER (minus 3 interactive-only
slides: intro-presenter, ai-qa, annex-sdk-demo). Standard PDF: 25 pages.
Financial PDF: 27 pages (or 26 for Wandeldarlehen, no cap-table).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-13 09:06:57 +02:00
Sharang Parnerkar 5510689710 fix(print): override globals.css body overflow:hidden and dark background
Build pitch-deck / build-push-deploy (push) Successful in 1m57s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 34s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
globals.css sets html,body { height:100%; overflow:hidden; background:#0a0a1a }
with no media query. In print mode this clips all slides to one viewport
height (explaining the 2-page limit) and renders a black background.
Override with height:auto, overflow:visible, background:white in @media print.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 19:04:58 +02:00
Sharang Parnerkar 49e594bf38 fix(print): set height:210mm on block wrapper, not flex container
Build pitch-deck / build-push-deploy (push) Successful in 1m39s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 29s
Firefox doesn't honor height on flex containers in print mode — the
container collapses to content height, causing all slides to fit on 2
pages. Moved the authoritative height to the display:block wrapper
(.print-page-break) and changed .print-page to height:100% so it
fills its reliably-sized block parent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:59:47 +02:00
Sharang Parnerkar 583e54fabc fix(print): use CSS named pages + break-before for reliable Firefox pagination
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
page: slide-page on each block wrapper forces Firefox to allocate a new
physical page per slide — the spec-correct approach. break-before: page
is belt-and-suspenders. Switched from break-after to break-before via
adjacent sibling selector to avoid a blank trailing page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:55:35 +02:00
Sharang Parnerkar 7f4b7da098 fix(print): add Firefox print-color-adjust prefix for background colors
Build pitch-deck / build-push-deploy (push) Successful in 1m34s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 40s
-moz-print-color-adjust: exact ensures Firefox doesn't strip background
colors from headers, badges, and accent elements when printing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:42:05 +02:00
Sharang Parnerkar f3e54180f0 fix(print): wrap flex pages in block container to fix Chrome page breaks
Build pitch-deck / build-push-deploy (push) Successful in 1m38s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 32s
Chrome's print engine silently ignores break-after/page-break-after on
flex containers. Wrapping each .print-page (flex) in a plain block
.print-page-break element gives Chrome a reliable page break anchor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 18:16:13 +02:00
Benjamin Admin ae937a35d7 feat(cmp): Phase 3 — backend consent withdrawal + consent_id tracking
- ConsentBanner: save consent_id to localStorage after successful POST
- Footer: DELETE /api/consent/{id} on consent re-open (Art. 17 DSGVO)
- New proxy route: DELETE /api/consent/[id] → backend withdrawal endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:55:29 +02:00
Sharang Parnerkar edac3aca6c Merge branch 'main' of ssh://coolify.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
Build pitch-deck / build-push-deploy (push) Successful in 1m48s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 30s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 29s
2026-05-12 17:45:50 +02:00
Sharang Parnerkar fc4d5d8c56 fix(pitch-deck): use imported CSS for print styles instead of inline style tag
Inline <style> tags in React body are unreliable for @media print in
Chrome. Move all print CSS to app/pitch-print/print.css imported via
a layout.tsx — Next.js injects this as a proper <link> in <head>,
which is guaranteed to be applied before print rendering.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 17:45:46 +02:00
Benjamin Admin f5d4e3bd95 feat(cmp): active script blocking + DSE Interessenabwaegung
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 30s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 31s
ScriptManager: two blocking mechanisms — injection of CONSENT_SCRIPTS
after consent + activation of type="text/plain" data-consent scripts.
Standard CMP blocking pattern ready for third-party analytics/marketing.

DSE: add Interessenabwaegung (balancing test) for Art. 6(1)(f) DSGVO
processing: Hosting and Server-Logfiles sections now document why
legitimate interest outweighs data subject rights.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 16:55:24 +02:00
Benjamin Admin 9e3604fe31 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 28s
CI / test-bqas (push) Successful in 31s
2026-05-12 16:36:59 +02:00
Benjamin Admin 0c09b960b9 feat(cmp): Phase 2 complete — self-hosted fonts, ScriptManager, GeoIP, vendor UI
- Session ID via sessionStorage UUID
- Self-host Google Fonts (Inter, Plus Jakarta Sans, JetBrains Mono) — eliminates
  third-party transfer to Google, no more DSGVO violation
- ScriptManager component: consent-change listener for future analytics/marketing scripts
- GeoIP via browser timezone (Intl.DateTimeFormat) + IP injection in proxy
- Vendor-level consent UI: loads vendor config from backend, shows per-vendor
  toggles under each category, sends vendor_consents dict
- DSE updated: Google Fonts section now says "lokal gehostet"
- Config proxy route: GET /api/consent/config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 14:42:55 +02:00
Sharang Parnerkar cf18b1074a fix(pitch-deck): PDF print layout — fill page height, fix page breaks
Build pitch-deck / build-push-deploy (push) Successful in 2m2s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 44s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 33s
- Switch from inline pageBreakAfter to CSS class `.print-page` with
  explicit `page-break-after: always !important` so Chrome print
  preview creates a new page per slide (was collapsing to 2 pages)
- Remove margin/box-shadow in @media print so A4 boundaries align
- Content areas now use flex:1 so cards/pillars stretch to fill the
  full page height (no more blank void below content)
- Remove conditional rendering on data-dependent slides — always
  render all 9 core pages
- Larger font sizes throughout (11px body, 13px card titles)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 13:40:31 +02:00
Sharang Parnerkar 2e8cbfff3f feat(pitch-deck): add per-version PDF export (standard + financial)
Build pitch-deck / build-push-deploy (push) Successful in 1m49s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 29s
Adds /pitch-print/[versionId] — a server-rendered, print-CSS-optimized
page that generates investor-ready PDFs via the browser's native print
dialog (Save as PDF). Two variants per version:

- Standard PDF (9 pages): Cover, Problem, Solution, Products, Market,
  Team, Milestones, The Ask
- Financial PDF (+4 pages): adds Financials P&L table (aggregated from
  pitch_fm_results), Assumptions, Cap Table, Legal Disclaimer

White background with indigo accents, A4 landscape via @page CSS, all
color-rendered in print via print-color-adjust: exact. Auto-triggers
window.print() 900ms after load. Admin toolbar visible on screen only.

Export buttons added to /pitch-admin/versions/[id] detail page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-12 13:00:19 +02:00
Benjamin Admin f6489e7748 feat(cmp): Phase 2 — send scripts_blocked, scripts_released, cookies_set
ConsentBanner detects loaded scripts (analytics/marketing) and cookies
after consent, sends them to the CMP backend for transparency tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 22:52:41 +02:00
Benjamin Admin 519cc274bb docs: session handover — MC Quality + Gap Engine + RAG Ingestion (5 Tage)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 21:47:22 +02:00
Benjamin Admin 79810f4eb8 feat(cmp): GDPR-compliant DSE + consent re-open button
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 29s
- Rewrite Datenschutzerklaerung: cookie section with bp_consent table,
  legal basis (Art. 6(1)(a) + §25 TDDDG), DPO, Hetzner hosting, Google
  Fonts DPF, retention periods, all data subject rights (Art. 15-21),
  supervisory authority (LfD Niedersachsen)
- Add "Cookie-Einstellungen" re-open button in footer (Art. 7(3) DSGVO)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 08:07:35 +02:00
Benjamin Admin 5f193c8a72 feat(cmp): send extended consent data from ConsentBanner
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 34s
CI / test-bqas (push) Successful in 33s
Send consent_method, page_url, referrer, device_type, browser, os,
screen_resolution and consent_scope with each consent record for
vendor-agnostic compliance tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 23:12:22 +02:00
Benjamin Admin d13f4511cb feat(marketing-website): add BreakPilot marketing website with CMP integration
Multi-page marketing website positioned as "Deterministic Regulatory Engineering Platform":
- 7 pages: Home, Plattform, CE-Prozess, Product Compliance, Architektur, Team, Preise
- Platform Bridge animation (adapted from pitch-deck USP slide)
- Cookie-Banner with consent-service integration (breakpilot-marketing site)
- DE/EN language toggle + Dark/Light theme
- Docker service on port 3014

[guardrail-change] PlatformBridgeSection.tsx added to loc-exceptions (816 LOC, SVG animation)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 22:41:00 +02:00
Benjamin Admin 937eca6b77 test(pipeline): Phase 6 — Golden Dataset + MC Quality Tests
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 34s
- 20 manually verified golden controls with expected MC topics
- Structural quality tests: min 10K MCs, max 300/MC, no orphans
- Doc-check controls tests: 8 doc types covered, no empty questions
- Quality thresholds: 90% accuracy, enforced by regression tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 21:03:49 +02:00
Benjamin Admin 0c1561d6cc feat(pipeline): derive 1,874 doc_check_controls from Master Controls
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 45s
CI / test-python-voice (push) Successful in 44s
CI / test-bqas (push) Successful in 40s
8 document types: DSE (571), Cookie (381), Löschkonzept (309),
Widerrufsbelehrung (153), DSFA (147), AVV (125), AGB (113), Impressum (75).

Each control has binary check_question + pass_criteria + fail_criteria.
Derived via Claude Haiku from existing MCs filtered by regulation source.

Table: compliance.doc_check_controls (local + production synced)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 20:56:23 +02:00
Benjamin Admin 0bb9726ddd Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 48s
CI / test-python-voice (push) Successful in 43s
CI / test-bqas (push) Successful in 36s
2026-05-10 15:09:51 +02:00
Benjamin Admin 8510af46eb feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 15:08:15 +02:00
Benjamin Admin 81db904b3e feat(legal-sources): add OSHA machinery safety standards + international norms mapping
OSHA 29 CFR 1910 Subpart O (1910.211-1910.219) — complete machine
guarding requirements. US federal law, public domain.

International norms mapping table: China GB/T, Korea KS, India BIS
equivalents to ISO/EN standards. Unfortunately all countries protect
ISO copyright even for identical national adoptions (IDT).

Only OSHA provides truly free machinery safety content.
EU Excel harmonised standards list included for reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-09 10:50:43 +02:00
Sharang Parnerkar 572052285c fix: require button click to consume magic link token
Build pitch-deck / build-push-deploy (push) Successful in 1m54s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 37s
Email security gateways follow GET redirects automatically and were
consuming the token before the investor clicked through. The verify page
now shows an 'Access Pitch Deck' button; the token is only consumed on
explicit click, which scanners cannot trigger.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 23:30:27 +02:00
Sharang Parnerkar 1ef22e6f95 fix: use PITCH_BASE_URL for short link redirects instead of request.url
Build pitch-deck / build-push-deploy (push) Successful in 1m39s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 29s
Behind Orca's reverse proxy, request.url resolves to http://127.0.0.1:3000
which causes redirects to go to the internal address instead of the public
domain. Use PITCH_BASE_URL (already set in service.toml) as the base.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:55:53 +02:00
Sharang Parnerkar d291af0e33 fix: whitelist /p/* in middleware so short links work without a session
Build pitch-deck / build-push-deploy (push) Successful in 1m38s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:42:09 +02:00
Sharang Parnerkar 76aad8b1d1 feat(pitch-deck): branded short links for magic URLs (pitch.breakpilot.ai/p/ab3xk2)
Build pitch-deck / build-push-deploy (push) Successful in 1m31s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 34s
CI / test-bqas (push) Successful in 30s
- New pitch_short_links table stores 6-char alphanumeric codes mapped to magic link tokens
- GET /p/[code] redirects to /auth/verify?token=... (302, validates expiry)
- All magic link generation points (invite, generate-link, resend) now create a short code
- Emails (invite + resend) use the short URL — less token-like, cleaner for spam filters
- Copy-link UI shows short URL prominently with full URL as fallback
- Migration 008 added to /api/admin/migrate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:34:24 +02:00
Sharang Parnerkar 54f0919b73 feat(pitch-deck): translate financial plan row labels when lang=en
Build pitch-deck / build-push-deploy (push) Successful in 2m0s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 32s
- Add ROW_LABEL_MAP (DE→EN) covering GuV, Liquidität, Kunden, Betriebliche Aufwendungen rows
- Add FORMULA_TOOLTIPS_EN with English tooltip text for all formula-driven rows
- Add MONTH_LABELS_EN (Mrz→Mar, Mai→May, Okt→Oct)
- LabelWithTooltip now accepts `de` flag, translates display text and tooltip accordingly
- Month column headers switch between DE/EN month abbreviations
- Falls back to original German label for any row not in the map

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 09:47:45 +02:00
Sharang Parnerkar ec7eee8e3d feat(pitch-deck): change preferred_lang for existing investors from detail page
Build pitch-deck / build-push-deploy (push) Successful in 1m27s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 34s
- GET /api/admin/investors/:id now returns preferred_lang
- PATCH /api/admin/investors/:id accepts preferred_lang (de/en), validates value
- Investor detail page: DE/EN toggle in the Pitch Version card, instant save on click

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:31:59 +02:00
Sharang Parnerkar b0d273d3ab feat(pitch-deck): add pitch version selection to investor invite form
Build pitch-deck / build-push-deploy (push) Successful in 1m33s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 32s
- Version dropdown on the invite form shows all committed versions
- Selected version is assigned to the investor at creation time (no separate step needed)
- API validates version is committed before upserting
- Leaving the dropdown empty keeps any existing assignment (COALESCE behavior)
- version_id included in audit log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:27:23 +02:00
Sharang Parnerkar 17b9006b88 feat(pitch-deck): English email templates, investor language preference, link-only invite mode
Build pitch-deck / build-push-deploy (push) Successful in 1m55s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 35s
- Add English email template variants (greeting, message, closing, subject, CTA copy)
- Add `preferred_lang` column to `pitch_investors` — stored per investor, deck opens in that language by default
- Invite form: DE/EN language toggle that switches email defaults and pitch language setting
- Invite form: "Send email" toggle — when off, creates investor + returns magic link without sending email (for cold outreach attachment)
- `app/page.tsx`: initializes pitch language from investor's `preferred_lang` before first render (no flash)
- Migration 007 added to `/api/admin/migrate` route for production rollout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:18:40 +02:00
Benjamin Admin e013702a02 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 37s
2026-05-06 21:06:19 +02:00
Benjamin Admin f022b489e2 docs: comprehensive session handover — Blocks F+G complete, next: MC quality refinement
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 21:06:01 +02:00
Benjamin Admin 0092c4fe47 feat(pipeline): G-pre1 refinement script for large object groups
Splits master controls >200 members by re-clustering their object groups
with k=4-20 per group. First round: 38 groups → 325 sub-groups → 253 new MCs.
25 generic MCs remain (monitoring, procedure, etc.) — need regulation-source split.

Session summary: Block F complete, Control Generation (1,599+), Pass 0a/0b,
Production Sync, G-pre1/2/3 Object Clustering + Master Controls + API,
G1-G4 Compliance Execution Layer (Decision Trace, Commit Ledger, Decision Memory,
Pre-Deployment Enforcement).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:41:49 +02:00
Benjamin Admin d5bcd0bd5b feat(pipeline): G4 Pre-Deployment Enforcement — CI/CD compliance gate
New table: deployment_checks (verdict, blocking/warning controls, risk score)
New API:
  POST /v1/deployment-checks (SDK asks: "can I deploy?")
  GET /v1/deployment-checks/{id} (check result)
  POST /v1/deployment-checks/{id}/override (manual override with justification)
  GET /v1/deployment-checks/stats (approval/block rate)

Check logic: queries G1 decision_traces + G3 open failures per affected control.
Verdict: approved (0 blocking) or blocked (with fix recommendations).
454 tests pass, 0 regressions.

Block G complete: G1-G4 all implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:24:45 +02:00
Benjamin Admin c398e74d5e feat(pipeline): G3 Full Decision Memory — compliance lifecycle event stream
New table: decision_events (assessment→decision→fix→verification→failure cycle)
New API:
  POST /v1/decision-events (record lifecycle event)
  GET /v1/decision-events (list with filters)
  GET /v1/decision-events/timeline/{control_id} (full chronological timeline)
  GET /v1/decision-events/stats (failure rate, cycle times)

Each event captures input_state, output_state, actor, evidence.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:16:25 +02:00
Benjamin Admin e82f99b8cb feat(pipeline): G2 Compliance Commit Ledger — code↔control audit trail
New table: compliance_commits (commit hash, affected controls, risk level)
New API:
  POST /v1/compliance-commits (SDK registers commit + impact)
  GET /v1/compliance-commits (list with filters)
  GET /v1/compliance-commits/by-control/{id} (all commits for a control)
  GET /v1/compliance-commits/stats (dashboard)
  GET /v1/compliance-commits/{id} (detail)

GIN index on affected_control_ids for fast @> containment queries.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 19:17:45 +02:00
Benjamin Admin 66a70ab31c feat(pipeline): G1 Decision Trace — compliance decision tracking
New table: decision_traces (status, reason, evidence, fix plan per control)
New API:
  POST/GET/PUT /v1/decision-traces (CRUD for decisions)
  GET /v1/decision-traces/stats (compliance dashboard)
  GET /v1/controls/{id}/full-trace (Regulation→Obligation→Control→Decision→Evidence)

454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 18:26:21 +02:00
Benjamin Admin ad24835940 feat(pipeline): G-pre1/2/3 — Object Clustering + Master Controls + API
G-pre1: 144k objects clustered into 7,466 groups via Mini-Batch K-Means
  on bge-m3 embeddings. Two-stage: k=5000 base + sub-cluster groups >50.
G-pre2: 5,114 Master Controls from lifecycle phase chains
  (define→implement→test→monitor), linking 172,504 atomic controls.
G-pre3: REST API for Master Controls
  GET /v1/master-controls (list, search, filter)
  GET /v1/master-controls/stats
  GET /v1/master-controls/{mc_id} (detail with phase-controls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:11:38 +02:00
Benjamin Admin e683701a44 fix(gitea): remove /etc/timezone mount (macOS incompatible), use TZ env var
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 19:37:43 +02:00
Benjamin Admin 0bad74a3bd docs: session handover — Block F complete, pipeline done, G-pre1 analysis
Session 03-05.05.2026:
- Block F1-F5 complete (DB migration of hardcoded dicts)
- Control Generation: 1,599 controls + 11,522 obligations + 1,147 atomics
- Production sync: 2,625 controls + 11,522 obligations synced
- G-pre1 analysis: 183k objects → 144k after normalize (needs hierarchical clustering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:02:10 +02:00
Benjamin Admin 22257a7ed8 feat(pipeline): F5 validation tests — verify DB matches hardcoded dicts
8 tests confirm all REGULATION_LICENSE_MAP, ACTION_TYPES, _NEGATIVE_PATTERNS,
_ACTION_SYNONYMS, and _OBJECT_SYNONYMS entries are correctly migrated to DB.
Dicts kept as fallback for DB-unavailability resilience.

Block F complete: F1-F5 all done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 16:06:59 +02:00
Benjamin Admin a20de0b52b feat(pipeline): F4 LLM synonym enrichment script
Uses Ollama (qwen3.5:35b-a3b, think:false) to generate additional
German synonyms for action types and object tokens. Results stored
with source='llm' in action_synonyms/object_synonyms tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:45:43 +02:00
Benjamin Admin 775d8b52f3 fix(vault): prevent CPU-burning init loop with marker file + idempotent checks
Root cause: init scripts ran repeatedly (on container restart) and tried
vault secrets enable / vault auth enable for already-existing paths.
Vault logged ERRORs and burned 40-84% CPU in the loop.

Fix:
- Marker file /vault/data/.init-complete skips re-initialization
- vault secrets list / vault auth list checks before enable calls
- No more "path already in use" errors on subsequent runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:46:16 +02:00
Sharang Parnerkar f0a84e79ab fix(preview): return fp_scenarios key so version-specific scenario is resolved in admin preview
Build pitch-deck / build-push-deploy (push) Successful in 1m39s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 33s
The preview-data API was returning `fm_scenarios` but PitchDeck reads
`data.fp_scenarios`, so fpBaseScenarioId was always null and the
Finanzplan slide fell back to the global default scenario (Base Case 200k)
instead of the version's assigned scenario (e.g. 1 Mio. Euro Base).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:39:53 +02:00
Benjamin Admin 64f45be63a feat(pipeline): add Pass 0a endpoint to core control-pipeline
Registers /generate/run-pass0a and /generate/pass0a-status/{job_id}
on the core control-pipeline (port 8098). Previously Pass 0a was only
available on the compliance backend which connects to Production DB,
causing a split-brain when controls are generated locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 07:21:58 +02:00
Sharang Parnerkar 404963db77 feat(showcase): restore intro-presenter and executive-summary slides in showcase mode
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:14:18 +02:00
Sharang Parnerkar 0acbf25956 fix(showcase): hide Data Room link for showcase sessions
Build pitch-deck / build-push-deploy (push) Successful in 1m23s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:12:57 +02:00
Sharang Parnerkar 2bd9b015eb fix(showcase): block financial data from AI Q&A, fix FAB overflow, fix presenter slide mapping
Build pitch-deck / build-push-deploy (push) Successful in 1m47s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 32s
AI Q&A: fetch is_showcase from DB; showcase sessions receive no financial/funding
context and have an explicit LLM guard refusing to discuss investment details.
FAQ context and financial slide IDs stripped from system prompt.

FAB: flex layout so Fullscreen button is always visible regardless of panel height.

Presenter: pass activeSlideOrder to usePresenterMode so buildSlideAudioPlan maps
slideIdx → slideId from the filtered list, not the full SLIDE_ORDER. Progress
calculation also filters to active scripts only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:00:55 +02:00
Sharang Parnerkar be126a7a39 fix(pitch): showcase sidebar shows only filtered slides + AI presenter via FAB
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 31s
NavigationFAB and SlideOverview now accept slideNames prop and render only the
active slide list (filtered for showcase mode). Adds AI presenter start button
to the FAB footer so it's accessible even when intro-presenter slide is hidden.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 22:50:33 +02:00
Sharang Parnerkar 30a9165497 feat(pitch): showcase mode — per-investor toggle hides financial/investor slides for customer demos
Build pitch-deck / build-push-deploy (push) Successful in 1m35s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 30s
Adds is_showcase boolean to pitch_investors; when set, filters out financials,
the ask, cap table, assumptions, finanzplan, risks, and intro-presenter slides.
Slide navigation is fully dynamic — progress bar and counts update accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 22:41:15 +02:00
Sharang Parnerkar f2184be02f fix: tab row counts use investor's scenario, not always Base Case
Build pitch-deck / build-push-deploy (push) Successful in 1m34s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 32s
/api/finanzplan now accepts ?scenarioId and uses it for the per-sheet
row counts (the numbers in brackets on the tab bar). FinanzplanSlide
passes fpBaseScenarioId when fetching the sheet list, so Wandeldarlehen
investors see e.g. Personalkosten (9) instead of (35).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:21:40 +02:00
Sharang Parnerkar 06014d57b3 fix: derive fp_scenario IDs from version snapshot, eliminate hardcoded UUIDs
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 31s
The fm_scenarios array in each pitch version snapshot already stores the
fp_scenario IDs directly (same pattern 1 Mio used). Wandeldarlehen snapshots
were missing Bear/Bull entries — updated in DB to add them.

- /api/data: include fp_scenarios in version response (was omitted)
- PitchDeck: derive fpBaseScenarioId from data.fp_scenarios
- useFpKPIs: accept fpBaseScenarioId instead of isWandeldarlehen boolean
- AssumptionsSlide: find Bear/Base/Bull by name from fpScenarios prop
- FinanzplanSlide: initialize from fpBaseScenarioId, use version scenarios for selector
- FinancialsSlide / ExecutiveSummarySlide: pass fpBaseScenarioId to hook
- types: add FpScenarioRef + fp_scenarios field to PitchData

No UUID hardcoded in any component. Adding a new pitch version only
requires setting the correct fp_scenario IDs in its fm_scenarios snapshot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:00:06 +02:00
Sharang Parnerkar 6c022d1a79 fix: allow investors to query fp_ scenarios by scenarioId
Build pitch-deck / build-push-deploy (push) Successful in 1m55s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 34s
AssumptionsSlide sends ?scenarioId=<uuid> for Bear/Base/Bull cards but
the route was silently dropping it for non-admin requests, making all
three cards return the same default Base Case data. Since fp_ financial
projections are already investor-facing, any valid scenarioId is allowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:27:14 +02:00
Benjamin Admin e869cabc81 docs: session handover — F1-F3 done, control generation running
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 07:21:24 +02:00
Benjamin Admin 652e3a65a3 feat(pipeline): F2+F3 action/object ontology — DB-backed normalization
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 31s
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.

- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:47:53 +02:00
Benjamin Admin aab8eeb335 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 33s
2026-05-03 23:14:34 +02:00
Benjamin Admin 9437e029d0 feat(pipeline): F1 regulation registry — DB-backed license/source-type lookup
Migrates REGULATION_LICENSE_MAP (135 entries) and SOURCE_REGULATION_CLASSIFICATION
(58 entries) from hardcoded Python dicts to compliance.regulation_registry table.

- SQL migration: 002_regulation_registry.sql (table + indexes + trigger)
- Migration script: f1_migrate_regulation_registry.py (162 rows, --dry-run)
- RegulationRegistry cache: 5min TTL, prefix fallback, graceful degradation
- control_generator._classify_regulation() delegates to DB with dict fallback
- source_type_classification.classify_source_regulation() delegates to DB
- 34 new tests (lookup, cache, degradation, migration data consistency)
- 421 total tests pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:14:06 +02:00
Benjamin Admin 4fd2bfefcd docs: session handover updated for Block F start
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:51:23 +02:00
Benjamin Admin fac9280716 feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments:
- D5+: NIST/ENISA PDF quality fix (0%→45% section rate)
- D5+: 4 lost NIST PDFs restored (11k chunks)
- D5+: Text normalization + section detection for NIST/BSI
- D6: Citation backfill (3,651 controls updated, old archived)
- E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...)
- E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.)
- E4: Standards (GoBD, BAIT, VAIT)
- E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG)
- E7: 9 court judgments as full text (Schrems II 154 chunks,
  Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41,
  Schadenersatz 29, BAG 48, Google Fonts 14)
- Infra: Qdrant snapshot mechanism, upload-before-delete safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:31:57 +02:00
Benjamin Admin 118be3540d feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap),
  archives old citations, updated 3.651 controls (93.6% coverage)
- ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG,
  MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks)
- ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due
  to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency
  manually ingested (1.057 chunks)
- Updated session handover with current state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
Benjamin Admin a9671a572b fix(embedding): single-number ALL-CAPS section detection for ENISA/BSI
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.

Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:02 +02:00
Benjamin Admin 2f4a3f2ea2 fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:42:06 +02:00
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 97a7f6f264 docs: comprehensive session handover with full roadmap (Blocks A-G)
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:30:50 +02:00
Benjamin Admin ff21bc258a docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements:
- Structural metadata end-to-end (D2-D4)
- 430 docs re-ingested with new chunking
- HTML stripping + charset detection (0% → 97.6%)
- 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%)
- Quality report script (500 controls: 13% fully correct)
- Frontend requirements.map fix

Open: NIST/ENISA text normalization, citation backfill,
D5 script safety (upload-before-delete), BEG IV ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 5a6e588641 docs: update session handover — D2-D5 complete, EU PDF issue documented
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.

Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:34:34 +02:00
Benjamin Admin 41183ff93d fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:30:33 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
Benjamin Admin a459636bc4 fix(rag): HTML charset detection + opening block tag newlines
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
   closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
   followed inline <a> text — § ended up mid-line, not at line start.

2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
   gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
   was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
   fallback ISO-8859-1.

32 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:35:47 +02:00
Benjamin Admin ddad58f607 fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.

Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.

Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.

27 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:18:25 +02:00
Sharang Parnerkar f130c45ca8 feat(dataroom): bilingual descriptions, drag-drop multi-file upload, edit existing upload descriptions
Build pitch-deck / build-push-deploy (push) Successful in 1m47s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 32s
- lib/translate.ts: LiteLLM DE<>EN translation utility
- Migration 006: description_de/description_en on both dataroom tables
- Admin + investor upload APIs: accept description+lang, auto-translate the other language on save
- PATCH /api/admin/dataroom/documents/[id]: description path in addition to display_name path
- PATCH /api/dataroom/uploads/[id]: investor can edit their own upload descriptions
- PATCH /api/admin/dataroom/investors/[id]/uploads: admin can edit investor upload descriptions
- All GET queries updated to return description fields
- Admin dataroom: drop zone replaces upload button, multi-file, inline description editor per doc and per investor upload
- Investor dataroom: drop zone, multi-file, description+lang textarea before upload, inline description editing on existing uploads

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 21:00:36 +02:00
Benjamin Admin 93099b2770 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.

D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.

D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.

Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).

456 tests passing (58 embedding + 387 pipeline + 11 rag-service).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 20:34:00 +02:00
Sharang Parnerkar 370143b643 fix(dataroom): use getSessionFromCookie() instead of middleware headers; fix auth page overflow
Build pitch-deck / build-push-deploy (push) Successful in 1m33s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 37s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 27s
Dataroom routes were reading x-investor-id from request headers which
the middleware sets as response headers — these don't reach route handlers
when the admin fallback path runs (NextResponse.next() without header).
Switch to getSessionFromCookie() consistent with all other investor routes.

Auth page DSGVO footer switched from absolute bottom-0 to normal flow
so the expanded Art. 13 notice doesn't overlap the login card.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:03:21 +02:00
Sharang Parnerkar 07039cc408 fix(pitch-deck): pre-create /data/dataroom owned by nextjs in Dockerfile
Build pitch-deck / build-push-deploy (push) Successful in 1m18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 29s
Docker volume inherits directory ownership from the image on first mount.
Without this, the volume mounts as root and the nextjs (uid 1001) process
gets EACCES when trying to write dataroom uploads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:51:50 +02:00
Sharang Parnerkar af83e41494 feat(pitch-deck): add Data Room link for investors in top-right corner
Build pitch-deck / build-push-deploy (push) Successful in 1m19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 29s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:47:14 +02:00
Sharang Parnerkar 9888b1b5d7 feat(pitch-deck): data room — file sharing and investor uploads
Build pitch-deck / build-push-deploy (push) Successful in 1m21s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 32s
- lib/dataroom-storage.ts: local volume storage (DATAROOM_PATH env var,
  default /data/dataroom) replacing NextCloud WebDAV
- Admin API: upload documents, rename, delete, manage per-investor releases
- Investor API: list released documents, stream download with audit log,
  upload own documents (max DATAROOM_MAX_UPLOAD_MB, default 50MB)
- /pitch-admin/dataroom: document list + release toggles + investor uploads tab
- /dataroom: investor-facing document library + upload section
- All reads and writes logged to pitch_audit_logs
- Migration 005: dataroom_documents, dataroom_releases, dataroom_investor_uploads
- AdminShell: Data Room nav link (FolderOpen icon)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:38:21 +02:00
Benjamin Admin da21339e76 docs: add session handover instructions for next session
Covers: completed blocks A-D1, remaining D2-G, critical files,
DB state, memory files, test commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:33:05 +02:00
Benjamin Admin 6ab10415d8 feat(embedding): add structural metadata to legal chunking (Block D1)
chunk_text_legal_structured() returns metadata per chunk:
- section: "§ 312k", "Art. 5"
- section_title: "Kündigungsbutton"
- paragraph: "Abs. 1", "Nr. 3"
- paragraph_num: 1, 3
- page: (prepared for PDF integration)
- index: sequential position

/chunk endpoint now returns chunks_with_metadata alongside plain chunks.
Backward compatible — existing consumers use chunks field unchanged.

New regex: _PARAGRAPH_RE (Abs/Nr/Satz/lit), _SECTION_NUMBER_RE
New functions: _parse_section_metadata(), _extract_paragraph_ref()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:25:23 +02:00
Sharang Parnerkar 1bf1411c66 fix(pitch-deck): update email privacy notice to match GDPR changes
Build pitch-deck / build-push-deploy (push) Successful in 1m19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 29s
72 Stunden → 30 Tage, expand scope to include personal contact data,
add Art. 15–21 rights, LfDI BW supervisory authority. Both DE + EN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:20:46 +02:00
Sharang Parnerkar 5946aa47d5 fix(pitch-deck): GDPR compliance — automated cleanup, full Art. 13 notice
Build pitch-deck / build-push-deploy (push) Successful in 1m37s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 38s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 30s
- runDataCleanup() replaces maskOverdueInvestors(): now also anonymizes
  never-activated invites after 90 days, deletes sessions + magic links
  older than 30 days, NULLs IPs in audit logs older than 30 days, and
  redacts email from audit log details JSONB for masked investors
- New /api/admin/cleanup POST endpoint for scheduled invocation
- New .gitea/workflows/pitch-cleanup.yml: daily cron at 02:00 UTC calls
  the cleanup endpoint so anonymization is genuinely automatic, not lazy
- Switch masking window from first_activity_at to last_login_at (30 days
  of inactivity; resets on each login)
- Both auth pages: DSGVO footer now covers all Art. 13 requirements —
  data categories, retention cutoffs, Art. 15–21 rights, contact address,
  LfDI Baden-Württemberg as supervisory authority

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:11:51 +02:00
Benjamin Admin d9c16fb914 feat(pipeline): add adversarial tests (30 cases) + regression harness
Block C implementation:
- adversarial_cases.yaml: 30 tricky cases in 5 categories
  (wrong legal basis, dark patterns, incomplete docs, similar-but-different, homonyms)
- test_adversarial.py: 63 tests validating adversarial cases
- test_regression.py: ontology stability, dependency engine, quality metrics
- conftest.py: shared fixtures (DB session, sample controls)

Total: 371 tests passing (221 existing + 150 new).
Real-world benchmarks (C1) need manual ground truth creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 13:02:29 +02:00
Benjamin Admin 6f58fdbaa5 docs: add test strategy instruction for dedicated session (Block C)
3 test levels: Real-World Benchmarks (10 DE websites), Adversarial Tests
(30 tricky cases), Regression Harness (CI/CD quality gate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 12:28:58 +02:00
Benjamin Admin b8ff4e9290 feat(pipeline): add review-verify endpoint — LLM decides DUPLIKAT/VERSCHIEDEN
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 09:36:30 +02:00
Benjamin Admin f2104768a0 fix(docker): re-enable healthcheck after dedup completion
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 08:39:57 +02:00
Sharang Parnerkar 2f861cd6d7 feat(pitch-admin): backfill first_activity_at for existing investors
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 30s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 31s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:08:26 +02:00
Sharang Parnerkar 23b233bda3 feat(pitch-admin): generate magic link + 72h investor data masking
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 30s
- New POST /api/admin/investors/[id]/generate-link endpoint: creates a
  magic link without sending email, returns the URL for the admin to
  copy and share manually (for when email is filtered)
- Adds 'Copy Link' button (emerald) to investor list and detail pages;
  link is copied to clipboard on click
- New lib/masking.ts: maskOverdueInvestors() UPDATE that anonymizes
  email/name/company → revokes sessions 72h after first investor login
- first_activity_at recorded on first verify (COALESCE, set once only)
- migration 004 adds first_activity_at + data_masked_at columns with
  partial index; also wired into /api/admin/migrate for one-shot apply
- Admin UI shows 'anonymized' badge, expiry countdown, and masked state;
  Copy Link + Resend are disabled for anonymized investors
- verify route returns 410 if data_masked_at is set (belt-and-suspenders
  alongside the revoked status check)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:55:29 +02:00
Sharang Parnerkar adfff6cfe4 fix(pitch-deck): exclude mcp-server from Next.js tsconfig + resolve FinanzplanSlide conflict
Build pitch-deck / build-push-deploy (push) Successful in 1m13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 27s
CI / test-python-voice (push) Successful in 27s
CI / test-bqas (push) Successful in 31s
- tsconfig.json: add mcp-server to exclude list so the standalone MCP
  package's imports don't break the Next.js type-check build
- FinanzplanSlide.tsx: resolve merge conflict, keep MonthlyGrid refactor
  from upstream (discards superseded inline table from stash)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:11:40 +02:00
Sharang Parnerkar 269464943e fix(pitch-deck): restore complete USPSlide with all helper functions
Build pitch-deck / build-push-deploy (push) Failing after 40s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 26s
The previously committed version was missing useIsLight hook, all sub-components
(PillarRow, ColHeader, CentralHub, BridgeConnectors, FeatureCard, DetailModal,
StarField, ticker components) and their data/types. Only the main component
shell was present, causing a CI build failure on type-check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:05:42 +02:00
Benjamin Admin e8df15c0f8 fix: add proxy_read_timeout 300s to admin-compliance location block
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 11:23:02 +02:00
Benjamin Admin 7c5592b50e feat(pipeline): add checkpoint to dedup Phase 2 — survives container restart
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 09:12:23 +02:00
Benjamin Admin e8f018f2c6 fix: increase client_max_body_size to 50M for ports 3007 + 8093
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 56s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 31s
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 08:54:06 +02:00
Benjamin Admin b151951448 fix(pipeline): make dedup Phase 2 resilient — paginated, timeout, per-control error handling
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:31:28 +02:00
Benjamin Admin 2e2e81b3e1 fix(docker): disable healthcheck + auto-restart for control-pipeline during dedup
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.

TEMPORARY — re-enable after dedup is finished.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:39:19 +02:00
Benjamin Admin b873c0e4ae fix(docker): increase control-pipeline healthcheck tolerance for long-running jobs
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 12:35:39 +02:00
Benjamin Admin 9dc16674e2 perf(pipeline): skip singleton groups in dedup Phase 1
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.

Reduces Phase 1 from ~96h to ~2h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 00:31:22 +02:00
Benjamin Admin e6e2688b56 fix(pipeline): add idempotency guard to submit-pass0b endpoint
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 18:59:03 +02:00
331 changed files with 57761 additions and 1850 deletions
+4
View File
@@ -25,6 +25,7 @@ voice-service/bqas/** | owner=pipeline | reason=RAG Quality Assessment, produkti
# Seed/Helper Scripts (keine Service-Logik)
scripts/seed-demo-and-screenshot.py | owner=infra | reason=Einmaliges Seed-Script, kein Service-Code | review=permanent
pitch-deck/scripts/import-finanzplan.py | owner=pitch-deck | reason=583 LOC, einmaliges Excel-Import-Script (9 Sheet-Importer), hardcodierte Row/Col-Mappings fuer eine Finanzplan-.xlsm-Datei, keine wiederverwendbare Logik | review=2027-01
pitch-deck/scripts/export-finanzplan-excel.ts | owner=pitch-deck | reason=1254 LOC, Excel-Export-Script — analog zu import-finanzplan.py: 9 Sheets, ~80% Cell-Formatting/Styling-Boilerplate, keine wiederverwendbare Logik | review=2027-01
# PDF Templates (reine statische HTML/CSS Strings, keine Logik)
backend-core/services/pdf_templates.py | owner=all | reason=519 LOC, rein statische Jinja2-HTML-Templates + CSS, keine Logik | review=2026-07
@@ -33,3 +34,6 @@ backend-core/services/pdf_templates.py | owner=all | reason=519 LOC, rein statis
pitch-deck/lib/presenter/presenter-faq.ts | owner=pitch-deck | reason=973 LOC, pure static FAQ array (questions/answers/keywords), no logic | review=2027-01
pitch-deck/lib/presenter/presenter-script.ts | owner=pitch-deck | reason=608 LOC, pure static presenter script data + 3 trivial lookup functions | review=2027-01
pitch-deck/lib/i18n.ts | owner=pitch-deck | reason=620 LOC, pure DE/EN translation dictionaries + 3 small format helpers | review=2027-01
# Marketing Website — adapted from pitch-deck USP slide (complex SVG animation, inline styles, no logic to split)
marketing-website/components/sections/PlatformBridgeSection.tsx | owner=marketing | reason=816 LOC, adapted 1:1 from pitch-deck USPSlide with SVG animations, CSS keyframes, inline styles — splitting would break animation coherence | review=2027-01
-297
View File
@@ -1,297 +0,0 @@
# Night Scheduler - Entwicklerdokumentation
**Status:** Produktiv
**Letzte Aktualisierung:** 2026-02-09
**URL:** https://macmini:3002/infrastructure/night-mode
**API:** http://macmini:8096
---
## Uebersicht
Der Night Scheduler ermoeglicht die automatische Nachtabschaltung der Docker-Services:
- Zeitgesteuerte Abschaltung (Standard: 22:00)
- Zeitgesteuerter Start (Standard: 06:00)
- Manuelle Sofortaktionen (Start/Stop)
- Dashboard-UI zur Konfiguration
---
## Architektur
```
┌─────────────────────────────────────────────────────────────┐
│ Admin Dashboard (Port 3002) │
│ /infrastructure/night-mode │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ API Proxy: /api/admin/night-mode │
│ - GET: Status abrufen │
│ - POST: Konfiguration speichern │
│ - POST /execute: Sofortaktion (start/stop) │
│ - GET /services: Service-Liste │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ night-scheduler (Port 8096) │
│ - Python/FastAPI Container │
│ - Prueft jede Minute ob Aktion faellig │
│ - Fuehrt docker compose start/stop aus │
│ - Speichert Config in /config/night-mode.json │
└─────────────────────────────────────────────────────────────┘
```
---
## Dateien
| Pfad | Beschreibung |
|------|--------------|
| `night-scheduler/scheduler.py` | Python Scheduler mit FastAPI |
| `night-scheduler/Dockerfile` | Container mit Docker CLI |
| `night-scheduler/requirements.txt` | Dependencies |
| `night-scheduler/config/night-mode.json` | Konfigurationsdatei |
| `night-scheduler/tests/test_scheduler.py` | Unit Tests |
| `admin-v2/app/api/admin/night-mode/route.ts` | API Proxy |
| `admin-v2/app/api/admin/night-mode/execute/route.ts` | Execute Endpoint |
| `admin-v2/app/api/admin/night-mode/services/route.ts` | Services Endpoint |
| `admin-v2/app/(admin)/infrastructure/night-mode/page.tsx` | UI Seite |
---
## API Endpoints
### GET /api/night-mode
Status und Konfiguration abrufen.
**Response:**
```json
{
"config": {
"enabled": true,
"shutdown_time": "22:00",
"startup_time": "06:00",
"last_action": "startup",
"last_action_time": "2026-02-09T06:00:00",
"excluded_services": ["night-scheduler", "nginx"]
},
"current_time": "14:30:00",
"next_action": "shutdown",
"next_action_time": "22:00",
"time_until_next_action": "7h 30min",
"services_status": {
"backend": "running",
"postgres": "running"
}
}
```
### POST /api/night-mode
Konfiguration aktualisieren.
**Request:**
```json
{
"enabled": true,
"shutdown_time": "23:00",
"startup_time": "07:00",
"excluded_services": ["night-scheduler", "nginx", "vault"]
}
```
### POST /api/night-mode/execute
Sofortige Aktion ausfuehren.
**Request:**
```json
{
"action": "stop" // oder "start"
}
```
**Response:**
```json
{
"success": true,
"message": "Aktion 'stop' erfolgreich ausgefuehrt fuer 25 Services"
}
```
### GET /api/night-mode/services
Liste aller Services abrufen.
**Response:**
```json
{
"all_services": ["backend", "postgres", "valkey", ...],
"excluded_services": ["night-scheduler", "nginx"],
"status": {
"backend": "running",
"postgres": "running"
}
}
```
---
## Konfiguration
### Config-Format (night-mode.json)
```json
{
"enabled": true,
"shutdown_time": "22:00",
"startup_time": "06:00",
"last_action": "startup",
"last_action_time": "2026-02-09T06:00:00",
"excluded_services": ["night-scheduler", "nginx"]
}
```
### Umgebungsvariablen
| Variable | Default | Beschreibung |
|----------|---------|--------------|
| `COMPOSE_PROJECT_NAME` | `breakpilot-pwa` | Docker Compose Projektname |
---
## Ausgeschlossene Services
Diese Services werden NICHT gestoppt:
1. **night-scheduler** - Muss laufen, um Services zu starten
2. **nginx** - Optional, fuer HTTPS-Zugriff
Weitere Services koennen ueber die Konfiguration ausgeschlossen werden.
---
## Docker Compose Integration
```yaml
night-scheduler:
build: ./night-scheduler
container_name: breakpilot-pwa-night-scheduler
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./night-scheduler/config:/config
- ./docker-compose.yml:/app/docker-compose.yml:ro
environment:
- COMPOSE_PROJECT_NAME=breakpilot-pwa
ports:
- "8096:8096"
networks:
- breakpilot-pwa-network
restart: unless-stopped
```
---
## Tests ausfuehren
```bash
# Im Container
docker exec -it breakpilot-pwa-night-scheduler pytest -v
# Lokal (mit Dependencies)
cd night-scheduler
pip install -r requirements.txt
pytest -v tests/
```
---
## Deployment
```bash
# 1. Dateien synchronisieren
rsync -avz night-scheduler/ macmini:.../night-scheduler/
# 2. Container bauen
ssh macmini "docker compose -f .../docker-compose.yml build --no-cache night-scheduler"
# 3. Container starten
ssh macmini "docker compose -f .../docker-compose.yml up -d night-scheduler"
# 4. Testen
curl http://macmini:8096/health
curl http://macmini:8096/api/night-mode
```
---
## Troubleshooting
### Problem: Services werden nicht gestoppt/gestartet
1. Pruefen ob Docker Socket gemountet ist:
```bash
docker exec breakpilot-pwa-night-scheduler ls -la /var/run/docker.sock
```
2. Pruefen ob docker compose CLI verfuegbar ist:
```bash
docker exec breakpilot-pwa-night-scheduler docker compose version
```
3. Logs pruefen:
```bash
docker logs breakpilot-pwa-night-scheduler
```
### Problem: Konfiguration wird nicht gespeichert
1. Pruefen ob /config beschreibbar ist:
```bash
docker exec breakpilot-pwa-night-scheduler touch /config/test
```
2. Volume-Mount pruefen in docker-compose.yml
### Problem: API nicht erreichbar
1. Container-Status pruefen:
```bash
docker ps | grep night-scheduler
```
2. Health-Check pruefen:
```bash
curl http://localhost:8096/health
```
---
## Sicherheitshinweise
- Der Container benoetigt Zugriff auf den Docker Socket
- Nur interne Services koennen gestoppt/gestartet werden
- Keine Authentifizierung (internes Netzwerk)
- Keine sensitiven Daten in der Konfiguration
---
## Dependencies (SBOM)
| Package | Version | Lizenz |
|---------|---------|--------|
| FastAPI | 0.109.0 | MIT |
| Uvicorn | 0.27.0 | BSD-3-Clause |
| Pydantic | 2.5.3 | MIT |
| pytest | 8.0.0 | MIT |
| pytest-asyncio | 0.23.0 | Apache-2.0 |
| httpx | 0.26.0 | BSD-3-Clause |
---
## Aenderungshistorie
| Datum | Aenderung |
|-------|-----------|
| 2026-02-09 | Initiale Implementierung |
+2 -2
View File
@@ -3,7 +3,7 @@
#
# Services:
# Go: consent-service
# Python: backend-core, voice-service (+ BQAS), embedding-service, night-scheduler
# Python: backend-core, voice-service (+ BQAS), embedding-service
# Node.js: admin-core
name: CI
@@ -46,7 +46,7 @@ jobs:
- name: Lint Python services
run: |
pip install --quiet ruff
for svc in backend-core voice-service night-scheduler embedding-service; do
for svc in backend-core voice-service embedding-service; do
if [ -d "$svc" ]; then
echo "=== Linting $svc ==="
ruff check "$svc/" --output-format=github || true
+36
View File
@@ -0,0 +1,36 @@
# Daily GDPR data cleanup for the pitch deck.
# Calls /api/admin/cleanup which runs runDataCleanup():
# - anonymizes investors inactive 30+ days
# - anonymizes never-activated invites after 90 days
# - deletes sessions + magic links older than 30 days
# - anonymizes IPs in audit logs older than 30 days
#
# Requires Gitea Actions secret: PITCH_ADMIN_SECRET
name: Pitch deck — GDPR cleanup
on:
schedule:
- cron: '0 2 * * *'
jobs:
cleanup:
runs-on: docker
container:
image: alpine:3.19
steps:
- name: Run data cleanup
env:
PITCH_ADMIN_SECRET: ${{ secrets.PITCH_ADMIN_SECRET }}
run: |
apk add --no-cache curl
RESPONSE=$(curl -sSf -w "\n%{http_code}" -X POST \
-H "Authorization: Bearer $PITCH_ADMIN_SECRET" \
-H "Content-Type: application/json" \
https://pitch.breakpilot.com/api/admin/cleanup) \
|| { echo "Cleanup request failed"; exit 1; }
HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
BODY=$(echo "$RESPONSE" | head -n-1)
echo "Response: $BODY"
[ "$HTTP_CODE" = "200" ] || { echo "Unexpected status $HTTP_CODE"; exit 1; }
echo "GDPR cleanup completed successfully"
+9
View File
@@ -41,6 +41,11 @@ backups/*.backup
*.mp3
*.wav
# Cloned external legal-source repos (gitignored; pulled fresh at ingest time)
legal-sources/bsi-quaidal/
legal-sources/bsi-quaidal-src/
legal-sources/bsi-grundschutz-plus/
# Compiled binaries
billing-service/billing-service
consent-service/server
@@ -62,3 +67,7 @@ consent-service/server
# Coverage
coverage/
*.coverage
controls_backup_*.dump
# Allow Finanzplan exports (generated by pitch-deck/scripts/export-finanzplan.sh)
!pitch-deck/exports/*.xlsx
+2948
View File
File diff suppressed because it is too large Load Diff
+1 -1
View File
@@ -10,7 +10,7 @@
},
"dependencies": {
"lucide-react": "^0.468.0",
"next": "^15.1.0",
"next": "^15.5.16",
"react": "^18.3.1",
"react-dom": "^18.3.1",
"reactflow": "^11.11.4",
@@ -0,0 +1,158 @@
# Controls nutzen — Anleitung für andere Sessions
**Stand:** 2026-05-07, wird laufend aktualisiert
**Repo:** breakpilot-core (~/Projekte/breakpilot-core)
---
## Was sind die Controls?
174.497 atomare Compliance-Controls in der Datenbank. Jeder Control ist eine **einzelne prüfbare Anforderung** aus einer Rechtsquelle (DSGVO, NIS2, NIST, AI Act, etc.).
### Beispiel
```
Control-ID: AUTH-2956-A14
Titel: "Implementierung von Multi-Faktor-Authentifizierung prüfen"
Objective: "Sicherstellen, dass MFA korrekt implementiert ist..."
Merge-Key: "verify:multi_factor_auth:testing"
Severity: high
```
## Wo liegen die Controls?
### Datenbank (PostgreSQL auf Mac Mini)
```sql
-- Alle Controls abfragen
SELECT id, control_id, title, objective, severity,
source_citation, -- Rechtsquelle (JSON)
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE release_state NOT IN ('deprecated', 'rejected');
```
**Verbindung:**
```bash
# Vom MacBook:
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db"
# Oder via Control-Pipeline Container:
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline curl -sf http://127.0.0.1:8098/..."
```
### API (Port 8098, nur via Docker exec erreichbar)
```bash
# Master Controls auflisten
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls?limit=50&sort=total_controls'"
# Master Control Detail mit allen Membern
ssh macmini "/usr/local/bin/docker exec bp-core-control-pipeline \
curl -sf 'http://127.0.0.1:8098/v1/master-controls/MC-8292'"
```
## Struktur der Controls
### merge_group_hint (Schlüsselfeld!)
Jeder Control hat einen `merge_group_hint` im Format `action:object:phase`:
```
implement:encryption:implementation
define:access_control:definition
monitor:network_security:monitoring
report:supervisory_authority:reporting
```
**74 kanonische Object-Tokens** (Stand 2026-05-07):
| Kategorie | Tokens |
|-----------|--------|
| **Security** | multi_factor_auth, password_policy, credentials, session_management, privileged_access, access_control, encryption, transport_encryption, key_management, certificate_management, network_security, network_segmentation, firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting, compliance_audit, vulnerability, patch_management, backup, disaster_recovery, physical_security, secure_development, api_security, input_validation, container_security, logging_configuration |
| **Data Protection** | personal_data, sensitive_data, health_data, consent, data_subject_rights, data_retention, data_transfer, data_breach_notification, dpia, data_processing_agreement, privacy_by_design, data_processing_register, data_classification, cookie_consent, video_surveillance |
| **Governance** | policy, procedure, process, training, awareness, incident, risk_management, third_party_management, change_management, documentation, records_management, compliance_reporting, asset_management, human_resources_security |
| **Regulatory** | supervisory_authority, certification, product_safety, ai_system, financial_reporting, aml, whistleblowing, consumer_protection, ecommerce, telecommunications, medical_device, payment_services, critical_infrastructure, supply_chain_due_diligence, sustainability_reporting |
### Rechtsquellen (source_citation)
Die **Parent-Controls** (nicht die atomaren!) haben `source_citation`:
```sql
-- Controls mit Rechtsquelle finden
SELECT cc.control_id, cc.title,
pc.source_citation->>'source' AS regulation,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation IS NOT NULL
AND pc.source_citation->>'source' LIKE '%DSGVO%';
```
148 verschiedene Rechtsquellen (DSGVO, NIS2, NIST, OWASP, BSI, TKG, etc.)
## Controls filtern (Use Cases)
### Beispiel: Alle DSGVO Art. 13 Controls (für DSI-Prüfung)
```sql
SELECT cc.control_id, cc.title, cc.objective,
cc.generation_metadata->>'merge_group_hint' AS merge_key,
pc.source_citation->>'article' AS article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation->>'source' = 'DSGVO (EU) 2016/679'
AND pc.source_citation->>'article' LIKE '%13%'
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.control_id;
```
### Beispiel: Alle Encryption-Controls
```sql
SELECT control_id, title, objective
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:encryption:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
### Beispiel: Controls nach Object-Token filtern
```sql
-- Alle Controls zu einem bestimmten Thema
SELECT control_id, title,
generation_metadata->>'merge_group_hint' AS merge_key
FROM compliance.canonical_controls
WHERE generation_metadata->>'merge_group_hint' LIKE '%:data_retention:%'
AND release_state NOT IN ('deprecated', 'rejected');
```
## Wichtige Tabellen
| Tabelle | Rows | Beschreibung |
|---------|------|-------------|
| `compliance.canonical_controls` | ~294K | Alle Controls (Rich + Atomic) |
| `compliance.master_controls` | ~5.329 | Gruppierte Master Controls |
| `compliance.master_control_members` | ~172K | Zuordnung Control → MC |
| `compliance.object_ontology` | 74 | Kanonische Object-Definitionen |
| `compliance.regulation_registry` | 223 | Rechtsquellen-Register |
## Was gerade passiert (2026-05-07)
**Phase 2 läuft:** Alle 174K Controls werden per Claude Haiku re-klassifiziert. Die `merge_group_hint` werden von frei-form LLM-Objekten auf 74 kanonische Tokens normalisiert. Danach:
- Phase 3: Re-Clustering (gpre1 mit K=20000)
- Phase 4: Neue Master Controls (gpre2)
- Phase 5: Regulation-Source-Split (gpre3)
**NICHT ÄNDERN:** `canonical_controls`, `master_controls`, `object_ontology` Tabellen werden aktiv bearbeitet.
## DB-Zugang Quick Reference
```bash
# Quick Query (eine Zeile)
ssh macmini "/usr/local/bin/docker exec bp-core-postgres psql -U breakpilot -d breakpilot_db -c \"SELECT count(*) FROM compliance.canonical_controls\""
# Interaktive Session
ssh macmini "/usr/local/bin/docker exec -it bp-core-postgres psql -U breakpilot -d breakpilot_db"
```
@@ -0,0 +1,117 @@
# Session-Handover: MC Quality + Gap-Analyse + RAG Ingestion
**Datum:** 2026-05-07 bis 2026-05-11 (5 Tage Marathon)
**Repo:** breakpilot-core + breakpilot-compliance
---
## ERLEDIGT
### Master Control Quality Overhaul (Core)
- **74.5% → 92.8% Accuracy** (13.588 MCs, 83.073 Members)
- Phase 0: Quality Audit mit Claude Sonnet ($3)
- Phase 1: Ontologie 31 → 74 Tokens + LLM-Prompt fix
- Phase 2: 174K Controls re-klassifiziert via Haiku (10 Batches, ~$50)
- Phase 2b: Generic Tokens gefixt (documentation/procedure → echte Themen, $7.54)
- Phase 2c: L2 Sub-Topics (2 Runden, 172K Controls, ~$32)
- Phase 2d: Bad Subtopics gefixt (stakeholder_*, $0.50)
- Phase 3: Re-Clustering K=18704
- Phase 4: gpre2 Direct MC (13.588 MCs)
- Phase 6: Golden Dataset (20 Controls) + 8 Quality Tests (alle grün)
- **Production Sync:** MCs + Members + Hints + doc_check_controls
### doc_check_controls (Core → Production)
- **1.874 Controls** über 8 Dokumenttypen (DSE, Cookie, Impressum, AGB, Widerruf, DSFA, AVV, Löschkonzept)
- Jeder mit check_question + pass_criteria + fail_criteria
- Tabelle `compliance.doc_check_controls` lokal + Production
### RAG Ingestion (Core)
- **126 BAuA PDFs** (TRBS/TRGS/ASR): 27.664 Chunks → `bp_compliance_ce`
- **OSHA Technical Manual** (23 Kapitel): 7.241 Chunks → `bp_compliance_ce`
- **OSHA 1910 Subpart O** (Volltext): 745 Chunks
- **EuGH C-588/21 P**: 216 Chunks
- **EU 2018/1725**: 842 Chunks → `bp_compliance`
- **CE-Obligations extrahiert:** 6.141 Obligations → `/tmp/ce_obligations_v2.json`
- Playwright-Crawler für BAuA + OSHA gebaut
### Gap-Analyse Engine (Compliance)
- **12 Regulierungen** automatisch klassifiziert (CRA, AI Act, NIS2, DSGVO, MiCA, PSD2, AML, etc.)
- **IST-Zustand Assessment:** CE-Kennzeichnung, angewandte Normen, bestehende Prozesse, IACE-Projekt-Link
- **Norm→Control Mapping:** 20 Normen → MC-Topic Coverage
- **Prioritäts-Engine:** Severity × Deadline × Dependency
- **5 Branchentemplates:** IoT, Exchange, Cobot, SaaS, Medical
- **Frontend:** 2-Step Wizard (Produkt + IST-Zustand) + Dashboard mit Ampel-Status
- **API:** 8 Endpoints unter `/sdk/v1/gap/`
- **Persistente Projekte:** Speichern + wieder öffnen
- **Getestet:** SmartFactory Gateway → 5 Regulierungen, 500 Gaps
### Tenant Document Upload API (Core)
- `POST/GET/DELETE /api/v1/tenant/documents`
- Tenant-isolierte Qdrant-Collections
- Code fertig, nicht deployed (RAG Service rebuild nötig)
### Master Controls Browser (Compliance)
- **Neue Seite** `/sdk/master-controls` — reused Control Library UI
- Sidebar-Eintrag zwischen Control Library und Provenance
- 13.588 MCs mit allen Filtern, Paginierung, Klick-Detail
- Verbindet sich mit Production-DB
---
## DB-Tabellen (neu/geändert)
| Tabelle | Repo | Rows (lokal) | Rows (Production) |
|---------|------|-------------|-------------------|
| compliance.master_controls | Core | 13.588 | 13.588 |
| compliance.master_control_members | Core | 83.073 | 83.073 |
| compliance.object_ontology | Core | 74 | 74 |
| compliance.object_groups | Core | 16.683 | — |
| compliance.doc_check_controls | Core | 1.874 | 1.874 |
| compliance.gap_projects | Compliance | 1 | 0 |
---
## OFFEN / NÄCHSTE SESSION
1. **Orca Deploy-Fix** — Production deployed nicht automatisch (Webhook + docker pull Problem)
2. **Gap-Analyse v2 IST-Zustand** — Frontend Step 2 deployed, Backend deployed, aber Orca blockiert
3. **Tenant Document Upload** deployen (RAG Service rebuild)
4. **Compliance-Repo auf gitea pushen** — aktuell "Everything up-to-date", Orca muss manuell redeployt werden
5. **MC-Browser erweitern** — Detail-View mit Member-Controls verbessern
---
## BACKUPS (auf MacBook)
| Datei | Inhalt |
|-------|--------|
| `backup_pre_gpre3_20260510.dump` | Vor gpre3 Live-Run (171 MB) |
| `backup_session_end_20260511.dump` | Session-Ende |
| `production_backup_20260508.dump` | Production nach Phase 2 |
| `gpre0_checkpoints_backup_20260508/` | 10 Corrections-JSONs |
---
## API-Kosten (Anthropic)
| Phase | Modell | Kosten |
|-------|--------|--------|
| Phase 0: Quality Audit | Sonnet | $2.92 |
| Phase 0b: Quality Audit v2 | Sonnet | $5.93 |
| Phase 2: 174K Re-Klassifizierung | Haiku | ~$50 |
| Phase 2b: Generic Token Fix | Haiku | $7.54 |
| Phase 2c: Subtopics R1 | Haiku | $20.22 |
| Phase 2c: Subtopics R2 | Haiku | $12.03 |
| Phase 2d: Bad Subtopics | Haiku | ~$0.50 |
| 5K Test-Run | Sonnet | $5.32 |
| doc_check_controls | Haiku | ~$5 |
| **Gesamt** | | **~$110** |
---
## STRATEGISCHE ENTSCHEIDUNGEN (in Memory)
1. **3 Use Cases:** Gap-Analyse (Prio 1), Vendor Risk (Prio 2), Web3/Crypto als Vertikal (Prio 3)
2. **Keine Norm-Reproduktion:** Obligation Extraction statt ISO-Texte (juristisch sicher)
3. **Regulatory Ingestion Engine:** BAuA/OSHA Crawler als Vorlage für automatisierte Source-Feeds
4. **CE-Compliance Crossover:** IACE × Master Controls für Trigger-basierte Compliance-Hinweise
@@ -0,0 +1,335 @@
# Instruktion: Teststrategie Block C
**Repo:** `/Users/benjaminadmin/Projekte/breakpilot-core/`
**Verzeichnis:** `control-pipeline/tests/`
**Erstellt:** 2026-05-01
**Geschaetzter Aufwand:** 2-3 Tage
## Ausgangslage
- 221 bestehende Tests in 7 Dateien (NICHT aendern!)
- 40 Golden Test Cases (golden_controls.yaml)
- 24 Demo Cases (demo_cases.yaml)
- Alle Tests sind pure Python, kein DB noetig
- Pipeline v1 abgeschlossen: 151.675 unique Controls, 15.291 Dependencies
## Aufgabe 1: Real-World Benchmarks (C1)
### Was zu tun ist
10 echte deutsche E-Commerce Websites manuell pruefen und Ground Truth YAML erstellen.
### Verzeichnis
```
control-pipeline/tests/benchmarks/
├── amazon_de.yaml
├── zalando_de.yaml
├── otto_de.yaml
├── lidl_de.yaml
├── check24_de.yaml
├── booking_de.yaml
├── thomann_de.yaml
├── aboutyou_de.yaml
├── mytheresa_com.yaml
└── kleiner_shop.yaml
```
### Format pro Website
```yaml
website: amazon.de
url: https://www.amazon.de
checked_at: "2026-05-XX"
checked_by: "Name"
ground_truth:
impressum:
present: true/false
complete: true/false # Name, Adresse, Email, HR-Nummer, USt-ID
within_2_clicks: true/false
missing_fields: [] # z.B. ["USt-ID", "Handelsregister"]
datenschutzerklaerung:
present: true/false
art13_complete: true/false
missing_art13_fields: [] # z.B. ["Speicherdauer", "Empfaenger"]
rechtsgrundlagen_korrekt: true/false
wrong_legal_bases: [] # z.B. ["Analytics auf lit. f statt lit. a"]
cookie_banner:
present: true/false
reject_equally_easy: true/false # CNIL: Ablehnen = gleich prominent
cookies_before_consent: true/false # Planet49: Cookies VOR Consent?
dark_patterns: [] # z.B. ["Ablehnen-Button kleiner", "Ablehnen hinter Einstellungen"]
widerrufsbelehrung:
present: true/false
matches_legal_template: true/false # Gesetzliches Muster
agb:
present: true/false
checkout_button_text: "..." # z.B. "Jetzt kaufen" (korrekt) vs "Weiter" (falsch)
google_fonts_external: true/false
google_analytics: true/false
third_party_services:
- name: "Google Analytics"
detected: true
consent_required: true
consent_obtained_before_load: false
- name: "Facebook Pixel"
detected: true
consent_required: true
consent_obtained_before_load: false
expected_findings:
- "Cookie-Banner: Ablehnen nicht gleichwertig"
- "Google Analytics ohne vorherige Einwilligung"
- "DSE: Rechtsgrundlage fuer Analytics falsch"
expected_no_findings:
- "Impressum fehlt" # Ist vorhanden, darf nicht geflagt werden
```
### Test-Runner
```python
# control-pipeline/tests/test_benchmarks.py
"""
Real-World Benchmark Tests — vergleicht Agent-Findings mit manueller Ground Truth.
Erfordert: Compliance Agent muss laufen (https://macmini:3007/sdk/agent)
"""
import yaml
import pytest
import os
BENCHMARK_DIR = os.path.join(os.path.dirname(__file__), "benchmarks")
def load_benchmarks():
cases = []
for f in sorted(os.listdir(BENCHMARK_DIR)):
if f.endswith(".yaml"):
with open(os.path.join(BENCHMARK_DIR, f)) as fh:
cases.append(yaml.safe_load(fh))
return cases
class TestBenchmarks:
"""Precision/Recall gegen Ground Truth messen."""
@pytest.mark.parametrize("case", load_benchmarks(), ids=lambda c: c["website"])
def test_benchmark(self, case):
# TODO: Agent gegen Website laufen lassen
# TODO: Findings mit expected_findings vergleichen
# TODO: Precision + Recall berechnen
pass
```
### Wie die Ground Truth erstellt wird
1. Website im Browser oeffnen
2. Impressum pruefen (alle Pflichtfelder nach § 5 DDG)
3. Datenschutzerklaerung lesen (Art. 13 DSGVO Checkliste)
4. Cookie-Banner testen (Ablehnen gleich einfach? Cookies vor Consent?)
5. Widerrufsbelehrung gegen gesetzliches Muster pruefen
6. Browser DevTools: Netzwerk-Tab → externe Requests vor Consent?
7. Alles in YAML dokumentieren
**Ziel-Metriken:**
- Precision > 80% (wenige False Positives)
- Recall > 70% (findet die meisten echten Probleme)
---
## Aufgabe 2: Adversarial Tests (C2)
### Was zu tun ist
30 tricky Test Cases erstellen die den Agent/Controls herausfordern.
### Datei
`control-pipeline/tests/adversarial_cases.yaml`
### Kategorien
**A. Falsche Rechtsgrundlage (8 Cases):**
- Analytics auf lit. f statt lit. a
- Marketing-Emails auf lit. b statt lit. a
- Mitarbeiter-Tracking auf lit. f statt Betriebsvereinbarung
- Biometrische Daten auf lit. f statt Art. 9
- Profiling auf lit. f statt Art. 22
- Newsletter auf lit. b statt lit. a
- Social Login auf lit. b statt lit. a
- Kreditscoring auf lit. f statt lit. a + Art. 22
**B. Dark Patterns (6 Cases):**
- Ablehnen-Button existiert aber 3px gross + grau
- "Alle akzeptieren" prominent, "Einstellungen" statt "Ablehnen"
- Cookie-Wall: Inhalt erst nach Zustimmung sichtbar
- Vorausgefuellte Checkboxen (Planet49)
- Confirm-Shaming: "Nein, ich moechte keine sichere Verbindung"
- Ablehnen erfordert 3 Klicks, Akzeptieren nur 1
**C. Fast-vollstaendige Dokumente (6 Cases):**
- Impressum komplett bis auf USt-ID
- DSE ohne Speicherdauer
- DSE ohne DSB-Kontakt
- Widerrufsbelehrung mit falschem Fristbeginn
- AGB ohne Gerichtsstand
- Cookie-Policy ohne Auflistung aller Cookies
**D. Semantisch aehnlich aber verschieden (5 Cases):**
- "Admin-MFA" vs "User-MFA" (verschiedene Scopes!)
- "Daten loeschen nach Kuendigung" vs "Daten loeschen nach Aufbewahrungsfrist"
- "Rate Limiting API" vs "Rate Limiting Login"
- "Verschluesselung at rest" vs "Verschluesselung in transit"
- "Incident Response Plan" vs "Business Continuity Plan"
**E. Semantisch verschieden aber gleich klingend (5 Cases):**
- "Einwilligung" (DSGVO) vs "Einwilligung" (Werbung)
- "Verarbeitung" (Daten) vs "Verarbeitung" (Lebensmittel)
- "Risikobewertung" (DSGVO DSFA) vs "Risikobewertung" (Finanzrisiko)
- "Audit" (Datenschutz) vs "Audit" (Finanzen)
- "Zertifizierung" (ISO 27001) vs "Zertifizierung" (CE-Marking)
### Format
```yaml
- id: ADV-LIT-001
category: wrong_legal_basis
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
context: "DSE-Abschnitt ueber Google Analytics"
expected:
finding: true
finding_type: "wrong_legal_basis"
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
difficulty: medium # easy / medium / hard
```
---
## Aufgabe 3: Regression-Harness (C3)
### Was zu tun ist
1. `conftest.py` mit shared Fixtures
2. `test_regression.py` mit Snapshot-Tests
3. CI/CD Quality Gate
### conftest.py
```python
# control-pipeline/tests/conftest.py
import os
import pytest
@pytest.fixture(scope="session")
def db_session():
"""DB session for integration tests — skip if no DATABASE_URL."""
url = os.getenv("DATABASE_URL")
if not url:
pytest.skip("DATABASE_URL not set")
from db.session import SessionLocal
db = SessionLocal()
yield db
db.close()
@pytest.fixture
def sample_controls(db_session):
"""Load 100 random draft controls for regression testing."""
from sqlalchemy import text
rows = db_session.execute(text("""
SELECT control_id, title, category, severity,
generation_metadata->>'assertion' as assertion
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
ORDER BY random() LIMIT 100
""")).fetchall()
return [dict(r._mapping) for r in rows]
```
### test_regression.py
```python
# control-pipeline/tests/test_regression.py
"""
Regression Tests — pruefen ob Pipeline-Updates bestehende Controls veraendern.
Erfordert: DATABASE_URL Umgebungsvariable
"""
class TestControlStability:
def test_draft_count_stable(self, db_session):
"""Draft count darf nicht um >5% abweichen."""
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count > 140000, f"Draft count too low: {count}"
assert count < 200000, f"Draft count too high: {count}"
def test_no_null_assertions(self, db_session):
"""Alle draft Controls muessen eine assertion haben."""
from sqlalchemy import text
null_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (generation_metadata->>'assertion' IS NULL OR generation_metadata->>'assertion' = '')"
)).scalar()
assert null_count < 1000, f"Too many controls without assertion: {null_count}"
def test_dependency_graph_valid(self, db_session):
"""Keine Zyklen im Dependency-Graph."""
from sqlalchemy import text
cycle_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
)).scalar()
assert cycle_count > 10000, f"Too few dependencies: {cycle_count}"
class TestQualityGates:
def test_duplicate_rate(self, db_session):
pass # Implementieren: duplicate_rate < 5%
def test_evidence_leak_rate(self, db_session):
pass # Implementieren: evidence_leak < 2%
```
### CI/CD Quality Gate
```yaml
# .gitea/workflows/quality-gate.yml
name: Control Pipeline Quality Gate
on:
push:
paths:
- 'control-pipeline/**'
jobs:
quality-gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Tests
run: |
cd control-pipeline
pip install -r requirements.txt pytest pyyaml
PYTHONPATH=. pytest tests/ -v --tb=short -x
- name: Quality Metrics
run: |
# Nur wenn Container laeuft
curl -sf http://127.0.0.1:8098/v1/canonical/generate/quality-metrics || echo "Pipeline not running, skip metrics"
```
---
## WICHTIG
- Bestehende 221 Tests NICHT aendern
- NICHT deployen (Container nicht neustarten)
- Alle neuen Tests muessen ohne DB laufen (ausser test_regression.py mit skip-Marker)
- Ground Truth YAML manuell erstellen (kein LLM fuer die Referenzdaten!)
- Bei Fragen: Memory lesen unter `/Users/benjaminadmin/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/`
@@ -0,0 +1,132 @@
# Lessons Learned — MC `check_type` Klassifikation (KRITISCH fuer CRA + alle neuen Frameworks)
Datum: 2026-05-17
Auslöser: Compliance-Check BMW lieferte 0/381 Cookie-MCs, 3/75 Impressum-MCs, 43/571 DSE-MCs — alle Doc-Typen unter 20%.
## TL;DR
**Die heutigen Master-Controls (MCs) vermischen drei strukturell unterschiedliche Klassen von Pruefungen in einer einzigen Tabelle (`compliance.doc_check_controls`). Nur eine der drei Klassen lässt sich gegen Dokument-Text matchen. Die anderen zwei werden faelschlich als "failed" gezaehlt, weil sie ueberhaupt nicht ueber Text-Matching gepruefbar sind.**
Bei der CRA-MC-Generierung (laeuft jetzt im Pass 0a mit Haiku) **MUSS** jeder MC ein **`check_type`-Feld** bekommen, bevor er in die Datenbank geht. Sonst wiederholt sich das Problem.
## Die drei Klassen
| `check_type` | Pruefungsfrage-Pattern | Beispiel | Wie pruefbar? |
|---|---|---|---|
| **`text`** | "Enthaelt das Dokument...", "Wird im X die Y benannt?", "Ist im Text aufgelistet..." | "Wird im Impressum die Aufsichtsbehoerde benannt?" | Regex / Embedding-Match gegen Doc-Text |
| **`process`** | "Ist sichergestellt...", "Ist implementiert...", "Wird durchgefuehrt..." | "Ist sichergestellt, dass Cookies erst nach Einwilligung gespeichert werden?" | Evidence/TOM-Check — kein Doc-Text vorhanden |
| **`review`** | "Sind ALLE / Werden ALLE / Stimmt X mit Y ueberein?" | "Sind alle Verarbeitungszwecke vollstaendig erfasst?" | Mensch (DSB) — Checkliste, nicht automatisch |
## Befund aus den BMW-Daten
| Doc-Type | TEXT (matchbar) | PROCESS | UNKLAR/REVIEW | Total | % TEXT |
|---|---|---|---|---|---|
| cookie | 30 | 49 | 302 | 381 | **8%** |
| dse | 72 | 139 | 359 | 571 | **13%** |
| impressum | 14 | 14 | 47 | 75 | **19%** |
| agb | 24 | 20 | 69 | 113 | 21% |
| widerruf | 29 | 26 | 96 | 153 | 19% |
| loeschkonzept | 38 | 39 | 232 | 309 | 12% |
**Selbst mit perfektem Matching liegt die Obergrenze fuer doc_check bei 8-20%**, weil 80-92% der MCs nicht ueber Text-Matching pruefbar sind. Es sind keine "schlechten MCs" — sie sind in der falschen Schublade.
## Konsequenzen fuer CRA-Generation (Pass 0a)
### 1. Prompt-Aenderung (Hauptmassnahme)
Der Pass-0a-Prompt fuer Haiku/Sonnet MUSS pro generiertem Control ein `check_type`-Feld erzwingen. Vorschlag:
```jsonc
{
"control_id": "CRA-...-A01",
"title": "...",
"check_question": "...",
"check_type": "text" | "process" | "review", // PFLICHT
"rationale_for_check_type": "..."
}
```
Klassifikations-Regel im Prompt:
> Wenn deine `check_question` mit "Enthaelt", "Wird … genannt/aufgelistet/erwaehnt", "Steht im Text" beginnt -> `text`.
> Wenn sie mit "Ist sichergestellt", "Ist implementiert", "Wird durchgefuehrt", "Existiert ein Prozess" beginnt -> `process`.
> Wenn sie mit "Sind ALLE", "Werden ALLE", "Stimmt X mit Y ueberein" beginnt -> `review`.
> Im Zweifel: lieber `review` als `text`.
### 2. Doc-Type-Zuordnung kritisch validieren
Bei den heutigen MCs sind viele falsch zugeordnet (z.B. "Bestellbestätigung implementieren" landet im `impressum`-doc_type, gehoert aber zu AGB/Widerruf). Fuer CRA:
- **`doc_type` darf nur Werte aus einer expliziten Liste annehmen** — pro Regulation festlegen.
- Fuer CRA z.B.: `produkt_konformitaetserklaerung`, `risiko_management_dossier`, `sbom`, `cra_dse`, `meldepflichten_doku`.
- Falsche Zuordnung im Prompt explizit verbieten: "Wenn der Control nicht eindeutig zu EINEM dieser Doc-Typen passt, setze `doc_type: 'unassigned'` und `check_type: 'review'`."
### 3. Zwei Tabellen statt einer
Heutige Architektur:
- `compliance.doc_check_controls` <- alle 1874 MCs (mit allem vermischt)
Empfohlen fuer CRA + Refactor:
- `compliance.text_check_controls` <- nur `check_type='text'`
- `compliance.process_check_controls` <- nur `check_type='process'`, gepruefte via Evidence/TOM
- `compliance.review_checklist_controls` <- nur `check_type='review'`, gepruefte via DSB-Workflow
Falls Schema-Aenderung nicht moeglich (CLAUDE.md: DB ist frozen), Sidecar-SQLite mit `mc_classification.db` oder neue Spalte als Add-only-Migration.
### 4. Dedupe-Phase respektieren
In Pass 0b (Dedup) muss `check_type` ein **Pflicht-Dedupe-Key** sein:
- Zwei MCs mit gleicher Aussage aber unterschiedlichem `check_type` sind **nicht** Duplikate — sie pruefen verschiedene Dinge ("ist im Text genannt" vs "ist technisch implementiert").
- Heute werden solche faelschlich gemerged → noch mehr Vermischung.
### 5. Matching-Engine danach umbauen
Das eigentliche doc-check-Match-System muss nur noch `check_type='text'`-MCs verarbeiten. Andere werden in ihre eigenen Module geroutet:
- `text` MCs -> `rag_document_checker` (Regex + spaeter Embedding)
- `process` MCs -> neuer `evidence_check_runner` (Kunde lieferte Nachweise/TOM hoch)
- `review` MCs -> neuer `review_checklist_ui` (DSB beantwortet manuell)
## Checkliste fuer CRA-Session
- [ ] Pass-0a-Prompt um `check_type`-Pflichtfeld erweitert (Wortlaut-Regel + Beispiele)
- [ ] Pass-0a-Prompt zwingt `doc_type` aus expliziter Whitelist
- [ ] Pass-0b-Dedup-Key enthaelt `check_type`
- [ ] Output-Validator weist MCs ohne `check_type` zurueck
- [ ] DB-Schema (oder Sidecar) hat `check_type`-Spalte mit Default `review` (sicherer Fallback)
- [ ] Stichprobe von 50 generierten CRA-MCs vor Bulk-Run: TEXT-Anteil sollte 30-50% sein (mehr als bei den alten DSGVO-MCs, weil CRA stark dokument-fokussiert ist).
## Update 2026-05-17 — Parallel-CRA-Session-Findings
Die laufende CRA-Generation hat ein Feld `verification_method` (document/tool/hybrid/code_review/empty), das **NICHT identisch** mit `check_type` ist:
- `verification_method` fragt: **WAS schaust du dir an?** (Dokument, Tool-Output, Code)
- `check_type` fragt: **KANN das per Text-Match geprueft werden?** (text/process/review)
Ein Control kann `verification_method=document` haben UND trotzdem `check_type=process` sein. Beispiel: "Wird die SBOM regelmaessig (mindestens monatlich) aktualisiert?" — Du schaust ins Dokument SBOM-Historie, prüfst aber einen Prozess. Text-Match findet das nie.
**Mapping-Heuristik (gut genug fuer 80% der Faelle, Rest LLM):**
| `verification_method` | Auto-Mapping `check_type` | LLM noetig? |
|---|---|---|
| `tool` | `process` | nein |
| `code_review` | `process` | nein |
| empty/null | `review` (sicherer Default) | nein |
| `document` | erstmal `text`, Stichprobe pruefen | 10-20% sampling |
| `hybrid` | LLM klassifizieren | ja, alle |
**Idealfall (fuer alle KUENFTIGEN Pass-0a-Generationen — auch CRA falls man nochmal generiert):** Beide Felder gleichzeitig generieren, nicht eins aus dem anderen ableiten.
## Backfill-Workflow fuer die laufende CRA-Generation
1. Aktueller Haiku-Job laeuft fertig (kein Restart, kein Verlust)
2. Nach Job-Ende: Auto-Mapping fuer eindeutige Buckets (tool/code_review/empty)
3. Sonnet-Klassifikation nur fuer `document`+`hybrid` Subset (~62 Calls fuer 1500 Controls, ~$0.05 statt $2)
4. Wiederverwenden: `breakpilot-compliance/backend-compliance/scripts/classify_mc_check_type.py` — nur DB-Query anpassen (Source-Tabelle + WHERE-Filter)
5. Validierung: TEXT-Anteil bei CRA sollte 40-60% sein (CRA ist dokument-zentrierter als DSGVO-Cookie)
## Quervewweise
- BMW-Run-Befund: `breakpilot-compliance` E-Mail vom 2026-05-17, check_id `08bcc9dd`
- Bestehender Klassifikations-Skript fuer Retrofit der alten 1874: `backend-compliance/scripts/classify_mc_check_type.py`
- Doc-Type-Audit-Query: dieselbe Datei, am Ende
+12
View File
@@ -4,9 +4,21 @@ from api.control_generator_routes import router as generator_router
from api.canonical_control_routes import router as canonical_router
from api.document_compliance_routes import router as document_router
from api.dependency_routes import router as dependency_router
from api.master_control_routes import router as master_control_router
from api.decision_trace_routes import router as decision_trace_router
from api.decision_trace_routes import full_trace_router
from api.compliance_commit_routes import router as compliance_commit_router
from api.decision_event_routes import router as decision_event_router
from api.deployment_check_routes import router as deployment_check_router
router = APIRouter()
router.include_router(generator_router)
router.include_router(canonical_router)
router.include_router(document_router)
router.include_router(dependency_router)
router.include_router(master_control_router)
router.include_router(decision_trace_router)
router.include_router(full_trace_router)
router.include_router(compliance_commit_router)
router.include_router(decision_event_router)
router.include_router(deployment_check_router)
@@ -0,0 +1,255 @@
"""Compliance Commit Ledger API — G2.
Tracks code commits and their compliance impact. SDK reports each commit
with affected controls, building an audit trail for code↔compliance mapping.
"""
import json
import logging
import uuid
from typing import Optional
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel
from sqlalchemy import text
from db.session import SessionLocal
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/v1/compliance-commits", tags=["compliance-commits"])
class CreateCommitRequest(BaseModel):
tenant_id: str
project_id: Optional[str] = None
commit_hash: str
commit_message: Optional[str] = None
commit_author: Optional[str] = None
commit_date: Optional[str] = None
branch: Optional[str] = None
repo_url: Optional[str] = None
affected_control_ids: list[str] = []
affected_files: list[str] = []
risk_level: str = "low"
analysis_summary: Optional[str] = None
analysis_metadata: dict = {}
@router.post("")
async def register_commit(req: CreateCommitRequest):
"""Register a code commit with its compliance impact."""
db = SessionLocal()
try:
cid = str(uuid.uuid4())
db.execute(text("""
INSERT INTO compliance_commits
(id, tenant_id, project_id, commit_hash, commit_message,
commit_author, commit_date, branch, repo_url,
affected_control_ids, affected_files,
risk_level, analysis_summary, analysis_metadata)
VALUES
(CAST(:id AS uuid), CAST(:tenant_id AS uuid), :project_id,
:commit_hash, :commit_message, :commit_author,
:commit_date, :branch, :repo_url,
CAST(:control_ids AS jsonb), CAST(:files AS jsonb),
:risk_level, :analysis_summary, CAST(:metadata AS jsonb))
"""), {
"id": cid,
"tenant_id": req.tenant_id,
"project_id": req.project_id,
"commit_hash": req.commit_hash,
"commit_message": req.commit_message,
"commit_author": req.commit_author,
"commit_date": req.commit_date,
"branch": req.branch,
"repo_url": req.repo_url,
"control_ids": json.dumps(req.affected_control_ids),
"files": json.dumps(req.affected_files),
"risk_level": req.risk_level,
"analysis_summary": req.analysis_summary,
"metadata": json.dumps(req.analysis_metadata),
})
db.commit()
return {
"id": cid,
"status": "registered",
"affected_controls": len(req.affected_control_ids),
"risk_level": req.risk_level,
}
finally:
db.close()
@router.get("")
async def list_commits(
tenant_id: Optional[str] = None,
control_id: Optional[str] = None,
risk_level: Optional[str] = None,
branch: Optional[str] = None,
since: Optional[str] = None,
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
):
"""List compliance commits with filters."""
db = SessionLocal()
try:
clauses = []
params: dict = {"limit": limit, "offset": offset}
if tenant_id:
clauses.append("tenant_id = CAST(:tenant_id AS uuid)")
params["tenant_id"] = tenant_id
if control_id:
clauses.append("affected_control_ids @> CAST(:cid_json AS jsonb)")
params["cid_json"] = json.dumps([control_id])
if risk_level:
clauses.append("risk_level = :risk")
params["risk"] = risk_level
if branch:
clauses.append("branch = :branch")
params["branch"] = branch
if since:
clauses.append("commit_date >= CAST(:since AS timestamptz)")
params["since"] = since
where = "WHERE " + " AND ".join(clauses) if clauses else ""
rows = db.execute(text(f"""
SELECT id, commit_hash, commit_message, commit_author, commit_date,
branch, affected_control_ids, affected_files, risk_level
FROM compliance_commits
{where}
ORDER BY commit_date DESC NULLS LAST
LIMIT :limit OFFSET :offset
"""), params).fetchall()
total = db.execute(text(f"""
SELECT count(*) FROM compliance_commits {where}
"""), params).scalar()
return {
"total": total,
"commits": [
{
"id": str(r[0]),
"commit_hash": r[1],
"message": r[2],
"author": r[3],
"date": str(r[4]) if r[4] else None,
"branch": r[5],
"affected_control_ids": r[6],
"affected_files": r[7],
"risk_level": r[8],
}
for r in rows
],
}
finally:
db.close()
@router.get("/stats")
async def commit_stats(tenant_id: Optional[str] = None):
"""Dashboard stats for compliance commits."""
db = SessionLocal()
try:
tf = ""
params: dict = {}
if tenant_id:
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
params["tid"] = tenant_id
risk = db.execute(text(f"""
SELECT risk_level, count(*) FROM compliance_commits {tf}
GROUP BY risk_level
"""), params).fetchall()
recent = db.execute(text(f"""
SELECT count(*) FROM compliance_commits
{tf + ' AND' if tf else 'WHERE'} commit_date > NOW() - interval '7 days'
"""), params).scalar()
total = sum(r[1] for r in risk)
return {
"total_commits": total,
"last_7_days": recent,
"by_risk_level": {r[0]: r[1] for r in risk},
}
finally:
db.close()
@router.get("/by-control/{control_id}")
async def commits_by_control(
control_id: str,
limit: int = Query(50, ge=1, le=200),
):
"""Get all commits that affect a specific control."""
db = SessionLocal()
try:
rows = db.execute(text("""
SELECT id, commit_hash, commit_message, commit_author, commit_date,
branch, repo_url, affected_files, risk_level
FROM compliance_commits
WHERE affected_control_ids @> CAST(:cid_json AS jsonb)
ORDER BY commit_date DESC NULLS LAST
LIMIT :limit
"""), {
"cid_json": json.dumps([control_id]),
"limit": limit,
}).fetchall()
return {
"control_id": control_id,
"total_commits": len(rows),
"commits": [
{
"id": str(r[0]),
"commit_hash": r[1],
"message": r[2],
"author": r[3],
"date": str(r[4]) if r[4] else None,
"branch": r[5],
"repo_url": r[6],
"affected_files": r[7],
"risk_level": r[8],
}
for r in rows
],
}
finally:
db.close()
@router.get("/{commit_id}")
async def get_commit(commit_id: str):
"""Get details of a single compliance commit."""
db = SessionLocal()
try:
row = db.execute(text("""
SELECT * FROM compliance_commits WHERE id = CAST(:id AS uuid)
"""), {"id": commit_id}).fetchone()
if not row:
raise HTTPException(status_code=404, detail="Commit not found")
return {
"id": str(row.id),
"tenant_id": str(row.tenant_id),
"project_id": str(row.project_id) if row.project_id else None,
"commit_hash": row.commit_hash,
"commit_message": row.commit_message,
"commit_author": row.commit_author,
"commit_date": str(row.commit_date) if row.commit_date else None,
"branch": row.branch,
"repo_url": row.repo_url,
"affected_control_ids": row.affected_control_ids,
"affected_files": row.affected_files,
"risk_level": row.risk_level,
"analysis_summary": row.analysis_summary,
"analysis_metadata": row.analysis_metadata,
}
finally:
db.close()
@@ -1553,6 +1553,7 @@ async def get_repair_backfill_status(backfill_id: str):
class BatchDedupRequest(BaseModel):
dry_run: bool = True
hint_filter: Optional[str] = None # Only process groups matching this hint prefix
since: Optional[str] = None # ISO datetime — scope to controls created at/after this
_batch_dedup_status: dict = {}
@@ -1567,7 +1568,15 @@ async def _run_batch_dedup(req: BatchDedupRequest, dedup_id: str):
runner = BatchDedupRunner(db)
_batch_dedup_status[dedup_id] = {"status": "running", "phase": "starting"}
stats = await runner.run(dry_run=req.dry_run, hint_filter=req.hint_filter)
since_dt = None
if req.since:
from datetime import datetime
since_dt = datetime.fromisoformat(req.since.replace("Z", "+00:00"))
stats = await runner.run(
dry_run=req.dry_run,
hint_filter=req.hint_filter,
since=since_dt,
)
_batch_dedup_status[dedup_id] = {
"status": "completed",
@@ -2293,18 +2302,95 @@ async def get_batch_process_status(job_id: str):
return status
class RunPass0aRequest(BaseModel):
limit: int = 0 # 0 = no limit
batch_size: int = 5
use_anthropic: bool = True
category_filter: Optional[str] = None
source_filter: Optional[str] = None
_pass0a_status: dict = {}
async def _run_pass0a_background(req: RunPass0aRequest, job_id: str):
"""Run Pass 0a in background with own DB session."""
from services.decomposition_pass import DecompositionPass
db = SessionLocal()
try:
_pass0a_status[job_id] = {"status": "running"}
dp = DecompositionPass(db)
result = await dp.run_pass0a(
limit=req.limit,
batch_size=req.batch_size,
use_anthropic=req.use_anthropic,
category_filter=req.category_filter,
source_filter=req.source_filter,
)
_pass0a_status[job_id] = {"status": "completed", **result}
logger.info("Pass 0a job %s completed: %s", job_id, result)
except Exception as e:
logger.error("Pass 0a job %s failed: %s", job_id, e)
_pass0a_status[job_id] = {"status": "failed", "error": str(e)}
finally:
db.close()
@router.post("/generate/run-pass0a")
async def run_pass0a(req: RunPass0aRequest):
"""Run Pass 0a (Obligation Extraction) on undecomposed controls.
Extracts individual normative obligations from rich controls using LLM.
Runs in background — poll status via GET /generate/pass0a-status/{job_id}.
"""
import uuid
job_id = str(uuid.uuid4())[:8]
_pass0a_status[job_id] = {"status": "starting"}
asyncio.create_task(_run_pass0a_background(req, job_id))
return {
"status": "running",
"job_id": job_id,
"message": f"Pass 0a started. Poll /generate/pass0a-status/{job_id}",
}
@router.get("/generate/pass0a-status/{job_id}")
async def get_pass0a_status(job_id: str):
"""Get status of a Pass 0a job."""
status = _pass0a_status.get(job_id)
if not status:
raise HTTPException(status_code=404, detail="Pass 0a job not found")
return status
class SubmitPass0bRequest(BaseModel):
limit: int = 10
batch_size: int = 5
_last_submit_batch_id: str = ""
_last_submit_time: float = 0
@router.post("/generate/submit-pass0b")
async def submit_pass0b(req: SubmitPass0bRequest):
"""Submit Pass 0b batch to Anthropic Batch API.
Loads unprocessed obligations, applies pre-LLM filter, submits batch.
Returns batch_id for status polling and later result processing.
SAFETY: Refuses to submit if a batch was submitted in the last 10 minutes.
This prevents duplicate batches from curl retries or timeouts.
"""
import time
global _last_submit_batch_id, _last_submit_time
# Idempotency guard: refuse if last submit was <10 min ago
elapsed = time.time() - _last_submit_time
if elapsed < 600 and _last_submit_batch_id:
return {
"status": "blocked",
"reason": f"Batch {_last_submit_batch_id} was submitted {int(elapsed)}s ago. Wait {int(600 - elapsed)}s or use force=true.",
"last_batch_id": _last_submit_batch_id,
}
from services.decomposition_pass import DecompositionPass
db = SessionLocal()
try:
@@ -2313,6 +2399,12 @@ async def submit_pass0b(req: SubmitPass0bRequest):
limit=req.limit,
batch_size=req.batch_size,
)
# Record successful submit
batch_id = result.get("batch_id", "")
if batch_id:
_last_submit_batch_id = batch_id
_last_submit_time = time.time()
logger.info("Submit guard: recorded batch %s", batch_id)
return result
except Exception as e:
logger.error("Submit Pass 0b failed: %s", e)
@@ -2693,3 +2785,199 @@ async def get_quality_metrics(
}
finally:
db.close()
# =============================================================================
# REVIEW CANDIDATE VERIFICATION (Block B — LLM decides DUPLIKAT/VERSCHIEDEN)
# =============================================================================
_REVIEW_VERIFY_SYSTEM = """Du vergleichst Paare von Compliance Controls und entscheidest ob sie Duplikate sind.
Antworte NUR mit einem JSON-Array. Fuer jedes Paar ein Objekt:
{"pair_id": "...", "decision": "DUPLIKAT" oder "VERSCHIEDEN", "reason": "kurze Begruendung"}
DUPLIKAT = gleiche Anforderung, nur anders formuliert.
VERSCHIEDEN = unterschiedliche Anforderungen, auch wenn aehnliche Woerter vorkommen."""
class ReviewVerifyRequest(BaseModel):
limit: int = 0
batch_size: int = 10
dry_run: bool = True
_review_verify_status: dict = {}
async def _run_review_verify(req: ReviewVerifyRequest, job_id: str):
from services.decomposition_pass import (
create_anthropic_batch, fetch_batch_results, check_batch_status,
)
import asyncio as aio
db = SessionLocal()
try:
_review_verify_status[job_id] = {"status": "loading"}
query = """
SELECT r.id::text, r.candidate_control_id, r.candidate_title,
r.matched_control_id, c2.title as matched_title,
r.similarity_score
FROM control_dedup_reviews r
LEFT JOIN canonical_controls c2 ON c2.id = r.matched_control_uuid
WHERE r.review_status = 'pending'
ORDER BY r.similarity_score DESC
"""
if req.limit > 0:
query += f" LIMIT {req.limit}"
rows = db.execute(text(query)).fetchall()
total = len(rows)
_review_verify_status[job_id] = {"status": "preparing", "total": total}
if total == 0:
_review_verify_status[job_id] = {
"status": "completed", "total": 0, "message": "No pending reviews",
}
return
if req.dry_run:
_review_verify_status[job_id] = {
"status": "dry_run", "total": total,
"estimated_requests": (total + req.batch_size - 1) // req.batch_size,
}
return
# Build batch requests
api_requests = []
pair_map = {}
for i in range(0, total, req.batch_size):
batch = rows[i:i + req.batch_size]
prompt = "Vergleiche diese Control-Paare:\n\n"
batch_pairs = []
for r in batch:
pair_id = r[0][:8]
prompt += (
f"Paar {pair_id}:\n"
f" A: {r[1]}{r[2]}\n"
f" B: {r[3]}{r[4]}\n"
f" Similarity: {r[5]:.3f}\n\n"
)
batch_pairs.append({"review_id": r[0], "candidate_id": r[1]})
batch_idx = i // req.batch_size
custom_id = f"rv_b{batch_idx:05d}"
pair_map[custom_id] = batch_pairs
api_requests.append({
"custom_id": custom_id,
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": max(1024, len(batch) * 150),
"system": [{
"type": "text",
"text": _REVIEW_VERIFY_SYSTEM,
"cache_control": {"type": "ephemeral"},
}],
"messages": [{"role": "user", "content": prompt}],
},
})
_review_verify_status[job_id] = {
"status": "submitting", "total": total, "requests": len(api_requests),
}
batch_result = await create_anthropic_batch(api_requests)
batch_id = batch_result.get("id", "")
_review_verify_status[job_id] = {
"status": "batch_submitted", "batch_id": batch_id,
"total": total, "requests": len(api_requests),
}
# Poll for completion
for _ in range(720):
await aio.sleep(10)
status = await check_batch_status(batch_id)
if status.get("processing_status") == "ended":
break
# Process results
results = await fetch_batch_results(batch_id)
duplicates = 0
different = 0
errors = 0
for result in results:
custom_id = result.get("custom_id", "")
result_data = result.get("result", {})
if result_data.get("type") != "succeeded":
errors += 1
continue
content = result_data.get("message", {}).get("content", [])
text_content = content[0].get("text", "") if content else ""
try:
import json as jmod
import re
json_matches = re.findall(r'\{[^}]+\}', text_content)
pairs = pair_map.get(custom_id, [])
for j, match_str in enumerate(json_matches):
try:
parsed = jmod.loads(match_str)
except Exception:
continue
decision = parsed.get("decision", "").upper()
if j < len(pairs):
review_id = pairs[j]["review_id"]
if "DUPLIKAT" in decision:
db.execute(text("""
UPDATE control_dedup_reviews
SET review_status = 'duplicate', review_notes = :notes
WHERE id = CAST(:rid AS uuid)
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
duplicates += 1
else:
db.execute(text("""
UPDATE control_dedup_reviews
SET review_status = 'different', review_notes = :notes
WHERE id = CAST(:rid AS uuid)
"""), {"rid": review_id, "notes": parsed.get("reason", "")})
different += 1
db.commit()
except Exception as e:
logger.error("Review verify parse error: %s", e)
errors += 1
try:
db.rollback()
except Exception:
pass
_review_verify_status[job_id] = {
"status": "completed", "batch_id": batch_id, "total": total,
"duplicates": duplicates, "different": different, "errors": errors,
}
except Exception as e:
logger.error("Review verify %s failed: %s", job_id, e)
_review_verify_status[job_id] = {"status": "failed", "error": str(e)}
finally:
db.close()
@router.post("/generate/review-verify")
async def start_review_verify(req: ReviewVerifyRequest):
"""LLM-verify review candidates (DUPLIKAT/VERSCHIEDEN) via Haiku Batch."""
import uuid as uuid_mod
job_id = str(uuid_mod.uuid4())[:8]
_review_verify_status[job_id] = {"status": "starting"}
asyncio.create_task(_run_review_verify(req, job_id))
return {
"status": "running", "job_id": job_id,
"message": f"Poll /generate/review-verify-status/{job_id}",
}
@router.get("/generate/review-verify-status/{job_id}")
async def get_review_verify_status(job_id: str):
status = _review_verify_status.get(job_id)
if not status:
raise HTTPException(status_code=404, detail="Review verify job not found")
return status
@@ -0,0 +1,224 @@
"""Decision Events API — G3 Full Decision Memory.
Event-stream for each control's compliance lifecycle:
assessment → decision → fix → verification → (failure → new cycle)
"""
import json
import logging
import uuid
from typing import Optional
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel
from sqlalchemy import text
from db.session import SessionLocal
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/v1/decision-events", tags=["decision-events"])
class CreateEventRequest(BaseModel):
control_uuid: str
decision_trace_id: Optional[str] = None
tenant_id: Optional[str] = None
event_type: str
input_state: dict = {}
output_state: dict = {}
summary: Optional[str] = None
actor: Optional[str] = None
evidence_ids: list[str] = []
metadata: dict = {}
@router.post("")
async def create_event(req: CreateEventRequest):
"""Record a decision event in the compliance lifecycle."""
db = SessionLocal()
try:
eid = str(uuid.uuid4())
db.execute(text("""
INSERT INTO decision_events
(id, decision_trace_id, control_uuid, tenant_id,
event_type, input_state, output_state,
summary, actor, evidence_ids, metadata)
VALUES
(CAST(:id AS uuid),
CASE WHEN :trace_id IS NOT NULL THEN CAST(:trace_id AS uuid) ELSE NULL END,
CAST(:control_uuid AS uuid),
CASE WHEN :tenant_id IS NOT NULL THEN CAST(:tenant_id AS uuid) ELSE NULL END,
:event_type, CAST(:input AS jsonb), CAST(:output AS jsonb),
:summary, :actor, CAST(:evidence AS jsonb), CAST(:meta AS jsonb))
"""), {
"id": eid,
"trace_id": req.decision_trace_id,
"control_uuid": req.control_uuid,
"tenant_id": req.tenant_id,
"event_type": req.event_type,
"input": json.dumps(req.input_state),
"output": json.dumps(req.output_state),
"summary": req.summary,
"actor": req.actor,
"evidence": json.dumps(req.evidence_ids),
"meta": json.dumps(req.metadata),
})
db.commit()
return {"id": eid, "event_type": req.event_type, "status": "recorded"}
finally:
db.close()
@router.get("")
async def list_events(
control_uuid: Optional[str] = None,
tenant_id: Optional[str] = None,
event_type: Optional[str] = None,
limit: int = Query(100, ge=1, le=1000),
offset: int = Query(0, ge=0),
):
"""List decision events with filters."""
db = SessionLocal()
try:
clauses = []
params: dict = {"limit": limit, "offset": offset}
if control_uuid:
clauses.append("de.control_uuid = CAST(:cuuid AS uuid)")
params["cuuid"] = control_uuid
if tenant_id:
clauses.append("de.tenant_id = CAST(:tid AS uuid)")
params["tid"] = tenant_id
if event_type:
clauses.append("de.event_type = :etype")
params["etype"] = event_type
where = "WHERE " + " AND ".join(clauses) if clauses else ""
rows = db.execute(text(f"""
SELECT de.id, de.control_uuid, cc.control_id,
de.event_type, de.summary, de.actor,
de.input_state, de.output_state,
de.evidence_ids, de.created_at
FROM decision_events de
LEFT JOIN canonical_controls cc ON cc.id = de.control_uuid
{where}
ORDER BY de.created_at DESC
LIMIT :limit OFFSET :offset
"""), params).fetchall()
return {
"total": len(rows),
"events": [
{
"id": str(r[0]),
"control_uuid": str(r[1]),
"control_id": r[2],
"event_type": r[3],
"summary": r[4],
"actor": r[5],
"input_state": r[6],
"output_state": r[7],
"evidence_ids": r[8],
"created_at": str(r[9]),
}
for r in rows
],
}
finally:
db.close()
@router.get("/stats")
async def event_stats(tenant_id: Optional[str] = None):
"""Lifecycle statistics: cycle times, failure rates."""
db = SessionLocal()
try:
tf = ""
params: dict = {}
if tenant_id:
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
params["tid"] = tenant_id
by_type = db.execute(text(f"""
SELECT event_type, count(*) FROM decision_events {tf}
GROUP BY event_type ORDER BY count(*) DESC
"""), params).fetchall()
total = sum(r[1] for r in by_type)
failures = next((r[1] for r in by_type if r[0] == "failure"), 0)
verifications = next((r[1] for r in by_type if r[0] == "verification"), 0)
return {
"total_events": total,
"by_event_type": {r[0]: r[1] for r in by_type},
"failure_rate": round(failures / total * 100, 1) if total > 0 else 0,
"verification_rate": round(verifications / total * 100, 1) if total > 0 else 0,
}
finally:
db.close()
@router.get("/timeline/{control_id}")
async def get_timeline(control_id: str):
"""Full chronological timeline for a control's compliance lifecycle."""
db = SessionLocal()
try:
# Resolve control_id to UUID
ctrl = db.execute(text("""
SELECT id, control_id, title FROM canonical_controls
WHERE control_id = :cid
"""), {"cid": control_id}).fetchone()
if not ctrl:
raise HTTPException(status_code=404, detail="Control not found")
events = db.execute(text("""
SELECT id, event_type, summary, actor,
input_state, output_state, evidence_ids, created_at
FROM decision_events
WHERE control_uuid = CAST(:uuid AS uuid)
ORDER BY created_at ASC
"""), {"uuid": str(ctrl[0])}).fetchall()
# Determine current state from latest event
current_state = "unknown"
if events:
last = events[-1]
output = last[5] or {}
current_state = output.get("status", last[1])
# Calculate avg fix time (assessment → fix_completed)
fix_times = []
assessment_at = None
for e in events:
if e[1] == "assessment":
assessment_at = e[7]
elif e[1] == "fix_completed" and assessment_at:
delta = (e[7] - assessment_at).total_seconds() / 3600
fix_times.append(delta)
assessment_at = None
return {
"control_id": ctrl[1],
"control_title": ctrl[2],
"current_state": current_state,
"total_events": len(events),
"time_to_fix_avg_hours": round(sum(fix_times) / len(fix_times), 1) if fix_times else None,
"events": [
{
"id": str(e[0]),
"type": e[1],
"summary": e[2],
"actor": e[3],
"input_state": e[4],
"output_state": e[5],
"evidence_count": len(e[6]) if e[6] else 0,
"at": str(e[7]),
}
for e in events
],
}
finally:
db.close()
@@ -0,0 +1,404 @@
"""Decision Trace API — G1 Compliance Execution Layer.
Tracks compliance decisions per control: who decided, when, why,
what evidence supports it, and what's the remediation plan.
"""
import json
import logging
import uuid
from datetime import datetime
from typing import Optional
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel
from sqlalchemy import text
from db.session import SessionLocal
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/v1/decision-traces", tags=["decision-traces"])
# ── Request/Response Models ──────────────────────────────────────────
class CreateDecisionRequest(BaseModel):
control_uuid: str
regulation_id: Optional[str] = None
obligation_id: Optional[str] = None
status: str = "not_assessed"
decision_reason: Optional[str] = None
decided_by: Optional[str] = None
fix_strategy: Optional[str] = None
fix_owner: Optional[str] = None
fix_target_date: Optional[str] = None
evidence_ids: list[str] = []
confidence: float = 0.0
tenant_id: Optional[str] = None
project_id: Optional[str] = None
metadata: dict = {}
class UpdateDecisionRequest(BaseModel):
status: Optional[str] = None
decision_reason: Optional[str] = None
decided_by: Optional[str] = None
fix_strategy: Optional[str] = None
fix_owner: Optional[str] = None
fix_target_date: Optional[str] = None
fix_completed_date: Optional[str] = None
evidence_ids: Optional[list[str]] = None
confidence: Optional[float] = None
metadata: Optional[dict] = None
# ── Endpoints ────────────────────────────────────────────────────────
@router.post("")
async def create_decision(req: CreateDecisionRequest):
"""Record a new compliance decision for a control."""
db = SessionLocal()
try:
trace_id = str(uuid.uuid4())
db.execute(text("""
INSERT INTO decision_traces
(id, control_uuid, regulation_id, obligation_id,
status, decision_reason, decided_by, decided_at,
fix_strategy, fix_owner, fix_target_date,
evidence_ids, confidence, tenant_id, project_id, metadata)
VALUES
(CAST(:id AS uuid), CAST(:control_uuid AS uuid), :regulation_id, :obligation_id,
:status, :decision_reason, :decided_by, NOW(),
:fix_strategy, :fix_owner, :fix_target_date,
CAST(:evidence_ids AS jsonb), :confidence,
:tenant_id, :project_id, CAST(:metadata AS jsonb))
"""), {
"id": trace_id,
"control_uuid": req.control_uuid,
"regulation_id": req.regulation_id,
"obligation_id": req.obligation_id,
"status": req.status,
"decision_reason": req.decision_reason,
"decided_by": req.decided_by,
"fix_strategy": req.fix_strategy,
"fix_owner": req.fix_owner,
"fix_target_date": req.fix_target_date,
"evidence_ids": json.dumps(req.evidence_ids),
"confidence": req.confidence,
"tenant_id": req.tenant_id,
"project_id": req.project_id,
"metadata": json.dumps(req.metadata),
})
db.commit()
return {"id": trace_id, "status": "created"}
finally:
db.close()
@router.get("")
async def list_decisions(
control_uuid: Optional[str] = None,
status: Optional[str] = None,
tenant_id: Optional[str] = None,
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
):
"""List decision traces with optional filters."""
db = SessionLocal()
try:
clauses = []
params: dict = {"limit": limit, "offset": offset}
if control_uuid:
clauses.append("dt.control_uuid = CAST(:control_uuid AS uuid)")
params["control_uuid"] = control_uuid
if status:
clauses.append("dt.status = :status")
params["status"] = status
if tenant_id:
clauses.append("dt.tenant_id = CAST(:tenant_id AS uuid)")
params["tenant_id"] = tenant_id
where = "WHERE " + " AND ".join(clauses) if clauses else ""
rows = db.execute(text(f"""
SELECT dt.id, dt.control_uuid, cc.control_id, cc.title,
dt.status, dt.decision_reason, dt.decided_by, dt.decided_at,
dt.fix_strategy, dt.fix_owner, dt.fix_target_date, dt.fix_completed_date,
dt.evidence_ids, dt.confidence, dt.regulation_id
FROM decision_traces dt
LEFT JOIN canonical_controls cc ON cc.id = dt.control_uuid
{where}
ORDER BY dt.decided_at DESC NULLS LAST
LIMIT :limit OFFSET :offset
"""), params).fetchall()
total = db.execute(text(f"""
SELECT count(*) FROM decision_traces dt {where}
"""), params).scalar()
return {
"total": total,
"decisions": [
{
"id": str(r[0]),
"control_uuid": str(r[1]),
"control_id": r[2],
"control_title": r[3],
"status": r[4],
"decision_reason": r[5],
"decided_by": r[6],
"decided_at": str(r[7]) if r[7] else None,
"fix_strategy": r[8],
"fix_owner": r[9],
"fix_target_date": str(r[10]) if r[10] else None,
"fix_completed_date": str(r[11]) if r[11] else None,
"evidence_ids": r[12],
"confidence": float(r[13]) if r[13] else 0,
"regulation_id": r[14],
}
for r in rows
],
}
finally:
db.close()
@router.get("/stats")
async def decision_stats(tenant_id: Optional[str] = None):
"""Dashboard statistics for compliance decisions."""
db = SessionLocal()
try:
tenant_filter = ""
params: dict = {}
if tenant_id:
tenant_filter = "WHERE tenant_id = CAST(:tenant_id AS uuid)"
params["tenant_id"] = tenant_id
stats = db.execute(text(f"""
SELECT status, count(*) FROM decision_traces
{tenant_filter}
GROUP BY status
"""), params).fetchall()
total = sum(r[1] for r in stats)
by_status = {r[0]: r[1] for r in stats}
return {
"total_decisions": total,
"by_status": by_status,
"compliance_rate": round(
by_status.get("compliant", 0) / total * 100, 1
) if total > 0 else 0,
"pending_remediation": by_status.get("under_remediation", 0),
"not_assessed": by_status.get("not_assessed", 0),
}
finally:
db.close()
@router.get("/{trace_id}")
async def get_decision(trace_id: str):
"""Get a single decision trace."""
db = SessionLocal()
try:
row = db.execute(text("""
SELECT dt.*, cc.control_id, cc.title, cc.source_citation
FROM decision_traces dt
LEFT JOIN canonical_controls cc ON cc.id = dt.control_uuid
WHERE dt.id = CAST(:id AS uuid)
"""), {"id": trace_id}).fetchone()
if not row:
raise HTTPException(status_code=404, detail="Decision trace not found")
return {
"id": str(row.id),
"control_uuid": str(row.control_uuid),
"control_id": row.control_id,
"control_title": row.title,
"regulation_id": row.regulation_id,
"obligation_id": row.obligation_id,
"status": row.status,
"decision_reason": row.decision_reason,
"decided_by": row.decided_by,
"decided_at": str(row.decided_at) if row.decided_at else None,
"fix_strategy": row.fix_strategy,
"fix_owner": row.fix_owner,
"fix_target_date": str(row.fix_target_date) if row.fix_target_date else None,
"fix_completed_date": str(row.fix_completed_date) if row.fix_completed_date else None,
"evidence_ids": row.evidence_ids,
"confidence": float(row.confidence) if row.confidence else 0,
"source_citation": row.source_citation,
"metadata": row.metadata,
}
finally:
db.close()
@router.put("/{trace_id}")
async def update_decision(trace_id: str, req: UpdateDecisionRequest):
"""Update a decision trace (status, fix progress, evidence)."""
db = SessionLocal()
try:
updates = []
params: dict = {"id": trace_id}
if req.status is not None:
updates.append("status = :status")
params["status"] = req.status
if req.decision_reason is not None:
updates.append("decision_reason = :reason")
params["reason"] = req.decision_reason
if req.decided_by is not None:
updates.append("decided_by = :decided_by")
params["decided_by"] = req.decided_by
if req.fix_strategy is not None:
updates.append("fix_strategy = :fix_strategy")
params["fix_strategy"] = req.fix_strategy
if req.fix_owner is not None:
updates.append("fix_owner = :fix_owner")
params["fix_owner"] = req.fix_owner
if req.fix_target_date is not None:
updates.append("fix_target_date = :fix_target")
params["fix_target"] = req.fix_target_date
if req.fix_completed_date is not None:
updates.append("fix_completed_date = :fix_completed")
params["fix_completed"] = req.fix_completed_date
if req.evidence_ids is not None:
updates.append("evidence_ids = CAST(:evidence AS jsonb)")
params["evidence"] = json.dumps(req.evidence_ids)
if req.confidence is not None:
updates.append("confidence = :confidence")
params["confidence"] = req.confidence
if not updates:
raise HTTPException(status_code=400, detail="No fields to update")
result = db.execute(text(f"""
UPDATE decision_traces SET {', '.join(updates)}
WHERE id = CAST(:id AS uuid)
"""), params)
db.commit()
if result.rowcount == 0:
raise HTTPException(status_code=404, detail="Decision trace not found")
return {"status": "updated", "id": trace_id}
finally:
db.close()
# ── Full Trace Endpoint ──────────────────────────────────────────────
full_trace_router = APIRouter(prefix="/v1/controls", tags=["decision-traces"])
@full_trace_router.get("/{control_id}/full-trace")
async def get_full_trace(control_id: str):
"""Get the complete Decision Trace chain for a control.
Returns: Regulation → Obligation → Control → Master Control → Decision → Evidence
"""
db = SessionLocal()
try:
# 1. Control
ctrl = db.execute(text("""
SELECT id, control_id, title, objective, severity,
source_citation, source_original_text,
verification_method, category,
generation_metadata->>'merge_group_hint' AS merge_hint
FROM canonical_controls
WHERE control_id = :cid
"""), {"cid": control_id}).fetchone()
if not ctrl:
raise HTTPException(status_code=404, detail="Control not found")
# 2. Regulation (from source_citation)
citation = ctrl.source_citation or {}
regulation = {
"source": citation.get("source"),
"article": citation.get("article"),
"paragraph": citation.get("paragraph"),
"source_type": citation.get("source_type"),
"license": citation.get("license"),
}
# 3. Obligation (from parent links)
obligations = db.execute(text("""
SELECT oc.candidate_id, oc.obligation_text, oc.action,
oc.object, oc.normative_strength
FROM obligation_candidates oc
WHERE oc.parent_control_uuid = CAST(:uuid AS uuid)
ORDER BY oc.candidate_id
LIMIT 10
"""), {"uuid": str(ctrl.id)}).fetchall()
# 4. Master Control (if member)
master = db.execute(text("""
SELECT mc.master_control_id, mc.canonical_name, mc.phases_covered
FROM master_control_members mcm
JOIN master_controls mc ON mc.id = mcm.master_control_uuid
WHERE mcm.control_uuid = CAST(:uuid AS uuid)
LIMIT 1
"""), {"uuid": str(ctrl.id)}).fetchone()
# 5. Decision Traces
decisions = db.execute(text("""
SELECT id, status, decision_reason, decided_by, decided_at,
fix_strategy, fix_owner, evidence_ids, confidence
FROM decision_traces
WHERE control_uuid = CAST(:uuid AS uuid)
ORDER BY decided_at DESC NULLS LAST
"""), {"uuid": str(ctrl.id)}).fetchall()
return {
"control": {
"id": ctrl.control_id,
"uuid": str(ctrl.id),
"title": ctrl.title,
"objective": ctrl.objective,
"severity": ctrl.severity,
"category": ctrl.category,
"verification_method": ctrl.verification_method,
},
"regulation": regulation,
"original_text": ctrl.source_original_text[:500] if ctrl.source_original_text else None,
"obligations": [
{
"id": o.candidate_id,
"text": o.obligation_text,
"action": o.action,
"object": o.object,
"strength": o.normative_strength,
}
for o in obligations
],
"master_control": {
"id": master.master_control_id,
"name": master.canonical_name,
"phases": master.phases_covered,
} if master else None,
"decisions": [
{
"id": str(d.id),
"status": d.status,
"reason": d.decision_reason,
"decided_by": d.decided_by,
"decided_at": str(d.decided_at) if d.decided_at else None,
"fix_strategy": d.fix_strategy,
"fix_owner": d.fix_owner,
"evidence_count": len(d.evidence_ids) if d.evidence_ids else 0,
"confidence": float(d.confidence) if d.confidence else 0,
}
for d in decisions
],
"latest_status": decisions[0].status if decisions else "not_assessed",
}
finally:
db.close()
@@ -0,0 +1,258 @@
"""Pre-Deployment Enforcement API — G4.
CI/CD gate: checks if a deployment is safe by evaluating the compliance
status of all affected controls. Blocks deploys with non-compliant controls.
"""
import json
import logging
import uuid
from typing import Optional
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel
from sqlalchemy import text
from db.session import SessionLocal
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/v1/deployment-checks", tags=["deployment-checks"])
SEVERITY_WEIGHT = {
"critical": 4.0,
"high": 3.0,
"medium": 2.0,
"low": 1.0,
}
class DeployCheckRequest(BaseModel):
tenant_id: str
commit_hash: str
branch: Optional[str] = None
environment: str = "production"
affected_control_ids: list[str] = []
metadata: dict = {}
class OverrideRequest(BaseModel):
override_by: str
override_reason: str
@router.post("")
async def check_deployment(req: DeployCheckRequest):
"""Check if a deployment is safe. Returns verdict: approved/blocked."""
db = SessionLocal()
try:
check_id = str(uuid.uuid4())
blocking = []
warnings = []
risk_score = 0.0
if req.affected_control_ids:
# Look up latest decision status for each affected control
for ctrl_id in req.affected_control_ids:
row = db.execute(text("""
SELECT dt.status, dt.decision_reason, dt.fix_strategy,
cc.control_id, cc.title, cc.severity
FROM decision_traces dt
JOIN canonical_controls cc ON cc.id = dt.control_uuid
WHERE cc.control_id = :cid
ORDER BY dt.decided_at DESC NULLS LAST
LIMIT 1
"""), {"cid": ctrl_id}).fetchone()
if not row:
# No decision → treat as not_assessed (warning)
warnings.append({
"control_id": ctrl_id,
"status": "not_assessed",
"reason": "No compliance decision recorded",
})
continue
status = row[0]
severity = row[5] or "medium"
weight = SEVERITY_WEIGHT.get(severity, 2.0)
if status in ("not_compliant", "under_remediation"):
blocking.append({
"control_id": row[3],
"title": row[4],
"status": status,
"reason": row[1],
"fix_strategy": row[2],
"severity": severity,
})
risk_score += weight
elif status == "partially_compliant":
warnings.append({
"control_id": row[3],
"title": row[4],
"status": status,
"reason": row[1],
"severity": severity,
})
risk_score += weight * 0.5
# Also check for open failure events (G3)
if req.affected_control_ids:
placeholders = ",".join(["'%s'" % c for c in req.affected_control_ids])
open_failures = db.execute(text(f"""
SELECT cc.control_id, de.summary
FROM decision_events de
JOIN canonical_controls cc ON cc.id = de.control_uuid
WHERE cc.control_id IN ({placeholders})
AND de.event_type = 'failure'
AND de.created_at > NOW() - interval '30 days'
AND NOT EXISTS (
SELECT 1 FROM decision_events de2
WHERE de2.control_uuid = de.control_uuid
AND de2.event_type = 'verification'
AND de2.created_at > de.created_at
)
""")).fetchall()
for f in open_failures:
if not any(b["control_id"] == f[0] for b in blocking):
blocking.append({
"control_id": f[0],
"status": "open_failure",
"reason": f[1] or "Unresolved failure event",
"severity": "high",
})
risk_score += 3.0
verdict = "approved" if not blocking else "blocked"
summary = (
f"{len(blocking)} blocking, {len(warnings)} warnings. "
+ ("Deploy approved." if verdict == "approved"
else f"Fix {', '.join(b['control_id'] for b in blocking)} before deploying.")
)
# Store check result
db.execute(text("""
INSERT INTO deployment_checks
(id, tenant_id, commit_hash, branch, environment,
verdict, affected_control_ids, blocking_controls,
warning_controls, risk_score, summary, metadata)
VALUES
(CAST(:id AS uuid), CAST(:tid AS uuid), :hash, :branch, :env,
:verdict, CAST(:affected AS jsonb), CAST(:blocking AS jsonb),
CAST(:warnings AS jsonb), :risk, :summary, CAST(:meta AS jsonb))
"""), {
"id": check_id,
"tid": req.tenant_id,
"hash": req.commit_hash,
"branch": req.branch,
"env": req.environment,
"verdict": verdict,
"affected": json.dumps(req.affected_control_ids),
"blocking": json.dumps(blocking),
"warnings": json.dumps(warnings),
"risk": risk_score,
"summary": summary,
"meta": json.dumps(req.metadata),
})
db.commit()
return {
"id": check_id,
"verdict": verdict,
"risk_score": risk_score,
"blocking_controls": blocking,
"warning_controls": warnings,
"summary": summary,
}
finally:
db.close()
@router.get("/stats")
async def check_stats(tenant_id: Optional[str] = None):
"""Deployment check statistics."""
db = SessionLocal()
try:
tf = ""
params: dict = {}
if tenant_id:
tf = "WHERE tenant_id = CAST(:tid AS uuid)"
params["tid"] = tenant_id
by_verdict = db.execute(text(f"""
SELECT verdict, count(*) FROM deployment_checks {tf}
GROUP BY verdict
"""), params).fetchall()
total = sum(r[1] for r in by_verdict)
verdicts = {r[0]: r[1] for r in by_verdict}
return {
"total_checks": total,
"by_verdict": verdicts,
"approval_rate": round(
verdicts.get("approved", 0) / total * 100, 1
) if total > 0 else 0,
"override_count": verdicts.get("override", 0),
}
finally:
db.close()
@router.post("/{check_id}/override")
async def override_check(check_id: str, req: OverrideRequest):
"""Override a blocked deployment (with justification)."""
db = SessionLocal()
try:
result = db.execute(text("""
UPDATE deployment_checks
SET verdict = 'override', override_by = :by, override_reason = :reason
WHERE id = CAST(:id AS uuid) AND verdict = 'blocked'
"""), {
"id": check_id,
"by": req.override_by,
"reason": req.override_reason,
})
db.commit()
if result.rowcount == 0:
raise HTTPException(status_code=404, detail="Check not found or not blocked")
return {"id": check_id, "verdict": "override", "override_by": req.override_by}
finally:
db.close()
@router.get("/{check_id}")
async def get_check(check_id: str):
"""Get details of a deployment check."""
db = SessionLocal()
try:
row = db.execute(text("""
SELECT * FROM deployment_checks WHERE id = CAST(:id AS uuid)
"""), {"id": check_id}).fetchone()
if not row:
raise HTTPException(status_code=404, detail="Check not found")
return {
"id": str(row.id),
"tenant_id": str(row.tenant_id),
"commit_hash": row.commit_hash,
"branch": row.branch,
"environment": row.environment,
"verdict": row.verdict,
"affected_control_ids": row.affected_control_ids,
"blocking_controls": row.blocking_controls,
"warning_controls": row.warning_controls,
"risk_score": float(row.risk_score),
"override_by": row.override_by,
"override_reason": row.override_reason,
"summary": row.summary,
"created_at": str(row.created_at),
}
finally:
db.close()
@@ -0,0 +1,178 @@
"""Master Control API — G-pre3.
Provides read access to Master Controls (lifecycle-grouped atomic controls).
"""
import json
import logging
from typing import Optional
from fastapi import APIRouter, HTTPException, Query
from sqlalchemy import text
from db.session import SessionLocal
logger = logging.getLogger(__name__)
router = APIRouter(prefix="/v1/master-controls", tags=["master-controls"])
@router.get("")
async def list_master_controls(
limit: int = Query(50, ge=1, le=500),
offset: int = Query(0, ge=0),
search: Optional[str] = None,
min_phases: Optional[int] = None,
min_controls: Optional[int] = None,
sort: str = Query("total_controls", regex="^(total_controls|phases|name|created_at)$"),
):
"""List Master Controls with optional filtering."""
db = SessionLocal()
try:
where_clauses = []
params: dict = {"limit": limit, "offset": offset}
if search:
where_clauses.append("mc.canonical_name ILIKE :search")
params["search"] = f"%{search}%"
if min_phases:
where_clauses.append("jsonb_array_length(mc.phases_covered) >= :min_phases")
params["min_phases"] = min_phases
if min_controls:
where_clauses.append("mc.total_controls >= :min_controls")
params["min_controls"] = min_controls
where = "WHERE " + " AND ".join(where_clauses) if where_clauses else ""
sort_map = {
"total_controls": "mc.total_controls DESC",
"phases": "jsonb_array_length(mc.phases_covered) DESC",
"name": "mc.canonical_name ASC",
"created_at": "mc.created_at DESC",
}
order = sort_map.get(sort, "mc.total_controls DESC")
rows = db.execute(text(f"""
SELECT mc.id, mc.master_control_id, mc.object_group_id,
mc.canonical_name, mc.phases_covered,
mc.phase_control_count, mc.total_controls,
mc.created_at
FROM master_controls mc
{where}
ORDER BY {order}
LIMIT :limit OFFSET :offset
"""), params).fetchall()
total = db.execute(text(f"""
SELECT count(*) FROM master_controls mc {where}
"""), params).scalar()
return {
"total": total,
"limit": limit,
"offset": offset,
"master_controls": [
{
"id": str(r[0]),
"master_control_id": r[1],
"object_group_id": r[2],
"canonical_name": r[3],
"phases_covered": r[4],
"phase_control_count": r[5],
"total_controls": r[6],
"created_at": str(r[7]),
}
for r in rows
],
}
finally:
db.close()
@router.get("/stats")
async def master_control_stats():
"""Aggregate statistics about Master Controls."""
db = SessionLocal()
try:
stats = db.execute(text("""
SELECT
count(*) AS total_master_controls,
sum(total_controls) AS total_member_controls,
avg(total_controls)::int AS avg_controls_per_mc,
max(total_controls) AS max_controls,
avg(jsonb_array_length(phases_covered))::numeric(3,1) AS avg_phases,
max(jsonb_array_length(phases_covered)) AS max_phases
FROM master_controls
""")).fetchone()
phase_dist = db.execute(text("""
SELECT phase, count(*) AS control_count
FROM master_control_members
GROUP BY phase
ORDER BY control_count DESC
""")).fetchall()
return {
"total_master_controls": stats[0],
"total_member_controls": stats[1],
"avg_controls_per_mc": stats[2],
"max_controls": stats[3],
"avg_phases": float(stats[4]) if stats[4] else 0,
"max_phases": stats[5],
"phase_distribution": {r[0]: r[1] for r in phase_dist},
}
finally:
db.close()
@router.get("/{mc_id}")
async def get_master_control(mc_id: str):
"""Get a single Master Control with all phase-controls."""
db = SessionLocal()
try:
mc = db.execute(text("""
SELECT mc.id, mc.master_control_id, mc.object_group_id,
mc.canonical_name, mc.phases_covered,
mc.phase_control_count, mc.total_controls
FROM master_controls mc
WHERE mc.master_control_id = :mc_id
"""), {"mc_id": mc_id}).fetchone()
if not mc:
raise HTTPException(status_code=404, detail="Master Control not found")
members = db.execute(text("""
SELECT mcm.phase, mcm.action,
cc.control_id, cc.title, cc.severity,
cc.source_citation->>'source' AS source
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
ORDER BY mcm.phase, cc.control_id
"""), {"mc_uuid": str(mc[0])}).fetchall()
# Group by phase
phases = {}
for phase, action, ctrl_id, title, severity, source in members:
if phase not in phases:
phases[phase] = []
phases[phase].append({
"control_id": ctrl_id,
"title": title,
"action": action,
"severity": severity,
"source": source,
})
return {
"id": str(mc[0]),
"master_control_id": mc[1],
"object_group_id": mc[2],
"canonical_name": mc[3],
"phases_covered": mc[4],
"phase_control_count": mc[5],
"total_controls": mc[6],
"phases": phases,
}
finally:
db.close()
@@ -0,0 +1,430 @@
source: Derived from BSI QUAIDAL (Clean-Room)
source_url: https://github.com/BSI-Bund/QUAIDAL
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
plagiarism_limit_4gram: 0.2
generated_by_model: qwen3.5:35b-a3b
controls:
- id: AC-AI-DATA-QB-01-syntaktische-genauigkeit
canonical_name: Syntaktische Genauigkeit
description: Das KI-Trainingsset muss syntaktisch konsistent sein, wobei alle definierten
Grammatik- und Strukturregeln strikt einzuhalten sind. Eine fehlerfreie Datenstruktur
ist zwingend erforderlich, um eine korrekte Verarbeitung durch Parser oder Sprachmodelle
zu gewährleisten. Die Validierung der formalen Korrektheit ist vor jedem Training
durchzuführen, um Verarbeitungsfehler auszuschließen.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-01
- MA-02
- MA-03
- MA-04
- MA-05
- MA-27
external_refs:
- framework: BSI AIC4
citation: null
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QB-01
title_original_de: QB-01 Syntaktische Genauigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-01_Syntactic%20Accuracy.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-02-semantische-genauigkeit
canonical_name: Semantische Genauigkeit
description: Die KI-Trainingsdaten müssen inhaltlich korrekt sein, sodass die zugewiesenen
Werte dem tatsächlichen Sachverhalt entsprechen und nicht nur formal valide sind.
Es ist sicherzustellen, dass semantische Zuordnungen keine logischen Fehler aufweisen,
wie beispielsweise die Klassifizierung von Tieren als technische Geräte. Eine
Prüfung muss verifizieren, dass die Bedeutung der Datenpunkte im Kontext der Anwendung
eindeutig und fehlerfrei interpretiert werden kann.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-05
- MA-06
- MA-07
- MA-27
external_refs:
- framework: BSI AIC4
citation: null
source:
framework: BSI QUAIDAL
section: QB-02
title_original_de: QB-02 Semantische Genauigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-02_Semantic%20Accuracy.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-03-vielfalt
canonical_name: Vielfalt
description: Das KI-Trainingsdatenset muss eine maximale Varianz in den relevanten
Merkmalen aufweisen, um die Heterogenität der Eingabewerte zu gewährleisten. Es
ist sicherzustellen, dass das Spektrum der enthaltenen Werte breit genug ist,
um das Variationspotential der Zielgruppe vollständig abzudecken. Eine Prüfung
der Datenverteilung ist vor dem Training durchzuführen, um eine unzureichende
Diversität auszuschließen.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-08
- MA-09
- MA-10
- MA-12
- MA-27
- MA-28
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-03
title_original_de: QB-03 Vielfalt
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-03_Diversity.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0204
- id: AC-AI-DATA-QB-04-ausgewogenheit
canonical_name: Ausgewogenheit
description: Der Trainingsdatensatz ist so zu konzipieren, dass die Verteilung aller
relevanten Klassen proportional zur Zielrealität erfolgt, um eine einseitige Dominanz
einzelner Kategorien zu vermeiden. Es ist sicherzustellen, dass keine Gruppe systematisch
unter- oder überrepräsentiert wird, um Verzerrungen im Modellverhalten auszuschließen.
Die Datenqualität muss durch eine ausgewogene Varianz aller Merkmale gewährleistet
werden, um Overfitting und Bias wirksam zu verhindern.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-08
- MA-09
- MA-10
- MA-12
- MA-14
- MA-27
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-04
title_original_de: QB-04 Ausgewogenheit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-04_Balance.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0182
- id: AC-AI-DATA-QB-05-umfang
canonical_name: Umfang
description: Der Trainingsdatensatz muss eine quantitativ ausreichende Anzahl an
Datenpunkten aufweisen, um statistisch signifikante Muster zu erfassen und das
Risiko von Overfitting zu minimieren. Die Größe der Datenbasis ist so zu dimensionieren,
dass sie eine belastbare Analyse der zugrundeliegenden Verteilungen ermöglicht
und die Generalisierungsfähigkeit des Modells stabilisiert. Eine Prüfung ist durchzuführen,
um sicherzustellen, dass der reine quantitative Umfang die notwendige Basis für
eine robuste Modellbildung bildet.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-11
- MA-12
- MA-15
- MA-27
external_refs:
- framework: BSI AIC4
citation: null
source:
framework: BSI QUAIDAL
section: QB-05
title_original_de: QB-05 Umfang
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-05_Size.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0161
- id: AC-AI-DATA-QB-06-verzerrung
canonical_name: Verzerrung
description: Das KI-System muss vor dem produktiven Einsatz auf systematische Verzerrungen
in den Trainingsdaten und den daraus resultierenden Vorhersagen untersucht werden.
Es ist sicherzustellen, dass latente Ungleichbehandlungen quantitativ erfasst
und dokumentiert werden, um eine transparente Bewertung der Fairness zu ermöglichen.
Die Prüfung umfasst die Identifikation von Abweichungen, die auf unausgewogene
Datenverteilungen zurückzuführen sind, bevor das Modell für reale Anwendungen
freigegeben wird.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-01
- MA-02
- MA-03
- MA-04
- MA-06
- MA-07
- MA-08
- MA-09
- MA-10
- MA-11
- MA-12
- MA-13
- MA-14
- MA-15
- MA-16
- MA-17
- MA-18
- MA-20
- MA-23
- MA-24
- MA-27
- MA-28
- QB-15
- QM-11
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-06
title_original_de: QB-06 Verzerrung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-06_Bias-Detektion.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-07-gesamtheit
canonical_name: Gesamtheit
description: Das Trainingsdatenset muss sämtliche für das spezifische Anwendungsszenario
definierten Attribute und Entitätsinstanzen vollständig enthalten, um die Anforderung
der Gesamtheit zu erfüllen. Diese Vollständigkeit ist auf der Ebene des gesamten
Datensatzes, einzelner Spalten oder einzelner Datenpunkte nachweisbar zu prüfen.
Die Bewertung der Datenqualität erfolgt stets kontextbezogen unter Berücksichtigung
der jeweiligen Nutzungszwecke.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-12
- MA-13
- MA-27
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-07
title_original_de: QB-07 Gesamtheit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-07_Totality.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-08-konsistenzsicherung
canonical_name: Konsistenzsicherung
description: Die Konsistenz der KI-Trainingsdaten ist durch standardisierte Datentypen
und formatierte Attribute über den gesamten Lebenszyklus sicherzustellen. Automatisierte
Prüfmechanismen müssen Abweichungen in den Datenwerten sowie zeitlichen Verläufen
frühzeitig identifizieren, um nachvollziehbare Transformations- oder Imputationsmaßnahmen
einzuleiten. Eine einheitliche Datenstruktur ist zwingend erforderlich, um die
Integrität der Trainingsbasis für valide Modellentscheidungen zu gewährleisten.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-01
- MA-02
- MA-03
external_refs:
- framework: ISO/IEC 25012
citation: null
- framework: BSI AIC4
citation: null
source:
framework: BSI QUAIDAL
section: QB-08
title_original_de: QB-08 Konsistenzsicherung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-08_ConsistencyAssurance.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-09-quellenmanagement
canonical_name: Quellenmanagement
description: Die Organisation muss einen durchgängigen Mechanismus implementieren,
der die Herkunft und den Verarbeitungsweg jeder Trainingsdaten-Einheit lückenlos
dokumentiert. Es ist sicherzustellen, dass jeder Datenpunkt mit seinem Ursprung
sowie allen nachfolgenden Transformationsschritten verknüpft bleibt, um die Integrität
der KI-Datenbasis zu gewährleisten. Zusätzlich sind alle Zugriffe und Modifikationen
in einem unveränderlichen Protokoll chronologisch festzuhalten, um einen vollständigen
Audit-Trail für Compliance-Prüfungen zu schaffen.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-18
- MA-19
- MA-20
- MA-22
external_refs:
- framework: BSI AIC4
citation: null
- framework: AI Act
citation: null
source:
framework: BSI QUAIDAL
section: QB-09
title_original_de: QB-09 Quellenmanagement
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-09_Sourcemanagement.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0167
- id: AC-AI-DATA-QB-10-datenpruefung
canonical_name: _Datenprüfung
description: Vor der Initialisierung des Trainingsprozesses ist eine systematische
Validierung der Eingangsdaten auf Vollständigkeit, Konsistenz und Integrität durchzuführen.
Dabei sind Unregelmäßigkeiten wie fehlende Werte, formatinkonsistenzen oder statistische
Ausreißer zu identifizieren und zu bereinigen. Das System muss sicherstellen,
dass keine verzerrten oder fehlerhaften Datensätze das Modelltraining beeinträchtigen
und die Datenqualität den definierten Qualitätsstandards entspricht.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-05
- MA-20
- MA-26
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-10
title_original_de: QB-10_Datenprüfung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-10_DataChecks.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0204
- id: AC-AI-DATA-QB-11-prozesse
canonical_name: Prozesse
description: Es ist sicherzustellen, dass jeder Schritt der Datenvorbereitung und
-verarbeitung für KI-Trainingszwecke lückenlos protokolliert wird, um die vollständige
Nachvollziehbarkeit der Datenherkunft und aller Transformationen zu gewährleisten.
Diese Dokumentation muss so strukturiert sein, dass sie eine valide Reproduzierbarkeit
der Modelle sowie eine fundierte Qualitätssicherung der zugrundeliegenden Datensätze
ermöglicht. Durch die Erfassung aller Änderungsereignisse wird die Integrität
der Trainingsdaten über den gesamten Lebenszyklus hinweg verifiziert.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-18
- MA-21
external_refs:
- framework: BSI Grundschutz
citation: null
- framework: ISO/IEC 23894
citation: null
- framework: ISO/IEC 42001
citation: null
- framework: AI Act
citation: null
source:
framework: BSI QUAIDAL
section: QB-11
title_original_de: QB-11 Prozesse
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-11_Processes.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-12-merkmalsentwicklung
canonical_name: Merkmalsentwicklung
description: Die Erstellung und Auswahl von Eingangsmerkmalen für KI-Modelle ist
so zu gestalten, dass sie signifikante Korrelationen zur Zielgröße aufweisen und
redundante Informationen eliminieren. Es ist sicherzustellen, dass die transformierten
Daten generalisierbar sind und eine hohe Informationsdichte für neue, unbekannte
Datensätze bieten. Eine Validierung muss nachweisen, dass die abgeleiteten Merkmale
die Interpretierbarkeit des Modells unterstützen und keine unnötige Komplexität
verursachen.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-01
- MA-02
- MA-03
- MA-06
- MA-12
- MA-14
- MA-17
- MA-23
- MA-24
- MA-27
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-12
title_original_de: QB-12 Merkmalsentwicklung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-12_FeatureEngineering.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-13-datenvorbereitung
canonical_name: Datenvorbereitung
description: Vor der Initialisierung des Trainingsprozesses sind alle Rohdaten durch
definierte Transformationen in eine qualitätsgeprüfte und für das Modell verarbeitbare
Struktur zu überführen. Es ist sicherzustellen, dass jede angewandte Datenaufbereitung
die Integrität der Trainingsmenge gewährleistet und keine nicht validierten Artefakte
in das Lernsystem einfließen. Die Durchführbarkeit dieser Schritte ist vor dem
Start der Modellkonvergenz durch systematische Prüfverfahren nachzuweisen.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-02
- MA-03
- MA-04
- MA-13
- MA-14
- MA-16
- MA-17
- MA-23
- MA-24
- MA-25
- MA-27
- MA-29
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-13
title_original_de: QB-13 Datenvorbereitung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-13_DataPreparation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-14-expertanalysis
canonical_name: _Expertanalysis
description: Die Qualität der KI-Trainingsdaten ist durch eine unabhängige, manuelle
Begutachtung durch qualifiziertes Fachpersonal zu validieren. Dabei sind mehrere
Prüfer eigenständig einzusetzen, um subjektive Verzerrungen und Gruppenkonformitätseffekte
bei der Bewertung auszuschließen. Die Ergebnisse dieser fachlichen Analyse müssen
anonymisiert zusammengeführt werden, um eine objektive Beurteilung der Datensatzqualität
zu gewährleisten.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-06
- MA-10
- MA-14
- MA-15
- MA-21
- MA-22
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-14
title_original_de: QB-14_Expertanalysis
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-14_Expertanalysis.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: AC-AI-DATA-QB-15-bias-mitigation
canonical_name: Bias-Mitigation
description: Das System muss technische Mechanismen implementieren, um systematische
Verzerrungen in den Trainingsdaten oder während des Lernprozesses zu identifizieren
und zu kompensieren. Diese Maßnahmen sind unabhängig vom Entwicklungsstadium anzuwenden,
wobei Datenanpassungen vor dem Training, Regularisierungsverfahren während des
Lernens oder Korrekturen der Ausgabeergebnisse nach dem Training möglich sind.
Eine Prüfung der Fairness-Kriterien ist vor der Freigabe des Modells durchzuführen,
um sicherzustellen, dass keine diskriminierenden Muster in den Ergebnissen verbleiben.
kind: building_block
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-30
- QM-57
external_refs: []
source:
framework: BSI QUAIDAL
section: QB-15
title_original_de: QB-15 Bias-Mitigation
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0001_Qualitätsbausteine/QB-15_Bias-Mitigation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
@@ -0,0 +1,280 @@
source: Derived from BSI QUAIDAL (Clean-Room)
source_url: https://github.com/BSI-Bund/QUAIDAL
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
plagiarism_limit_4gram: 0.2
generated_by_model: qwen3.5:35b-a3b
controls:
- id: MC-AI-DATA-QKB-01-repraesentativitaet
canonical_name: Repräsentativität
description: Der Trainingsdatensatz muss die statistische Verteilung der Zielpopulation
exakt abbilden, um systematische Verzerrungen im Modell zu vermeiden. Es ist sicherzustellen,
dass alle relevanten Merkmalsausprägungen in ausreichender Häufigkeit und ohne
Über- oder Unterrepräsentation vorliegen. Die Datenmenge ist so zu dimensionieren,
dass eine robuste Generalisierungsfähigkeit für alle Subgruppen der Gesamtpopulation
gewährleistet wird. Eine Prüfung auf Stichprobenqualität ist vor dem Training
durchzuführen.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-03
- QB-04
- QB-05
- QB-06
- QB-15
external_refs:
- framework: AI Act
citation: Artikel 10
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-01
title_original_de: QKB-01 Repräsentativität
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-01_Representativity.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-02-vollstaendigkeit
canonical_name: Vollständigkeit
description: Der Datensatz muss sämtliche für das spezifische KI-Modell erwarteten
Attribute und Merkmalsausprägungen lückenlos beinhalten. Es ist sicherzustellen,
dass keine Entitätsinstanzen fehlen und alle definierten Merkmale mit Werten belegt
sind. Eine Prüfung auf fehlende Werte oder unvollständige Attributmengen ist vor
dem Training zwingend durchzuführen, um Verzerrungen zu vermeiden.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-07
- QB-09
external_refs:
- framework: AI Act
citation: Artikel 10
- framework: BSI AIC4
citation: null
- framework: ISO/IEC 25012
citation: null
- framework: ISO/IEC 25024
citation: null
source:
framework: BSI QUAIDAL
section: QKB-02
title_original_de: QKB-02 Vollständigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-02_Completeness.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-03-genauigkeit
canonical_name: Genauigkeit
description: Die Integrität der KI-Trainingsdaten erfordert, dass jeder einzelne
Datenelementwert eine definierte numerische oder symbolische Übereinstimmung mit
dem referenzierten Sollwert aufweist. Es ist sicherzustellen, dass Abweichungen
innerhalb festgelegter Toleranzgrenzen bezüglich Rundung, Formatierung und Messauflösung
bleiben. Die Einhaltung dieser Spezifikation ist durch automatisierte Prüfverfahren
vor jedem Trainingslauf zu verifizieren.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-01
- QB-02
external_refs:
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-03
title_original_de: QKB-03 Genauigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-03_Accuracy.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-04-konsistenz
canonical_name: Konsistenz
description: Das System muss sicherstellen, dass alle Eingabedaten für das KI-Training
logisch kohärent und frei von internen Widersprüchen sind. Einheitliche Kodierungen
für Kategorien sowie konsistente Formatierungen sind zwingend erforderlich, um
eine fehlerfreie Generalisierung durch das Modell zu ermöglichen. Jede Abweichung
von den definierten Datenstandards ist durch automatische Prüfmechanismen zu identifizieren
und zu unterbinden.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-02
- QB-07
- QB-08
- QB-10
- QB-11
- QB-12
external_refs:
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-04
title_original_de: QKB-04 Konsistenz
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-04_Consistency.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-05-korrektheit
canonical_name: Korrektheit
description: Das KI-Modell muss ausschließlich auf Datensätzen trainiert werden,
die inhaltlich frei von Fehlern sind und den tatsächlichen Gegebenheiten oder
definierten Referenzstandards exakt entsprechen. Es ist sicherzustellen, dass
jede annotierte Information den als wahr geltenden Zustand im Anwendungskontext
fehlerfrei abbildet. Die Validierung der Trainingsdaten ist vor Beginn des Lernprozesses
durchzuführen, um sicherzustellen, dass keine inkorrekten Werte die Modellleistung
beeinträchtigen.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-09
- QB-10
- QB-12
- QB-14
external_refs:
- framework: ISO/IEC 25012
citation: null
- framework: BSI AIC4
citation: null
- framework: AI Act
citation: Artikel 10
source:
framework: BSI QUAIDAL
section: QKB-05
title_original_de: QKB-05 Korrektheit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-05_Correctness.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-06-einheitlichkeit
canonical_name: Einheitlichkeit
description: Die Konsistenz der KI-Trainingsdaten ist durch die strikte Einhaltung
definierter Syntaxregeln und Datenstrukturen sicherzustellen. Jedes Datenelement
muss vor der Verarbeitung gemäß festgelegten Standards formatiert werden, um strukturelle
Abweichungen auszuschließen. Eine Prüfung der formalen Einheitlichkeit ist unabhängig
von der inhaltlichen Richtigkeit der Werte durchzuführen.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-02
- QB-08
- QB-10
- QB-12
- QB-14
external_refs:
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-06
title_original_de: QKB-06 Einheitlichkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-06_Uniformity.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-07-gueltigkeit
canonical_name: Gültigkeit
description: Das System muss sicherstellen, dass die für das KI-Training verwendeten
Daten inhaltlich exakt das intendierte Zielkonstrukt abbilden und nicht nur oberflächliche
Korrelationen erfassen. Es ist zu prüfen, ob die erfassten Merkmale den theoretischen
Anforderungen an den Messgegenstand entsprechen, um eine valide Grundlage für
Ableitungen zu gewährleisten. Eine Abweichung zwischen dem gemessenen Inhalt und
dem definierten Zielkonzept ist als Fehlerzustand zu klassifizieren und muss ausgeschlossen
werden.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-02
- QB-05
- QB-09
- QB-10
- QB-14
external_refs:
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-07
title_original_de: QKB-07 Gültigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-07_Validity.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-08-eindeutigkeit
canonical_name: Eindeutigkeit
description: Jeder Datensatz im Trainingskorpus muss eine eindeutige Identität besitzen,
um die Entstehung redundanter Instanzen auszuschließen. Es ist sicherzustellen,
dass keine doppelten oder mehrdeutigen Einträge vorliegen, da diese die Modellgeneralisierung
beeinträchtigen und zu Overfitting führen können. Die Validierung muss nachweisen,
dass jede Dateneinheit eindeutig identifizierbar ist und logisch von anderen unterscheidbar
bleibt.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-05
- QB-10
- QB-13
external_refs:
- framework: ISO/IEC 25012
citation: null
source:
framework: BSI QUAIDAL
section: QKB-08
title_original_de: QKB-08 Eindeutigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-08_Uniqueness.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-09-sichere-quellen
canonical_name: Sichere Quellen
description: Für KI-Trainingsdaten muss eine lückenlose Provenienz-Dokumentation
etabliert werden, die jeden Verarbeitungsschritt von der Erfassung bis zur finalen
Nutzung nachvollziehbar macht. Es ist sicherzustellen, dass alle Transformationen
und Herkunftsinformationen vollständig erfasst sind, um die Datenintegrität und
-qualität kontinuierlich verifizieren zu können. Die Nachprüfbarkeit dieser Metadaten
ist zwingend erforderlich, um potenzielle Qualitätsmängel oder Manipulationen
in den Trainingsbeständen frühzeitig zu identifizieren.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-09
- QB-11
external_refs:
- framework: ISO/IEC 25012
citation: null
- framework: BSI AIC4
citation: null
source:
framework: BSI QUAIDAL
section: QKB-09
title_original_de: QKB-09 Sichere Quellen
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-09_SecureSource.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MC-AI-DATA-QKB-10-daten-mit-personenbezug
canonical_name: Daten mit Personenbezug
description: Das System muss vor der Nutzung von Trainingsdaten eine automatisierte
Prüfung durchführen, um personenbezogene Informationen zu identifizieren. Ist
derartige Datenbestandteil der Eingabedaten, ist deren vollständige und nachweisbare
Entfernung sicherzustellen, bevor ein Modelltraining initiiert wird. Die Integrität
der verbleibenden Datensätze ist durch technische Maßnahmen gegen unbeabsichtigte
Wiederverwendung zu gewährleisten.
kind: criterion
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QB-09
- QB-10
- QB-11
- QB-14
external_refs:
- framework: EU GDPR
citation: null
source:
framework: BSI QUAIDAL
section: QKB-10
title_original_de: QKB-10 Daten mit Personenbezug
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0000_Qualitätskriterien/QKB-10_PersonalDataCheck.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,753 @@
source: Derived from BSI QUAIDAL (Clean-Room)
source_url: https://github.com/BSI-Bund/QUAIDAL
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
plagiarism_limit_4gram: 0.2
generated_by_model: qwen3.5:35b-a3b
controls:
- id: MIT-AI-DATA-MA-01-datentyp-validierung
canonical_name: Datentyp Validierung
description: Es ist sicherzustellen, dass alle Eingabedaten und Trainingsdatensätze
vor der Verarbeitung auf Konformität mit den definierten Schemata und Datentypen
des Modells geprüft werden. Abweichungen von den erwarteten Formaten sind automatisch
zu identifizieren und müssen entweder bereinigt oder ausgeschlossen werden, um
Inferenzfehler zu verhindern. Diese Validierung ist als automatisierter Schritt
in den Datenpipelines zu implementieren, um die Integrität der KI-Systeme zu gewährleisten.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-32
- QM-34
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-01
title_original_de: MA-01 Datentyp Validierung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-01_Datatype%20Validation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-02-format-pruefung
canonical_name: Format Prüfung
description: Die Eingabedaten für KI-Trainingszwecke sind vor der Verarbeitung auf
strukturelle Korrektheit zu validieren, wobei Datentypen wie Zeitstempel oder
Textfelder exakt den definierten Schemata entsprechen müssen. Durch die erzwingung
einer einheitlichen Formatierung wird verhindert, dass regionale Abweichungen
oder inkonsistente Darstellungen zu Fehlinterpretationen im Modell führen. Die
Konformität ist automatisiert zu prüfen, um sicherzustellen, dass keine nicht
konformen Datensätze in den Lernprozess eingehen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-32
- QM-34
- QM-43
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-02
title_original_de: MA-02 Format Prüfung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-02_Format%20Check.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-03-bereichspruefung
canonical_name: Bereichsprüfung
description: Das System muss vor dem KI-Training eine automatische Validierung aller
Eingangsmerkmale durchführen, um Werte außerhalb definierter physikalischer oder
logischer Grenzen zu identifizieren. Dabei sind insbesondere inkonsistente Datentypen,
fehlerhafte Maßeinheiten und statistisch unplausible Ausreißer zu detektieren
und zu isolieren. Die Integrität des Trainingsdatensatzes ist erst dann gewährleistet,
wenn alle nicht konformen Einträge ausgeschlossen oder korrigiert wurden, bevor
der Lernprozess initiiert wird.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-51
- QM-52
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-03
title_original_de: MA-03 Bereichsprüfung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-03_Range%20Check.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-04-over-undersampling
canonical_name: Over-Undersampling
description: Das Daten-Set für das KI-Training ist auf ein ausgewogenes Klassenverhältnis
zu prüfen, wobei eine künstliche Aufstockung seltener Kategorien durch synthetische
Generierung oder Duplizierung zulässig ist. Alternativ ist eine Reduktion der
Datenpunkte der Mehrheitsklasse nach definierten Kriterien durchzuführen, um eine
Verzerrung des Modells zu vermeiden. Die angewandte Methode zur Erreichung dieses
Gleichgewichts ist dokumentiert und muss reproduzierbar sein.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-34
- QM-38
- QM-57
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-04
title_original_de: MA-04 Over-Undersampling
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-04_Over-Undersampling.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-05-automatisierte-aufgaben
canonical_name: Automatisierte Aufgaben
description: Wiederkehrende Prozesse der Datenvorverarbeitung und Qualitätsprüfung
im KI-Lebenszyklus sind durch automatisierte Mechanismen zu implementieren. Die
Ausführung dieser Aufgaben muss so konfiguriert sein, dass eine konsistente Ergebnisqualität
über alle Durchläufe hinweg sichergestellt wird. Es ist zu prüfen, dass die eingesetzten
Automatisierungswerkzeuge spezifische Validierungsregeln für Trainingsdaten zuverlässig
anwenden.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-02
- MA-03
- QM-10
- QM-34
- QM-64
external_refs:
- framework: AI Act
citation: null
source:
framework: BSI QUAIDAL
section: MA-05
title_original_de: MA-05 Automatisierte Aufgaben
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-05_Automated%20Tasks.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-06-experten-auswertung
canonical_name: Experten Auswertung
description: Für die Validierung von KI-Trainingsdaten ist eine manuelle Prüfung
durch qualifizierte Fachexperten zwingend erforderlich. Diese Experten müssen
die inhaltliche Gültigkeit, Relevanz und Korrektheit der Datensätze auf Basis
domänenspezifischen Wissens systematisch evaluieren. Das Ergebnis dieser Begutachtung
dient dazu, methodische Fehler oder qualitative Mängel frühzeitig zu identifizieren
und konkrete Maßnahmen zur Datenbereinigung abzuleiten.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-16
- QM-30
- QM-43
- QM-45
- QM-59
- QM-70
external_refs:
- framework: ISO/IEC 25012
citation: null
- framework: ISO/IEC 25024
citation: null
source:
framework: BSI QUAIDAL
section: MA-06
title_original_de: MA-06 Experten Auswertung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-06_Expert%20Evaluation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0204
- id: MIT-AI-DATA-MA-07-massenbeteiligung
canonical_name: Massenbeteiligung
description: Das System muss Mechanismen implementieren, um die Qualität von Trainingsdaten
durch dezentrale Validierung durch eine heterogene Gruppe externer Prüfer sicherzustellen.
Es ist zwingend erforderlich, dass die Ergebnisse dieser kollektiven Überprüfung
mit internen Qualitätsstandards abgeglichen werden, um systematische Fehler in
den annotierten Datensätzen zu identifizieren. Die Integrität der KI-Modelle ist
nur gewährleistet, wenn diese skalierbare Prüfprozedur für kritische Datenmengen
routinemäßig angewendet wird.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-06
- QM-03
- QM-16
- QM-43
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-07
title_original_de: MA-07 Massenbeteiligung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-07_Crowdsourcing.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-08-verteilungsanalyse
canonical_name: Verteilungsanalyse
description: Es ist sicherzustellen, dass die Verteilung der Trainingsdaten über
alle relevanten Klassen und Merkmalsbereiche systematisch auf statistische Verzerrungen
und Anomalien geprüft wird. Diese Analyse muss nachweisen, dass das Modell auf
einer repräsentativen und ausgewogenen Datenbasis trainiert wurde, um die Generalisierungsfähigkeit
der Vorhersagen zu gewährleisten. Die Ergebnisse der Verteilungsprüfung sind vor
Beginn des Trainings zu dokumentieren und bei signifikanten Abweichungen sind
Korrekturmaßnahmen einzuleiten.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-06
- QM-10
- QM-11
- QM-51
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-08
title_original_de: MA-08 Verteilungsanalyse
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-08_DistributionAnalysis.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0339
- id: MIT-AI-DATA-MA-09-vergleichgrundgesamtheit
canonical_name: VergleichGrundgesamtheit
description: Das System muss eine repräsentative Referenzstichprobe aus der Zielverteilung
bereitstellen, um die Validität von KI-Trainingsdaten zu verifizieren. Es ist
sicherzustellen, dass diese Referenzdaten als Goldstandard dienen, um Abweichungen
zwischen dem Trainingsset und der tatsächlichen Grundgesamtheit zu quantifizieren.
Die Übereinstimmung ist durch einen automatisierten Abgleich mit den vorab definierten
Verteilungsparametern zu prüfen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-9
- QM-51
- QM-52
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-09
title_original_de: MA-09 VergleichGrundgesamtheit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-09_CompareGroundtruth.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-10-gewichtung-der-daten
canonical_name: Gewichtung der Daten
description: Für KI-Trainingsdatensätze ist eine manuelle Gewichtung der einzelnen
Merkmale zwingend erforderlich, um systematische Verzerrungen zu minimieren. Diese
Maßnahme dient der Sicherstellung einer ausgewogenen Datenrepräsentation und verbessert
die Generalisierungsfähigkeit des Modells auf spezifische Anwendungsfälle. Die
Zuordnung der Gewichtungsfaktoren ist vor dem Training durchzuführen und muss
dokumentiert werden, um die Nachvollziehbarkeit der Datenqualität zu gewährleisten.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-10
- QM-18
- QM-28
- QM-29
- QM-37
- QM-38
- QM-39
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-10
title_original_de: MA-10 Gewichtung der Daten
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-10_ManualWeights.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-11-stichprobengroesse
canonical_name: Stichprobengröße
description: Die Menge der für das Training verwendeten Daten ist so zu dimensionieren,
dass statistisch signifikante Ergebnisse bei definiertem Konfidenzniveau und akzeptabler
Fehlervarianz gewährleistet sind. Die Datengröße muss iterativ angepasst werden,
wobei sowohl die Gesamtgröße der zugrundeliegenden Population als auch die spezifische
Art der Datenerweiterung systematisch zu berücksichtigen sind. Eine Validierung
der Datenqualität ist zwingend erforderlich, um Verzerrungen durch unterschiedliche
Skalierungsmethoden auszuschließen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-08
- QM-09
- QM-39
- QM-41
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-11
title_original_de: MA-11 Stichprobengröße
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-11_Trainingsdataset%20Size.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-12-abdeckung-relevanter-merkmale
canonical_name: Abdeckung relevanter Merkmale
description: Das Trainingsdatenset muss vollständig alle für die spezifische Problemstellung
essenziellen Eingangsvariablen enthalten, um eine lückenlose Merkmalsabdeckung
zu gewährleisten. Es ist sicherzustellen, dass keine kritischen Einflussgrößen
fehlen, da sonst das Modell keine verlässlichen Korrelationen erlernen kann. Die
Vollständigkeit des Merkmalsraums ist vor Beginn des Trainingsprozesses durch
eine formale Prüfung zu verifizieren.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-06
- MA-14
- QM-10
- QM-11
- QM-13
- QM-25
- QM-26
- QM-27
- QM-28
- QM-29
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-12
title_original_de: MA-12 Abdeckung relevanter Merkmale
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-12_RelevantFeatureCoverage.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-13-vollstaendige-information-in-datensaetze
canonical_name: Vollständige Information in Datensätzen
description: Für die Validierung von KI-Trainingsdaten ist sicherzustellen, dass
alle für die Analyse erforderlichen Attribute vollständig vorliegen und keine
unbeabsichtigten Lücken existieren. Bei festgestellten Datenfehlern ist zwingend
die Ursache zu ermitteln, um das passende Imputationsverfahren basierend auf dem
spezifischen Fehlerschema auszuwählen. Eine unzureichende Datenbasis darf nicht
zur Modellierung genutzt werden, solange die Integrität der relevanten Information
nicht durch geeignete Maßnahmen wiederhergestellt wurde.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-12
- QM-40
- QM-53
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-13
title_original_de: MA-13 Vollständige Information in Datensätzen
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-13_CompleteInformation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-14-eda-explorative-daten-analyse
canonical_name: EDA-Explorative Daten Analyse
description: Vor Beginn des Modelltrainings ist eine explorative Datenanalyse durchzuführen,
um Datenverteilungen, Korrelationen sowie Ausreißer und strukturelle Anomalien
ohne vorab definierte Hypothesen zu identifizieren. Die gewonnenen Erkenntnisse
sind systematisch zu dokumentieren, um die Qualität der Trainingsdaten zu validieren
und fundierte Entscheidungen über notwendige Bereinigungs- oder Erweiterungsschritte
abzuleiten. Auf Basis dieser Analyse ist der Datensatz so anzupassen, dass er
die für die Zielfunktion erforderliche Repräsentativität und Integrität gewährleistet.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-10
- QM-12
- QM-24
- QM-25
- QM-26
- QM-27
- QM-28
- QM-29
- QM-36
- QM-42
- QM-54
- QM-57
- QM-61
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-14
title_original_de: MA-14 EDA-Explorative Daten Analyse
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-14_EDA-ExplorativeDataAnalysis.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-15-empirische-evidenz
canonical_name: Empirische Evidenz
description: Es ist sicherzustellen, dass die Wirksamkeit von Schutzmaßnahmen gegen
KI-gestützte Angriffe durch den systematischen Vergleich mit historischen Einsatzszenarien
empirisch validiert wird. Dabei sind Leistungsdaten aus vergleichbaren Anwendungsfällen
heranzuziehen, um die Angemessenheit der eingesetzten Trainingsdatensätze und
Methoden für den spezifischen Kontext nachzuweisen. Die Analyse muss belegen,
dass die gewählten Maßnahmen die identifizierten Risiken in der Praxis effektiv
reduzieren und die Datenqualität den aktuellen Bedrohungsmodellen entspricht.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-16
- QM-30
- QM-61
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-15
title_original_de: MA-15 Empirische Evidenz
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-15_EmpiricEvidence.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-16-daten-imputation
canonical_name: Daten Imputation
description: Für KI-Trainingsdatensätze ist eine systematische Analyse der Ursachen
für fehlende Werte zwingend erforderlich, bevor eine Rekonstruktion erfolgt. Das
gewählte Verfahren zur Datenergänzung muss sich strikt an den identifizierten
Entstehungsgründen orientieren, um die statistische Integrität des Modells zu
wahren. Eine unkritische Imputation ohne Ursachenanalyse ist unzulässig, da sie
das Lernverhalten des Algorithmus verfälschen kann.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-13
- QM-10
- QM-22
- QM-44
- QM-53
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-16
title_original_de: MA-16 Daten Imputation
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-16_DataImputation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-17-metadatenverwaltung
canonical_name: Metadatenverwaltung
description: Für den KI-Trainingsprozess ist eine vollständige Dokumentation der
Datenherkunft, der Qualitätsmetriken sowie der rechtlichen Klassifizierung jeder
einzelnen Trainingsinstanz sicherzustellen. Diese strukturellen Begleitinformationen
müssen maschinenlesbar vorliegen, um eine automatisierte Validierung der Datenintegrität
und eine nachvollziehbare Auditierung des Datensatzes zu ermöglichen. Die Erfassung
dieser Attribute ist zwingend erforderlich, um die Eignung der Daten für den spezifischen
Trainingszweck zu gewährleisten und regulatorische Vorgaben einzuhalten.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-59
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-17
title_original_de: MA-17 Metadatenverwaltung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-17_MetadataManagement.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-18-provenienztracking
canonical_name: ProvenienzTracking
description: Die Herkunft und der Verarbeitungsweg von KI-Trainingsdaten sind lückenlos
zu dokumentieren, um deren Integrität und Nachvollziehbarkeit sicherzustellen.
Für jeden Datensatz ist eine eindeutige Identifikation des Ursprungs sowie aller
Transformationsschritte im Lebenszyklus zu führen. Diese Metadaten müssen so strukturiert
sein, dass eine Rückverfolgung zur ursprünglichen Quelle jederzeit möglich ist,
ohne dass Datenverluste oder Manipulationen unentdeckt bleiben.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-59
- QM-60
- QM-61
- QM-65
- QM-67
- QM-70
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-18
title_original_de: MA-18 ProvenienzTracking
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-18_ProvenienzTracking.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-19-audit-trails
canonical_name: Audit Trails
description: Für die Nachvollziehbarkeit von KI-Trainingsprozessen ist ein lückenloses
Protokollierungssystem zu implementieren, das alle Datenmanipulationen und Modellupdates
zeitgestempelt erfasst. Jeder Zugriff auf Trainingsdatensätze sowie jede Änderung
der Modellparameter muss mit eindeutigen Benutzeridentitäten verknüpft werden.
Die gespeicherten Logs müssen so strukturiert sein, dass sie eine vollständige
Rekonstruktion des Datenflusses und eine Rückführung auf frühere Datenqualitätszustände
ermöglichen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- MA-22
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-19
title_original_de: MA-19 Audit Trails
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-19_AuditTrails.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-20-prozess-dokumentation
canonical_name: Prozess Dokumentation
description: Für die Sicherstellung der Datenqualität im KI-Trainingsprozess ist
eine vollständige Dokumentation aller Phasen der Datenerstellung und -aufbereitung
zwingend erforderlich. Diese Spezifikation muss verbindlich festlegen, welche
Aktivitäten auszuführen sind, wer hierfür verantwortlich zeichnet, welche Ressourcen
notwendig sind und welche qualitativen Ergebnisse zu erzielen sind. Insbesondere
ist die Nachverfolgbarkeit der Datenherkunft innerhalb des Dokumentationsprozesses
lückenlos zu gewährleisten, um die Integrität der Trainingsdaten zu validieren.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-15
- QM-31
- QM-62
- QM-65
external_refs:
- framework: ISO/IEC 42001
citation: null
source:
framework: BSI QUAIDAL
section: MA-20
title_original_de: MA-20 Prozess Dokumentation
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-20_ProcessDocumentation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-21-compliance
canonical_name: Compliance
description: Der Einsatz von KI-Modellen erfordert eine zwingende Prüfung der Trainingsdatensätze
auf rechtliche Konformität und ethische Integrität, bevor diese zur Modellgenerierung
verwendet werden. Es ist sicherzustellen, dass alle verarbeiteten Informationen
die Vorgaben der DSGVO sowie branchenspezifische Regularien vollständig erfüllen
und keine unrechtmäßig beschafften oder personenbezogenen Daten ohne explizite
Einwilligung enthalten. Die Validierung dieser Datenqualität muss vor jedem Trainingslauf
durch einen automatisierten oder manuellen Compliance-Check nachgewiesen werden.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-12
- QM-15
external_refs:
- framework: EU GDPR
citation: null
- framework: AI Act
citation: null
source:
framework: BSI QUAIDAL
section: MA-21
title_original_de: MA-21 Compliance
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-21_Compliance.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-22-vertrauenswuerdigkeit
canonical_name: Vertrauenswürdigkeit
description: Die Integrität und Zuverlässigkeit der für das KI-Training verwendeten
Datensätze ist im jeweiligen Anwendungskontext nachweislich zu verifizieren. Es
ist sicherzustellen, dass potenzielle Manipulationen oder unbeabsichtigte Korruptionen
des Datenflusses durch technische Prüfmechanismen ausgeschlossen werden. Bei der
Anwendung von Korrekturverfahren zur Datenbereinigung muss die ursprüngliche Glaubwürdigkeit
der Informationen gewahrt bleiben und darf nicht durch die Maßnahme beeinträchtigt
werden.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-15
- QM-43
- QM-65
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-22
title_original_de: MA-22 Vertrauenswürdigkeit
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-22_Credibility.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-23-merkmalsskalierung
canonical_name: Merkmalsskalierung
description: Für KI-Trainingsdatensätze ist eine Normalisierung der Merkmalswerte
auf einen einheitlichen Wertebereich zwingend erforderlich, um Dominanzeffekte
durch unterschiedliche Größenordnungen zu vermeiden. Diese Maßnahme stellt sicher,
dass Algorithmen, die auf Distanzberechnungen oder Gradientenverfahren basieren,
nicht durch skalenbedingte Verzerrungen beeinträchtigt werden. Die Wirksamkeit
der Skalierung ist vor dem Training systematisch zu prüfen, um die Vorhersagegenauigkeit
des Modells zu garantieren.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-10
- QM-56
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-23
title_original_de: MA-23 Merkmalsskalierung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-23_FeatureScaling.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-24-merkmalserstellung
canonical_name: Merkmalserstellung
description: Es ist sicherzustellen, dass bei der Erstellung neuer Eingangsmerkmale
für KI-Modelle ausschließlich validierte Transformationsverfahren angewendet werden,
um die Datenqualität zu gewährleisten. Die Generierung neuer Features muss auf
nachvollziehbaren Algorithmen basieren, die eine signifikante Verbesserung der
Modellleistung gegenüber den Rohdaten nachweisen. Jede angewandte Methode zur
Datenanreicherung oder -bereinigung ist vor dem Training auf ihre Eignung zur
Mustererkennung und Vorhersagegenauigkeit zu prüfen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-11
- QM-25
- QM-26
- QM-27
- QM-28
- QM-51
- QM-71
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-24
title_original_de: MA-24 Merkmalserstellung
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-24_FeatureCreation.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-25-differential-privacy
canonical_name: Differential Privacy
description: Das System muss bei der Verarbeitung von KI-Trainingsdaten differenzielle
Privatsphäre implementieren, indem statistisch signifikante, zufällige Störgrößen
zu den Ergebnissen hinzugefügt werden. Es ist sicherzustellen, dass die An- oder
Abwesenheit einzelner Datensätze im Trainingsset das Ausgabeergebnis nur marginal
beeinflusst. Durch diese Maßnahme ist zu prüfen, ob keine Rückschlüsse auf spezifische
Personen aus den generierten Analysen gezogen werden können, während die allgemeine
Datenqualität für das Modelltraining erhalten bleibt.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-58
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-25
title_original_de: MA-25 Differential Privacy
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-25_Differential%20Privacy.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0625
- id: MIT-AI-DATA-MA-26-federated-learning
canonical_name: Federated Learning
description: Für KI-Systeme, die auf verteilten Datenquellen basieren, ist ein Federated-Learning-Ansatz
zwingend vorzusehen, um die Rohdaten dezentral zu belassen. Die lokalen Modelle
müssen ausschließlich aggregierte Parameter an eine zentrale Instanz übermitteln,
während die ursprünglichen Trainingsdaten niemals die lokale Umgebung verlassen.
Eine Prüfung ist sicherzustellen, dass durch diese Architektur keine sensiblen
Informationen während des Lernprozesses zentralisiert oder übertragen werden.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-63
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-26
title_original_de: MA-26 Federated Learning
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-26_Federated%20Learning%20Approach.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-27-statistische-grundlagenthemen
canonical_name: Statistische Grundlagenthemen
description: Für die Sicherstellung der Datenqualität im KI-Lebenszyklus sind statistische
Basisverfahren systematisch zu implementieren und kontinuierlich zu validieren.
Es ist sicherzustellen, dass alle relevanten Metriken zur Verteilungsanalyse und
Datenintegrität konsistent in die Berechnungspipelines integriert werden. Diese
fundamentalen Analysen müssen unabhängig von spezifischen Bausteinen als übergeordnete
Prüfkriterien für die Modellgüte dienen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-01
- QM-02
- QM-03
- QM-04
- QM-06
- QM-07
- QM-09
- QM-23
- QM-51
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-27
title_original_de: MA-27 Statistische Grundlagenthemen
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-27_StatisticalBasis.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0213
- id: MIT-AI-DATA-MA-28-diversitaetsindizes
canonical_name: Diversitätsindizes
description: Das System muss quantitative Metriken zur Erfassung der Heterogenität
von KI-Trainingsdaten implementieren, um die Verteilung verschiedener Kategorien
zu messen. Es ist sicherzustellen, dass diese Kennzahlen sowohl die Anzahl vorhandener
Klassen als auch deren Gleichverteilung abbilden. Die Validierung der Datenqualität
erfolgt durch die Berechnung von Diversitätsindizes, die statistische Unsicherheit
oder Kollisionswahrscheinlichkeiten quantifizieren.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-68
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-28
title_original_de: MA-28 Diversitätsindizes
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-28_Diversity-Indices.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-29-data-splitting
canonical_name: Data-Splitting
description: Die Aufteilung von KI-Trainingsdaten in disjunkte Teilmengen ist zwingend
erforderlich, um eine unvoreingenommene Validierung der Modellgüte zu gewährleisten.
Dabei müssen mindestens drei voneinander getrennte Bereiche für das Training,
die Hyperparameter-Optimierung sowie die abschließende Leistungsbewertung definiert
werden. Eine zufällige oder stratifizierte Trennung ist sicherzustellen, um Datenlecks
zwischen den Phasen auszuschließen und die Generalisierungsfähigkeit des Systems
nachweisbar zu prüfen.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-69
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-29
title_original_de: MA-29 Data-Splitting
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-29_Data%20Splitting.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
- id: MIT-AI-DATA-MA-30-fairness
canonical_name: Fairness
description: Das System muss sicherstellen, dass KI-Trainingsdaten keine systematischen
Verzerrungen bezüglich sensibler demografischer Merkmale aufweisen, um diskriminierende
Vorhersagen zu vermeiden. Bei unzureichender Repräsentation von Teilgruppen sind
präventive Aufbereitungsverfahren oder algorithmische Transformationsmethoden
zur Bias-Korrektur zwingend anzuwenden. Die Wirksamkeit dieser Maßnahmen ist vor
der Modellbereitstellung durch quantitative Prüfverfahren auf Gleichbehandlungsgrundsätze
zu validieren.
kind: measure
regulation_anchor: EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)
related_quaidal_ids:
- QM-57
external_refs: []
source:
framework: BSI QUAIDAL
section: MA-30
title_original_de: MA-30 Fairness
url: https://github.com/BSI-Bund/QUAIDAL/blob/main/0000_Markdown/0001_Criteria,Measurements,Metrics/0002_Maßnahmen/MA-30_Fairness.md
commit_sha: c39b75369841b359c6bf56d6588e3768c722842f
license_note: § 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.
plagiarism_score_at_generation: 0.0
File diff suppressed because it is too large Load Diff
@@ -165,21 +165,29 @@ def classify_source_regulation(source_regulation: str) -> str:
"""
Klassifiziert eine source_regulation als law, guideline oder framework.
Verwendet exaktes Matching gegen die Map. Bei unbekannten Quellen
wird anhand von Schluesselwoertern geraten, Fallback ist 'framework'
(konservativstes Ergebnis).
Delegates to DB-backed RegulationRegistry (with 5min cache).
Falls back to SOURCE_REGULATION_CLASSIFICATION dict + heuristic
if DB is unavailable.
"""
if not source_regulation:
return SOURCE_TYPE_FRAMEWORK
# Exaktes Match
# Try DB-backed registry first
try:
from services.regulation_registry import classify_source_regulation as _db_classify
result = _db_classify(source_regulation)
if result:
return result
except Exception:
pass
# Fallback: local dict
if source_regulation in SOURCE_REGULATION_CLASSIFICATION:
return SOURCE_REGULATION_CLASSIFICATION[source_regulation]
# Heuristik fuer unbekannte Quellen
lower = source_regulation.lower()
# Gesetze erkennen
law_indicators = [
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
@@ -187,19 +195,16 @@ def classify_source_regulation(source_regulation: str) -> str:
if any(ind in lower for ind in law_indicators):
return SOURCE_TYPE_LAW
# Leitlinien erkennen
guideline_indicators = [
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
]
if any(ind in lower for ind in guideline_indicators):
return SOURCE_TYPE_GUIDELINE
# Frameworks erkennen
framework_indicators = [
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
]
if any(ind in lower for ind in framework_indicators):
return SOURCE_TYPE_FRAMEWORK
# Konservativ: unbekannt = framework (geringste Verbindlichkeit)
return SOURCE_TYPE_FRAMEWORK
+83
View File
@@ -0,0 +1,83 @@
# Lizenzregeln der Control-Pipeline
> **Stand:** 2026-05-21 — Mapping festgezurrt nach DB-Inspektion und IACE-Audit.
>
> Die Pipeline klassifiziert jede Regulation (und damit jedes daraus extrahierte
> Chunk und jeden atomic_control) in eine von **drei Lizenzregeln**. Die Regel
> entscheidet, ob der Volltext aufbewahrt werden darf und welche Attribution im
> Ausgabe-Renderer Pflicht ist.
## Die drei Regeln
| Regel | Bedeutung | Volltext speichern? | Attribution Pflicht? | Beispiele |
|-------|-----------|---------------------|----------------------|-----------|
| **1** | Wörtlich — Hoheitsrecht / Public Domain | ✓ | nein (empfohlen für Audit) | EU-Recht (EUR-Lex), Bundesrecht, Satzungsrecht (DGUV UVV), TRBS, TRGS, ASR, US Federal Code (OSHA), NIST SP, EU-Leitfäden |
| **2** | Wörtlich mit Attribution — freie Lizenzen | ✓ | **ja** | OWASP (CC-BY-SA-4.0), OECD AI Principles (OECD_PUBLIC), ENISA-Dokumente (CC-BY-4.0), Apache-2.0 Werke |
| **3** | Nur zitieren — proprietäre Standards | ✗ | nicht anwendbar (kein Volltext) | DIN, EN, ISO, ANSI, UL, IEC, IEEE, DGUV Regeln/Informationen/Grundsätze, Bitkom-Leitfäden, BSI-Bausteine (urheberrechtlich) |
**Wichtige Klarstellung:** Regel 3 = "nur Identifier/Abschnitt zitieren", **nicht** "umformulieren". Die ursprüngliche Bezeichnung "neu formulieren" war irreführend. Korrekt: Bei Regel-3-Quellen darf die Pipeline den Volltext nicht speichern; sie bewahrt nur die Quellenreferenz (regulation_id + article/paragraph), und der Output-Renderer zeigt diese Referenz im Frontend/PDF.
## Mapping `license_type``license_rule`
| license_type | license_rule | Erklärung |
|---|---|---|
| `EU_LAW`, `EU_PUBLIC` | 1 | EU-Verordnungen, Richtlinien, OJ-Veröffentlichungen, EU-Leitfäden |
| `DE_LAW`, `DE_PUBLIC` | 1 | Bundesgesetze, TRBS, TRGS, ASR, DGUV-UVV (Satzungsrecht) |
| `AT_LAW`, `CH_LAW`, `FR_LAW`, `IT_LAW`, `ES_LAW`, `NL_LAW`, `HU_LAW` | 1 | Andere EU-Mitgliedsstaaten-Recht |
| `US_GOV_PUBLIC`, `NIST_PUBLIC_DOMAIN`, `OSHA_PUBLIC` | 1 | US Federal Code (17 U.S.C. §105 Public Domain) |
| `CC-BY-4.0`, `CC-BY-SA-4.0`, `CC-BY-3.0`, `CC-BY-SA-3.0` | 2 | Creative-Commons mit Attribution-Pflicht |
| `Apache-2.0`, `MIT` | 2 | Permissive OSS-Lizenzen, NOTICE-Pflicht |
| `OECD_PUBLIC`, `ENISA_CC_BY_4.0` | 2 | Behörden-Publikationen mit Attribution-Auflage |
| `DIN_COPYRIGHT`, `ISO_COPYRIGHT`, `ANSI_COPYRIGHT`, `UL_COPYRIGHT`, `IEC_COPYRIGHT` | 3 | Normungsorganisationen — nur Identifier-Zitat |
| `DGUV_COPYRIGHT` | 3 | DGUV Regeln/Informationen/Grundsätze (nicht UVV) |
| `BITKOM_COPYRIGHT`, `BSI_COPYRIGHT`, `VDMA_COPYRIGHT` | 3 | Verbands-/Behörden-Publikationen mit eigenständigem Urheberrecht |
| `OWN_WORK` | 3 | BreakPilot-Eigentexte (Templates, eigene Patterns) — kein externes Lizenzrisiko, aber auch kein Public-Domain-Status |
**Sonderfall DGUV:** Die Klasse trennt sich nach Publikationstyp:
- DGUV **Vorschriften / UVV**`DE_LAW` → Regel 1
- DGUV **Regeln, Informationen, Grundsätze**`DGUV_COPYRIGHT` → Regel 3
## Auswirkung pro Pipeline-Stage
| Stage | Verhalten bei Regel 1 | Regel 2 | Regel 3 |
|---|---|---|---|
| Stage 6 ControlCompose (`pipeline_adapter.py:147`) | speichert `chunk_text` | speichert `chunk_text` | speichert `chunk_text = None` |
| Atomic-Control-Bildung | Volltext als Quelle | Volltext + Attribution-Vermerk | nur regulation_id + article |
| Output-Renderer (Frontend/PDF) | optionaler Quellen-Hinweis | **Pflicht-Attribution in Footer + Inline** | nur Identifier rendern |
| Tech-File-Anhang | Quelle nennen | Quelle + Lizenz-URL | Identifier-Liste |
## Quellen ohne Klassifikation
Aktuell sind in `regulation_registry` **232 Regulationen** klassifiziert (Stand 2026-05-21). Die folgenden müssen noch ergänzt werden (Task #20 deckt den DGUV-Ingest):
| Quelle | Regel | Begründung |
|---|---|---|
| TRBS-Familie (24 PDFs im RAG) | 1 | Technische Regeln Betriebssicherheit — BAuA Bundesarbeitsblatt |
| TRGS-Familie (alle Volltext-Chunks) | 1 | Technische Regeln Gefahrstoffe — BAuA |
| ASR-Familie (17 PDFs) | 1 | Arbeitsstättenregeln — BAuA |
| OSHA 29 CFR 1910 Subpart O + Technical Manual | 1 | US Federal Public Domain (17 U.S.C. §105) |
| DGUV Vorschrift 1 + UVV-Familie (sobald ingest) | 1 | Satzungsrecht der BG |
| DGUV Regel 100-500 + Information 209-072/074/073 | 3 | DGUV-Copyright, nur Identifier |
| DIN-Identifier-Tabelle (ohne Volltext) | 3 | DIN-Beuth-Copyright |
| ANSI B11.0 + RIA R15.06 + UL 508A Identifier | 3 | ANSI/UL-Copyright |
| ISO 12100/13849/13857 Identifier | 3 | ISO-Copyright |
## Audit-Pflicht
Vor jedem Ingest neuer Quellen:
1. Lizenz prüfen (publikationen.dguv.de, EUR-Lex, etc.)
2. license_type aus obiger Tabelle wählen — wenn nicht vorhanden, hier ergänzen
3. license_rule wird daraus deterministisch abgeleitet
4. Attribution-Text bei Regel 2 ist Pflichtfeld
Vor jedem Output:
- Wenn ein atomic_control aus einer Regel-3-Quelle stammt: prüfen dass NUR Identifier gezeigt wird, niemals Volltext
- Wenn aus Regel-2-Quelle: Attribution muss im PDF-Footer und im Frontend-Tooltip vorhanden sein
- Wenn aus Regel-1-Quelle: empfohlen Quelle nennen für Auditierbarkeit
## Verweise
- Schema: `migrations/002_regulation_registry.sql`
- Code: `services/regulation_registry.py`, `services/pipeline_adapter.py`
- Seed-Script: `scripts/f1_migrate_regulation_registry.py`
- Tests: `tests/test_regulation_registry.py` (assert: rule IN (1,2,3))
+101
View File
@@ -0,0 +1,101 @@
# Incremental BatchDedup für nachgeschobene Dokumente
Eingefuehrt am 2026-05-18. Pattern fuer alle zukuenftigen Einzeldokument-Ingestionen.
## Problem
Der Default-BatchDedup-Runner lief gegen ALLE `pass0b` Atomics ohne Filter
(WHERE decomposition_method = 'pass0b' AND release_state NOT IN ('deprecated','duplicate')).
Das sind bei uns ~172k Controls. Pace ~5k/h → 25-40h Laufzeit. Bei jedem
hinzugefuegten Dokument der gleiche volle Lauf — auch wenn das neue Dokument
nur 1-2k Atomics erzeugt.
Zusaetzliches Risiko: Phase 1 schreibt master_controls erst am Ende. Ein
Container-Crash mitten im Lauf (z.B. via Qdrant-Timeout) verwirft 100%
des In-Memory-Fortschritts.
## Loesung — `since` Parameter
`POST /v1/canonical/generate/batch-dedup` akzeptiert jetzt:
```json
{
"dry_run": false,
"since": "2026-05-18T02:53:00+00:00"
}
```
Effekt:
- Phase 1 (intra-group dedup) laedt nur Controls mit `created_at >= since`
- Phase 2 (cross-group dedup) filtert ebenfalls auf `created_at >= since`
- Phase 2 Checkpoint wird vor Lauf-Start geloescht (sonst skippt stale
`last_control_id` neu erzeugte Atomics deren control_id alphabetisch
davor liegt)
Phase 2 sucht weiter im **vollen** Qdrant-Index `atomic_controls_dedup`,
findet also Matches zu alten Master Controls und verlinkt korrekt.
## Wann verwenden
| Szenario | Empfehlung |
|---|---|
| Einzelnes neues Dokument ingestiert + Pass 0a + Pass 0b durchgelaufen | `since` setzen auf Zeitpunkt vor Pass 0b |
| Mehrere kleine Updates seit letztem Full-Dedup | `since` setzen auf Zeitpunkt nach letztem Full-Dedup |
| Initial-Setup oder Pipeline-Major-Update | KEIN `since` — full run |
| Verdacht auf Drift / Quality-Regression | KEIN `since` — full run |
## Workflow nach Einzeldokument-Ingestion
```bash
# 1. Pass 0a auf neue Controls (Obligations extrahieren)
curl -X POST .../v1/canonical/generate/run-pass0a -d '{...}'
# 2. Pass 0b Decomposition Submit (Atomics erzeugen)
curl -X POST .../v1/canonical/generate/submit-pass0b -d '{...}'
# 3. Wenn Anthropic Batch durch: process-batch
curl -X POST .../v1/canonical/generate/process-batch -d '{
"batch_id": "msgbatch_...",
"pass_type": "0b"
}'
# 4. Inkrementell deduppen (NEU, statt 25h full run)
curl -X POST .../v1/canonical/generate/batch-dedup -d '{
"dry_run": false,
"since": "<ISO-Datetime kurz vor Pass-0b-Start>"
}'
```
## Pace-Beobachtung (CRA-Lauf 2026-05-18)
- Total neue Atomics: 19.423
- Phase 1 multi-groups: 568 (Rest 18.101 sind Singletons → direkt Master)
- Phase 2 Cross-Group: ~3-4h erwartet
- Vergleich: Full-Run waere 25-40h gewesen, scoped 6-13x schneller.
## Implementation-Details (fuer Wartung)
Geaenderte Dateien:
- `services/batch_dedup_runner.py``run()` + `_load_merge_groups()` +
`_run_cross_group_pass()` SQL-Queries
- `api/control_generator_routes.py``BatchDedupRequest.since` Feld +
Handler reicht durch
Backwards-kompatibel: ohne `since` aequivalent zum alten Verhalten.
## Bekannte Limits
1. **Phase 2 Checkpoint wird beim scoped Lauf geloescht.** Wenn waehrend
eines `since`-Laufs ein voller Run dazwischen geschoben werden soll
(sollte nicht passieren), muss neu starten.
2. **Phase 1 commit-Granularitaet nicht angefasst.** Bei Crash mitten in
Phase 1 ohne `since` bleibt der Verlust gleich. Aber: scoped Phase 1
ist so kurz (Minuten), dass das praktisch egal ist.
3. **Singleton-Atomics werden direkt Master ohne Cross-Check.** Wenn ein
neues Singleton-Atomic semantisch identisch zu einem alten Master
ist, faengt das nur Phase 2 (via Qdrant). Funktioniert solange Phase 2
nicht uebersprungen wird (dry_run=false ist Pflicht).
## Memory-Eintrag
Siehe `~/.claude/projects/-Users-benjaminadmin-Projekte-breakpilot-core/memory/feedback_incremental_dedup.md`
@@ -0,0 +1,72 @@
-- Migration 002: Regulation Registry (Block F1)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/002_regulation_registry.sql
SET search_path TO compliance, public;
-- ========================================
-- regulation_registry
-- ========================================
-- Central registry for all regulations, laws, guidelines, and frameworks
-- referenced by the control pipeline. Replaces hardcoded Python dicts
-- (REGULATION_LICENSE_MAP, SOURCE_REGULATION_CLASSIFICATION).
CREATE TABLE IF NOT EXISTS regulation_registry (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
-- regulation_id: machine key (e.g. "eu_2016_679", "nist_sp_800_53")
regulation_id VARCHAR(100) UNIQUE NOT NULL,
-- Display names
regulation_name_de TEXT,
regulation_name_en TEXT,
regulation_short VARCHAR(50),
-- License classification (3-rule system)
license_rule INTEGER NOT NULL DEFAULT 1
CHECK (license_rule IN (1, 2, 3)),
license_type VARCHAR(50), -- EU_LAW, DE_LAW, CC-BY-SA-4.0, etc.
attribution TEXT, -- Required for Rule 2 (CC-BY)
-- Source classification
source_type VARCHAR(20) NOT NULL DEFAULT 'law'
CHECK (source_type IN ('law', 'guideline', 'standard', 'framework', 'restricted')),
-- Metadata
jurisdiction VARCHAR(10), -- DE, EU, AT, CH, US, FR, ES, NL, IT, HU, INT
category VARCHAR(50),
celex VARCHAR(30), -- EU CELEX number if applicable
url TEXT,
-- Lifecycle
status VARCHAR(20) NOT NULL DEFAULT 'active'
CHECK (status IN ('active', 'needs_review', 'deprecated')),
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- Indexes
CREATE INDEX IF NOT EXISTS idx_reg_registry_status
ON regulation_registry(status);
CREATE INDEX IF NOT EXISTS idx_reg_registry_jurisdiction
ON regulation_registry(jurisdiction);
CREATE INDEX IF NOT EXISTS idx_reg_registry_source_type
ON regulation_registry(source_type);
CREATE INDEX IF NOT EXISTS idx_reg_registry_license_rule
ON regulation_registry(license_rule);
-- Updated-at trigger
CREATE OR REPLACE FUNCTION update_regulation_registry_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_regulation_registry_updated_at ON regulation_registry;
CREATE TRIGGER trg_regulation_registry_updated_at
BEFORE UPDATE ON regulation_registry
FOR EACH ROW
EXECUTE FUNCTION update_regulation_registry_updated_at();
@@ -0,0 +1,58 @@
-- Migration 003: Action & Object Ontology (Block F2+F3)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/003_action_object_ontology.sql
SET search_path TO compliance, public;
-- ========================================
-- action_types — 34 canonical action verbs
-- ========================================
CREATE TABLE IF NOT EXISTS action_types (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_name VARCHAR(50) UNIQUE NOT NULL,
phase VARCHAR(30) NOT NULL,
description_de TEXT,
description_en TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_action_types_phase ON action_types(phase);
-- ========================================
-- action_synonyms — German aliases + negative patterns
-- ========================================
CREATE TABLE IF NOT EXISTS action_synonyms (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_action VARCHAR(50) NOT NULL REFERENCES action_types(canonical_name),
synonym VARCHAR(100) NOT NULL,
language VARCHAR(5) NOT NULL DEFAULT 'de',
source VARCHAR(20) NOT NULL DEFAULT 'manual'
CHECK (source IN ('manual', 'llm', 'migration')),
pattern_type VARCHAR(20) NOT NULL DEFAULT 'alias'
CHECK (pattern_type IN ('alias', 'negative_pattern')),
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(synonym, language, pattern_type)
);
CREATE INDEX IF NOT EXISTS idx_action_synonyms_canonical ON action_synonyms(canonical_action);
CREATE INDEX IF NOT EXISTS idx_action_synonyms_pattern_type ON action_synonyms(pattern_type);
-- ========================================
-- object_synonyms — normalized object tokens
-- ========================================
CREATE TABLE IF NOT EXISTS object_synonyms (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
canonical_token VARCHAR(100) NOT NULL,
synonym VARCHAR(200) NOT NULL,
language VARCHAR(5) NOT NULL DEFAULT 'de',
source VARCHAR(20) NOT NULL DEFAULT 'manual'
CHECK (source IN ('manual', 'llm', 'migration')),
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(synonym, language)
);
CREATE INDEX IF NOT EXISTS idx_object_synonyms_canonical ON object_synonyms(canonical_token);
@@ -0,0 +1,18 @@
-- Migration 004: Object Groups (G-pre1)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/004_object_groups.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS object_groups (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
group_id INTEGER NOT NULL,
canonical_name VARCHAR(200) NOT NULL,
member_count INTEGER DEFAULT 0,
members JSONB DEFAULT '[]',
top_controls_count INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_object_groups_group_id ON object_groups(group_id);
CREATE INDEX IF NOT EXISTS idx_object_groups_canonical ON object_groups(canonical_name);
@@ -0,0 +1,30 @@
-- Migration 005: Master Controls (G-pre2)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/005_master_controls.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS master_controls (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
master_control_id VARCHAR(50) UNIQUE NOT NULL,
object_group_id INTEGER NOT NULL,
canonical_name VARCHAR(200) NOT NULL,
phases_covered JSONB NOT NULL DEFAULT '[]',
phase_control_count JSONB NOT NULL DEFAULT '{}',
total_controls INTEGER DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_master_controls_group ON master_controls(object_group_id);
CREATE TABLE IF NOT EXISTS master_control_members (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
master_control_uuid UUID NOT NULL REFERENCES master_controls(id) ON DELETE CASCADE,
control_uuid UUID NOT NULL,
phase VARCHAR(50) NOT NULL,
action VARCHAR(50) NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_mc_members_master ON master_control_members(master_control_uuid);
CREATE INDEX IF NOT EXISTS idx_mc_members_control ON master_control_members(control_uuid);
@@ -0,0 +1,58 @@
-- Migration 006: Decision Traces (G1)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/006_decision_traces.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS decision_traces (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
control_uuid UUID NOT NULL,
regulation_id VARCHAR(100),
obligation_id VARCHAR(100),
-- Decision
status VARCHAR(30) NOT NULL DEFAULT 'not_assessed'
CHECK (status IN ('not_assessed', 'compliant', 'partially_compliant',
'not_compliant', 'not_applicable', 'under_remediation')),
decision_reason TEXT,
decided_by VARCHAR(100),
decided_at TIMESTAMPTZ,
-- Fix/Remediation
fix_strategy TEXT,
fix_owner VARCHAR(100),
fix_target_date DATE,
fix_completed_date DATE,
-- Evidence
evidence_ids JSONB DEFAULT '[]',
confidence NUMERIC(3,2) DEFAULT 0.0,
-- Multi-tenant
tenant_id UUID,
project_id UUID,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_dt_control ON decision_traces(control_uuid);
CREATE INDEX IF NOT EXISTS idx_dt_status ON decision_traces(status);
CREATE INDEX IF NOT EXISTS idx_dt_tenant ON decision_traces(tenant_id);
CREATE INDEX IF NOT EXISTS idx_dt_decided_at ON decision_traces(decided_at);
-- Updated-at trigger
CREATE OR REPLACE FUNCTION update_decision_traces_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS trg_decision_traces_updated_at ON decision_traces;
CREATE TRIGGER trg_decision_traces_updated_at
BEFORE UPDATE ON decision_traces
FOR EACH ROW
EXECUTE FUNCTION update_decision_traces_updated_at();
@@ -0,0 +1,38 @@
-- Migration 007: Compliance Commit Ledger (G2)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/007_compliance_commits.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS compliance_commits (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
project_id UUID,
-- Git Info
commit_hash VARCHAR(64) NOT NULL,
commit_message TEXT,
commit_author VARCHAR(200),
commit_date TIMESTAMPTZ,
branch VARCHAR(200),
repo_url TEXT,
-- Affected Controls
affected_control_ids JSONB NOT NULL DEFAULT '[]',
affected_files JSONB DEFAULT '[]',
-- Analysis
risk_level VARCHAR(20) DEFAULT 'low'
CHECK (risk_level IN ('low', 'medium', 'high', 'critical')),
analysis_summary TEXT,
analysis_metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_cc_tenant ON compliance_commits(tenant_id);
CREATE INDEX IF NOT EXISTS idx_cc_hash ON compliance_commits(commit_hash);
CREATE INDEX IF NOT EXISTS idx_cc_date ON compliance_commits(commit_date);
CREATE INDEX IF NOT EXISTS idx_cc_risk ON compliance_commits(risk_level);
-- GIN index for JSONB array containment queries (@>)
CREATE INDEX IF NOT EXISTS idx_cc_control_ids ON compliance_commits USING GIN (affected_control_ids);
@@ -0,0 +1,37 @@
-- Migration 008: Decision Events / Full Decision Memory (G3)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/008_decision_events.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS decision_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
decision_trace_id UUID REFERENCES decision_traces(id) ON DELETE SET NULL,
control_uuid UUID NOT NULL,
tenant_id UUID,
-- Event type
event_type VARCHAR(30) NOT NULL
CHECK (event_type IN (
'assessment', 'decision', 'fix_planned', 'fix_started',
'fix_completed', 'verification', 'failure', 'exception', 'escalation'
)),
-- State before/after
input_state JSONB DEFAULT '{}',
output_state JSONB DEFAULT '{}',
-- Details
summary TEXT,
actor VARCHAR(200),
evidence_ids JSONB DEFAULT '[]',
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_de_control ON decision_events(control_uuid);
CREATE INDEX IF NOT EXISTS idx_de_trace ON decision_events(decision_trace_id);
CREATE INDEX IF NOT EXISTS idx_de_tenant ON decision_events(tenant_id);
CREATE INDEX IF NOT EXISTS idx_de_type ON decision_events(event_type);
CREATE INDEX IF NOT EXISTS idx_de_created ON decision_events(created_at);
@@ -0,0 +1,38 @@
-- Migration 009: Deployment Checks / Pre-Deployment Enforcement (G4)
-- Schema: compliance
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" < control-pipeline/migrations/009_deployment_checks.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS deployment_checks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL,
-- Deploy Info
commit_hash VARCHAR(64) NOT NULL,
branch VARCHAR(200),
environment VARCHAR(50) DEFAULT 'production',
-- Result
verdict VARCHAR(20) NOT NULL DEFAULT 'pending'
CHECK (verdict IN ('pending', 'approved', 'blocked', 'override')),
-- Impact
affected_control_ids JSONB DEFAULT '[]',
blocking_controls JSONB DEFAULT '[]',
warning_controls JSONB DEFAULT '[]',
risk_score NUMERIC(5,2) DEFAULT 0.0,
-- Override
override_by VARCHAR(200),
override_reason TEXT,
summary TEXT,
metadata JSONB DEFAULT '{}',
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_dc_tenant ON deployment_checks(tenant_id);
CREATE INDEX IF NOT EXISTS idx_dc_hash ON deployment_checks(commit_hash);
CREATE INDEX IF NOT EXISTS idx_dc_verdict ON deployment_checks(verdict);
CREATE INDEX IF NOT EXISTS idx_dc_created ON deployment_checks(created_at);
@@ -0,0 +1,162 @@
-- Migration 010: Expanded Object Ontology
-- Expands from 31 to ~180 canonical object tokens with clear semantic boundaries.
-- Each token has a description to prevent ambiguous classification.
--
-- IMPORTANT: This migration ADDS new tokens. Existing synonyms are preserved.
SET search_path TO compliance, public;
-- Add description column to object_synonyms if not exists
DO $$ BEGIN
ALTER TABLE object_synonyms ADD COLUMN IF NOT EXISTS description TEXT;
EXCEPTION WHEN duplicate_column THEN NULL;
END $$;
-- New table: canonical object definitions with clear boundaries
CREATE TABLE IF NOT EXISTS object_ontology (
canonical_token VARCHAR(100) PRIMARY KEY,
category VARCHAR(50) NOT NULL, -- security, data_protection, governance, regulatory, technical
description_de TEXT NOT NULL, -- German description for LLM prompts
description_en TEXT NOT NULL, -- English description
NOT_confused_with TEXT, -- Explicit disambiguation
examples TEXT, -- Example controls that belong here
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- ═══════════════════════════════════════════════════════════════
-- SECURITY & TECHNICAL
-- ═══════════════════════════════════════════════════════════════
-- Authentication & Identity
INSERT INTO object_ontology VALUES
('multi_factor_auth', 'security', 'Multi-Faktor-Authentifizierung (2FA/MFA)', 'Multi-factor authentication', 'NOT password_policy (Passwortregeln) oder session_management (Sitzungen)', 'MFA implementieren, 2FA-Pflicht, Authentifizierungsfaktoren'),
('password_policy', 'security', 'Passwortrichtlinien und -komplexität', 'Password policies and complexity', 'NOT credentials (allg. Zugangsdaten) oder multi_factor_auth (MFA)', 'Passwortlänge, Komplexität, Rotation, Passwort-Historie'),
('credentials', 'security', 'Zugangsdaten-Verwaltung (Tokens, API-Keys, Secrets)', 'Credential management', 'NOT password_policy (Passwortregeln) oder key_management (kryptografisch)', 'API-Key-Rotation, Token-Verwaltung, Secret Storage'),
('session_management', 'security', 'Sitzungsverwaltung (Session Timeout, Token-Lifecycle)', 'Session management', 'NOT multi_factor_auth (Login) oder access_control (Berechtigungen)', 'Session Timeout, Token-Invalidierung, Concurrent Sessions'),
('privileged_access', 'security', 'Verwaltung privilegierter Zugriffe (Admin, Root)', 'Privileged access management', 'NOT access_control (allg. Zugriffskontrolle)', 'Admin-Konten, Root-Zugriff, PAM, Just-in-Time-Access'),
('access_control', 'security', 'Allgemeine Zugriffskontrolle (RBAC, Berechtigungen)', 'Access control (RBAC, permissions)', 'NOT privileged_access (Admin) oder authentication (Login)', 'Rollenbasierte Zugriffskontrolle, Berechtigungsvergabe, Least Privilege')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Encryption & Cryptography
INSERT INTO object_ontology VALUES
('encryption', 'security', 'Verschlüsselung at-rest (Datenverschlüsselung)', 'Encryption at rest', 'NOT transport_encryption (in-transit) oder key_management (Schlüssel)', 'AES-256, Festplattenverschlüsselung, DB-Verschlüsselung'),
('transport_encryption', 'security', 'Transportverschlüsselung (TLS, HTTPS)', 'Transport encryption (TLS)', 'NOT encryption (at-rest)', 'TLS 1.3, HTTPS, mTLS, Zertifikats-Pinning'),
('key_management', 'security', 'Kryptografische Schlüsselverwaltung', 'Cryptographic key management', 'NOT credentials (API-Keys) oder certificate_management (Zertifikate)', 'Key Rotation, HSM, Key Escrow, Schlüsselerzeugung'),
('certificate_management', 'security', 'Zertifikatsverwaltung (PKI, X.509)', 'Certificate management (PKI)', 'NOT key_management (Schlüssel) oder encryption (Verschlüsselung)', 'X.509-Zertifikate, PKI, Zertifikatsrückruf, CA-Verwaltung')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Network Security
INSERT INTO object_ontology VALUES
('network_security', 'security', 'Allgemeine Netzwerksicherheit', 'General network security', 'NOT network_segmentation (Segmentierung) oder firewall (Regeln)', 'Netzwerk-Hardening, Port-Management, DNS-Sicherheit'),
('network_segmentation', 'security', 'Netzwerksegmentierung (VLANs, Zonen)', 'Network segmentation', 'NOT network_security (allg.) oder firewall (Regeln)', 'VLANs, DMZ, Micro-Segmentation, Zero Trust Network'),
('firewall', 'security', 'Firewall-Regeln und -Verwaltung', 'Firewall rules and management', 'NOT network_security (allg.)', 'WAF, Firewall-Regeln, Ingress/Egress, Whitelist'),
('vpn', 'security', 'VPN-Konfiguration und -Verwaltung', 'VPN configuration', NULL, 'IPSec, WireGuard, Site-to-Site VPN'),
('remote_access', 'security', 'Fernzugriff und Remote-Arbeit', 'Remote access', 'NOT vpn (Technologie)', 'Remote Desktop, Bastion Hosts, Jump Server')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Monitoring & Logging (CRITICAL: clear boundaries!)
INSERT INTO object_ontology VALUES
('monitoring', 'security', 'Kontinuierliche Echtzeit-Überwachung von Systemen/Metriken', 'Continuous real-time monitoring of systems', 'NOT audit_logging (Protokollierung), NOT training (Schulung), NOT procedure (Verfahren), NOT risk_assessment (Bewertung)', 'System-Health-Monitoring, Verfügbarkeitsüberwachung, Performance-Monitoring, Anomalie-Erkennung in Echtzeit'),
('audit_logging', 'security', 'Protokollierung und Audit-Trail (Nachvollziehbarkeit)', 'Audit logging and trail', 'NOT monitoring (Echtzeit-Überwachung), NOT compliance_audit (Prüfungen)', 'Log-Aufzeichnung, Audit Trail, Zeitstempel, Nachvollziehbarkeit, Protokollierung von Zugriffen'),
('siem', 'security', 'Security Information and Event Management', 'SIEM', 'NOT monitoring (allg.) oder audit_logging (Protokollierung)', 'SIEM-Korrelation, Security Events, Log-Aggregation'),
('alerting', 'security', 'Benachrichtigungen und Meldepflichten bei Sicherheitsereignissen', 'Security alerting and notification obligations', 'NOT monitoring (Überwachung) oder incident (Vorfallsbehandlung)', 'Sicherheitsmeldungen, Breach Notification, Benachrichtigungspflichten'),
('compliance_audit', 'governance', 'Compliance-Prüfungen und externe Audits', 'Compliance audits and external reviews', 'NOT audit_logging (technische Protokollierung), NOT monitoring (Überwachung)', 'Externe Prüfung, Jahresabschlussprüfung, Zertifizierungsaudit, Lieferanten-Audit')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Vulnerability & Patch Management
INSERT INTO object_ontology VALUES
('vulnerability', 'security', 'Schwachstellenmanagement und -scanning', 'Vulnerability management', 'NOT patch_management (Updates)', 'Vulnerability Scanning, CVE-Tracking, Penetration Testing'),
('patch_management', 'security', 'Software-Updates und Patch-Verwaltung', 'Patch management', 'NOT vulnerability (Scanning)', 'Patch-Zyklus, Update-Policy, Hotfix-Prozess')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Backup & Recovery
INSERT INTO object_ontology VALUES
('backup', 'security', 'Datensicherung und Backup-Strategien', 'Backup strategies', 'NOT disaster_recovery (Wiederherstellung)', 'Backup-Rotation, Offsite-Backup, Backup-Verschlüsselung'),
('disaster_recovery', 'security', 'Notfallwiederherstellung und Business Continuity', 'Disaster recovery', 'NOT backup (Datensicherung) oder incident (Vorfälle)', 'DR-Plan, RTO/RPO, Failover, Business Continuity')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- DATA PROTECTION (CRITICAL: clear boundaries!)
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('personal_data', 'data_protection', 'Verarbeitung personenbezogener Daten (DSGVO-Grundsätze)', 'Personal data processing principles', 'NOT sensitive_data (besondere Kategorien), NOT data_subject_rights (Betroffenenrechte), NOT consent (Einwilligung)', 'Datenminimierung, Zweckbindung, Speicherbegrenzung, Rechtmäßigkeit der Verarbeitung'),
('sensitive_data', 'data_protection', 'Besondere Kategorien personenbezogener Daten (Art. 9 DSGVO)', 'Special categories of personal data', 'NOT personal_data (allg.), NOT health_data (Gesundheit)', 'Biometrische Daten, ethnische Herkunft, politische Meinungen, Gewerkschaftszugehörigkeit'),
('health_data', 'data_protection', 'Gesundheitsdaten und Medizindaten', 'Health and medical data', 'NOT sensitive_data (allg. besondere Kategorien)', 'Patientendaten, Medizinprodukte-Daten, klinische Daten'),
('consent', 'data_protection', 'Einwilligungsmanagement', 'Consent management', 'NOT data_subject_rights (andere Betroffenenrechte)', 'Einwilligung einholen, Widerruf, Opt-In, Consent-Banner'),
('data_subject_rights', 'data_protection', 'Betroffenenrechte (Auskunft, Löschung, Portabilität)', 'Data subject rights (access, erasure, portability)', 'NOT consent (Einwilligung), NOT personal_data (Verarbeitung)', 'Auskunftsrecht, Recht auf Löschung, Datenportabilität, Widerspruchsrecht'),
('data_retention', 'data_protection', 'Aufbewahrungsfristen und Löschkonzept', 'Data retention and deletion', 'NOT backup (technische Sicherung)', 'Löschfristen, Aufbewahrungspflichten, Löschkonzept, Archivierung'),
('data_transfer', 'data_protection', 'Internationale Datenübermittlung (Drittländer, SCC)', 'International data transfer', 'NOT data_processing (Verarbeitung)', 'Drittlandtransfer, Standardvertragsklauseln, Angemessenheitsbeschluss, BCR'),
('data_breach_notification', 'data_protection', 'Meldung von Datenschutzverletzungen (Art. 33/34 DSGVO)', 'Data breach notification', 'NOT incident (allg. Sicherheitsvorfälle), NOT alerting (techn. Alerts)', 'Breach-Meldung an Aufsichtsbehörde, Benachrichtigung Betroffener, 72-Stunden-Frist'),
('dpia', 'data_protection', 'Datenschutz-Folgenabschätzung (Art. 35 DSGVO)', 'Data protection impact assessment', NULL, 'DSFA, Schwellwertanalyse, Risikobewertung für Betroffene'),
('data_processing_agreement', 'data_protection', 'Auftragsverarbeitung (Art. 28 DSGVO)', 'Data processing agreements', NULL, 'AVV, Auftragsverarbeiter, Sub-Auftragsverarbeiter, TOMs'),
('privacy_by_design', 'data_protection', 'Datenschutz durch Technikgestaltung (Art. 25 DSGVO)', 'Privacy by design and default', NULL, 'Privacy by Default, Datenminimierung in der Architektur'),
('data_processing_register', 'data_protection', 'Verzeichnis von Verarbeitungstätigkeiten (Art. 30 DSGVO)', 'Records of processing activities', NULL, 'VVT, Verarbeitungsverzeichnis')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- GOVERNANCE & ORGANIZATION
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('policy', 'governance', 'Richtlinien und Leitlinien ERSTELLEN/DEFINIEREN', 'Creating/defining policies', 'NOT procedure (Verfahrensablauf), NOT compliance_audit (Prüfung)', 'Sicherheitsrichtlinie erstellen, Policy-Framework definieren, Leitlinie verabschieden'),
('procedure', 'governance', 'Verfahren und Prozessabläufe DEFINIEREN/DOKUMENTIEREN', 'Defining/documenting procedures', 'NOT incident (Vorfallsbehandlung), NOT process (laufender Betrieb)', 'Verfahrensanweisung, Ablaufbeschreibung, Standardprozess definieren'),
('process', 'governance', 'Laufende betriebliche Prozesse AUSFÜHREN', 'Executing operational processes', 'NOT procedure (Definition), NOT monitoring (Überwachung)', 'Betriebsprozess, Geschäftsprozess, Workflow-Ausführung'),
('training', 'governance', 'Schulung und Weiterbildung DURCHFÜHREN', 'Training and education', 'NOT awareness (Sensibilisierung), NOT monitoring (Überwachung!)', 'Mitarbeiterschulung, Zertifizierungskurs, Pflichtunterweisung'),
('awareness', 'governance', 'Sicherheitsbewusstsein und Sensibilisierung', 'Security awareness', 'NOT training (formale Schulung)', 'Phishing-Simulation, Awareness-Kampagne, Sicherheitskultur'),
('incident', 'governance', 'Sicherheitsvorfälle BEHANDELN (Incident Response)', 'Incident response and handling', 'NOT alerting (Benachrichtigung), NOT data_breach_notification (DSGVO-Meldung)', 'Incident Response Plan, Vorfallsanalyse, Containment, Recovery, Lessons Learned'),
('risk_management', 'governance', 'Risikomanagement und -bewertung', 'Risk management and assessment', 'NOT vulnerability (techn. Schwachstellen), NOT monitoring (Überwachung)', 'Risikobewertung, Risikobehandlung, Risikoakzeptanz, Risikomatrix'),
('third_party_management', 'governance', 'Lieferanten- und Drittanbieter-Management', 'Third-party and vendor management', 'NOT data_processing_agreement (AVV)', 'Lieferantenbewertung, Vendor Risk Assessment, Supply Chain Security'),
('change_management', 'governance', 'Änderungsmanagement', 'Change management', 'NOT patch_management (Updates)', 'Change Request, Change Advisory Board, Rollback-Verfahren'),
('documentation', 'governance', 'Allgemeine Dokumentationspflichten', 'General documentation requirements', 'NOT audit_logging (technische Logs), NOT data_processing_register (VVT)', 'Betriebshandbuch, Systemdokumentation, Verfahrensdokumentation'),
('records_management', 'governance', 'Akten- und Unterlagenverwaltung', 'Records management', 'NOT data_retention (Löschfristen)', 'Archivierung, Aktenführung, Aufbewahrungspflichten nach HGB/AO'),
('compliance_reporting', 'governance', 'Compliance-Berichterstattung', 'Compliance reporting', 'NOT alerting (techn. Alerts), NOT supervisory_authority (Behördenkommunikation)', 'Compliance-Bericht, Management-Reporting, KPI-Tracking'),
('asset_management', 'governance', 'IT-Asset-Verwaltung und Inventar', 'IT asset management', NULL, 'Asset-Inventar, CMDB, Hardware-Lifecycle, Software-Inventar'),
('physical_security', 'security', 'Physische Sicherheit und Zutrittskontrolle', 'Physical security and access', NULL, 'Zutrittskontrolle, Videoüberwachung (physisch), Serverraum-Sicherheit'),
('human_resources_security', 'governance', 'Personalsicherheit', 'HR security', 'NOT training (Schulung)', 'Background-Checks, Geheimhaltungsvereinbarungen, Onboarding/Offboarding')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- REGULATORY SPECIFIC
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('supervisory_authority', 'regulatory', 'Kommunikation mit Aufsichtsbehörden', 'Supervisory authority communication', 'NOT compliance_reporting (interne Berichte)', 'Meldung an BaFin, Abstimmung mit DPA, behördliche Anfragen'),
('certification', 'regulatory', 'Zertifizierung und Konformitätsbewertung', 'Certification and conformity assessment', 'NOT compliance_audit (Prüfung), NOT personal_data (Datenschutz)', 'CE-Kennzeichnung, ISO-Zertifizierung, Konformitätserklärung'),
('product_safety', 'regulatory', 'Produktsicherheit und Marktüberwachung', 'Product safety and market surveillance', 'NOT certification (Zertifizierung)', 'Rückrufmanagement, Sicherheitsbewertung, RAPEX-Meldung'),
('ai_system', 'regulatory', 'KI-System-Regulierung (AI Act)', 'AI system regulation', NULL, 'KI-Risikobewertung, Hochrisiko-KI, Transparenzpflichten, FRIA'),
('financial_reporting', 'regulatory', 'Finanzberichterstattung und Rechnungslegung', 'Financial reporting and accounting', NULL, 'Jahresabschluss, HGB-Pflichten, IFRS, Buchführung'),
('aml', 'regulatory', 'Geldwäscheprävention und KYC', 'Anti-money laundering and KYC', NULL, 'KYC, Verdachtsmeldung, PEP-Prüfung, Transaktionsmonitoring'),
('whistleblowing', 'regulatory', 'Hinweisgeberschutz und Meldekanäle', 'Whistleblower protection', NULL, 'Hinweisgebersystem, Meldekanal, Hinweisgeberschutzgesetz'),
('consumer_protection', 'regulatory', 'Verbraucherschutz und AGB', 'Consumer protection', NULL, 'AGB-Prüfung, Widerrufsrecht, Informationspflichten, Preistransparenz'),
('ecommerce', 'regulatory', 'E-Commerce-Pflichten (Impressum, Fernabsatz)', 'E-commerce obligations', NULL, 'Impressumspflicht, Fernabsatzrecht, Online-Handel-Pflichten'),
('telecommunications', 'regulatory', 'Telekommunikationsregulierung', 'Telecommunications regulation', NULL, 'TKG-Pflichten, Vorratsdatenspeicherung, Notruf'),
('medical_device', 'regulatory', 'Medizinprodukte-Regulierung (MDR)', 'Medical device regulation', NULL, 'UDI, klinische Bewertung, Post-Market Surveillance'),
('payment_services', 'regulatory', 'Zahlungsdienste-Regulierung (PSD2)', 'Payment services regulation', NULL, 'Starke Kundenauthentifizierung, PSD2-Compliance, Open Banking'),
('critical_infrastructure', 'regulatory', 'KRITIS und NIS2-Pflichten', 'Critical infrastructure (NIS2)', NULL, 'KRITIS-Meldepflichten, NIS2-Maßnahmen, Mindeststandards'),
('supply_chain_due_diligence', 'regulatory', 'Lieferkettensorgfaltspflicht (LkSG)', 'Supply chain due diligence', 'NOT third_party_management (allg. Lieferanten)', 'Menschenrechts-Due-Diligence, Umwelt-Sorgfaltspflicht, LkSG-Bericht'),
('sustainability_reporting', 'regulatory', 'Nachhaltigkeitsberichterstattung (CSRD)', 'Sustainability reporting', NULL, 'ESG-Reporting, CSRD, Nachhaltigkeitsbericht'),
('cookie_consent', 'regulatory', 'Cookie-Consent und Tracking (TDDDG/ePrivacy)', 'Cookie consent and tracking', 'NOT consent (allg. Einwilligung)', 'Cookie-Banner, Tracking-Einwilligung, TDDDG §25'),
('video_surveillance', 'regulatory', 'Videoüberwachung (datenschutzrechtlich)', 'Video surveillance (data protection)', 'NOT physical_security (physische Sicherheit), NOT monitoring (IT-Monitoring)', 'Kamera-Überwachung, Speicherfristen, Kennzeichnungspflicht')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- ═══════════════════════════════════════════════════════════════
-- APPLICATION SECURITY
-- ═══════════════════════════════════════════════════════════════
INSERT INTO object_ontology VALUES
('secure_development', 'technical', 'Sichere Softwareentwicklung (SDLC)', 'Secure software development lifecycle', NULL, 'Secure Coding, Code Review, SAST/DAST, DevSecOps'),
('api_security', 'technical', 'API-Sicherheit', 'API security', NULL, 'API-Authentifizierung, Rate Limiting, Input Validation'),
('input_validation', 'technical', 'Eingabevalidierung und Output Encoding', 'Input validation and output encoding', NULL, 'XSS-Prävention, SQL-Injection-Schutz, Parametervalidierung'),
('container_security', 'technical', 'Container- und Cloud-Sicherheit', 'Container and cloud security', NULL, 'Docker-Hardening, Kubernetes-Security, Image-Scanning'),
('logging_configuration', 'technical', 'Log-Konfiguration und -Format', 'Log configuration and format', 'NOT audit_logging (Nachvollziehbarkeit), NOT monitoring (Überwachung)', 'Log-Format, Log-Rotation, Log-Shipping, Structured Logging'),
('data_classification', 'governance', 'Datenklassifizierung und -kennzeichnung', 'Data classification and labeling', 'NOT sensitive_data (besondere Kategorien)', 'Vertraulichkeitsstufen, Datenklassifizierung, Labeling')
ON CONFLICT (canonical_token) DO UPDATE SET description_de = EXCLUDED.description_de, description_en = EXCLUDED.description_en, NOT_confused_with = EXCLUDED.NOT_confused_with;
-- Count results
DO $$
DECLARE cnt INTEGER;
BEGIN
SELECT count(*) INTO cnt FROM object_ontology;
RAISE NOTICE 'object_ontology: % canonical tokens defined', cnt;
END $$;
@@ -0,0 +1,58 @@
-- Migration 011: Derived Controls Library (Clean-Room MCs from external sources)
-- Schema: compliance
--
-- Holds Master Controls + atomic controls + mitigations + metrics that were
-- derived Clean-Room from external regulatory sources (BSI QUAIDAL today,
-- Grundschutz++/CRA/NIST AI RMF next). Kept separate from the gpre2
-- master_controls table because:
-- 1) The shape is different (no object_group/phase concepts).
-- 2) Source-Layer-Trennung: derivations from external IP must be cleanly
-- separable from internally-generated artifacts.
-- 3) Each row carries the licence + provenance for due diligence.
--
-- Run: ssh macmini "docker exec -i bp-core-postgres psql -U breakpilot -d breakpilot_db" \
-- < control-pipeline/migrations/011_derived_controls.sql
SET search_path TO compliance, public;
CREATE TABLE IF NOT EXISTS derived_controls (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
derived_id VARCHAR(200) UNIQUE NOT NULL, -- e.g. MC-AI-DATA-QKB-01-repraesentativitaet
kind VARCHAR(30) NOT NULL, -- criterion | building_block | measure | metric
canonical_name VARCHAR(300) NOT NULL,
description TEXT NOT NULL, -- our own wording, never the original
regulation_anchor TEXT, -- e.g. "EU AI Act Art. 10"
related_quaidal_ids JSONB NOT NULL DEFAULT '[]', -- ["QB-03", "QB-04", ...]
external_refs JSONB NOT NULL DEFAULT '[]', -- [{framework, citation}, ...]
source_framework VARCHAR(80) NOT NULL, -- "BSI QUAIDAL"
source_section VARCHAR(80) NOT NULL, -- "QKB-01"
source_url TEXT,
source_commit_sha VARCHAR(80),
source_title_original TEXT, -- original title (label, not protected)
source_license_note TEXT,
plagiarism_score_at_generation NUMERIC(5,4), -- 0..1; gate was 0.20
generated_by_model VARCHAR(80),
yaml_path TEXT, -- pointer back to source YAML
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_derived_controls_kind ON derived_controls(kind);
CREATE INDEX IF NOT EXISTS idx_derived_controls_source_framework ON derived_controls(source_framework);
CREATE INDEX IF NOT EXISTS idx_derived_controls_source_section ON derived_controls(source_section);
CREATE INDEX IF NOT EXISTS idx_derived_controls_related_quaidal_gin
ON derived_controls USING GIN(related_quaidal_ids);
-- Trigger to keep updated_at fresh
CREATE OR REPLACE FUNCTION trg_derived_controls_set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS derived_controls_updated_at ON derived_controls;
CREATE TRIGGER derived_controls_updated_at
BEFORE UPDATE ON derived_controls
FOR EACH ROW EXECUTE FUNCTION trg_derived_controls_set_updated_at();
@@ -0,0 +1,170 @@
#!/usr/bin/env python3
"""Upsert derived QUAIDAL controls from YAML into compliance.derived_controls.
Reads:
control-pipeline/data/quaidal/master_controls.yaml
control-pipeline/data/quaidal/atomic_controls.yaml
control-pipeline/data/quaidal/mitigations.yaml
control-pipeline/data/quaidal/metrics.yaml
Writes: compliance.derived_controls (idempotent UPSERT by derived_id)
Usage:
# Mac Mini direct:
python3 control-pipeline/scripts/apply_quaidal_to_db.py
# Via SSH (locally, against macmini DB):
DB_HOST=macmini python3 control-pipeline/scripts/apply_quaidal_to_db.py
"""
from __future__ import annotations
import argparse
import json
import os
import sys
from pathlib import Path
try:
import psycopg
import yaml
except ImportError as e:
print(f"ERROR: missing dependency {e.name}. Install with: pip install psycopg[binary] pyyaml", file=sys.stderr)
sys.exit(2)
REPO_ROOT = Path(__file__).resolve().parents[2]
DATA_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
KIND_FILES = {
"criterion": "master_controls.yaml",
"building_block": "atomic_controls.yaml",
"measure": "mitigations.yaml",
"metric": "metrics.yaml",
}
UPSERT_SQL = """
INSERT INTO compliance.derived_controls (
derived_id, kind, canonical_name, description, regulation_anchor,
related_quaidal_ids, external_refs,
source_framework, source_section, source_url, source_commit_sha,
source_title_original, source_license_note,
plagiarism_score_at_generation, generated_by_model, yaml_path
) VALUES (
%(derived_id)s, %(kind)s, %(canonical_name)s, %(description)s, %(regulation_anchor)s,
%(related_quaidal_ids)s::jsonb, %(external_refs)s::jsonb,
%(source_framework)s, %(source_section)s, %(source_url)s, %(source_commit_sha)s,
%(source_title_original)s, %(source_license_note)s,
%(plagiarism_score)s, %(generated_by_model)s, %(yaml_path)s
)
ON CONFLICT (derived_id) DO UPDATE SET
kind = EXCLUDED.kind,
canonical_name = EXCLUDED.canonical_name,
description = EXCLUDED.description,
regulation_anchor = EXCLUDED.regulation_anchor,
related_quaidal_ids = EXCLUDED.related_quaidal_ids,
external_refs = EXCLUDED.external_refs,
source_framework = EXCLUDED.source_framework,
source_section = EXCLUDED.source_section,
source_url = EXCLUDED.source_url,
source_commit_sha = EXCLUDED.source_commit_sha,
source_title_original = EXCLUDED.source_title_original,
source_license_note = EXCLUDED.source_license_note,
plagiarism_score_at_generation = EXCLUDED.plagiarism_score_at_generation,
generated_by_model = EXCLUDED.generated_by_model,
yaml_path = EXCLUDED.yaml_path
"""
def load_yaml_records(yaml_path: Path) -> tuple[list[dict], str | None, str | None]:
if not yaml_path.exists():
return [], None, None
data = yaml.safe_load(yaml_path.read_text(encoding="utf-8"))
return data.get("controls", []), data.get("commit_sha"), data.get("generated_by_model")
def to_row(ctrl: dict, yaml_path: Path, default_model: str | None, default_commit: str | None) -> dict:
source = ctrl.get("source") or {}
return {
"derived_id": ctrl["id"],
"kind": ctrl["kind"],
"canonical_name": ctrl["canonical_name"],
"description": ctrl["description"],
"regulation_anchor": ctrl.get("regulation_anchor"),
"related_quaidal_ids": json.dumps(ctrl.get("related_quaidal_ids", []), ensure_ascii=False),
"external_refs": json.dumps(ctrl.get("external_refs", []), ensure_ascii=False),
"source_framework": source.get("framework", "BSI QUAIDAL"),
"source_section": source.get("section", ""),
"source_url": source.get("url"),
"source_commit_sha": source.get("commit_sha") or default_commit,
"source_title_original": source.get("title_original_de"),
"source_license_note": source.get("license_note"),
"plagiarism_score": ctrl.get("plagiarism_score_at_generation"),
"generated_by_model": default_model,
"yaml_path": str(yaml_path.relative_to(REPO_ROOT)),
}
def build_dsn(args: argparse.Namespace) -> str:
if args.dsn:
return args.dsn
return (
f"host={args.db_host} port={args.db_port} "
f"dbname={args.db_name} user={args.db_user} password={args.db_password}"
)
def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--dsn", help="Full DSN; overrides individual flags")
ap.add_argument("--db-host", default=os.environ.get("DB_HOST", "localhost"))
ap.add_argument("--db-port", default=os.environ.get("DB_PORT", "5432"))
ap.add_argument("--db-name", default=os.environ.get("DB_NAME", "breakpilot_db"))
ap.add_argument("--db-user", default=os.environ.get("DB_USER", "breakpilot"))
ap.add_argument("--db-password", default=os.environ.get("DB_PASSWORD", "breakpilot"))
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
total = 0
rows: list[dict] = []
for kind, fname in KIND_FILES.items():
path = DATA_DIR / fname
records, commit, model = load_yaml_records(path)
for rec in records:
rows.append(to_row(rec, path, model, commit))
if records:
print(f" {fname}: {len(records)} entries", file=sys.stderr)
total += len(records)
if not rows:
print("ERROR: no YAML records found; run derive_quaidal_mcs.py first", file=sys.stderr)
return 2
print(f"Total rows: {total}", file=sys.stderr)
if args.dry_run:
print("Dry run — sample row:", file=sys.stderr)
print(json.dumps({k: (v[:200] if isinstance(v, str) else v) for k, v in rows[0].items()}, indent=2, ensure_ascii=False))
return 0
dsn = build_dsn(args)
print(f"Connecting to {args.db_host}:{args.db_port}/{args.db_name}", file=sys.stderr)
inserted = updated = 0
with psycopg.connect(dsn) as conn:
with conn.cursor() as cur:
for row in rows:
cur.execute(
"SELECT 1 FROM compliance.derived_controls WHERE derived_id = %s",
(row["derived_id"],),
)
existed = cur.fetchone() is not None
cur.execute(UPSERT_SQL, row)
if existed:
updated += 1
else:
inserted += 1
conn.commit()
print(f"Inserted: {inserted}, Updated: {updated}", file=sys.stderr)
return 0
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,148 @@
#!/usr/bin/env python3
"""Inherit source_citation from parent to atom controls.
Background
==========
citation_backfill.py fills source_citation on the *source-bearing* controls
(those with source_original_text ~2-7 %) by re-linking them to the
re-ingested, article_label-bearing chunks. The remaining ~93 % are "atom"
controls (decompositions) that carry a parent_control_uuid but no own citation.
They cite the SAME norm as their parent, so the citation can be inherited
no re-matching needed.
Self-written controls (license_rule = 3) are skipped (no external source).
Runs in idempotent iterations (atom -> master -> grandmaster) and prints
per-stage counts before any write. Safe to rerun only fills rows whose
source_citation lacks an 'article'.
Usage::
python3 scripts/atom_citation_inheritance.py --db-host 100.80.114.48 \\
--db-password breakpilot123 --dry-run
python3 scripts/atom_citation_inheritance.py --db-host 100.80.114.48 \\
--db-password breakpilot123 --apply
"""
from __future__ import annotations
import argparse
import os
import sys
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
def _art(alias: str) -> str:
"""SQL for source_citation->>'article' that works whether the column is jsonb
(macmini) or text-containing-JSON (prod schema anomaly from the DB swap).
pg_input_is_valid (PG16+) guards rows with invalid JSON so the cast never errors."""
col = f"{alias}.source_citation"
return (
f"(CASE WHEN {col} IS NOT NULL AND pg_input_is_valid({col}::text, 'jsonb') "
f"THEN ({col}::text)::jsonb->>'article' ELSE NULL END)"
)
# A row "needs" a citation when it has no article yet.
_NEEDS = f"({_art('cc')} IS NULL OR {_art('cc')} = '')"
# A parent can supply one when it carries a real article.
_PARENT_HAS = f"({_art('p')} IS NOT NULL AND {_art('p')} <> '')"
SQL_REPORT = f"""
SET search_path TO compliance, public;
SELECT
CASE WHEN cc.parent_control_uuid IS NULL THEN 'no_parent'
WHEN ({_PARENT_HAS.replace('p.', 'p2.')}) THEN 'parent_has_article'
ELSE 'parent_no_article' END AS bucket,
COUNT(*) AS n
FROM canonical_controls cc
LEFT JOIN canonical_controls p2 ON cc.parent_control_uuid = p2.id
WHERE {_NEEDS}
AND cc.license_rule IS DISTINCT FROM 3
GROUP BY 1 ORDER BY 2 DESC;
"""
SQL_INHERIT = f"""
SET search_path TO compliance, public;
UPDATE canonical_controls cc
SET source_citation = p.source_citation, updated_at = NOW()
FROM canonical_controls p
WHERE cc.parent_control_uuid = p.id
AND {_NEEDS}
AND {_PARENT_HAS}
AND cc.license_rule IS DISTINCT FROM 3;
"""
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--db-url", default=DB_URL,
help="Postgres URL (default: $DATABASE_URL)")
p.add_argument("--max-iterations", type=int, default=6,
help="Cap on inheritance iterations to avoid loops")
g = p.add_mutually_exclusive_group(required=True)
g.add_argument("--dry-run", action="store_true")
g.add_argument("--apply", action="store_true")
return p.parse_args()
def print_bucket(rows, label: str) -> None:
print(f"\n## {label}")
total = 0
for bucket, n in rows:
print(f" {bucket:20} {n:>8}")
total += n
print(f" {'TOTAL':20} {total:>8}")
def main() -> int:
args = parse_args()
try:
import psycopg2
except ImportError:
print("error: psycopg2 not installed", file=sys.stderr)
return 2
conn = psycopg2.connect(args.db_url)
conn.autocommit = False
cur = conn.cursor()
print("=" * 60)
print(" Atom citation inheritance — source_citation via parent")
print(f" Mode: {'DRY-RUN' if args.dry_run else 'APPLY'}")
print("=" * 60)
cur.execute(SQL_REPORT)
print_bucket(cur.fetchall(), "Controls without article (need citation)")
if args.dry_run:
cur.execute(
"SET search_path TO compliance, public; "
f"SELECT COUNT(*) FROM canonical_controls cc "
f"JOIN canonical_controls p ON cc.parent_control_uuid = p.id "
f"WHERE {_NEEDS} AND {_PARENT_HAS} AND cc.license_rule IS DISTINCT FROM 3;"
)
print(f"\n## First inherit-pass would fill: {cur.fetchone()[0]} rows")
print("\nNo writes performed. Use --apply to execute.")
conn.rollback()
return 0
total = 0
for i in range(1, args.max_iterations + 1):
cur.execute(SQL_INHERIT)
updated = cur.rowcount
total += updated
print(f"\n iteration {i}: {updated} rows inherited")
if updated == 0:
break
conn.commit()
print(f"\n✓ Total atoms inherited: {total}")
cur.execute(SQL_REPORT)
print_bucket(cur.fetchall(), "Remaining without article")
return 0
if __name__ == "__main__":
raise SystemExit(main())
@@ -0,0 +1,256 @@
#!/usr/bin/env python3
"""Audit script for license classification gaps in the control pipeline.
Reports:
1. **regulation_registry coverage** how many regulations are classified, by
rule and license_type.
2. **atomic_controls without license_rule** how many controls reference a
regulation_id that has no entry (or no license_rule) in the registry.
3. **Qdrant payload consistency** for each indexed collection, how many
chunks carry both ``license`` and ``license_rule`` payload fields.
The goal is to surface every record where the engine could in principle
extract or emit content but the license rule is unknown those records are
the highest-risk material in a license audit.
Usage::
python3 scripts/audit_license_classification.py --db-host 100.80.114.48
Add ``--check-qdrant`` to also probe ``http://<host>:6333`` collections.
"""
from __future__ import annotations
import argparse
import json
import sys
from collections import Counter
from pathlib import Path
from typing import Optional
from urllib import request as urllib_request
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
DEFAULT_HOST = "100.80.114.48"
DEFAULT_PORT = 5432
DEFAULT_USER = "breakpilot"
DEFAULT_DB = "breakpilot_db"
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--db-host", default=DEFAULT_HOST)
p.add_argument("--db-port", type=int, default=DEFAULT_PORT)
p.add_argument("--db-user", default=DEFAULT_USER)
p.add_argument("--db-name", default=DEFAULT_DB)
p.add_argument("--db-password", default="")
p.add_argument("--check-qdrant", action="store_true")
p.add_argument("--qdrant-host", default="100.80.114.48")
p.add_argument("--qdrant-port", type=int, default=6333)
p.add_argument("--json", action="store_true", help="Emit JSON result on stdout")
return p.parse_args()
def audit_registry(conn) -> dict:
"""Coverage of regulation_registry."""
cur = conn.cursor()
cur.execute(
"SET search_path TO compliance, public; "
"SELECT license_rule, license_type, COUNT(*) "
"FROM regulation_registry GROUP BY license_rule, license_type "
"ORDER BY license_rule, license_type;"
)
by_rule_and_type: list[tuple] = []
by_rule: Counter = Counter()
for rule, ltype, count in cur.fetchall():
by_rule_and_type.append((rule, ltype or "(empty)", count))
by_rule[rule] += count
cur.execute(
"SELECT COUNT(*) FROM regulation_registry "
"WHERE license_type IS NULL OR license_type = '';"
)
missing_type = cur.fetchone()[0]
cur.execute("SELECT COUNT(*) FROM regulation_registry;")
total = cur.fetchone()[0]
return {
"total": total,
"by_rule": dict(by_rule),
"by_rule_and_type": by_rule_and_type,
"missing_license_type": missing_type,
}
def audit_atomic_controls(conn) -> dict:
"""Controls whose source regulation has no license rule.
Important: the schema differs between core (bp-core) and customer
deployments. We probe a handful of likely column names and skip if
none are found.
"""
cur = conn.cursor()
# Detect controls table
cur.execute(
"SELECT table_name FROM information_schema.tables "
"WHERE table_schema='compliance' AND table_name IN "
"('atomic_controls','atomic_controls_dedup','canonical_controls');"
)
tables = [r[0] for r in cur.fetchall()]
if not tables:
return {"skipped": True, "reason": "no controls table found"}
result: dict = {"tables": {}}
for tbl in tables:
cur.execute(
f"SELECT column_name FROM information_schema.columns "
f"WHERE table_schema='compliance' AND table_name='{tbl}';"
)
cols = {r[0] for r in cur.fetchall()}
if "license_rule" not in cols:
result["tables"][tbl] = {"skipped": True, "reason": "no license_rule column"}
continue
cur.execute(f"SELECT COUNT(*) FROM compliance.{tbl};")
total = cur.fetchone()[0]
cur.execute(
f"SELECT license_rule, COUNT(*) FROM compliance.{tbl} "
f"GROUP BY license_rule ORDER BY license_rule;"
)
by_rule = {str(r[0]): r[1] for r in cur.fetchall()}
cur.execute(
f"SELECT COUNT(*) FROM compliance.{tbl} WHERE license_rule IS NULL;"
)
missing = cur.fetchone()[0]
result["tables"][tbl] = {
"total": total,
"by_rule": by_rule,
"missing_license_rule": missing,
}
return result
def audit_qdrant(host: str, port: int) -> dict:
"""Probe Qdrant collections for license + license_rule payload coverage.
Samples 500 points per collection and reports how many have neither
field populated.
"""
out: dict = {"collections": {}}
base = f"http://{host}:{port}"
try:
with urllib_request.urlopen(f"{base}/collections", timeout=10) as r:
colls = json.loads(r.read()).get("result", {}).get("collections", [])
except Exception as e:
return {"error": str(e)}
for c in colls:
name = c["name"]
if "compliance" not in name and "atomic_controls" not in name:
continue
payload = {"limit": 500, "with_payload": True, "with_vector": False}
req = urllib_request.Request(
f"{base}/collections/{name}/points/scroll",
data=json.dumps(payload).encode(),
headers={"Content-Type": "application/json"},
)
try:
with urllib_request.urlopen(req, timeout=15) as r:
points = json.loads(r.read()).get("result", {}).get("points", [])
except Exception as e:
out["collections"][name] = {"error": str(e)}
continue
sampled = len(points)
both_set = 0
only_license = 0
only_rule = 0
neither = 0
for p in points:
pl = p.get("payload", {}) or {}
has_lic = bool(pl.get("license"))
has_rule = pl.get("license_rule") is not None
if has_lic and has_rule:
both_set += 1
elif has_lic:
only_license += 1
elif has_rule:
only_rule += 1
else:
neither += 1
out["collections"][name] = {
"sampled": sampled,
"both_set": both_set,
"only_license_field": only_license,
"only_license_rule_field": only_rule,
"neither_set": neither,
"neither_pct": round(neither / sampled * 100, 1) if sampled else 0,
}
return out
def main() -> int:
args = parse_args()
try:
import psycopg2
except ImportError:
print("error: psycopg2 not installed (pip install psycopg2-binary)", file=sys.stderr)
return 2
conn = psycopg2.connect(
host=args.db_host,
port=args.db_port,
user=args.db_user,
dbname=args.db_name,
password=args.db_password or None,
)
try:
registry = audit_registry(conn)
controls = audit_atomic_controls(conn)
finally:
conn.close()
qdrant: Optional[dict] = None
if args.check_qdrant:
qdrant = audit_qdrant(args.qdrant_host, args.qdrant_port)
result = {"registry": registry, "atomic_controls": controls, "qdrant": qdrant}
if args.json:
print(json.dumps(result, indent=2, default=str))
return 0
print("=" * 60)
print(" Audit — License Classification")
print("=" * 60)
print()
print(f"## regulation_registry ({registry['total']} rows)")
print(f" By rule: {registry['by_rule']}")
print(f" Missing license_type: {registry['missing_license_type']}")
print()
print("## atomic_controls")
for tbl, info in controls.get("tables", {}).items():
if info.get("skipped"):
print(f" {tbl}: SKIPPED ({info['reason']})")
continue
print(f" {tbl}: {info['total']} rows")
print(f" by_rule={info['by_rule']}")
print(f" missing_license_rule={info['missing_license_rule']}")
print()
if qdrant:
print("## qdrant")
for name, info in qdrant.get("collections", {}).items():
if "error" in info:
print(f" {name}: ERROR {info['error']}")
continue
print(
f" {name:30} sampled={info['sampled']:4} "
f"both={info['both_set']:4} "
f"neither={info['neither_set']:4} ({info['neither_pct']}%)"
)
return 0
if __name__ == "__main__":
raise SystemExit(main())
@@ -0,0 +1,184 @@
#!/usr/bin/env python3
"""Backfill license_rule on canonical_controls by inheriting from parent.
Background
==========
Audit (audit_license_classification.py) showed that 279,384 of 314,811 rows
in compliance.canonical_controls have NULL license_rule. Drilling in:
- 261,980 of those (94%) have a parent_control_uuid whose parent already
carries a non-NULL license_rule. The pass0b decomposition pipeline did
not propagate the rule to its child controls this is a clear inheritance
bug, fixable without any classification decisions.
- 16,617 have a parent that itself has no license_rule (transitive case).
Inheriting iteratively converges to either rule-set or root-orphan.
- 787 have no parent at all (decomposition roots). These need cluster-based
manual classification (see Strategy Notes at the bottom of this file).
This script runs the inheritance fix in three idempotent stages and
prints per-stage counts before any write happens.
Usage::
# Always dry-run first:
python3 scripts/backfill_license_rule.py --db-host 100.80.114.48 \\
--db-password breakpilot123 --dry-run
# If counts look right:
python3 scripts/backfill_license_rule.py --db-host 100.80.114.48 \\
--db-password breakpilot123 --apply
The script is safe to rerun it only touches rows where license_rule
IS NULL.
"""
from __future__ import annotations
import argparse
import sys
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--db-host", default="100.80.114.48")
p.add_argument("--db-port", type=int, default=5432)
p.add_argument("--db-user", default="breakpilot")
p.add_argument("--db-name", default="breakpilot_db")
p.add_argument("--db-password", required=True)
g = p.add_mutually_exclusive_group(required=True)
g.add_argument("--dry-run", action="store_true")
g.add_argument("--apply", action="store_true")
p.add_argument("--max-iterations", type=int, default=5,
help="Cap on inheritance iterations to avoid loops")
return p.parse_args()
# Stage 1: direct parent has license_rule — single UPDATE.
# Stage 2: iterative — parent did not have it, but a grandparent does.
# We loop until no more rows can be filled or max-iterations.
# Stage 3: residual rows with no resolvable parent. Report them clustered
# by category/pattern_id so the user can classify by family.
SQL_REPORT_NULLS = """
SET search_path TO compliance, public;
SELECT
CASE WHEN cc.parent_control_uuid IS NULL THEN 'no_parent'
WHEN p.license_rule IS NULL THEN 'parent_null'
ELSE 'parent_set' END AS bucket,
COUNT(*) AS n
FROM canonical_controls cc
LEFT JOIN canonical_controls p ON cc.parent_control_uuid = p.id
WHERE cc.license_rule IS NULL
GROUP BY 1 ORDER BY 2 DESC;
"""
SQL_INHERIT_FROM_PARENT = """
SET search_path TO compliance, public;
UPDATE canonical_controls cc
SET license_rule = p.license_rule, updated_at = NOW()
FROM canonical_controls p
WHERE cc.parent_control_uuid = p.id
AND cc.license_rule IS NULL
AND p.license_rule IS NOT NULL;
"""
SQL_REPORT_ORPHAN_CLUSTERS = """
SET search_path TO compliance, public;
SELECT
COALESCE(category, '(null)') AS category,
COALESCE(pattern_id, '(null)') AS pattern_id,
COALESCE(generation_strategy, '(null)') AS gen,
COUNT(*) AS n
FROM canonical_controls
WHERE license_rule IS NULL AND parent_control_uuid IS NULL
GROUP BY 1, 2, 3 ORDER BY n DESC LIMIT 25;
"""
def print_bucket(rows, label: str) -> None:
print(f"\n## {label}")
total = 0
for bucket, n in rows:
print(f" {bucket:12} {n:>8}")
total += n
print(f" {'TOTAL':12} {total:>8}")
def main() -> int:
args = parse_args()
try:
import psycopg2
except ImportError:
print("error: psycopg2 not installed", file=sys.stderr)
return 2
conn = psycopg2.connect(
host=args.db_host, port=args.db_port, user=args.db_user,
dbname=args.db_name, password=args.db_password,
)
conn.autocommit = False
cur = conn.cursor()
print("=" * 60)
print(" Backfill — license_rule via parent inheritance")
print(f" Mode: {'DRY-RUN' if args.dry_run else 'APPLY'}")
print("=" * 60)
# Initial bucket report
cur.execute(SQL_REPORT_NULLS)
rows = cur.fetchall()
print_bucket(rows, "Initial NULL distribution")
if args.dry_run:
# Print what the FIRST inherit pass would resolve (without writing)
cur.execute(
"SET search_path TO compliance, public; "
"SELECT p.license_rule, COUNT(*) "
"FROM canonical_controls cc "
"JOIN canonical_controls p ON cc.parent_control_uuid = p.id "
"WHERE cc.license_rule IS NULL AND p.license_rule IS NOT NULL "
"GROUP BY 1 ORDER BY 1;"
)
print("\n## First inherit-pass would fill:")
for rule, n in cur.fetchall():
print(f" rule={rule} {n:>8} rows")
# Show orphan clusters that would remain
cur.execute(SQL_REPORT_ORPHAN_CLUSTERS)
print("\n## Orphan clusters (no parent + no rule, top 25):")
for cat, pid, gen, n in cur.fetchall():
print(f" cat={cat[:20]:20} pat={pid[:20]:20} gen={gen[:20]:20} n={n}")
print("\nNo writes performed. Use --apply to execute.")
conn.rollback()
return 0
# Apply mode — iterative inheritance
total_updated = 0
for i in range(1, args.max_iterations + 1):
cur.execute(SQL_INHERIT_FROM_PARENT)
updated = cur.rowcount
total_updated += updated
print(f"\n iteration {i}: {updated} rows updated")
if updated == 0:
break
conn.commit()
print(f"\n✓ Total rows backfilled: {total_updated}")
# Final bucket report
cur.execute(SQL_REPORT_NULLS)
print_bucket(cur.fetchall(), "Remaining NULL distribution")
cur.execute(SQL_REPORT_ORPHAN_CLUSTERS)
rows = cur.fetchall()
if rows:
print("\n## Orphan clusters still need classification:")
for cat, pid, gen, n in rows:
print(f" cat={cat[:20]:20} pat={pid[:20]:20} gen={gen[:20]:20} n={n}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
@@ -0,0 +1,203 @@
#!/usr/bin/env python3
"""Backfill ``license_rule`` payload field into Qdrant atomic_controls_dedup
and related compliance collections, sourced from canonical_controls in Postgres.
The audit (audit_license_classification.py) surfaced that Qdrant collections
holding canonical-control vectors (notably ``atomic_controls_dedup``) carry no
license_rule payload at all, even though the underlying Postgres table is now
fully classified. This script joins the two via ``control_uuid`` and patches the
Qdrant payload in batches.
Usage::
python3 scripts/backfill_qdrant_license_payload.py \\
--pg-host 100.80.114.48 --pg-password breakpilot123 \\
--qdrant http://100.80.114.48:6333 \\
--collection atomic_controls_dedup \\
--dry-run
# apply
python3 scripts/backfill_qdrant_license_payload.py ... --apply
Notes
-----
- ``control_uuid`` lives in the payload of atomic_controls_dedup. For other
collections that key the canonical control by a different field, override with
``--uuid-field``.
- Qdrant ``set_payload`` is keyed by point id, not payload field. We resolve
UUID point id by a paginated scroll-and-filter pass, then issue grouped
set_payload requests per license_rule (3 batches per collection).
"""
from __future__ import annotations
import argparse
import json
import sys
import time
from typing import Iterator
from urllib import request as urllib_request
def parse_args() -> argparse.Namespace:
p = argparse.ArgumentParser(description=__doc__)
p.add_argument("--pg-host", default="100.80.114.48")
p.add_argument("--pg-port", type=int, default=5432)
p.add_argument("--pg-user", default="breakpilot")
p.add_argument("--pg-name", default="breakpilot_db")
p.add_argument("--pg-password", required=True)
p.add_argument("--qdrant", default="http://100.80.114.48:6333")
p.add_argument("--qdrant-api-key", default="",
help="API key for managed Qdrant (Production)")
p.add_argument("--collection", default="atomic_controls_dedup")
p.add_argument("--uuid-field", default="control_uuid",
help="Payload field used for lookup (control_uuid or regulation_id)")
p.add_argument("--lookup", choices=["canonical_controls", "regulation_registry"],
default="canonical_controls",
help="Postgres table to resolve the lookup against")
p.add_argument("--batch-size", type=int, default=500)
g = p.add_mutually_exclusive_group(required=True)
g.add_argument("--dry-run", action="store_true")
g.add_argument("--apply", action="store_true")
return p.parse_args()
def fetch_rule_by_uuid(args) -> dict[str, int]:
"""Pull lookup-key → license_rule mapping from Postgres.
Source table is chosen by ``--lookup``:
- canonical_controls: id (UUID) license_rule, for atomic_controls_dedup
- regulation_registry: regulation_id license_rule, for document chunks
"""
import psycopg2
conn = psycopg2.connect(
host=args.pg_host, port=args.pg_port, user=args.pg_user,
dbname=args.pg_name, password=args.pg_password,
)
cur = conn.cursor()
cur.execute("SET search_path TO compliance, public;")
if args.lookup == "regulation_registry":
cur.execute(
"SELECT regulation_id, license_rule FROM regulation_registry "
"WHERE license_rule IS NOT NULL"
)
else:
cur.execute(
"SELECT id::text, license_rule FROM canonical_controls "
"WHERE license_rule IS NOT NULL"
)
mapping = {row[0]: int(row[1]) for row in cur.fetchall()}
conn.close()
return mapping
def _headers(api_key: str = "") -> dict:
h = {"Content-Type": "application/json"}
if api_key:
h["api-key"] = api_key
return h
def scroll_collection(qdrant: str, collection: str, uuid_field: str, api_key: str = "") -> Iterator[dict]:
"""Yield (point_id, uuid_value, has_rule_already) tuples."""
next_offset = None
while True:
body = {"limit": 1000, "with_payload": True, "with_vector": False}
if next_offset is not None:
body["offset"] = next_offset
req = urllib_request.Request(
f"{qdrant}/collections/{collection}/points/scroll",
data=json.dumps(body).encode(),
headers=_headers(api_key),
)
with urllib_request.urlopen(req, timeout=60) as r:
payload = json.loads(r.read())
result = payload.get("result", {})
for pt in result.get("points", []):
pl = pt.get("payload", {}) or {}
yield {
"id": pt["id"],
"uuid": pl.get(uuid_field),
"has_rule": "license_rule" in pl,
}
next_offset = result.get("next_page_offset")
if next_offset is None:
break
def set_payload_batch(qdrant: str, collection: str, point_ids: list, rule: int, api_key: str = "") -> int:
"""POST set_payload for a batch of point IDs with a single license_rule."""
body = {
"payload": {"license_rule": rule},
"points": point_ids,
}
req = urllib_request.Request(
f"{qdrant}/collections/{collection}/points/payload?wait=true",
data=json.dumps(body).encode(),
headers=_headers(api_key),
method="POST",
)
with urllib_request.urlopen(req, timeout=120) as r:
resp = json.loads(r.read())
if resp.get("status") != "ok":
raise RuntimeError(f"set_payload failed: {resp}")
return len(point_ids)
def main() -> int:
args = parse_args()
print("Loading canonical_controls → license_rule mapping…")
rule_by_uuid = fetch_rule_by_uuid(args)
print(f" Postgres returned {len(rule_by_uuid)} classified controls")
print(f"Scrolling Qdrant collection {args.collection!r}")
by_rule: dict[int, list] = {1: [], 2: [], 3: []}
points_total = 0
points_with_uuid = 0
points_already_set = 0
points_no_match = 0
for pt in scroll_collection(args.qdrant, args.collection, args.uuid_field, args.qdrant_api_key):
points_total += 1
uuid = pt["uuid"]
if not uuid:
continue
points_with_uuid += 1
if pt["has_rule"]:
points_already_set += 1
continue
rule = rule_by_uuid.get(uuid)
if rule is None:
points_no_match += 1
continue
if rule not in by_rule:
continue
by_rule[rule].append(pt["id"])
print(f" total points scanned: {points_total}")
print(f" with {args.uuid_field}: {points_with_uuid}")
print(f" already had license_rule: {points_already_set}")
print(f" uuid not found in Postgres: {points_no_match}")
print(f" to set per rule: rule1={len(by_rule[1])} rule2={len(by_rule[2])} rule3={len(by_rule[3])}")
if args.dry_run:
print("\nDRY-RUN: no writes performed. Use --apply to execute.")
return 0
total_written = 0
for rule, ids in by_rule.items():
if not ids:
continue
print(f"\nWriting license_rule={rule} to {len(ids)} points (batch {args.batch_size})…")
for i in range(0, len(ids), args.batch_size):
chunk = ids[i:i + args.batch_size]
n = set_payload_batch(args.qdrant, args.collection, chunk, rule, args.qdrant_api_key)
total_written += n
print(f" batch {i // args.batch_size + 1}: {n} points (cumulative {total_written})")
time.sleep(0.05)
print(f"\nWrote license_rule on {total_written} Qdrant points in {args.collection}")
return 0
if __name__ == "__main__":
raise SystemExit(main())
@@ -0,0 +1,498 @@
#!/usr/bin/env python3
"""D6 Citation Backfill — update ~291k controls with section metadata from Qdrant chunks.
Archives old source_citation in generation_metadata.old_citation.
Updates source_citation.article, .paragraph, .page from matched Qdrant chunks.
3-tier matching:
Tier 1: sha256(source_original_text) exact chunk text match
Tier 2: Parse [section] prefix from source_original_text
Tier 3: Best text overlap within same regulation_id
Usage:
python3 control-pipeline/scripts/d6_citation_backfill.py --dry-run --limit 100
python3 control-pipeline/scripts/d6_citation_backfill.py --batch-size 1000
"""
import argparse
import hashlib
import json
import logging
import os
import re
import time
from dataclasses import dataclass
from typing import Optional
import httpx
import psycopg2
import psycopg2.extras
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger("d6-backfill")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot@localhost:5432/breakpilot_db")
QDRANT_URL = os.getenv("QDRANT_URL", "http://localhost:6333")
COLLECTIONS = [
"bp_compliance_ce",
"bp_compliance_gesetze",
"bp_compliance_datenschutz",
"bp_dsfa_corpus",
"bp_legal_templates",
]
# Parse [§ 312k Title] or [AC-1 POLICY] prefix from chunk text
_SECTION_PREFIX_RE = re.compile(r'^\[([^\]]+)\]\s*')
@dataclass
class ChunkMeta:
section: str
section_title: str
paragraph: str
page: Optional[int]
regulation_id: str
@dataclass
class Stats:
total: int = 0
already_correct: int = 0
matched_hash: int = 0
matched_prefix: int = 0
matched_overlap: int = 0
unmatched: int = 0
updated: int = 0
errors: int = 0
# -------------------------------------------------------------------
# Phase 1: Build Qdrant index
# -------------------------------------------------------------------
def build_qdrant_index(qdrant_url: str) -> tuple[dict, dict]:
"""Build hash index and regulation index from all Qdrant collections.
Returns:
hash_index: {sha256(chunk_text) ChunkMeta}
reg_index: {regulation_id [ChunkMeta with text snippets]}
"""
hash_index: dict[str, ChunkMeta] = {}
reg_index: dict[str, list[tuple[str, ChunkMeta]]] = {}
total_chunks = 0
for coll in COLLECTIONS:
offset = None
coll_count = 0
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"limit": 250,
"with_payload": [
"chunk_text", "section", "section_title",
"paragraph", "page", "regulation_id",
],
"with_vector": False,
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{coll}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
p = pt.get("payload", {})
chunk_text = p.get("chunk_text", "")
if not chunk_text or len(chunk_text.strip()) < 30:
continue
meta = ChunkMeta(
section=p.get("section", "") or "",
section_title=p.get("section_title", "") or "",
paragraph=p.get("paragraph", "") or "",
page=p.get("page"),
regulation_id=p.get("regulation_id", "") or "",
)
# Hash index
h = hashlib.sha256(chunk_text.encode()).hexdigest()
if meta.section: # only index chunks WITH section data
hash_index[h] = meta
# Regulation index (for text overlap matching)
if meta.regulation_id and meta.section:
reg_index.setdefault(meta.regulation_id, []).append(
(chunk_text[:500], meta)
)
coll_count += 1
offset = data.get("next_page_offset")
if offset is None:
break
total_chunks += coll_count
logger.info(" [%s] %d chunks indexed", coll, coll_count)
logger.info("Qdrant index: %d total chunks, %d with section (hash), %d regulations",
total_chunks, len(hash_index), len(reg_index))
return hash_index, reg_index
# -------------------------------------------------------------------
# Phase 2: Load controls
# -------------------------------------------------------------------
def load_controls(db_url: str, limit: int = 0) -> list[dict]:
"""Load all controls needing citation update."""
conn = psycopg2.connect(db_url)
conn.set_session(autocommit=False)
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute("SET search_path TO compliance, core, public")
query = """
SELECT id, control_id, source_citation, source_original_text,
generation_metadata, license_rule
FROM canonical_controls
WHERE license_rule IN (1, 2)
AND source_citation IS NOT NULL
ORDER BY control_id
"""
if limit > 0:
query += f" LIMIT {limit}"
cur.execute(query)
rows = cur.fetchall()
conn.close()
controls = []
for row in rows:
ctrl = dict(row)
ctrl["id"] = str(ctrl["id"])
for jf in ("source_citation", "generation_metadata"):
val = ctrl.get(jf)
if isinstance(val, str):
try:
ctrl[jf] = json.loads(val)
except (json.JSONDecodeError, TypeError):
ctrl[jf] = {}
elif val is None:
ctrl[jf] = {}
controls.append(ctrl)
return controls
# -------------------------------------------------------------------
# Phase 3: Matching
# -------------------------------------------------------------------
def match_control(
ctrl: dict,
hash_index: dict[str, ChunkMeta],
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
) -> tuple[Optional[ChunkMeta], str]:
"""Match a control to a Qdrant chunk. Returns (meta, method) or (None, '')."""
source_text = ctrl.get("source_original_text", "") or ""
# Tier 1: Hash match
if source_text:
h = hashlib.sha256(source_text.encode()).hexdigest()
meta = hash_index.get(h)
if meta and meta.section:
return meta, "hash"
# Tier 2: Parse [section] prefix from source_original_text
if source_text:
m = _SECTION_PREFIX_RE.match(source_text)
if m:
prefix = m.group(1).strip()
parsed = _parse_section_from_prefix(prefix)
if parsed:
return parsed, "prefix"
# Tier 3: Text overlap within same regulation
gen_meta = ctrl.get("generation_metadata") or {}
reg_id = gen_meta.get("source_regulation", "")
if reg_id and source_text and reg_id in reg_index:
best = _find_best_overlap(source_text, reg_index[reg_id])
if best:
return best, "overlap"
return None, ""
def _parse_section_from_prefix(prefix: str) -> Optional[ChunkMeta]:
"""Parse a section prefix like '§ 312k Kuendigungsbutton' or 'AC-1 POLICY'."""
if not prefix:
return None
# § pattern
m = re.match(r'\s*\d+[a-z]*)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# Art./Artikel pattern
m = re.match(r'(Art(?:ikel|\.)\s*\d+)\s*(.*)', prefix, re.IGNORECASE)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# NIST control pattern (AC-1, AU-2, etc.)
m = re.match(r'([A-Z]{2,4}-\d+(?:\(\d+\))?)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# Numbered section (3.1 Title)
m = re.match(r'(\d+(?:\.\d+)+)\s*(.*)', prefix)
if m:
return ChunkMeta(
section=m.group(1).strip(),
section_title=m.group(2).strip(),
paragraph="", page=None, regulation_id="",
)
# ALL-CAPS heading (fallback — use as section_title)
if prefix == prefix.upper() and len(prefix) > 3:
return ChunkMeta(
section="", section_title=prefix,
paragraph="", page=None, regulation_id="",
)
return None
def _find_best_overlap(source_text: str, chunks: list[tuple[str, ChunkMeta]]) -> Optional[ChunkMeta]:
"""Find chunk with best text overlap (simple word-set Jaccard)."""
source_words = set(source_text.lower().split())
if len(source_words) < 5:
return None
best_score = 0.0
best_meta = None
for chunk_text, meta in chunks:
chunk_words = set(chunk_text.lower().split())
if not chunk_words:
continue
intersection = len(source_words & chunk_words)
union = len(source_words | chunk_words)
jaccard = intersection / union if union > 0 else 0
if jaccard > best_score and jaccard > 0.3: # 30% threshold
best_score = jaccard
best_meta = meta
return best_meta
# -------------------------------------------------------------------
# Phase 4: Update controls
# -------------------------------------------------------------------
def update_controls(
db_url: str,
controls: list[dict],
hash_index: dict[str, ChunkMeta],
reg_index: dict[str, list[tuple[str, ChunkMeta]]],
dry_run: bool = True,
batch_size: int = 1000,
) -> Stats:
"""Match and update all controls."""
stats = Stats(total=len(controls))
conn = psycopg2.connect(db_url)
conn.set_session(autocommit=False)
cur = conn.cursor()
cur.execute("SET search_path TO compliance, core, public")
updates = []
for i, ctrl in enumerate(controls):
if i > 0 and i % 5000 == 0:
logger.info("Progress: %d/%d (hash=%d prefix=%d overlap=%d unmatched=%d)",
i, stats.total, stats.matched_hash, stats.matched_prefix,
stats.matched_overlap, stats.unmatched)
citation = ctrl.get("source_citation") or {}
old_article = citation.get("article", "")
gen_meta = ctrl.get("generation_metadata") or {}
# Match
meta, method = match_control(ctrl, hash_index, reg_index)
if not meta or not meta.section:
# No match — check if existing article is already good
if old_article:
stats.already_correct += 1
else:
stats.unmatched += 1
continue
# Check if update is needed
if old_article == meta.section:
stats.already_correct += 1
continue
# Track method
if method == "hash":
stats.matched_hash += 1
elif method == "prefix":
stats.matched_prefix += 1
elif method == "overlap":
stats.matched_overlap += 1
# Archive old citation
if old_article or citation.get("paragraph"):
gen_meta["old_citation"] = {
"article": old_article,
"paragraph": citation.get("paragraph", ""),
"page": citation.get("page"),
"archived_at": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
}
# Update citation
citation["article"] = meta.section
if meta.paragraph:
citation["paragraph"] = meta.paragraph
if meta.page is not None:
citation["page"] = meta.page
# Update generation_metadata
gen_meta["source_article"] = meta.section
if meta.paragraph:
gen_meta["source_paragraph"] = meta.paragraph
if meta.page is not None:
gen_meta["source_page"] = meta.page
gen_meta["backfill_method"] = method
gen_meta["backfill_at"] = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
updates.append((
json.dumps(citation, ensure_ascii=False),
json.dumps(gen_meta, ensure_ascii=False, default=str),
ctrl["id"],
))
# Batch commit
if len(updates) >= batch_size and not dry_run:
_execute_batch(cur, updates)
conn.commit()
stats.updated += len(updates)
logger.info("Committed batch: %d updates (total %d)", len(updates), stats.updated)
updates = []
# Final batch
if updates and not dry_run:
_execute_batch(cur, updates)
conn.commit()
stats.updated += len(updates)
logger.info("Committed final batch: %d updates (total %d)", len(updates), stats.updated)
elif updates and dry_run:
stats.updated = len(updates) # would-be updates
conn.close()
return stats
def _execute_batch(cur, updates: list[tuple]):
"""Execute batch UPDATE statements."""
for citation_json, meta_json, ctrl_id in updates:
cur.execute(
"""UPDATE canonical_controls
SET source_citation = %s::jsonb,
generation_metadata = %s::jsonb,
updated_at = NOW()
WHERE id = %s::uuid""",
(citation_json, meta_json, ctrl_id),
)
# -------------------------------------------------------------------
# Main
# -------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="D6 Citation Backfill")
parser.add_argument("--dry-run", action="store_true", help="Don't write to DB")
parser.add_argument("--limit", type=int, default=0, help="Limit controls (0=all)")
parser.add_argument("--batch-size", type=int, default=1000)
parser.add_argument("--db-url", default=DB_URL)
parser.add_argument("--qdrant-url", default=QDRANT_URL)
args = parser.parse_args()
logger.info("=" * 60)
logger.info("D6 Citation Backfill")
logger.info(" DB: %s", args.db_url.split("@")[-1])
logger.info(" Qdrant: %s", args.qdrant_url)
logger.info(" Dry run: %s", args.dry_run)
logger.info(" Limit: %s", args.limit or "ALL")
logger.info("=" * 60)
# Phase 1: Build Qdrant index
logger.info("\nPhase 1: Building Qdrant index...")
t0 = time.time()
hash_index, reg_index = build_qdrant_index(args.qdrant_url)
logger.info("Index built in %.1fs", time.time() - t0)
# Phase 2: Load controls
logger.info("\nPhase 2: Loading controls...")
controls = load_controls(args.db_url, args.limit)
logger.info("Loaded %d controls", len(controls))
if not controls:
logger.info("No controls to process")
return
# Phase 3+4: Match and update
logger.info("\nPhase 3+4: Matching and updating...")
t0 = time.time()
stats = update_controls(
args.db_url, controls, hash_index, reg_index,
dry_run=args.dry_run, batch_size=args.batch_size,
)
elapsed = time.time() - t0
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
logger.info(" Total controls: %d", stats.total)
logger.info(" Already correct: %d (%.1f%%)", stats.already_correct,
stats.already_correct / max(stats.total, 1) * 100)
logger.info(" Matched (hash): %d (%.1f%%)", stats.matched_hash,
stats.matched_hash / max(stats.total, 1) * 100)
logger.info(" Matched (prefix): %d (%.1f%%)", stats.matched_prefix,
stats.matched_prefix / max(stats.total, 1) * 100)
logger.info(" Matched (overlap): %d (%.1f%%)", stats.matched_overlap,
stats.matched_overlap / max(stats.total, 1) * 100)
logger.info(" Unmatched: %d (%.1f%%)", stats.unmatched,
stats.unmatched / max(stats.total, 1) * 100)
logger.info(" Updated: %d", stats.updated)
logger.info(" Errors: %d", stats.errors)
logger.info(" Time: %.1fs (%.0f controls/sec)", elapsed,
stats.total / max(elapsed, 1))
if args.dry_run:
logger.info("\nDRY RUN — no changes written. Run without --dry-run to apply.")
if __name__ == "__main__":
main()
@@ -0,0 +1,310 @@
#!/usr/bin/env python3
"""
Derive doc_check_controls from existing Master Controls.
Filters MCs by document-relevant regulations, then uses Claude Haiku
to generate check_question + pass_criteria + fail_criteria per control.
Usage:
python3 /app/scripts/derive_doc_check_controls.py --dry-run
python3 /app/scripts/derive_doc_check_controls.py
"""
import argparse
import json
import logging
import os
import time
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("doc-check-derive")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
# Document types and their regulation sources
DOC_TYPES = {
"dse": {
"name": "Datenschutzinformation",
"sources": ["DSGVO (EU) 2016/679"],
"articles": ["%13%", "%14%"],
"extra_tokens": ["personal_data%", "data_subject_rights%", "consent%",
"data_processing_register%", "data_transfer%"],
},
"cookie": {
"name": "Cookie-Richtlinie",
"sources": ["TDDDG", "ePrivacy-Richtlinie"],
"articles": ["%25%", "%5%"],
"extra_tokens": ["cookie_consent%", "consent%"],
},
"impressum": {
"name": "Impressum",
"sources": ["TMG"],
"articles": ["%5%"],
"extra_tokens": ["ecommerce%"],
},
"widerruf": {
"name": "Widerrufsbelehrung",
"sources": ["BGB"],
"articles": ["%355%", "%312%"],
"extra_tokens": ["consumer_protection%"],
},
"agb": {
"name": "AGB",
"sources": ["BGB"],
"articles": ["%305%", "%307%", "%308%", "%309%"],
"extra_tokens": ["consumer_protection%"],
},
"dsfa": {
"name": "Datenschutz-Folgenabschaetzung",
"sources": ["DSGVO (EU) 2016/679"],
"articles": ["%35%"],
"extra_tokens": ["dpia%"],
},
"avv": {
"name": "Auftragsverarbeitung",
"sources": ["DSGVO (EU) 2016/679"],
"articles": ["%28%"],
"extra_tokens": ["data_processing_agreement%"],
},
"loeschkonzept": {
"name": "Loeschkonzept",
"sources": ["DSGVO (EU) 2016/679"],
"articles": ["%5%", "%17%"],
"extra_tokens": ["data_retention%"],
},
}
SYSTEM_PROMPT = """Du erzeugst binäre Prüfkriterien für Compliance-Dokumente.
Für jeden Control erzeugst du:
1. check_question: Eine JA/NEIN Frage die ein LLM anhand eines Dokuments beantworten kann
2. pass_criteria: Konkrete Textinhalte die vorhanden sein MÜSSEN (3-5 Stück)
3. fail_criteria: Typische Fehler/Mängel (2-3 Stück)
4. severity: HIGH, MEDIUM oder LOW
REGELN:
- check_question muss BINÄR beantwortbar sein (nicht "wie gut")
- pass_criteria müssen KONKRET sein ("Name + Rechtsform + Anschrift", nicht "Angaben")
- fail_criteria müssen TYPISCHE Fehler beschreiben
- Alles auf Deutsch
Antworte als JSON-Array:
[{"id":"...","check_question":"...","pass_criteria":["..."],"fail_criteria":["..."],"severity":"HIGH"}]"""
def get_doc_controls(engine, doc_type: str, config: dict) -> list[dict]:
"""Get controls relevant for a document type."""
controls = []
# Strategy 1: By source + article
for source in config["sources"]:
for article in config["articles"]:
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
pc.source_citation->>'article' as article
FROM compliance.canonical_controls cc
JOIN compliance.canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE pc.source_citation->>'source' = :source
AND pc.source_citation->>'article' LIKE :article
AND cc.release_state NOT IN ('deprecated', 'rejected')
LIMIT 200
"""), {"source": source, "article": article}).fetchall()
for r in rows:
controls.append({
"uuid": str(r[0]), "control_id": r[1],
"title": r[2] or "", "objective": r[3] or "",
"article": r[4] or "", "doc_type": doc_type,
})
# Strategy 2: By MC canonical_name
for token_pattern in config.get("extra_tokens", []):
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective
FROM compliance.master_controls mc
JOIN compliance.master_control_members mcm ON mcm.master_control_uuid = mc.id
JOIN compliance.canonical_controls cc ON cc.id = mcm.control_uuid
WHERE mc.canonical_name LIKE :pattern
AND cc.release_state NOT IN ('deprecated', 'rejected')
LIMIT 100
"""), {"pattern": token_pattern}).fetchall()
for r in rows:
controls.append({
"uuid": str(r[0]), "control_id": r[1],
"title": r[2] or "", "objective": r[3] or "",
"article": "", "doc_type": doc_type,
})
# Deduplicate
seen = set()
unique = []
for c in controls:
if c["control_id"] not in seen:
seen.add(c["control_id"])
unique.append(c)
return unique
def enrich_with_llm(controls: list[dict], doc_type_name: str) -> list[dict]:
"""Add check_question, pass/fail_criteria via Haiku."""
enriched = []
batch_size = 5
for i in range(0, len(controls), batch_size):
batch = controls[i:i + batch_size]
items = [
f'- id="{c["control_id"]}" doc="{doc_type_name}" '
f't="{c["title"]}" o="{c["objective"][:100]}"'
for c in batch
]
prompt = (
f"Dokumenttyp: {doc_type_name}\n"
f"Erzeuge Prüfkriterien:\n" + "\n".join(items)
)
try:
resp = httpx.post(ANTHROPIC_URL, headers={
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}, json={
"model": "claude-haiku-4-5-20251001",
"max_tokens": 2000, "temperature": 0.1,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}, timeout=45.0)
resp.raise_for_status()
content = resp.json().get("content", [{}])[0].get("text", "")
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
results = json.loads(content[start:end])
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
if r.get("check_question"):
ctrl["check_question"] = r["check_question"]
ctrl["pass_criteria"] = r.get("pass_criteria", [])
ctrl["fail_criteria"] = r.get("fail_criteria", [])
ctrl["severity"] = r.get("severity", "MEDIUM")
enriched.append(ctrl)
except Exception as e:
logger.error("Batch %d failed: %s", i, e)
time.sleep(0.5)
return enriched
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--doc-type", choices=list(DOC_TYPES.keys()),
help="Only one doc type")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Create table
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("""
CREATE TABLE IF NOT EXISTS doc_check_controls (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
control_id VARCHAR(500) NOT NULL,
control_uuid UUID,
doc_type VARCHAR(50) NOT NULL,
title VARCHAR(500),
regulation VARCHAR(200),
article VARCHAR(100),
check_question TEXT NOT NULL,
pass_criteria JSONB DEFAULT '[]',
fail_criteria JSONB DEFAULT '[]',
severity VARCHAR(20) DEFAULT 'MEDIUM',
created_at TIMESTAMPTZ DEFAULT NOW()
)
"""))
c.execute(text("""
CREATE INDEX IF NOT EXISTS idx_doc_check_doc_type
ON doc_check_controls(doc_type)
"""))
doc_types = [args.doc_type] if args.doc_type else list(DOC_TYPES.keys())
all_checks = []
for dt in doc_types:
config = DOC_TYPES[dt]
logger.info("\n=== %s (%s) ===", dt, config["name"])
controls = get_doc_controls(engine, dt, config)
logger.info("Found %d relevant controls", len(controls))
if not controls:
continue
enriched = enrich_with_llm(controls, config["name"])
logger.info("Enriched %d with check criteria", len(enriched))
all_checks.extend(enriched)
logger.info("\nTotal: %d doc_check_controls across %d doc types",
len(all_checks), len(doc_types))
if args.dry_run:
for dc in all_checks[:5]:
logger.info(" [%s] %s: %s", dc["doc_type"], dc["control_id"],
dc.get("check_question", "?")[:80])
logger.info("DRY RUN — not writing")
return
# Write to DB
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("DELETE FROM doc_check_controls"))
for dc in all_checks:
c.execute(text("""
INSERT INTO doc_check_controls
(control_id, control_uuid, doc_type, title,
check_question, pass_criteria, fail_criteria, severity)
VALUES (:cid, CAST(:uuid AS uuid), :doc_type, :title,
:question, CAST(:pass AS jsonb),
CAST(:fail AS jsonb), :severity)
"""), {
"cid": dc["control_id"],
"uuid": dc["uuid"],
"doc_type": dc["doc_type"],
"title": dc["title"],
"question": dc.get("check_question", ""),
"pass": json.dumps(dc.get("pass_criteria", [])),
"fail": json.dumps(dc.get("fail_criteria", [])),
"severity": dc.get("severity", "MEDIUM"),
})
logger.info("Wrote %d doc_check_controls to DB", len(all_checks))
# Save as JSON too
Path("/tmp/doc_check_controls.json").write_text(
json.dumps(all_checks, indent=2, ensure_ascii=False)
)
logger.info("Saved to /tmp/doc_check_controls.json")
if __name__ == "__main__":
main()
@@ -0,0 +1,400 @@
#!/usr/bin/env python3
"""Clean-Room MC derivation from BSI QUAIDAL.
For each QUAIDAL entry in the parsed index, ask a local LLM to produce our own
wording for a Master Control / atomic control / mitigation / metric. Reject any
output whose 4-gram overlap with the BSI source text exceeds PLAGIARISM_LIMIT.
We never store the BSI prose; only our own derived wording plus structural
references (BSI section ID + URL + commit SHA).
Usage:
# Single entry, prints to stdout for review:
python3 control-pipeline/scripts/derive_quaidal_mcs.py --only QKB-01 --dry-run
# Full run, writes YAML:
python3 control-pipeline/scripts/derive_quaidal_mcs.py --ollama-host macmini
Output: control-pipeline/data/quaidal/{master_controls,atomic_controls,mitigations,metrics}.yaml
"""
from __future__ import annotations
import argparse
import json
import re
import sys
import time
from dataclasses import dataclass
from pathlib import Path
try:
import httpx
import yaml
except ImportError as e:
print(f"ERROR: missing dependency {e.name}. Install with: pip install httpx pyyaml", file=sys.stderr)
sys.exit(2)
REPO_ROOT = Path(__file__).resolve().parents[2]
SOURCE_ROOT = REPO_ROOT / "legal-sources" / "bsi-quaidal"
INDEX_FILE = REPO_ROOT / "control-pipeline" / "data" / "quaidal" / "quaidal_index.json"
OUTPUT_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
PLAGIARISM_LIMIT = 0.20 # max share of 4-grams that may appear in BSI source
N_GRAM = 4
MAX_RETRIES = 3
DEFAULT_OLLAMA_URL = "http://macmini:11434"
OLLAMA_MODEL = "qwen3.5:35b-a3b"
QUAIDAL_REPO_URL = "https://github.com/BSI-Bund/QUAIDAL"
KIND_TO_PROMPT_ROLE = {
"criterion": "Master Control",
"building_block": "atomarer technischer Control",
"measure": "Schutzmaßnahme",
"metric": "messbarer Qualitäts-Indikator",
}
KIND_TO_OUTPUT_FILE = {
"criterion": "master_controls.yaml",
"building_block": "atomic_controls.yaml",
"measure": "mitigations.yaml",
"metric": "metrics.yaml",
}
# ---------------------------------------------------------------------------
# Source-side extraction (kept in memory, never written to disk)
# ---------------------------------------------------------------------------
FRONTMATTER_RE = re.compile(r"^---\s*\n.*?\n---\s*\n", re.DOTALL)
SECTION_RE = re.compile(r"^###?\s+(.+?)\s*$", re.MULTILINE)
def load_source_extract(rel_path: str) -> dict:
"""Load BSI source text for ONE entry. Used only for prompt + plagiarism check."""
path = SOURCE_ROOT / rel_path
text = path.read_text(encoding="utf-8")
# Strip frontmatter; capture shortdesc separately for the prompt.
fm_match = re.match(r"^---\s*\n(.*?)\n---\s*\n", text, re.DOTALL)
shortdesc = ""
if fm_match:
for line in fm_match.group(1).splitlines():
if line.lower().startswith("shortdesc:"):
shortdesc = line.split(":", 1)[1].strip()
break
body = FRONTMATTER_RE.sub("", text, count=1)
# Pull the first 1-2 paragraphs under "Beschreibung" (or whole body if none)
desc_match = re.search(r"###?\s+Beschreibung\s*\n+(.+?)(?:\n###?\s|\Z)", body, re.DOTALL)
description_excerpt = desc_match.group(1).strip() if desc_match else body[:1500].strip()
paragraphs = [p.strip() for p in description_excerpt.split("\n\n") if p.strip()]
description_excerpt = "\n\n".join(paragraphs[:2])
return {
"shortdesc": shortdesc,
"description_excerpt": description_excerpt,
"full_body": body,
}
# ---------------------------------------------------------------------------
# Plagiarism gate
# ---------------------------------------------------------------------------
WORD_RE = re.compile(r"\b[\wäöüÄÖÜß]+\b", re.UNICODE)
def _tokenize(text: str) -> list[str]:
return [w.lower() for w in WORD_RE.findall(text)]
def ngram_overlap(produced: str, source: str, n: int = N_GRAM) -> float:
"""Share of produced n-grams that also appear in source."""
p_tokens = _tokenize(produced)
s_tokens = _tokenize(source)
if len(p_tokens) < n:
return 0.0
s_grams = {tuple(s_tokens[i : i + n]) for i in range(len(s_tokens) - n + 1)}
if not s_grams:
return 0.0
p_grams = [tuple(p_tokens[i : i + n]) for i in range(len(p_tokens) - n + 1)]
hits = sum(1 for g in p_grams if g in s_grams)
return hits / len(p_grams)
# ---------------------------------------------------------------------------
# LLM prompt + call
# ---------------------------------------------------------------------------
PROMPT_TEMPLATE = """Du bist Compliance-Engineer bei BreakPilot. Schreibe eine eigenständige Anforderung im Stil einer technischen Kontroll-Spezifikation.
Quelle: BSI QUAIDAL Sektion {entry_id} ("{title_de}"). Die Quelle steht unter unklarer Lizenz (BSI-Veröffentlichung, § 5 UrhG anwendbar) wir dürfen die Idee aufgreifen, aber NICHT abschreiben.
Aufgabe: Formuliere eine eigenständige Anforderung im Stil eines {role}. Anforderungen:
- Eigene Formulierung in deutscher Sprache. Kein Satz darf aus der Quelle übernommen werden, auch nicht teilweise. Synonyme verwenden, Satzbau ändern, Inhalt strukturell anders aufbauen.
- 2-4 Sätze (max 80 Wörter).
- Sprachstil: nüchtern, technisch, normativ ("muss", "ist sicherzustellen", "ist zu prüfen").
- Bezug auf KI-Trainingsdaten oder KI-Datenqualität, je nach Quelle.
- Nicht die wörtlichen BSI-Beispiele kopieren.
Quellauszug (NUR zur Orientierung, NICHT abschreiben):
---
shortdesc: {shortdesc}
{description_excerpt}
---
Antwort: Liefere AUSSCHLIESSLICH die fertige Beschreibung als reinen Text kein JSON, keine Überschriften, keine Anführungszeichen, keine Quellenangabe."""
def call_ollama(prompt: str, ollama_url: str, model: str, retries: int = 2) -> str:
last_err = None
for attempt in range(retries + 1):
try:
resp = httpx.post(
f"{ollama_url}/api/chat",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
"options": {"temperature": 0.4},
"think": False,
},
timeout=180.0,
)
resp.raise_for_status()
return resp.json()["message"]["content"].strip()
except (httpx.HTTPError, KeyError, ValueError) as e:
last_err = e
if attempt < retries:
time.sleep(2 ** attempt)
raise RuntimeError(f"Ollama call failed after {retries+1} attempts: {last_err}")
def strip_llm_artifacts(text: str) -> str:
"""Clean leading/trailing markdown and quotes from LLM output."""
text = text.strip()
# Strip surrounding code fences
if text.startswith("```"):
text = re.sub(r"^```[a-zA-Z]*\n?", "", text)
text = re.sub(r"\n?```\s*$", "", text)
# Strip surrounding quotes
text = text.strip('""”„')
# Drop a leading "Beschreibung:" or similar label
text = re.sub(r"^(Beschreibung|Description|Anforderung|Control):\s*", "", text, flags=re.IGNORECASE)
return text.strip()
# ---------------------------------------------------------------------------
# Derivation
# ---------------------------------------------------------------------------
@dataclass
class DerivedControl:
derived_id: str
source_id: str
kind: str
canonical_name: str
description: str
plagiarism_score: float
related_quaidal_ids: list[str]
external_refs: list[dict]
source: dict
_ASCII_FOLD = str.maketrans({"ä": "ae", "ö": "oe", "ü": "ue", "Ä": "ae", "Ö": "oe", "Ü": "ue", "ß": "ss"})
def slug(text: str) -> str:
text = text.translate(_ASCII_FOLD).lower()
text = re.sub(r"[^a-z0-9]+", "-", text)
return text.strip("-")
def derived_id_for(entry: dict) -> str:
prefix = {
"criterion": "MC-AI-DATA",
"building_block": "AC-AI-DATA",
"measure": "MIT-AI-DATA",
"metric": "MET-AI-DATA",
}.get(entry["kind"], "X-AI-DATA")
title = entry["title_de"]
title = re.sub(r"^\s*(QKB|QB|MA|QM)-\d+[a-zA-Z]?\s*", "", title)
return f"{prefix}-{entry['id']}-{slug(title)[:40]}".rstrip("-")
def derive_one(entry: dict, source_extract: dict, ollama_url: str, model: str, *, verbose: bool = False) -> DerivedControl:
role = KIND_TO_PROMPT_ROLE.get(entry["kind"], "Control")
prompt = PROMPT_TEMPLATE.format(
entry_id=entry["id"],
title_de=entry["title_de"],
role=role,
shortdesc=source_extract["shortdesc"] or "(keiner)",
description_excerpt=source_extract["description_excerpt"] or "(keine Beschreibung)",
)
source_corpus = "\n\n".join(filter(None, [source_extract["shortdesc"], source_extract["description_excerpt"]]))
best: tuple[str, float] | None = None
for attempt in range(1, MAX_RETRIES + 1):
output = call_ollama(prompt, ollama_url, model)
output = strip_llm_artifacts(output)
score = ngram_overlap(output, source_corpus)
if verbose:
print(f" attempt {attempt}: overlap={score:.2%} len={len(output)}", file=sys.stderr)
if score < PLAGIARISM_LIMIT:
best = (output, score)
break
if best is None or score < best[1]:
best = (output, score)
# Strengthen the next prompt by appending a reject notice
prompt += f"\n\n(Vorheriger Versuch hatte {score:.0%} Wortdeckung mit der Quelle. Verwende völlig andere Begriffe und Satzstruktur.)"
if best is None:
raise RuntimeError(f"Could not derive {entry['id']}: no output")
output, score = best
if score >= PLAGIARISM_LIMIT:
raise RuntimeError(
f"Plagiarism gate failed for {entry['id']}: best overlap {score:.2%} >= limit {PLAGIARISM_LIMIT:.0%}.\n"
f"Output:\n{output}"
)
title_de_clean = re.sub(r"^\s*(QKB|QB|MA|QM)-\d+[a-zA-Z]?\s*", "", entry["title_de"]).strip()
return DerivedControl(
derived_id=derived_id_for(entry),
source_id=entry["id"],
kind=entry["kind"],
canonical_name=title_de_clean or entry["title_de"],
description=output,
plagiarism_score=round(score, 4),
related_quaidal_ids=entry["referenced_ids"],
external_refs=entry["external_refs"],
source={
"framework": "BSI QUAIDAL",
"section": entry["id"],
"title_original_de": entry["title_de"],
"url": f"{QUAIDAL_REPO_URL}/blob/main/{entry['source_path'].replace(' ', '%20')}",
"commit_sha": None, # filled in by main()
"license_note": "§ 5 UrhG anwendbar; share:true im Frontmatter; Clean-Room-Ableitung.",
},
)
# ---------------------------------------------------------------------------
# Output writers
# ---------------------------------------------------------------------------
def control_to_dict(c: DerivedControl) -> dict:
d = {
"id": c.derived_id,
"canonical_name": c.canonical_name,
"description": c.description,
"kind": c.kind,
"regulation_anchor": "EU AI Act Art. 10 (Datenqualität für Hochrisiko-KI)",
"related_quaidal_ids": c.related_quaidal_ids,
"external_refs": c.external_refs,
"source": c.source,
"plagiarism_score_at_generation": c.plagiarism_score,
}
return d
def write_yaml_per_kind(controls: list[DerivedControl], commit_sha: str | None) -> dict[str, Path]:
out: dict[str, list[dict]] = {}
for c in controls:
c.source["commit_sha"] = commit_sha
fname = KIND_TO_OUTPUT_FILE.get(c.kind, "other.yaml")
out.setdefault(fname, []).append(control_to_dict(c))
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
written: dict[str, Path] = {}
for fname, items in out.items():
path = OUTPUT_DIR / fname
payload = {
"source": "Derived from BSI QUAIDAL (Clean-Room)",
"source_url": QUAIDAL_REPO_URL,
"commit_sha": commit_sha,
"plagiarism_limit_4gram": PLAGIARISM_LIMIT,
"generated_by_model": OLLAMA_MODEL,
"controls": items,
}
path.write_text(yaml.safe_dump(payload, allow_unicode=True, sort_keys=False), encoding="utf-8")
written[fname] = path
return written
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--only", help="Derive only this QUAIDAL ID (e.g. QKB-01)")
ap.add_argument("--kind", help="Derive only entries of this kind (criterion/building_block/measure/metric)")
ap.add_argument("--limit", type=int, help="Process at most N entries")
ap.add_argument("--dry-run", action="store_true", help="Print derived controls instead of writing YAML")
ap.add_argument("--ollama-host", default="macmini", help="Ollama host (default: macmini)")
ap.add_argument("--model", default=OLLAMA_MODEL)
ap.add_argument("--verbose", action="store_true")
args = ap.parse_args()
if not INDEX_FILE.exists():
print(f"ERROR: missing index. Run ingest_bsi_quaidal.py first ({INDEX_FILE})", file=sys.stderr)
return 2
index = json.loads(INDEX_FILE.read_text(encoding="utf-8"))
entries = index["entries"]
if args.only:
entries = [e for e in entries if e["id"].upper() == args.only.upper()]
if args.kind:
entries = [e for e in entries if e["kind"] == args.kind]
if args.limit:
entries = entries[: args.limit]
if not entries:
print("No entries match the filter.", file=sys.stderr)
return 1
ollama_url = args.ollama_host if "://" in args.ollama_host else f"http://{args.ollama_host}:11434"
print(f"Derivation: {len(entries)} entries, model={args.model}, ollama={ollama_url}, limit={PLAGIARISM_LIMIT:.0%}", file=sys.stderr)
derived: list[DerivedControl] = []
failed: list[tuple[str, str]] = []
for i, entry in enumerate(entries, 1):
if args.verbose:
print(f"[{i}/{len(entries)}] {entry['id']} ({entry['kind']}): {entry['title_de']}", file=sys.stderr)
try:
extract = load_source_extract(entry["source_path"])
ctrl = derive_one(entry, extract, ollama_url, args.model, verbose=args.verbose)
derived.append(ctrl)
except Exception as exc: # noqa: BLE001
failed.append((entry["id"], str(exc)))
print(f" FAILED {entry['id']}: {exc}", file=sys.stderr)
print(f"\nDerived: {len(derived)} | Failed: {len(failed)}", file=sys.stderr)
if args.dry_run:
for c in derived:
c.source["commit_sha"] = index.get("commit_sha")
print(yaml.safe_dump(control_to_dict(c), allow_unicode=True, sort_keys=False))
print("---")
return 0 if not failed else 1
written = write_yaml_per_kind(derived, index.get("commit_sha"))
for fname, path in written.items():
print(f"Wrote {path.relative_to(REPO_ROOT)} ({sum(1 for c in derived if KIND_TO_OUTPUT_FILE[c.kind] == fname)} entries)", file=sys.stderr)
if failed:
print("\nFailures:", file=sys.stderr)
for fid, msg in failed:
print(f" - {fid}: {msg.splitlines()[0]}", file=sys.stderr)
return 1
return 0
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,280 @@
#!/usr/bin/env python3
"""Extract large NIST PDFs locally, then upload as .txt to RAG service.
Workaround for embedding-service container crashing on large PDFs (>5 MB).
Runs pdfplumber + normalization locally, uploads extracted text as .txt.
Usage (on Mac Mini):
python3 control-pipeline/scripts/extract_and_upload_nist.py
"""
import json
import os
import re
import sys
import tempfile
import unicodedata
import httpx
import pdfplumber
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
DOCS = [
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_53r5.txt",
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_82r3.txt",
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_short": "NIST SP 800-82",
"category": "ot_security",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_160v1r1.txt",
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_short": "NIST SP 800-160",
"category": "security_engineering",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_207.txt",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
]
def normalize_pdf_text(text: str) -> str:
"""Fix broken spacing from multi-column PDF extraction."""
text = unicodedata.normalize('NFKC', text)
text = text.replace('\u00ad', '').replace('\u200b', '')
prev = None
while prev != text:
prev = text
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
text = re.sub(
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
)
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
text = re.sub(r'[^\S\n]{2,}', ' ', text)
return text
def extract_pdf_locally(pdf_bytes: bytes) -> str:
"""Extract text from PDF using pdfplumber with normalization."""
import io
text_parts = []
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
print(f" Pages: {len(pdf.pages)}")
for i, page in enumerate(pdf.pages):
text = page.extract_text(x_tolerance=3, y_tolerance=4)
if text:
text_parts.append(text)
if (i + 1) % 50 == 0:
print(f" Extracted {i + 1}/{len(pdf.pages)} pages...")
raw = "\n\n".join(text_parts)
return normalize_pdf_text(raw)
def download_from_minio(object_name: str) -> bytes:
"""Download file from MinIO via RAG service."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
url = resp.json()["url"]
with httpx.Client(timeout=300.0, verify=False) as c:
resp = c.get(url)
resp.raise_for_status()
return resp.content
def upload_text(
text: str, filename: str, collection: str, extra_metadata: dict,
) -> dict:
"""Upload extracted text to RAG service as .txt."""
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
text_bytes = text.encode("utf-8")
with httpx.Client(timeout=1800.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, text_bytes, "text/plain")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_chunks(collection: str, regulation_id: str) -> int:
"""Count chunks for a regulation in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/count",
json={
"filter": {
"must": [{
"key": "regulation_id",
"match": {"value": regulation_id},
}]
},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def check_section_rate(collection: str, regulation_id: str) -> tuple:
"""Returns (total_chunks, chunks_with_section)."""
total = 0
with_sec = 0
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "regulation_id",
"match": {"value": regulation_id},
}]
},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
s = pt.get("payload", {}).get("section", "")
if s and s.strip():
with_sec += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_sec
def main():
print("=" * 60)
print("NIST PDF Local Extraction + Upload")
print("=" * 60)
results = []
for i, doc in enumerate(DOCS, 1):
reg_id = doc["extra_metadata"]["regulation_id"]
print(f"\n[{i}/{len(DOCS)}] {doc['filename']}{doc['collection']}")
# 1. Check current state
existing = count_chunks(doc["collection"], reg_id)
print(f" Existing chunks: {existing}")
# 2. Download PDF from MinIO
print(f" Downloading from MinIO...")
pdf_bytes = download_from_minio(doc["object_name"])
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
# 3. Extract text locally with pdfplumber
print(f" Extracting text locally...")
text = extract_pdf_locally(pdf_bytes)
print(f" Extracted {len(text):,} chars, {text.count(chr(10)):,} lines")
# 4. Save extracted text temporarily (for debugging)
tmp_path = f"/tmp/nist_{reg_id}.txt"
with open(tmp_path, "w", encoding="utf-8") as f:
f.write(text)
print(f" Saved to {tmp_path}")
# 5. Upload as .txt
print(f" Uploading as .txt to RAG service...")
result = upload_text(text, doc["filename"], doc["collection"],
doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
print(f" Uploaded: {new_chunks} chunks (doc_id={new_doc_id})")
# 6. Check section rate
if new_chunks > 0:
total, with_sec = check_section_rate(doc["collection"], reg_id)
pct = (with_sec / total * 100) if total > 0 else 0
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
else:
pct = 0
print(" WARNING: 0 chunks created!")
results.append({
"file": doc["filename"],
"old": existing,
"new": new_chunks,
"section_rate": round(pct, 1),
})
# Summary
print("\n" + "=" * 60)
print("RESULTS")
print("=" * 60)
for r in results:
print(f" {r['file']:<40} old={r['old']} new={r['new']} sect={r['section_rate']}%")
total_new = sum(r["new"] for r in results)
print(f"\nTotal new chunks: {total_new}")
if any(r["new"] == 0 for r in results):
print("\nWARNING: Some documents produced 0 chunks!")
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
Extract CE-relevant obligations from TRBS/TRGS/ASR/OSHA chunks in Qdrant.
Searches for MUSS/SOLL patterns in chunk texts and classifies them.
Output: JSON file with structured obligations for the CE session.
Usage:
python3 /app/scripts/extract_ce_obligations.py
python3 /app/scripts/extract_ce_obligations.py --output /tmp/ce_obligations.json
"""
import argparse
import json
import logging
import os
import re
from pathlib import Path
import httpx
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("ce-obligations")
QDRANT_URL = os.getenv("QDRANT_URL", "http://qdrant:6333")
COLLECTION = "bp_compliance_ce"
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
LLM_MODEL = "qwen3.5:35b-a3b"
# Obligation patterns (DE + EN)
OBLIGATION_PATTERNS = re.compile(
r"(muss|müssen|hat\s+[\w\s]*zu\s|ist\s+[\w\s]*sicherzustellen|"
r"ist\s+verpflichtet|sind\s+verpflichtet|darf\s+nicht|"
r"shall|must|required\s+to|is\s+required|shall\s+not)",
re.IGNORECASE,
)
# CE relevance keywords
CE_KEYWORDS = re.compile(
r"(maschine|schutzeinrichtung|gefährdung|quetsch|scher|stoß|"
r"schneid|fang|einzug|absturz|druck|explosion|brand|"
r"elektrisch|spannung|erdung|schutzleiter|not-halt|"
r"betriebsanleitung|kennzeichnung|prüfung|prüfpflicht|"
r"instandhaltung|wartung|sicherheitsabstand|"
r"schutzmaßnahme|persönliche schutzausrüstung|psa|"
r"machine|guard|hazard|crush|shear|cut|entangle|"
r"lockout|tagout|electrical|grounding|emergency stop|"
r"safety distance|protective device|ppe|inspection)",
re.IGNORECASE,
)
HAZARD_CATEGORIES = {
"quetsch|crush|squeeze": "mechanical_crushing",
"schneid|cut": "mechanical_cutting",
"fang|einzug|entangle|draw": "mechanical_entanglement",
"absturz|fall": "fall_hazard",
"explosion|ex-bereich|atex": "explosion_hazard",
"brand|fire|feuer": "fire_hazard",
"elektrisch|electrical|spannung|voltage": "electrical_hazard",
"lärm|noise|schall": "noise_hazard",
"gefahrstoff|hazardous substance|chemical": "chemical_hazard",
"ergonomie|ergonomic|heben|lift": "ergonomic_hazard",
"temperatur|heat|hitze|kälte|cold": "thermal_hazard",
"strahlung|radiation|laser": "radiation_hazard",
"not-halt|emergency stop|e-stop": "emergency_stop",
"lockout|tagout|loto": "lockout_tagout",
"kennzeichnung|label|marking|sign": "safety_marking",
"prüfung|inspection|test": "inspection_requirement",
"instandhaltung|maintenance|wartung": "maintenance",
"schutzeinrichtung|guard|protective device": "protective_device",
"betriebsanleitung|instruction|manual": "operating_instructions",
"druck|pressure|behälter|vessel|kessel|boiler": "pressure_hazard",
}
# Source-based overrides: TRGS docs about chemicals/storage
# should never be classified as mechanical hazards
_CHEMICAL_SOURCES = re.compile(
r"trgs\s*(5[0-9]{2}|7[0-9]{2}|9[0-9]{2}|4[0-9]{2}|6[0-9]{2})",
re.IGNORECASE,
)
def _classify_hazard(text: str, source: str) -> str:
"""Classify hazard with source-aware overrides."""
# TRGS sources → chemical/pressure/explosion, never mechanical
if _CHEMICAL_SOURCES.search(source):
if re.search(r"explosion|ex-bereich|atex|zündfähig", text, re.IGNORECASE):
return "explosion_hazard"
if re.search(r"druck|pressure|behälter|vessel", text, re.IGNORECASE):
return "pressure_hazard"
if re.search(r"brand|fire|feuer", text, re.IGNORECASE):
return "fire_hazard"
return "chemical_hazard"
# Standard pattern matching (order matters — specific first)
for pattern, category in HAZARD_CATEGORIES.items():
if re.search(pattern, text, re.IGNORECASE):
return category
return "general"
def scroll_chunks(source_filter: str = None) -> list[dict]:
"""Scroll through Qdrant to get all relevant chunks."""
chunks = []
offset = None
batch = 100
while True:
scroll_body = {
"limit": batch,
"with_payload": True,
"with_vector": False,
}
if offset is not None:
scroll_body["offset"] = offset
resp = httpx.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/scroll",
json=scroll_body,
timeout=30.0,
)
data = resp.json()
points = data.get("result", {}).get("points", [])
next_offset = data.get("result", {}).get("next_page_offset")
for pt in points:
payload = pt.get("payload", {})
source = payload.get("source", payload.get("filename", ""))
text = payload.get("chunk_text", "")
# Filter for TRBS/TRGS/ASR/OSHA
source_lower = source.lower()
is_relevant = any(k in source_lower for k in
["trbs", "trgs", "asr", "osha"])
if not is_relevant:
continue
# Check for obligation patterns
if not OBLIGATION_PATTERNS.search(text):
continue
# Check CE relevance
if not CE_KEYWORDS.search(text):
continue
# Classify hazard category (source-aware)
hazard = _classify_hazard(text, source)
# Determine obligation type
if re.search(r"muss|müssen|shall|must|required", text, re.IGNORECASE):
obl_type = "MUSS"
elif re.search(r"soll|sollte|should", text, re.IGNORECASE):
obl_type = "SOLL"
else:
obl_type = "MUSS"
chunks.append({
"source": source,
"section": payload.get("section", ""),
"paragraph": payload.get("paragraph", ""),
"obligation_text": text.strip()[:500],
"hazard_category": hazard,
"obligation_type": obl_type,
"ce_relevance": "high" if hazard != "general" else "medium",
"filename": payload.get("filename", ""),
})
if next_offset is None or not points:
break
offset = next_offset
if len(chunks) % 500 == 0:
logger.info(" Scanned... %d obligations found so far", len(chunks))
return chunks
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--output", default="/tmp/ce_obligations.json")
args = parser.parse_args()
logger.info("Scanning %s for CE obligations...", COLLECTION)
obligations = scroll_chunks()
logger.info("Found %d CE-relevant obligations", len(obligations))
# Stats
by_source = {}
by_hazard = {}
for o in obligations:
src = o["source"][:30]
by_source[src] = by_source.get(src, 0) + 1
by_hazard[o["hazard_category"]] = by_hazard.get(o["hazard_category"], 0) + 1
logger.info("\nBy source:")
for src, cnt in sorted(by_source.items(), key=lambda x: -x[1])[:20]:
logger.info(" %4d %s", cnt, src)
logger.info("\nBy hazard category:")
for cat, cnt in sorted(by_hazard.items(), key=lambda x: -x[1]):
logger.info(" %4d %s", cnt, cat)
# Save
Path(args.output).write_text(
json.dumps(obligations, indent=2, ensure_ascii=False)
)
logger.info("\nSaved to %s", args.output)
if __name__ == "__main__":
main()
@@ -0,0 +1,247 @@
#!/usr/bin/env python3
"""
F1 Migration: Populate regulation_registry from hardcoded Python dicts.
Sources:
- REGULATION_LICENSE_MAP (control_generator.py) 135 entries keyed by regulation_id
- SOURCE_REGULATION_CLASSIFICATION (source_type_classification.py) 58 entries keyed by name
Usage:
# Dry run (prints SQL, no DB write):
python3 scripts/f1_migrate_regulation_registry.py --dry-run
# Against Mac Mini:
python3 scripts/f1_migrate_regulation_registry.py --db-host macmini
# Against local Docker:
python3 scripts/f1_migrate_regulation_registry.py --db-host localhost
"""
import argparse
import sys
from pathlib import Path
# Add parent so we can import from services/data
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from services.control_generator import REGULATION_LICENSE_MAP, _RULE2_PREFIXES, _RULE3_PREFIXES # noqa: E402
from data.source_type_classification import SOURCE_REGULATION_CLASSIFICATION # noqa: E402
# Derive jurisdiction from license_type
_LICENSE_TO_JURISDICTION = {
"EU_LAW": "EU",
"EU_PUBLIC": "EU",
"DE_LAW": "DE",
"DE_PUBLIC": "DE",
"AT_LAW": "AT",
"CH_LAW": "CH",
"FR_LAW": "FR",
"ES_LAW": "ES",
"NL_LAW": "NL",
"IT_LAW": "IT",
"HU_LAW": "HU",
"NIST_PUBLIC_DOMAIN": "US",
"US_GOV_PUBLIC": "US",
"CC-BY-SA-4.0": "INT",
"CC-BY-4.0": "INT",
"OECD_PUBLIC": "INT",
}
def _derive_jurisdiction(license_type: str) -> str:
"""Map license_type to jurisdiction code."""
return _LICENSE_TO_JURISDICTION.get(license_type, "INT")
def build_rows() -> list[dict]:
"""Merge REGULATION_LICENSE_MAP + SOURCE_REGULATION_CLASSIFICATION into rows."""
rows = []
# Track names we've seen (for dedup against SOURCE_REGULATION_CLASSIFICATION)
seen_names: set[str] = set()
# 1) Primary source: REGULATION_LICENSE_MAP (has regulation_id as key)
for reg_id, info in REGULATION_LICENSE_MAP.items():
name = info.get("name", reg_id)
seen_names.add(name)
rows.append({
"regulation_id": reg_id.lower().strip(),
"regulation_name_de": name,
"license_rule": info["rule"],
"license_type": info.get("license", ""),
"attribution": info.get("attribution"),
"source_type": info.get("source_type", "law"),
"jurisdiction": _derive_jurisdiction(info.get("license", "")),
"status": "active",
})
# 2) Secondary: SOURCE_REGULATION_CLASSIFICATION entries not already covered
# These are keyed by name, not by regulation_id. We create synthetic IDs.
for name, source_type in SOURCE_REGULATION_CLASSIFICATION.items():
if name in seen_names:
continue
# Generate a regulation_id from the name
synthetic_id = (
name.lower()
.replace(" ", "_")
.replace("(", "")
.replace(")", "")
.replace("/", "_")
.replace("-", "_")
.replace(".", "")
.replace(",", "")
.replace("ä", "ae")
.replace("ö", "oe")
.replace("ü", "ue")
.replace("á", "a")
.replace("é", "e")
.replace("ó", "o")
.strip("_")
)[:100]
# Guess jurisdiction from name content
jurisdiction = "INT"
name_lower = name.lower()
if any(x in name_lower for x in ["edpb", "edps", "(eu)", "eu ", "wp2"]):
jurisdiction = "EU"
elif any(x in name_lower for x in ["bsi", "bdsg", "bundes", "gwg"]):
jurisdiction = "DE"
elif "nist" in name_lower or "cisa" in name_lower:
jurisdiction = "US"
elif "österreich" in name_lower:
jurisdiction = "AT"
elif "schweiz" in name_lower:
jurisdiction = "CH"
elif "spanien" in name_lower:
jurisdiction = "ES"
elif "frankreich" in name_lower:
jurisdiction = "FR"
elif "ungarn" in name_lower:
jurisdiction = "HU"
# Map source_type_classification's "framework" to our "standard"
# (source_type_classification uses law/guideline/framework)
mapped_source_type = source_type
if source_type == "framework":
mapped_source_type = "standard"
rows.append({
"regulation_id": synthetic_id,
"regulation_name_de": name,
"license_rule": 1, # default: conservative
"license_type": "",
"attribution": None,
"source_type": mapped_source_type,
"jurisdiction": jurisdiction,
"status": "needs_review", # needs manual review since we guessed
})
return rows
def generate_sql(rows: list[dict]) -> str:
"""Generate INSERT SQL for all rows."""
lines = [
"SET search_path TO compliance, public;",
"",
"-- Auto-generated by f1_migrate_regulation_registry.py",
f"-- {len(rows)} rows total",
"",
]
for row in rows:
attr = f"'{row['attribution']}'" if row["attribution"] else "NULL"
lines.append(
f"INSERT INTO regulation_registry "
f"(regulation_id, regulation_name_de, license_rule, license_type, "
f"attribution, source_type, jurisdiction, status) "
f"VALUES ("
f"'{row['regulation_id']}', "
f"'{_escape_sql(row['regulation_name_de'])}', "
f"{row['license_rule']}, "
f"'{row['license_type']}', "
f"{attr}, "
f"'{row['source_type']}', "
f"'{row['jurisdiction']}', "
f"'{row['status']}'"
f") ON CONFLICT (regulation_id) DO UPDATE SET "
f"regulation_name_de = EXCLUDED.regulation_name_de, "
f"license_rule = EXCLUDED.license_rule, "
f"license_type = EXCLUDED.license_type, "
f"attribution = EXCLUDED.attribution, "
f"source_type = EXCLUDED.source_type, "
f"jurisdiction = EXCLUDED.jurisdiction;"
)
return "\n".join(lines)
def _escape_sql(val: str) -> str:
"""Escape single quotes for SQL."""
return val.replace("'", "''")
def insert_via_sqlalchemy(rows: list[dict], db_host: str) -> int:
"""Insert rows using SQLAlchemy (same pattern as control-pipeline)."""
from sqlalchemy import create_engine, text
url = f"postgresql://breakpilot:breakpilot123@{db_host}:5432/breakpilot_db"
engine = create_engine(url)
inserted = 0
with engine.connect() as conn:
conn.execute(text("SET search_path TO compliance, public"))
for row in rows:
conn.execute(
text("""
INSERT INTO regulation_registry
(regulation_id, regulation_name_de, license_rule, license_type,
attribution, source_type, jurisdiction, status)
VALUES
(:regulation_id, :regulation_name_de, :license_rule, :license_type,
:attribution, :source_type, :jurisdiction, :status)
ON CONFLICT (regulation_id) DO UPDATE SET
regulation_name_de = EXCLUDED.regulation_name_de,
license_rule = EXCLUDED.license_rule,
license_type = EXCLUDED.license_type,
attribution = EXCLUDED.attribution,
source_type = EXCLUDED.source_type,
jurisdiction = EXCLUDED.jurisdiction
"""),
row,
)
inserted += 1
conn.commit()
return inserted
def main():
parser = argparse.ArgumentParser(description="Migrate regulation registry data")
parser.add_argument("--dry-run", action="store_true", help="Print SQL only")
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
args = parser.parse_args()
rows = build_rows()
print(f"Built {len(rows)} rows from hardcoded dicts")
# Stats
by_rule = {}
by_status = {}
for r in rows:
by_rule[r["license_rule"]] = by_rule.get(r["license_rule"], 0) + 1
by_status[r["status"]] = by_status.get(r["status"], 0) + 1
print(f" By license_rule: {by_rule}")
print(f" By status: {by_status}")
if args.dry_run:
print("\n--- DRY RUN (SQL output) ---\n")
print(generate_sql(rows))
return
inserted = insert_via_sqlalchemy(rows, args.db_host)
print(f"Inserted/updated {inserted} rows into regulation_registry")
if __name__ == "__main__":
main()
@@ -0,0 +1,206 @@
#!/usr/bin/env python3
"""
F2 Migration: Populate action_types + action_synonyms from hardcoded dicts.
Sources:
- ACTION_TYPES (control_ontology.py) 26 types + ~150 aliases
- _NEGATIVE_PATTERNS (control_ontology.py) 22 patterns
- _ACTION_SYNONYMS (control_dedup.py) 65 synonyms
Usage:
python3 scripts/f2_migrate_actions.py --dry-run
python3 scripts/f2_migrate_actions.py --db-host macmini
"""
import argparse
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from services.control_ontology import ACTION_TYPES, _NEGATIVE_PATTERNS # noqa: E402
from services.control_dedup import _ACTION_SYNONYMS # noqa: E402
# Extra action types found in _ACTION_SYNONYMS but missing from ACTION_TYPES
_EXTRA_ACTION_TYPES = {
"audit": "evidence",
"log": "evidence",
"block": "implementation",
"authorize": "governance",
"authenticate": "implementation",
"update": "operation",
"backup": "operation",
"restore": "operation",
}
def build_action_types() -> list[dict]:
"""Build action_types rows from ACTION_TYPES + extras."""
rows = []
for name, info in ACTION_TYPES.items():
rows.append({
"canonical_name": name,
"phase": info["phase"],
})
for name, phase in _EXTRA_ACTION_TYPES.items():
if name not in ACTION_TYPES:
rows.append({
"canonical_name": name,
"phase": phase,
})
return rows
def build_action_synonyms() -> list[dict]:
"""Build action_synonyms rows from all 3 sources."""
rows = []
seen: set[tuple[str, str, str]] = set() # (synonym, language, pattern_type)
# 1) Aliases from ACTION_TYPES
for action_type, info in ACTION_TYPES.items():
for alias in info.get("aliases", []):
key = (alias.lower(), "de", "alias")
if key not in seen:
seen.add(key)
rows.append({
"canonical_action": action_type,
"synonym": alias.lower(),
"language": "de",
"source": "migration",
"pattern_type": "alias",
})
# 2) Negative patterns
for pattern, action_type in _NEGATIVE_PATTERNS:
key = (pattern.lower(), "de", "negative_pattern")
if key not in seen:
seen.add(key)
rows.append({
"canonical_action": action_type,
"synonym": pattern.lower(),
"language": "de",
"source": "migration",
"pattern_type": "negative_pattern",
})
# 3) _ACTION_SYNONYMS (German → canonical English)
for synonym, canonical in _ACTION_SYNONYMS.items():
# Determine language
lang = "en" if synonym == canonical else "de"
key = (synonym.lower(), lang, "alias")
if key not in seen:
seen.add(key)
# Map canonical to valid action_type
action = _map_dedup_canonical(canonical)
rows.append({
"canonical_action": action,
"synonym": synonym.lower(),
"language": lang,
"source": "migration",
"pattern_type": "alias",
})
return rows
def _map_dedup_canonical(canonical: str) -> str:
"""Map control_dedup canonical names to action_types names."""
# Most map directly, some need adjustment
mapping = {
"test": "test",
"verify": "verify", # in ACTION_TYPES
"validate": "validate", # in ACTION_TYPES
"audit": "audit",
"log": "log",
"block": "block",
"restrict": "restrict_access",
"authorize": "authorize",
"authenticate": "authenticate",
"update": "update",
"backup": "backup",
"restore": "restore",
}
return mapping.get(canonical, canonical)
def insert_via_sqlalchemy(action_types: list[dict], synonyms: list[dict], db_host: str):
"""Insert rows using SQLAlchemy."""
from sqlalchemy import create_engine, text
url = "postgresql://breakpilot:breakpilot123@%s:5432/breakpilot_db" % db_host
engine = create_engine(url)
with engine.connect() as conn:
conn.execute(text("SET search_path TO compliance, public"))
# Insert action_types
for row in action_types:
conn.execute(
text("""
INSERT INTO action_types (canonical_name, phase)
VALUES (:canonical_name, :phase)
ON CONFLICT (canonical_name) DO UPDATE SET
phase = EXCLUDED.phase
"""),
row,
)
print("Inserted %d action_types" % len(action_types))
# Insert action_synonyms
inserted = 0
skipped = 0
for row in synonyms:
try:
conn.execute(
text("""
INSERT INTO action_synonyms
(canonical_action, synonym, language, source, pattern_type)
VALUES
(:canonical_action, :synonym, :language, :source, :pattern_type)
ON CONFLICT (synonym, language, pattern_type) DO UPDATE SET
canonical_action = EXCLUDED.canonical_action,
source = EXCLUDED.source
"""),
row,
)
inserted += 1
except Exception as e:
print(" Skip %s: %s" % (row["synonym"], e))
skipped += 1
conn.commit()
print("Inserted %d action_synonyms (%d skipped)" % (inserted, skipped))
def main():
parser = argparse.ArgumentParser(description="Migrate action types + synonyms")
parser.add_argument("--dry-run", action="store_true", help="Print stats only")
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
args = parser.parse_args()
action_types = build_action_types()
synonyms = build_action_synonyms()
print("Action types: %d" % len(action_types))
print("Action synonyms: %d" % len(synonyms))
by_type = {}
for s in synonyms:
by_type[s["pattern_type"]] = by_type.get(s["pattern_type"], 0) + 1
print(" By pattern_type: %s" % by_type)
by_source = {}
for s in synonyms:
by_source[s["canonical_action"]] = by_source.get(s["canonical_action"], 0) + 1
print(" Top actions: %s" % dict(sorted(by_source.items(), key=lambda x: -x[1])[:10]))
if args.dry_run:
print("\n--- DRY RUN ---")
print("\nAction types:")
for at in action_types:
print(" %s (%s)" % (at["canonical_name"], at["phase"]))
return
insert_via_sqlalchemy(action_types, synonyms, args.db_host)
if __name__ == "__main__":
main()
@@ -0,0 +1,100 @@
#!/usr/bin/env python3
"""
F3 Migration: Populate object_synonyms from hardcoded dict.
Source: _OBJECT_SYNONYMS (control_dedup.py) 75 synonyms
Usage:
python3 scripts/f3_migrate_objects.py --dry-run
python3 scripts/f3_migrate_objects.py --db-host macmini
"""
import argparse
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
from services.control_dedup import _OBJECT_SYNONYMS # noqa: E402
def build_rows() -> list[dict]:
"""Build object_synonyms rows."""
rows = []
for synonym, canonical in _OBJECT_SYNONYMS.items():
# Detect language (heuristic: German if contains umlauts or common DE words)
lang = "de"
lower = synonym.lower()
if all(c in "abcdefghijklmnopqrstuvwxyz0123456789 -_" for c in lower):
# Pure ASCII — likely English
lang = "en"
# Override for known German without umlauts
if lower in ("passwort", "kennwort", "zugangsdaten", "fernzugriff",
"sitzung", "firewall", "netzwerk", "vorfall",
"schwachstelle", "richtlinie", "schulung",
"protokoll", "datensicherung", "wiederherstellung"):
lang = "de"
rows.append({
"canonical_token": canonical,
"synonym": lower,
"language": lang,
"source": "migration",
})
return rows
def insert_via_sqlalchemy(rows: list[dict], db_host: str):
"""Insert rows using SQLAlchemy."""
from sqlalchemy import create_engine, text
url = "postgresql://breakpilot:breakpilot123@%s:5432/breakpilot_db" % db_host
engine = create_engine(url)
with engine.connect() as conn:
conn.execute(text("SET search_path TO compliance, public"))
inserted = 0
for row in rows:
conn.execute(
text("""
INSERT INTO object_synonyms
(canonical_token, synonym, language, source)
VALUES
(:canonical_token, :synonym, :language, :source)
ON CONFLICT (synonym, language) DO UPDATE SET
canonical_token = EXCLUDED.canonical_token,
source = EXCLUDED.source
"""),
row,
)
inserted += 1
conn.commit()
print("Inserted %d object_synonyms" % inserted)
def main():
parser = argparse.ArgumentParser(description="Migrate object synonyms")
parser.add_argument("--dry-run", action="store_true", help="Print stats only")
parser.add_argument("--db-host", default="localhost", help="PostgreSQL host")
args = parser.parse_args()
rows = build_rows()
print("Object synonyms: %d" % len(rows))
# Group by canonical
by_canonical = {}
for r in rows:
by_canonical[r["canonical_token"]] = by_canonical.get(r["canonical_token"], 0) + 1
print("Unique canonical tokens: %d" % len(by_canonical))
print("Top tokens: %s" % dict(sorted(by_canonical.items(), key=lambda x: -x[1])[:10]))
if args.dry_run:
return
insert_via_sqlalchemy(rows, args.db_host)
if __name__ == "__main__":
main()
@@ -0,0 +1,267 @@
#!/usr/bin/env python3
"""
F4: LLM-based Synonym Enrichment for Action Types and Object Tokens.
Uses Ollama (qwen3.5:35b-a3b) to generate additional German synonyms
for each canonical action type and object token. Results are stored
with source='llm' in the DB.
Usage:
# Dry run (print, no DB write):
python3 scripts/f4_llm_enrich_synonyms.py --dry-run
# Against Mac Mini:
python3 scripts/f4_llm_enrich_synonyms.py --db-host macmini --ollama-host macmini
# Only actions or only objects:
python3 scripts/f4_llm_enrich_synonyms.py --actions-only
python3 scripts/f4_llm_enrich_synonyms.py --objects-only
"""
import argparse
import json
import logging
import sys
import time
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("f4-enrich")
OLLAMA_MODEL = "qwen3.5:35b-a3b"
def call_ollama(prompt: str, ollama_url: str) -> str:
"""Call Ollama with think:false for direct answers."""
resp = httpx.post(
f"{ollama_url}/api/chat",
json={
"model": OLLAMA_MODEL,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
"options": {"temperature": 0.3},
"think": False,
},
timeout=60.0,
)
resp.raise_for_status()
return resp.json().get("message", {}).get("content", "")
def enrich_action_types(db_url: str, ollama_url: str, dry_run: bool) -> dict:
"""Generate synonyms for each action type."""
engine = create_engine(db_url, connect_args={"options": "-c search_path=compliance,public"})
with engine.connect() as conn:
# Get existing action types and their current synonyms
types = conn.execute(text("SELECT canonical_name, phase FROM action_types")).fetchall()
existing = {}
for row in conn.execute(text("SELECT canonical_action, synonym FROM action_synonyms")).fetchall():
existing.setdefault(row[0], set()).add(row[1])
stats = {"types_processed": 0, "new_synonyms": 0, "skipped": 0}
all_new: list[dict] = []
for canonical, phase in types:
current_synonyms = existing.get(canonical, set())
prompt = f"""Du bist ein Compliance-Experte. Gib mir 5-8 deutsche Synonyme oder Umschreibungen fuer die Handlung "{canonical}" (Phase: {phase}) im Kontext von IT-Compliance und Datenschutz.
Bestehende Synonyme (NICHT wiederholen): {', '.join(sorted(current_synonyms)[:10])}
Antworte NUR mit einer JSON-Liste von Strings, z.B.: ["synonym1", "synonym2", ...]
Keine Erklaerungen, nur die JSON-Liste."""
try:
response = call_ollama(prompt, ollama_url)
# Parse JSON from response
synonyms = _parse_json_list(response)
new_count = 0
for syn in synonyms:
syn_lower = syn.lower().strip()
if not syn_lower or len(syn_lower) < 3:
continue
if syn_lower in current_synonyms:
stats["skipped"] += 1
continue
all_new.append({
"canonical_action": canonical,
"synonym": syn_lower,
"language": "de",
"source": "llm",
"pattern_type": "alias",
})
current_synonyms.add(syn_lower)
new_count += 1
stats["types_processed"] += 1
stats["new_synonyms"] += new_count
logger.info("%s: +%d new synonyms", canonical, new_count)
except Exception as e:
logger.warning("Error for %s: %s", canonical, e)
time.sleep(1) # Rate limit
# Write to DB
if not dry_run and all_new:
with engine.begin() as conn:
for row in all_new:
conn.execute(
text("""
INSERT INTO action_synonyms (canonical_action, synonym, language, source, pattern_type)
VALUES (:canonical_action, :synonym, :language, :source, :pattern_type)
ON CONFLICT (synonym, language, pattern_type) DO NOTHING
"""),
row,
)
logger.info("Wrote %d new action synonyms to DB", len(all_new))
elif dry_run:
print("\n--- DRY RUN: Action Synonyms ---")
for row in all_new[:20]:
print(" %s%s" % (row["canonical_action"], row["synonym"]))
if len(all_new) > 20:
print(" ... and %d more" % (len(all_new) - 20))
return stats
def enrich_object_tokens(db_url: str, ollama_url: str, dry_run: bool) -> dict:
"""Generate synonyms for each object canonical token."""
engine = create_engine(db_url, connect_args={"options": "-c search_path=compliance,public"})
with engine.connect() as conn:
# Get unique canonical tokens
tokens = conn.execute(text(
"SELECT DISTINCT canonical_token FROM object_synonyms ORDER BY canonical_token"
)).fetchall()
existing = {}
for row in conn.execute(text("SELECT canonical_token, synonym FROM object_synonyms")).fetchall():
existing.setdefault(row[0], set()).add(row[1])
stats = {"tokens_processed": 0, "new_synonyms": 0, "skipped": 0}
all_new: list[dict] = []
for (token,) in tokens:
current_synonyms = existing.get(token, set())
prompt = f"""Du bist ein IT-Security-Experte. Gib mir 5-8 deutsche und englische Begriffe/Synonyme fuer das Konzept "{token}" im Kontext von IT-Sicherheit und Compliance.
Bestehende Synonyme (NICHT wiederholen): {', '.join(sorted(current_synonyms)[:8])}
Antworte NUR mit einer JSON-Liste von Strings, z.B.: ["synonym1", "synonym2", ...]
Keine Erklaerungen, nur die JSON-Liste."""
try:
response = call_ollama(prompt, ollama_url)
synonyms = _parse_json_list(response)
new_count = 0
for syn in synonyms:
syn_lower = syn.lower().strip()
if not syn_lower or len(syn_lower) < 2:
continue
if syn_lower in current_synonyms:
stats["skipped"] += 1
continue
# Detect language
lang = "de"
if all(c in "abcdefghijklmnopqrstuvwxyz0123456789 -_" for c in syn_lower):
lang = "en"
all_new.append({
"canonical_token": token,
"synonym": syn_lower,
"language": lang,
"source": "llm",
})
current_synonyms.add(syn_lower)
new_count += 1
stats["tokens_processed"] += 1
stats["new_synonyms"] += new_count
logger.info("%s: +%d new synonyms", token, new_count)
except Exception as e:
logger.warning("Error for %s: %s", token, e)
time.sleep(1)
# Write to DB
if not dry_run and all_new:
with engine.begin() as conn:
for row in all_new:
conn.execute(
text("""
INSERT INTO object_synonyms (canonical_token, synonym, language, source)
VALUES (:canonical_token, :synonym, :language, :source)
ON CONFLICT (synonym, language) DO NOTHING
"""),
row,
)
logger.info("Wrote %d new object synonyms to DB", len(all_new))
elif dry_run:
print("\n--- DRY RUN: Object Synonyms ---")
for row in all_new[:20]:
print(" %s%s (%s)" % (row["canonical_token"], row["synonym"], row["language"]))
if len(all_new) > 20:
print(" ... and %d more" % (len(all_new) - 20))
return stats
def _parse_json_list(text: str) -> list[str]:
"""Extract JSON list from LLM response."""
# Try to find JSON array in response
text = text.strip()
# Remove markdown code fences
if "```" in text:
text = text.split("```")[1] if text.count("```") >= 2 else text
text = text.strip()
if text.startswith("json"):
text = text[4:].strip()
# Find first [ and last ]
start = text.find("[")
end = text.rfind("]")
if start >= 0 and end > start:
try:
return json.loads(text[start:end + 1])
except json.JSONDecodeError:
pass
return []
def main():
parser = argparse.ArgumentParser(description="LLM Synonym Enrichment")
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--db-host", default="localhost")
parser.add_argument("--ollama-host", default="localhost")
parser.add_argument("--actions-only", action="store_true")
parser.add_argument("--objects-only", action="store_true")
args = parser.parse_args()
db_url = f"postgresql://breakpilot:breakpilot123@{args.db_host}:5432/breakpilot_db"
ollama_url = f"http://{args.ollama_host}:11434"
if args.dry_run:
print("=== DRY RUN MODE ===\n")
if not args.objects_only:
print("=== Enriching Action Types ===")
action_stats = enrich_action_types(db_url, ollama_url, args.dry_run)
print("Actions: %d processed, %d new synonyms\n" % (
action_stats["types_processed"], action_stats["new_synonyms"]))
if not args.actions_only:
print("=== Enriching Object Tokens ===")
object_stats = enrich_object_tokens(db_url, ollama_url, args.dry_run)
print("Objects: %d processed, %d new synonyms\n" % (
object_stats["tokens_processed"], object_stats["new_synonyms"]))
if __name__ == "__main__":
main()
@@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""
Add L2 sub-topics to broad tokens. Instead of just "incident",
produces "incident:response", "incident:detection", etc.
Only processes tokens with >500 controls AND <90% audit accuracy.
Usage:
python3 /app/scripts/gpre0_add_subtopics.py --dry-run
python3 /app/scripts/gpre0_add_subtopics.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-subtopics")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_subtopic_checkpoints")
# Tokens that are too broad — need L2 sub-topics
BROAD_TOKENS = {
# Round 1 (already done)
"risk_management", "policy", "audit_logging", "incident",
"access_control", "compliance_audit", "asset_management",
"key_management", "third_party_management", "monitoring",
"financial_reporting", "data_classification", "change_management",
"alerting", "multi_factor_auth", "api_security",
"certificate_management", "human_resources_security",
"training", "data_processing_agreement", "data_processing_register",
"consumer_protection", "input_validation", "vulnerability",
"dpia", "data_breach_notification", "backup",
"supply_chain_due_diligence", "awareness",
"privacy_by_design", "credentials", "logging_configuration",
# Round 2 (remaining large tokens)
"supervisory_authority", "certification", "secure_development",
"product_safety", "personal_data", "data_subject_rights", "consent",
"ai_system", "encryption", "data_retention", "disaster_recovery",
"data_transfer", "aml", "transport_encryption", "network_security",
"physical_security", "medical_device", "patch_management",
"cookie_consent", "video_surveillance", "network_segmentation",
"telecommunications", "privileged_access", "session_management",
"password_policy", "governance", "whistleblowing", "payment_services",
"health_data", "sensitive_data", "ecommerce", "sustainability_reporting",
"critical_infrastructure", "regulatory",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Spezialist. Jeder Control hat bereits ein Hauptthema (L1 Token).
Deine Aufgabe: Bestimme ein SPEZIFISCHES Sub-Thema (L2) innerhalb des Hauptthemas.
Das L2 Sub-Thema soll den KONKRETEN Aspekt beschreiben. Verwende kurze, klare englische Bezeichnungen.
Beispiele:
- L1=incident, Titel="Incident Response Plan erstellen" L2="response_plan"
- L1=incident, Titel="Sicherheitsvorfälle erkennen" L2="detection"
- L1=incident, Titel="Recovery nach Vorfall dokumentieren" L2="recovery"
- L1=incident, Titel="Forensische Analyse durchführen" L2="forensics"
- L1=risk_management, Titel="Risikobewertung durchführen" L2="assessment"
- L1=risk_management, Titel="Risikominderungsmaßnahmen umsetzen" L2="treatment"
- L1=risk_management, Titel="Restrisiko akzeptieren" L2="acceptance"
- L1=access_control, Titel="Rollenbasierte Zugriffskontrolle" L2="rbac"
- L1=access_control, Titel="Zugriffsrechte regelmäßig prüfen" L2="access_review"
- L1=access_control, Titel="Identitätsmanagement implementieren" L2="identity_management"
- L1=monitoring, Titel="Systemverfügbarkeit überwachen" L2="availability"
- L1=monitoring, Titel="Sicherheitsereignisse überwachen" L2="security_events"
- L1=policy, Titel="Datenschutzrichtlinie erstellen" L2="data_protection"
- L1=policy, Titel="Acceptable Use Policy definieren" L2="acceptable_use"
- L1=policy, Titel="Passwortrichtlinie festlegen" L2="password"
- L1=financial_reporting, Titel="Jahresabschluss erstellen" L2="annual_accounts"
- L1=financial_reporting, Titel="Steuererklärung einreichen" L2="tax"
- L1=alerting, Titel="Datenpanne an Behörde melden" L2="breach_notification"
- L1=alerting, Titel="Sicherheitswarnung eskalieren" L2="escalation"
REGELN:
- L2 soll 1-3 Wörter sein, snake_case
- L2 soll SPEZIFISCH sein (nicht das L1 wiederholen)
- Verwende konsistente L2-Bezeichnungen für ähnliche Controls
Antworte NUR als JSON-Array: [{"id":"...","l2":"subtopic"}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude for L2 sub-topic assignment."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'L1="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:80]}"'
)
prompt = "Bestimme L2 Sub-Topics:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Build LIKE patterns for broad tokens
like_clauses = " OR ".join(
f"cc.generation_metadata->>'merge_group_hint' LIKE '%:{tok}:%'"
for tok in BROAD_TOKENS
)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND ({like_clauses})
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
obj = parts[1] if len(parts) > 1 else ""
if obj in BROAD_TOKENS:
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": obj,
})
logger.info("Found %d controls in broad tokens to add L2 sub-topics", len(controls))
# Process
total_tagged = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
l2_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
l2 = r.get("l2", "")
if not l2:
total_skipped += 1
continue
total_tagged += 1
old_hint = ctrl["current_hint"]
parts = old_hint.split(":", 2)
action = parts[0] if parts else "implement"
l1 = parts[1] if len(parts) > 1 else "unknown"
phase = parts[2] if len(parts) > 2 else "implementation"
# New format: action:L1_L2:phase
new_obj = f"{l1}_{l2}"
new_hint = f"{action}:{new_obj}:{phase}"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": old_hint,
"new_hint": new_hint,
})
l2_stats[l1][l2] += 1
processed = min(i + args.batch_size, len(controls))
if processed % 5000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (tagged=%d skip=%d)",
processed, len(controls), total_tagged, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("SUBTOPIC REPORT")
logger.info("=" * 60)
logger.info("Total: %d | Tagged: %d | Skipped: %d", len(controls), total_tagged, total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
# Show L2 distribution per L1
for l1, subs in sorted(l2_stats.items()):
top_subs = sorted(subs.items(), key=lambda x: -x[1])[:10]
logger.info("\n%s (%d unique L2):", l1, len(subs))
for l2, cnt in top_subs:
logger.info(" %4d %s_%s", cnt, l1, l2)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_subtopics.json"
corr_file.write_text(json.dumps(corrections))
logger.info("\nSaved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints updated.", len(corrections))
if __name__ == "__main__":
main()
@@ -0,0 +1,52 @@
#!/usr/bin/env python3
"""Apply saved corrections from JSON file to DB (crash recovery)."""
import argparse
import json
import logging
import os
from pathlib import Path
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("apply-corrections")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("file", help="Path to corrections JSON file")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
corrections = json.loads(Path(args.file).read_text())
logger.info("Loaded %d corrections from %s", len(corrections), args.file)
if args.dry_run:
for c in corrections[:10]:
logger.info(" %s: %s%s", c["uuid"][:8], c["old_hint"], c["new_hint"])
logger.info("DRY RUN — not applying")
return
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
applied = 0
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
applied += 1
logger.info("Applied %d corrections.", applied)
if __name__ == "__main__":
main()
@@ -0,0 +1,153 @@
#!/usr/bin/env python3
"""Fix bad L2 subtopics: stakeholder_*, escalation fragments, *_approval*, *_documentation."""
import json
import logging
import os
import time
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("fix-subtopics")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du klassifizierst Controls mit einem L1_L2 Token. Das L2 soll den KONKRETEN fachlichen Aspekt beschreiben.
VERBOTENE L2-Wörter (zu generisch):
- stakeholder (zu vage WER sind die Stakeholder? WAS wird getan?)
- documentation (ist eine Handlung, kein Thema)
- approval (ist eine Handlung)
- communication (zu vage)
Stattdessen SPEZIFISCH:
- "stakeholder_notification" bei Behördenmeldung "authority_reporting"
- "stakeholder_consultation" bei DSFA "impact_consultation"
- "stakeholder_engagement" bei Training "participant_selection"
- "escalation_procedure" "severity_classification" oder "response_plan"
- "access_documentation" "access_policy" oder "permission_matrix"
- "approval_process" "authorization_workflow" oder "sign_off"
L2 = 1-3 Wörter, snake_case, FACHLICH SPEZIFISCH.
Antworte NUR als JSON-Array: [{"id":"...","token":"L1_L2"}, ...]"""
def main():
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.release_state NOT IN ('deprecated', 'rejected')
AND cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%stakeholder%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_escalation_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%_approval_%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%response_time%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%machine_re%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%management_app%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else "",
})
logger.info("Found %d controls with bad subtopics to fix", len(controls))
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
corrections = []
total_fixed = 0
batch_size = 20
for i in range(0, len(controls), batch_size):
batch = controls[i:i + batch_size]
items = [
f'- id="{c["control_id"]}" cur="{c["current_object"]}" t="{c["title"]}" o="{c["objective"][:80]}"'
for c in batch
]
try:
resp = httpx.post(ANTHROPIC_URL, headers=headers, json={
"model": "claude-haiku-4-5-20251001",
"max_tokens": 1500, "temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": "Fix:\n" + "\n".join(items)}],
}, timeout=45.0)
resp.raise_for_status()
content = resp.json().get("content", [{}])[0].get("text", "")
start = content.find("[")
end = content.rfind("]") + 1
results = json.loads(content[start:end]) if start >= 0 else []
except Exception as e:
logger.error("Batch %d failed: %s", i, e)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token == ctrl["current_object"]:
continue
if "stakeholder" in new_token or "approval" in new_token:
continue # Still bad
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
total_fixed += 1
if (i + batch_size) % 200 < batch_size:
logger.info("Progress: %d/%d (fixed=%d)", min(i + batch_size, len(controls)), len(controls), total_fixed)
time.sleep(0.3)
logger.info("Fixed: %d of %d controls", total_fixed, len(controls))
# Save + apply
Path("/tmp/corrections_bad_subtopics.json").write_text(json.dumps(corrections))
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done.")
if __name__ == "__main__":
main()
@@ -0,0 +1,284 @@
#!/usr/bin/env python3
"""
Fix generic tokens: Re-classify controls that were assigned to
action-based tokens (documentation, procedure, process, etc.)
instead of topic-based tokens.
Runs sequentially in 5 batches. NO retry on timeout.
Usage:
python3 /app/scripts/gpre0_fix_generic_tokens.py --dry-run
python3 /app/scripts/gpre0_fix_generic_tokens.py
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-fix-generic")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_fix_checkpoints")
# Tokens that are ACTION-based, not TOPIC-based → must be re-classified
FORBIDDEN_TOKENS = {
"documentation", "procedure", "process",
"compliance_reporting", "records_management",
}
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control dem THEMA zu, nicht der Handlung.
KRITISCH: Die Tokens "documentation", "procedure", "process", "compliance_reporting",
"records_management" sind VERBOTEN. Klassifiziere nach dem INHALTLICHEN THEMA.
Beispiele:
- "Risikobewertung dokumentieren" risk_management (NICHT documentation)
- "Incident-Verfahren definieren" incident (NICHT procedure)
- "Verschlüsselungsprozess implementieren" encryption (NICHT process)
- "Audit-Ergebnisse berichten" compliance_audit (NICHT compliance_reporting)
- "Datenschutz-Unterlagen verwalten" personal_data (NICHT records_management)
- "Löschkonzept dokumentieren" data_retention (NICHT documentation)
- "Zertifizierungsverfahren definieren" certification (NICHT procedure)
- "Schulungsprozess durchführen" training (NICHT process)
ERLAUBTE TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, training, awareness, incident, risk_management,
third_party_management, change_management, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]"""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO retry on timeout."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere nach THEMA (nicht Handlung):\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
return [], usage
except httpx.TimeoutException:
logger.error("TIMEOUT — skipping batch")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s")
time.sleep(60)
else:
logger.error("API error %d", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Failed: %s", e)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load only controls with forbidden tokens
forbidden_pattern = "|".join(
f":{tok}:" for tok in FORBIDDEN_TOKENS
)
with engine.connect() as c:
rows = c.execute(text("""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.release_state NOT IN ('deprecated', 'rejected')
AND (
cc.generation_metadata->>'merge_group_hint' LIKE '%:documentation:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:procedure:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:process:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:compliance_reporting:%'
OR cc.generation_metadata->>'merge_group_hint' LIKE '%:records_management:%'
)
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint,
"current_object": parts[1] if len(parts) > 1 else hint,
})
logger.info("Found %d controls with forbidden tokens to re-classify", len(controls))
# Process
total_fixed = 0
total_kept = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(0, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token or new_token in FORBIDDEN_TOKENS:
total_kept += 1
continue
old_obj = ctrl["current_object"]
if new_token != old_obj:
total_fixed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
else:
total_kept += 1
processed = min(i + args.batch_size, len(controls))
if processed % 2000 < args.batch_size or processed >= len(controls):
logger.info(
"Progress: %d/%d (fixed=%d kept=%d skip=%d)",
processed, len(controls), total_fixed, total_kept, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80
cost_out = total_output_tokens / 1_000_000 * 4.00
logger.info("\n" + "=" * 60)
logger.info("GENERIC TOKEN FIX REPORT")
logger.info("=" * 60)
logger.info("Total: %d controls", len(controls))
logger.info("Fixed: %d", total_fixed)
logger.info("Kept: %d (LLM also chose forbidden → kept as-is)", total_kept)
logger.info("Skipped: %d", total_skipped)
logger.info("Cost: $%.2f (Haiku)", cost_in + cost_out)
logger.info("\nTop changes:")
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
for cnt, old, new in sorted(flat, reverse=True)[:30]:
logger.info(" %4d × %s%s", cnt, old, new)
# Save corrections
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file = CHECKPOINT_DIR / "corrections_generic_fix.json"
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("DRY RUN — not updating DB")
return
if corrections:
logger.info("Applying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
if __name__ == "__main__":
main()
+37
View File
@@ -0,0 +1,37 @@
#!/bin/bash
# Run all 10 batches sequentially. Safe: if one fails, the rest don't run.
# Each batch saves corrections to JSON before applying to DB.
#
# Usage: bash /app/scripts/gpre0_run_all.sh
# bash /app/scripts/gpre0_run_all.sh 5 # start from batch 5
set -e
START=${1:-1}
TOTAL=10
echo "=== Starting from batch $START of $TOTAL ==="
for i in $(seq $START $TOTAL); do
echo ""
echo "================================================================"
echo " BATCH $i/$TOTAL$(date)"
echo "================================================================"
PYTHONPATH=/app python3 /app/scripts/gpre0_validate_hints.py \
--batch-id $i \
--total-batches $TOTAL \
--batch-size 20
EXIT_CODE=$?
if [ $EXIT_CODE -ne 0 ]; then
echo "BATCH $i FAILED with exit code $EXIT_CODE"
echo "Resume with: bash /app/scripts/gpre0_run_all.sh $i"
exit $EXIT_CODE
fi
echo "BATCH $i DONE — $(date)"
done
echo ""
echo "ALL $TOTAL BATCHES COMPLETE!"
@@ -0,0 +1,351 @@
#!/usr/bin/env python3
"""
Phase 2: Validate and correct merge_group_hints using Claude Haiku.
Re-classifies each control's object token against the expanded ontology
(74 canonical tokens). Corrects wrong hints in the DB.
SAFETY: Split into 4 batches. NEVER retries on timeout (double-billing!).
Writes checkpoint after each API call for safe resume.
Usage:
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1 --dry-run
python3 /app/scripts/gpre0_validate_hints.py --batch-id 1
python3 /app/scripts/gpre0_validate_hints.py --batch-id 2
python3 /app/scripts/gpre0_validate_hints.py --batch-id 3
python3 /app/scripts/gpre0_validate_hints.py --batch-id 4
"""
import argparse
import json
import logging
import os
import time
from collections import defaultdict
from pathlib import Path
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre0-validate")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = "claude-haiku-4-5-20251001"
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
CHECKPOINT_DIR = Path("/tmp/gpre0_checkpoints")
SYSTEM_PROMPT = """Du bist ein Compliance-Klassifizierer. Ordne jeden Control GENAU EINEM Token zu.
REGEL: Waehle IMMER den naechstbesten Token aus der Liste. OTHER nur wenn ABSOLUT
kein Token auch nur entfernt passt (<1% der Faelle). Im Zweifel: den breitesten
passenden Token waehlen (z.B. "policy" fuer Governance-Dokumente, "procedure" fuer
Ablauf-Definitionen, "risk_management" fuer Bewertungen).
TOKENS:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring (NUR Echtzeit-Systemueberwachung),
audit_logging (Protokollierung/Audit Trail), siem, alerting (Meldepflichten),
compliance_audit (externe Pruefungen), vulnerability, patch_management,
backup, disaster_recovery, physical_security, secure_development,
api_security, input_validation, container_security, logging_configuration
DATA_PROTECTION: personal_data (DSGVO-Verarbeitung), sensitive_data (Art.9),
health_data, consent, data_subject_rights, data_retention, data_transfer,
data_breach_notification, dpia, data_processing_agreement, privacy_by_design,
data_processing_register, data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy (Richtlinie definieren), procedure (Verfahren definieren),
process (Betriebsprozess ausfuehren), training (Schulung), awareness,
incident (Vorfallsbehandlung), risk_management, third_party_management,
change_management, documentation, records_management, compliance_reporting,
asset_management, human_resources_security
REGULATORY: supervisory_authority, certification (Zertifizierung/Konformitaet),
product_safety, ai_system, financial_reporting, aml, whistleblowing,
consumer_protection, ecommerce, telecommunications, medical_device,
payment_services, critical_infrastructure, supply_chain_due_diligence,
sustainability_reporting
ABGRENZUNGEN:
- monitoring = NUR Echtzeit-Systemueberwachung, NICHT Audit/Schulung/Bewertung
- audit_logging = Protokollierung, NICHT externe Pruefung ( compliance_audit)
- procedure = Verfahren DEFINIEREN, NICHT Vorfaelle behandeln ( incident)
- personal_data = DSGVO-Verarbeitung, NICHT Zertifizierung ( certification)
- alerting = Meldepflichten, NICHT Vorfallsbehandlung ( incident)
Antworte NUR als JSON-Array: [{"id":"...","token":"...","conf":0.9}, ...]
KEIN weiterer Text. Nur das Array."""
def call_claude(controls_batch: list[dict]) -> tuple[list[dict], dict]:
"""Send batch to Claude. NO RETRY on timeout (double-billing risk!)."""
items = []
for c in controls_batch:
items.append(
f'- id="{c["control_id"]}" '
f'cur="{c["current_object"]}" '
f't="{c["title"]}" '
f'o="{c["objective"][:100]}"'
)
prompt = "Klassifiziere:\n" + "\n".join(items)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 1500,
"temperature": 0.0,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
try:
resp = httpx.post(
ANTHROPIC_URL, headers=headers, json=payload, timeout=45.0
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
return json.loads(content[start:end]), usage
logger.warning("No JSON array in response")
return [], usage
except httpx.TimeoutException:
# CRITICAL: Do NOT retry! Log and skip.
logger.error("TIMEOUT — skipping batch (NOT retrying to avoid double-billing)")
return [], {}
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
logger.warning("Rate limited — waiting 60s then skipping")
time.sleep(60)
else:
logger.error("API error %d — skipping batch", e.response.status_code)
return [], {}
except Exception as e:
logger.error("Request failed — skipping: %s", e)
return [], {}
def load_checkpoint(batch_id: int) -> int:
"""Load last processed index for this batch."""
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
if cp_file.exists():
data = json.loads(cp_file.read_text())
return data.get("last_index", 0)
return 0
def save_checkpoint(batch_id: int, last_index: int, stats: dict):
"""Save progress checkpoint."""
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
cp_file = CHECKPOINT_DIR / f"batch_{batch_id}.json"
cp_file.write_text(json.dumps({
"batch_id": batch_id,
"last_index": last_index,
**stats,
}))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--batch-id", type=int, required=True)
parser.add_argument("--total-batches", type=int, default=10)
parser.add_argument("--batch-size", type=int, default=20)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--resume", action="store_true",
help="Resume from checkpoint")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load ALL control IDs ordered deterministically, then select quarter
with engine.connect() as c:
all_ids = c.execute(text("""
SELECT cc.id
FROM canonical_controls cc
WHERE cc.generation_metadata->>'merge_group_hint' IS NOT NULL
AND cc.generation_metadata->>'merge_group_hint' != ''
AND cc.release_state NOT IN ('deprecated', 'rejected')
ORDER BY cc.id
""")).fetchall()
total = len(all_ids)
chunk = total // args.total_batches
start_idx = (args.batch_id - 1) * chunk
end_idx = total if args.batch_id == args.total_batches else args.batch_id * chunk
batch_ids = [str(r[0]) for r in all_ids[start_idx:end_idx]]
logger.info("Batch %d/%d: controls %d-%d (%d controls of %d total)",
args.batch_id, args.total_batches, start_idx, end_idx, len(batch_ids), total)
# Load full data for this batch
id_list = ",".join(f"'{uid}'" for uid in batch_ids)
with engine.connect() as c:
rows = c.execute(text(f"""
SELECT cc.id, cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective,
cc.generation_metadata->>'merge_group_hint' as hint
FROM canonical_controls cc
WHERE cc.id IN ({id_list})
ORDER BY cc.id
""")).fetchall()
controls = []
for uuid, cid, title, objective, hint in rows:
parts = hint.split(":", 2) if hint else []
controls.append({
"uuid": str(uuid), "control_id": cid,
"title": title or "", "objective": objective or "",
"current_hint": hint, "current_object": parts[1] if len(parts) > 1 else hint,
})
# Resume from checkpoint?
start_from = 0
if args.resume:
start_from = load_checkpoint(args.batch_id)
if start_from > 0:
logger.info("Resuming from index %d", start_from)
# Process
total_same = 0
total_changed = 0
total_other = 0
total_skipped = 0
total_input_tokens = 0
total_output_tokens = 0
corrections: list[dict] = []
change_stats: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
for i in range(start_from, len(controls), args.batch_size):
batch = controls[i:i + args.batch_size]
results, usage = call_claude(batch)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
if not results:
total_skipped += len(batch)
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
continue
result_map = {r.get("id", ""): r for r in results}
for ctrl in batch:
r = result_map.get(ctrl["control_id"], {})
new_token = r.get("token", "")
if not new_token:
total_skipped += 1
continue
old_obj = ctrl["current_object"]
if new_token == "OTHER":
total_other += 1
elif new_token == old_obj:
total_same += 1
else:
total_changed += 1
parts = ctrl["current_hint"].split(":", 2)
action = parts[0] if parts else "implement"
phase = parts[2] if len(parts) > 2 else "implementation"
corrections.append({
"uuid": ctrl["uuid"],
"old_hint": ctrl["current_hint"],
"new_hint": f"{action}:{new_token}:{phase}",
})
change_stats[old_obj][new_token] += 1
# Checkpoint every batch
save_checkpoint(args.batch_id, i + args.batch_size, {
"same": total_same, "changed": total_changed,
"other": total_other, "skipped": total_skipped,
})
processed = min(i + args.batch_size, len(controls))
if processed % 1000 < args.batch_size or processed >= len(controls):
logger.info(
"Batch %d: %d/%d (same=%d changed=%d other=%d skip=%d)",
args.batch_id, processed, len(controls),
total_same, total_changed, total_other, total_skipped,
)
time.sleep(0.3)
# Report
cost_in = total_input_tokens / 1_000_000 * 0.80 # Haiku
cost_out = total_output_tokens / 1_000_000 * 4.00 # Haiku
total_cost = cost_in + cost_out
total_proc = total_same + total_changed + total_other
logger.info("\n" + "=" * 60)
logger.info("BATCH %d REPORT", args.batch_id)
logger.info("=" * 60)
logger.info("Processed: %d | Skipped: %d", total_proc, total_skipped)
logger.info("Same: %d (%.1f%%)", total_same, total_same / max(total_proc, 1) * 100)
logger.info("Changed: %d (%.1f%%)", total_changed, total_changed / max(total_proc, 1) * 100)
logger.info("OTHER: %d (%.1f%%)", total_other, total_other / max(total_proc, 1) * 100)
logger.info("Cost: $%.2f (Haiku)", total_cost)
logger.info("Cost/ctrl: $%.5f", total_cost / max(total_proc, 1))
# Top changes
flat = []
for old, news in change_stats.items():
for new, cnt in news.items():
flat.append((cnt, old, new))
logger.info("\nTop Changes:")
for cnt, old, new in sorted(flat, reverse=True)[:20]:
logger.info(" %4d × %s%s", cnt, old, new)
# Always save corrections to file (recovery safety)
corr_file = CHECKPOINT_DIR / f"corrections_batch_{args.batch_id}.json"
if corrections:
CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)
corr_file.write_text(json.dumps(corrections))
logger.info("Saved %d corrections to %s", len(corrections), corr_file)
if args.dry_run:
logger.info("\nDRY RUN — not updating DB")
return
# Apply corrections in single transaction
if corrections:
logger.info("\nApplying %d corrections...", len(corrections))
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for corr in corrections:
c.execute(text("""
UPDATE canonical_controls
SET generation_metadata = jsonb_set(
generation_metadata,
'{merge_group_hint}',
to_jsonb(CAST(:new_hint AS text))
)
WHERE id = CAST(:uuid AS uuid)
"""), {"uuid": corr["uuid"], "new_hint": corr["new_hint"]})
logger.info("Done. %d hints corrected.", len(corrections))
else:
logger.info("No corrections needed.")
if __name__ == "__main__":
main()
+37
View File
@@ -0,0 +1,37 @@
#!/usr/bin/env python3
"""G-pre1: Analyze unique objects and test normalization reduction."""
from collections import Counter
from sqlalchemy import create_engine, text
engine = create_engine(
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
connect_args={"options": "-c search_path=compliance,public"},
)
with engine.connect() as c:
rows = c.execute(text("""
SELECT DISTINCT
split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
""")).fetchall()
objects = [r[0] for r in rows if r[0] and r[0].strip()]
print("Unique raw objects: %d" % len(objects))
from services.control_dedup import normalize_object
norm_counts: Counter = Counter()
for obj in objects:
norm_counts[normalize_object(obj)] += 1
print("After normalize_object(): %d unique" % len(norm_counts))
print("Reduction: %.1f%%" % ((1 - len(norm_counts) / len(objects)) * 100))
print()
print("Top 20 normalized objects:")
for token, count in norm_counts.most_common(20):
print(" %5d %s" % (count, token))
print()
print("Singletons (only 1 raw object): %d" % sum(1 for c in norm_counts.values() if c == 1))
print("Groups with 2+ members: %d" % sum(1 for c in norm_counts.values() if c >= 2))
@@ -0,0 +1,219 @@
#!/usr/bin/env python3
"""
G-pre1: Object Clustering via Mini-Batch K-Means on Embeddings.
Clusters ~144k unique normalized objects into ~15-25k semantic groups
using bge-m3 embeddings and Mini-Batch K-Means.
Usage (inside control-pipeline container):
python3 /app/scripts/gpre1_object_clustering.py --k 20000
python3 /app/scripts/gpre1_object_clustering.py --k 20000 --dry-run
"""
import argparse
import json
import logging
import sys
import time
from collections import Counter
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre1")
import os
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
EMBEDDING_URL = "http://embedding-service:8087"
BATCH_SIZE = 64 # Embeddings per API call
def extract_objects(engine) -> tuple[list[str], dict[str, int]]:
"""Extract unique normalized objects and their frequencies."""
from services.control_dedup import normalize_object
logger.info("Extracting objects from canonical_controls...")
with engine.connect() as c:
rows = c.execute(text("""
SELECT split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj,
count(*) AS freq
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
GROUP BY 1
""")).fetchall()
# Normalize and aggregate
norm_freq: Counter = Counter()
norm_to_raw: dict[str, list[str]] = {}
for raw_obj, freq in rows:
if not raw_obj or not raw_obj.strip():
continue
normed = normalize_object(raw_obj)
norm_freq[normed] += freq
norm_to_raw.setdefault(normed, []).append(raw_obj)
objects = list(norm_freq.keys())
freqs = {obj: norm_freq[obj] for obj in objects}
logger.info("Extracted %d unique normalized objects (from %d raw)", len(objects), len(rows))
return objects, freqs
def generate_embeddings(objects: list[str]) -> np.ndarray:
"""Generate embeddings via embedding-service in batches.
Uses pre-allocated numpy array to avoid Python list memory overhead
(Python float = 28 bytes vs numpy float32 = 4 bytes).
"""
total = len(objects)
# Pre-allocate: 144k × 1024 × 4 bytes = ~590 MB (vs ~4 GB with Python lists)
result = np.zeros((total, 1024), dtype=np.float32)
logger.info("Generating embeddings for %d objects (pre-allocated %.0f MB)...",
total, result.nbytes / 1024 / 1024)
failed_batches = []
for i in range(0, total, BATCH_SIZE):
batch = objects[i:i + BATCH_SIZE]
success = False
for attempt in range(3): # Max 3 retries per batch
try:
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
resp = client.post(
f"{EMBEDDING_URL}/embed",
json={"texts": batch},
)
resp.raise_for_status()
embeddings = resp.json().get("embeddings", [])
end = min(i + len(embeddings), total)
result[i:end] = np.array(embeddings, dtype=np.float32)
success = True
break
except Exception as e:
if attempt < 2:
logger.warning("Batch %d attempt %d failed: %s — retrying", i, attempt + 1, e)
import time
time.sleep(2)
else:
logger.error("Batch %d failed after 3 attempts: %s", i, e)
failed_batches.append(i)
if (i + BATCH_SIZE) % 5000 == 0 or i + BATCH_SIZE >= total:
logger.info(" Embedded %d/%d (%.1f%%) [%d failed]",
min(i + BATCH_SIZE, total), total,
min(i + BATCH_SIZE, total) / total * 100,
len(failed_batches))
return result
def cluster_objects(embeddings: np.ndarray, k: int) -> np.ndarray:
"""Run Mini-Batch K-Means clustering."""
logger.info("Clustering %d objects into %d groups (Mini-Batch K-Means)...", len(embeddings), k)
# Normalize embeddings for cosine-like clustering
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
kmeans = MiniBatchKMeans(
n_clusters=k,
batch_size=1000,
max_iter=100,
random_state=42,
verbose=0,
)
labels = kmeans.fit_predict(normalized)
logger.info("Clustering done. Inertia: %.2f", kmeans.inertia_)
return labels
def store_results(engine, objects: list[str], freqs: dict[str, int],
labels: np.ndarray, dry_run: bool):
"""Store clustering results in object_groups table."""
# Build groups
groups: dict[int, list[tuple[str, int]]] = {}
for i, obj in enumerate(objects):
gid = int(labels[i])
groups.setdefault(gid, []).append((obj, freqs.get(obj, 0)))
# Pick canonical name (highest frequency in group)
results = []
for gid, members in groups.items():
members_sorted = sorted(members, key=lambda x: -x[1])
canonical = members_sorted[0][0]
results.append({
"group_id": gid,
"canonical_name": canonical,
"member_count": len(members),
"members": json.dumps([m[0] for m in members_sorted]),
"top_controls_count": members_sorted[0][1],
})
# Stats
sizes = [r["member_count"] for r in results]
logger.info("Groups: %d total", len(results))
logger.info(" Singletons: %d", sum(1 for s in sizes if s == 1))
logger.info(" Groups 2-5: %d", sum(1 for s in sizes if 2 <= s <= 5))
logger.info(" Groups 6-20: %d", sum(1 for s in sizes if 6 <= s <= 20))
logger.info(" Groups 21-100: %d", sum(1 for s in sizes if 21 <= s <= 100))
logger.info(" Groups >100: %d", sum(1 for s in sizes if s > 100))
logger.info(" Max group size: %d", max(sizes))
logger.info(" Avg group size: %.1f", sum(sizes) / len(sizes))
# Top 10 largest groups
top10 = sorted(results, key=lambda x: -x["member_count"])[:10]
logger.info("\nTop 10 largest groups:")
for g in top10:
members_list = json.loads(g["members"])
logger.info(" [%d] %s (%d members): %s",
g["group_id"], g["canonical_name"], g["member_count"],
", ".join(members_list[:5]))
if dry_run:
logger.info("DRY RUN — not writing to DB")
return
# Write to DB
with engine.begin() as conn:
conn.execute(text("SET search_path TO compliance, public"))
conn.execute(text("DELETE FROM object_groups")) # Clear old results
for r in results:
conn.execute(text("""
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
"""), r)
logger.info("Wrote %d groups to object_groups table", len(results))
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--k", type=int, default=20000, help="Number of clusters")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# Step 1: Extract
objects, freqs = extract_objects(engine)
# Step 2: Embed
embeddings = generate_embeddings(objects)
logger.info("Embedding matrix: %s (%.1f MB)", embeddings.shape,
embeddings.nbytes / 1024 / 1024)
# Adjust k if we have fewer objects
k = min(args.k, len(objects) // 2)
logger.info("Using k=%d (requested %d, objects=%d)", k, args.k, len(objects))
# Step 3: Cluster
labels = cluster_objects(embeddings, k)
# Step 4: Store
store_results(engine, objects, freqs, labels, args.dry_run)
if __name__ == "__main__":
main()
@@ -0,0 +1,203 @@
#!/usr/bin/env python3
"""
G-pre1 INCREMENTAL: Append new objects to object_groups via embedding similarity.
Non-destructive alternative to gpre1_object_clustering.py (which DELETEs and
rebuilds all groups via K-Means). This script:
- Finds objects referenced in atomic controls that are NOT yet in
object_groups.members
- Embeds each unmatched object via bge-m3 (local embedding-service)
- Nearest-neighbor search against existing object_groups.canonical_name
- Cosine >= --threshold (default 0.85) APPEND to existing group's members
- Cosine < --threshold CREATE new object_group with next free group_id
Existing groups stay; only members get appended and new groups get added.
Usage (inside control-pipeline container):
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00 --dry-run
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00
python3 /app/scripts/gpre1_object_groups_incremental.py --since 2026-05-18T02:53:00+00:00 --threshold 0.82
"""
import argparse
import json
import logging
import os
from datetime import datetime
import httpx
import numpy as np
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre1_inc")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
EMBEDDING_URL = os.getenv("EMBEDDING_URL", "http://embedding-service:8087")
BATCH_SIZE = 64
def embed_batch(texts: list[str]) -> np.ndarray:
"""Embed a list of strings via bge-m3 embedding-service."""
with httpx.Client(timeout=120.0) as c:
resp = c.post(f"{EMBEDDING_URL}/embed", json={"texts": texts, "normalize": True})
resp.raise_for_status()
return np.array(resp.json()["embeddings"], dtype=np.float32)
def embed_many(texts: list[str], label: str = "") -> np.ndarray:
"""Embed many strings in batches."""
n = len(texts)
out = np.zeros((n, 1024), dtype=np.float32)
for i in range(0, n, BATCH_SIZE):
batch = texts[i:i + BATCH_SIZE]
out[i:i + len(batch)] = embed_batch(batch)
if (i // BATCH_SIZE) % 20 == 0:
logger.info(" %s: %d/%d embedded", label, i + len(batch), n)
return out
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--since", required=True, help="ISO datetime — consider atomics from this date onwards")
parser.add_argument("--threshold", type=float, default=0.85,
help="Cosine threshold for appending to existing group (default 0.85)")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
since_dt = datetime.fromisoformat(args.since.replace("Z", "+00:00"))
logger.info("Incremental object_groups update since %s, threshold=%.2f, dry_run=%s",
since_dt.isoformat(), args.threshold, args.dry_run)
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# 1. Load existing object_groups (id, canonical_name, members)
with engine.connect() as c:
rows = c.execute(text("""
SELECT group_id, canonical_name, members FROM object_groups
""")).fetchall()
existing_groups = [(r[0], r[1], json.loads(r[2]) if isinstance(r[2], str) else r[2]) for r in rows]
logger.info("Loaded %d existing object_groups", len(existing_groups))
existing_members: set[str] = set()
for _, _, members in existing_groups:
for m in members:
existing_members.add(m)
logger.info("Existing union of members: %d distinct strings", len(existing_members))
# 2. Find unmatched objects from atomics since `since`
from services.control_dedup import normalize_object
with engine.connect() as c:
rows = c.execute(text("""
SELECT DISTINCT split_part(generation_metadata->>'merge_group_hint', ':', 2) AS obj
FROM canonical_controls
WHERE decomposition_method = 'pass0b'
AND created_at >= :since
AND generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected', 'duplicate')
"""), {"since": since_dt}).fetchall()
new_objects_raw = [r[0] for r in rows if r[0]]
logger.info("Distinct objects in new atomics: %d", len(new_objects_raw))
# Normalize each + dedupe; track originals → normalized
normed_to_originals: dict[str, set[str]] = {}
for obj in new_objects_raw:
normed = normalize_object(obj)
if not normed:
continue
if normed in existing_members or obj in existing_members:
continue # already in some group
normed_to_originals.setdefault(normed, set()).update([normed, obj])
unmatched_normed = list(normed_to_originals.keys())
logger.info("Unmatched normalized objects: %d", len(unmatched_normed))
if not unmatched_normed:
logger.info("Nothing to do — all objects already mapped.")
return
# 3. Embed existing canonical_names + unmatched objects
logger.info("Embedding %d existing canonical_names...", len(existing_groups))
existing_emb = embed_many([g[1] for g in existing_groups], label="existing")
logger.info("Embedding %d unmatched objects...", len(unmatched_normed))
unmatched_emb = embed_many(unmatched_normed, label="unmatched")
# 4. Nearest-neighbor: for each unmatched, find best existing match
# cosine = dot product (both already L2-normalized)
logger.info("Computing nearest-neighbor matches...")
sims = unmatched_emb @ existing_emb.T # (N_unmatched, N_existing)
best_idx = sims.argmax(axis=1)
best_score = sims.max(axis=1)
appends: dict[int, list[str]] = {} # group_id → list of new members
new_groups: list[tuple[str, list[str]]] = [] # (canonical_name, members)
for i, normed in enumerate(unmatched_normed):
originals = sorted(normed_to_originals[normed])
if best_score[i] >= args.threshold:
gid = existing_groups[int(best_idx[i])][0]
appends.setdefault(gid, []).extend(originals)
else:
# Create a new group with this object as canonical
new_groups.append((normed, originals))
# Stats
distinct_groups_to_extend = len(appends)
total_appends = sum(len(v) for v in appends.values())
logger.info("Plan: extend %d existing groups (+%d members), create %d new groups",
distinct_groups_to_extend, total_appends, len(new_groups))
if args.dry_run:
logger.info("DRY RUN — no writes")
# Sample
if appends:
sample = list(appends.items())[:5]
for gid, members in sample:
gname = next((g[1] for g in existing_groups if g[0] == gid), "?")
logger.info(" Extend group_id=%d (%s) with: %s", gid, gname, members[:3])
if new_groups:
for name, members in new_groups[:5]:
logger.info(" NEW group: %s — members=%s", name, members[:3])
return
# 5. Write — pure INSERT/UPDATE
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
# UPDATE existing groups (append to members JSONB)
for gid, new_members in appends.items():
c.execute(text("""
UPDATE object_groups
SET members = (
SELECT jsonb_agg(DISTINCT m)
FROM jsonb_array_elements_text(members || CAST(:new_members AS jsonb)) AS x(m)
),
member_count = (
SELECT count(DISTINCT m)
FROM jsonb_array_elements_text(members || CAST(:new_members AS jsonb)) AS x(m)
)
WHERE group_id = :gid
"""), {"gid": gid, "new_members": json.dumps(new_members)})
# INSERT new groups with next free group_id
next_gid_row = c.execute(text("SELECT COALESCE(MAX(group_id), 0) + 1 FROM object_groups")).fetchone()
next_gid = next_gid_row[0] if next_gid_row else 1
for name, members in new_groups:
c.execute(text("""
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:gid, :name, :count, CAST(:members AS jsonb), 0)
"""), {
"gid": next_gid,
"name": name[:200],
"count": len(members),
"members": json.dumps(members),
})
next_gid += 1
logger.info("DONE — extended %d existing groups (+%d members), created %d new groups",
distinct_groups_to_extend, total_appends, len(new_groups))
if __name__ == "__main__":
main()
@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
G-pre1 Refinement: Re-cluster large object groups (>200 members in master_controls)
with k=10 sub-clusters for finer granularity.
Replaces the large master controls with smaller, more specific ones.
"""
import json
import logging
import os
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre1-refine")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
EMBEDDING_URL = "http://embedding-service:8087"
def main():
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# Step 1: Find large master controls and their object_group_ids
with engine.connect() as c:
large_mcs = c.execute(text("""
SELECT mc.master_control_id, mc.object_group_id, mc.canonical_name, mc.total_controls,
og.members, og.member_count
FROM master_controls mc
JOIN object_groups og ON og.group_id = mc.object_group_id
WHERE mc.total_controls > 200
ORDER BY mc.total_controls DESC
""")).fetchall()
logger.info("Found %d large master controls to refine", len(large_mcs))
# Step 2: For each large group, re-cluster the object members
with engine.connect() as c:
max_gid = c.execute(text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")).scalar()
next_gid = max_gid + 1
groups_to_delete = []
new_groups = []
total_sub = 0
for mc_id, og_id, canonical, total, members_json, member_count in large_mcs:
members = json.loads(members_json) if isinstance(members_json, str) else members_json
if len(members) < 20:
logger.info(" Skip %s (%d members) — too few to split", canonical, len(members))
continue
# Determine k based on group size
k = max(4, min(len(members) // 15, 20)) # 4-20 sub-clusters
# Embed members
embeddings = _embed_texts(members)
if embeddings is None:
logger.error(" Failed to embed %s", canonical)
continue
# Normalize + cluster
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=min(100, len(members)),
max_iter=50, random_state=42)
labels = kmeans.fit_predict(normalized)
# Build sub-groups
subs: dict[int, list[str]] = {}
for i, member in enumerate(members):
subs.setdefault(int(labels[i]), []).append(member)
for sub_members in subs.values():
new_groups.append({
"group_id": next_gid,
"canonical_name": sub_members[0],
"member_count": len(sub_members),
"members": json.dumps(sub_members),
"top_controls_count": 0,
})
next_gid += 1
total_sub += 1
groups_to_delete.append(og_id)
logger.info(" %s (%s, %d members) → %d sub-groups (k=%d)",
mc_id, canonical, len(members), len(subs), k)
logger.info("Refinement: %d groups → %d sub-groups", len(groups_to_delete), total_sub)
# Step 3: Update DB — replace old object_groups, delete old master_controls
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
# Delete old master controls and their members for affected groups
for og_id in groups_to_delete:
c.execute(text("""
DELETE FROM master_control_members
WHERE master_control_uuid IN (
SELECT id FROM master_controls WHERE object_group_id = :gid
)
"""), {"gid": og_id})
c.execute(text("DELETE FROM master_controls WHERE object_group_id = :gid"), {"gid": og_id})
c.execute(text("DELETE FROM object_groups WHERE group_id = :gid"), {"gid": og_id})
# Insert new sub-groups
for g in new_groups:
c.execute(text("""
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
"""), g)
logger.info("DB updated: %d old groups deleted, %d new groups inserted", len(groups_to_delete), len(new_groups))
# Step 4: Re-run master control generation for affected groups
logger.info("Re-generating master controls for new sub-groups...")
_regenerate_master_controls(engine, [g["group_id"] for g in new_groups])
# Final stats
with engine.connect() as c:
mc_count = c.execute(text("SELECT count(*) FROM master_controls")).scalar()
og_count = c.execute(text("SELECT count(*) FROM object_groups")).scalar()
large = c.execute(text("SELECT count(*) FROM master_controls WHERE total_controls > 200")).scalar()
logger.info("Final: %d master controls, %d object groups, %d still >200", mc_count, og_count, large)
def _regenerate_master_controls(engine, group_ids: list[int]):
"""Re-create master controls for specific object_group_ids."""
from collections import defaultdict
from services.control_dedup import normalize_object
# Build reverse index for new groups only
object_to_group = {}
with engine.connect() as c:
for gid in group_ids:
row = c.execute(text(
"SELECT group_id, canonical_name, members FROM object_groups WHERE group_id = :gid"
), {"gid": gid}).fetchone()
if row:
members = json.loads(row[2]) if isinstance(row[2], str) else row[2]
for m in members:
object_to_group[m] = (row[0], row[1])
# Load controls for these objects
with engine.connect() as c:
rows = c.execute(text("""
SELECT id, control_id, generation_metadata->>'merge_group_hint' AS hint
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND release_state NOT IN ('deprecated', 'rejected')
""")).fetchall()
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
group_names: dict[int, str] = {}
for uuid, control_id, hint in rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action, obj = parts[0], parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
normed = normalize_object(obj)
if normed in object_to_group:
gid, canonical = object_to_group[normed]
elif obj in object_to_group:
gid, canonical = object_to_group[obj]
else:
continue
group_phases[gid][phase].append((str(uuid), control_id, action))
group_names[gid] = canonical
# Create master controls
mc_count = 0
mem_count = 0
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
for gid, phases in group_phases.items():
if len(phases) < 2:
continue
mc_id = "MC-%d" % gid
canonical = group_names.get(gid, "unknown")
sorted_phases = sorted(phases.keys())
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
"""), {
"mcid": mc_id, "gid": gid, "name": canonical,
"phases": json.dumps(sorted_phases),
"pcounts": json.dumps(phase_counts),
"total": total,
})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action in controls:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
"""), {"mc": str(mc_uuid), "ctrl": ctrl_uuid, "phase": phase, "action": action})
mem_count += 1
mc_count += 1
logger.info("Created %d new master controls with %d members", mc_count, mem_count)
def _embed_texts(texts: list[str]) -> np.ndarray | None:
"""Embed texts with retry logic."""
try:
result = np.zeros((len(texts), 1024), dtype=np.float32)
batch_size = 64
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
for attempt in range(3):
try:
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
resp = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
resp.raise_for_status()
embs = resp.json().get("embeddings", [])
end = min(i + len(embs), len(texts))
result[i:end] = np.array(embs, dtype=np.float32)
break
except Exception as e:
if attempt == 2:
logger.error("Embed batch %d failed: %s", i, e)
import time
time.sleep(2)
return result
except Exception as e:
logger.error("Embedding failed: %s", e)
return None
if __name__ == "__main__":
main()
@@ -0,0 +1,164 @@
#!/usr/bin/env python3
"""
G-pre1 Step 2: Sub-cluster large object groups (>50 members) into k=4 sub-groups.
Reads existing object_groups, re-embeds members of large groups,
applies K-Means with k=4 per group, and writes sub-groups back.
Usage (inside container or with PYTHONPATH):
python3 /app/scripts/gpre1_subcluster.py
python3 /app/scripts/gpre1_subcluster.py --min-size 100 # only groups >100
python3 /app/scripts/gpre1_subcluster.py --sub-k 6 # 6 sub-clusters
"""
import argparse
import json
import logging
import os
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre1-sub")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
EMBEDDING_URL = "http://embedding-service:8087"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-size", type=int, default=50, help="Min group size to sub-cluster")
parser.add_argument("--sub-k", type=int, default=4, help="Sub-clusters per group")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# Load large groups
with engine.connect() as c:
groups = c.execute(text(
"SELECT group_id, canonical_name, member_count, members "
"FROM object_groups WHERE member_count > :min ORDER BY member_count DESC"
), {"min": args.min_size}).fetchall()
logger.info("Found %d groups with >%d members to sub-cluster", len(groups), args.min_size)
# Find next available group_id
with engine.connect() as c:
max_gid = c.execute(text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")).scalar()
next_gid = max_gid + 1
total_sub_groups = 0
all_new_rows = []
groups_to_delete = []
for group_id, canonical_name, member_count, members_json in groups:
members = json.loads(members_json) if isinstance(members_json, str) else members_json
if len(members) < args.sub_k * 2:
logger.info(" Skip group %d (%s, %d members) — too small for k=%d",
group_id, canonical_name, len(members), args.sub_k)
continue
# Embed members
embeddings = _embed_batch(members)
if embeddings is None:
logger.error(" Failed to embed group %d (%s)", group_id, canonical_name)
continue
# Normalize for cosine
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
# Sub-cluster
k = min(args.sub_k, len(members) // 2)
kmeans = MiniBatchKMeans(n_clusters=k, batch_size=min(100, len(members)),
max_iter=50, random_state=42)
labels = kmeans.fit_predict(normalized)
# Build sub-groups
sub_groups: dict[int, list[str]] = {}
for i, member in enumerate(members):
sub_groups.setdefault(int(labels[i]), []).append(member)
# Create new rows
for sub_id, sub_members in sub_groups.items():
sub_canonical = sub_members[0] # Most frequent would be better but we don't have freq here
all_new_rows.append({
"group_id": next_gid,
"canonical_name": sub_canonical,
"member_count": len(sub_members),
"members": json.dumps(sub_members),
"top_controls_count": 0,
"parent_group_id": group_id,
})
next_gid += 1
groups_to_delete.append(group_id)
total_sub_groups += len(sub_groups)
if len(groups_to_delete) % 50 == 0:
logger.info(" Processed %d/%d groups, %d sub-groups created",
len(groups_to_delete), len(groups), total_sub_groups)
logger.info("Sub-clustering complete: %d groups → %d sub-groups",
len(groups_to_delete), total_sub_groups)
# Stats
sub_sizes = [r["member_count"] for r in all_new_rows]
if sub_sizes:
logger.info(" Sub-group sizes: avg=%.1f, max=%d, min=%d",
sum(sub_sizes) / len(sub_sizes), max(sub_sizes), min(sub_sizes))
if args.dry_run:
logger.info("DRY RUN — not writing to DB")
for r in all_new_rows[:10]:
logger.info(" [%d] %s (%d members)", r["group_id"], r["canonical_name"], r["member_count"])
return
# Write to DB: delete old large groups, insert sub-groups
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
# Delete old large groups
for gid in groups_to_delete:
c.execute(text("DELETE FROM object_groups WHERE group_id = :gid"), {"gid": gid})
# Insert sub-groups
for r in all_new_rows:
c.execute(text("""
INSERT INTO object_groups (group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:group_id, :canonical_name, :member_count, CAST(:members AS jsonb), :top_controls_count)
"""), r)
logger.info("Wrote %d sub-groups to DB (replaced %d large groups)", len(all_new_rows), len(groups_to_delete))
# Final stats
with engine.connect() as c:
total = c.execute(text("SELECT count(*) FROM object_groups")).scalar()
logger.info("Total groups in DB: %d", total)
def _embed_batch(texts: list[str]) -> np.ndarray | None:
"""Embed a list of texts, return numpy array."""
try:
all_emb = np.zeros((len(texts), 1024), dtype=np.float32)
batch_size = 64
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
with httpx.Client(timeout=httpx.Timeout(60.0, connect=10.0)) as client:
resp = client.post(f"{EMBEDDING_URL}/embed", json={"texts": batch})
resp.raise_for_status()
embs = resp.json().get("embeddings", [])
end = min(i + len(embs), len(texts))
all_emb[i:end] = np.array(embs, dtype=np.float32)
return all_emb
except Exception as e:
logger.error("Embedding failed: %s", e)
return None
if __name__ == "__main__":
main()
+214
View File
@@ -0,0 +1,214 @@
#!/usr/bin/env python3
"""
G-pre2 v2: Build Master Controls directly from canonical tokens.
No K-Means needed Phase 2 already normalized merge_group_hints
to 74 canonical tokens. Each token = one object group.
Groups controls by (canonical_token, phase) and creates MCs
for tokens with >=2 distinct phases.
Usage:
python3 /app/scripts/gpre2_direct_mc.py --dry-run
python3 /app/scripts/gpre2_direct_mc.py --min-phases 2
"""
import argparse
import json
import logging
import os
from collections import defaultdict
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre2-direct")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
PHASE_ORDER = {
"scope": 0, "definition": 1, "governance": 1,
"design": 2, "implementation": 3, "configuration": 3,
"operation": 4, "training": 4, "monitoring": 5,
"testing": 6, "review": 7, "assessment": 8, "remediation": 8,
"validation": 9, "reporting": 10, "evidence": 11,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-phases", type=int, default=2)
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Load all controls with merge_group_hint
logger.info("Loading controls...")
with engine.connect() as c:
rows = c.execute(text("""
SELECT id, control_id,
generation_metadata->>'merge_group_hint' AS hint
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected')
""")).fetchall()
logger.info("Loaded %d controls", len(rows))
# Step 2: Group by (object_token, phase)
token_phases: dict[str, dict[str, list]] = defaultdict(
lambda: defaultdict(list)
)
for uuid, control_id, hint in rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action = parts[0]
obj = parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
token_phases[obj][phase].append((str(uuid), control_id, action))
logger.info("Found %d unique object tokens", len(token_phases))
# Step 3: Create Master Controls
master_controls = []
master_members = []
for token, phases in token_phases.items():
if len(phases) < args.min_phases:
continue
sorted_phases = sorted(
phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99)
)
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
master_controls.append({
"canonical_name": token,
"phases_covered": json.dumps(sorted_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
})
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action in controls:
master_members.append({
"canonical_name": token,
"control_uuid": ctrl_uuid,
"phase": phase,
"action": action,
})
logger.info(
"Created %d Master Controls with %d members (min %d phases)",
len(master_controls), len(master_members), args.min_phases,
)
# Stats
if master_controls:
counts = [mc["total_controls"] for mc in master_controls]
phases_per = [
len(json.loads(mc["phases_covered"])) for mc in master_controls
]
logger.info(" Avg controls/MC: %.1f", sum(counts) / len(counts))
logger.info(" Max controls/MC: %d", max(counts))
logger.info(" Avg phases/MC: %.1f", sum(phases_per) / len(phases_per))
logger.info(" Max phases/MC: %d", max(phases_per))
# Size distribution
logger.info("\n Size distribution:")
logger.info(" ≤10: %d", sum(1 for c in counts if c <= 10))
logger.info(" 11-50: %d", sum(1 for c in counts if 11 <= c <= 50))
logger.info(" 51-200: %d", sum(1 for c in counts if 51 <= c <= 200))
logger.info(" 201-500: %d", sum(1 for c in counts if 201 <= c <= 500))
logger.info(" 501-2K: %d", sum(1 for c in counts if 501 <= c <= 2000))
logger.info(" >2K: %d", sum(1 for c in counts if c > 2000))
# Top 15
top = sorted(master_controls, key=lambda x: -x["total_controls"])[:15]
logger.info("\n Top 15 Master Controls:")
for mc in top:
logger.info(
" %6d %s (%d phases)",
mc["total_controls"],
mc["canonical_name"],
len(json.loads(mc["phases_covered"])),
)
if args.dry_run:
logger.info("\nDRY RUN — not writing to DB")
return
# Step 4: Write to DB
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("DELETE FROM master_control_members"))
c.execute(text("DELETE FROM master_controls"))
# Get next object_group_id
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
mc_uuids = {}
for mc in master_controls:
gid = next_gid
next_gid += 1
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb),
CAST(:pcounts AS jsonb), :total)
"""), {
"mcid": mc_id, "gid": gid,
"name": mc["canonical_name"],
"phases": mc["phases_covered"],
"pcounts": mc["phase_control_count"],
"total": mc["total_controls"],
})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
mc_uuids[mc["canonical_name"]] = str(mc_uuid)
# Insert members
mem_count = 0
for mem in master_members:
mc_uuid = mc_uuids.get(mem["canonical_name"])
if not mc_uuid:
continue
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid),
:phase, :action)
"""), {
"mc": mc_uuid,
"ctrl": mem["control_uuid"],
"phase": mem["phase"],
"action": mem["action"],
})
mem_count += 1
logger.info("Wrote %d MCs + %d members to DB", len(master_controls), mem_count)
if __name__ == "__main__":
main()
@@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""
G-pre2: Build Master Controls from Object Groups + Lifecycle Phases.
Groups atomic controls by (object_group_id, phase) and creates
Master Controls for groups with >=2 distinct phases.
Usage:
python3 /app/scripts/gpre2_master_controls.py
python3 /app/scripts/gpre2_master_controls.py --min-phases 3
python3 /app/scripts/gpre2_master_controls.py --dry-run
"""
import argparse
import json
import logging
import os
from collections import defaultdict
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre2")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
# Canonical phase ordering for lifecycle chains
PHASE_ORDER = {
"scope": 0,
"definition": 1, "governance": 1,
"design": 2,
"implementation": 3, "configuration": 3,
"operation": 4, "training": 4,
"monitoring": 5,
"testing": 6,
"review": 7,
"assessment": 8, "remediation": 8,
"validation": 9,
"reporting": 10,
"evidence": 11,
}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-phases", type=int, default=2, help="Min distinct phases for Master Control")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# Step 1: Build reverse index (object_token → group_id)
logger.info("Building object → group_id reverse index...")
object_to_group = {}
with engine.connect() as c:
groups = c.execute(text("SELECT group_id, canonical_name, members FROM object_groups")).fetchall()
for gid, canonical, members_json in groups:
members = json.loads(members_json) if isinstance(members_json, str) else members_json
for member in members:
object_to_group[member] = (gid, canonical)
logger.info("Reverse index: %d objects → %d groups", len(object_to_group), len(groups))
# Step 2: Load all controls with merge_group_hint
logger.info("Loading controls with merge_group_hint...")
with engine.connect() as c:
rows = c.execute(text("""
SELECT id, control_id,
generation_metadata->>'merge_group_hint' AS hint,
title
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected')
""")).fetchall()
logger.info("Loaded %d controls with merge_group_hint", len(rows))
# Step 3: Parse and group by (group_id, phase)
# Structure: group_id → {phase → [(control_uuid, control_id, action, title)]}
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
group_names: dict[int, str] = {}
unmatched = 0
for uuid, control_id, hint, title in rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action = parts[0]
obj = parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
# Normalize object and find group
from services.control_dedup import normalize_object
normed = normalize_object(obj)
if normed in object_to_group:
gid, canonical = object_to_group[normed]
elif obj in object_to_group:
gid, canonical = object_to_group[obj]
else:
unmatched += 1
continue
group_phases[gid][phase].append((str(uuid), control_id, action, title))
group_names[gid] = canonical
logger.info("Grouped into %d object groups (%d controls unmatched to any group)",
len(group_phases), unmatched)
# Step 4: Create Master Controls (groups with >= min_phases distinct phases)
master_controls = []
master_members = []
mc_counter = 0
for gid, phases in group_phases.items():
if len(phases) < args.min_phases:
continue
mc_counter += 1
mc_id = "MC-%d" % gid
canonical = group_names.get(gid, "unknown")
# Sort phases by lifecycle order
sorted_phases = sorted(phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99))
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
master_controls.append({
"master_control_id": mc_id,
"object_group_id": gid,
"canonical_name": canonical,
"phases_covered": json.dumps(sorted_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
})
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action, title in controls:
master_members.append({
"mc_id": mc_id,
"control_uuid": ctrl_uuid,
"phase": phase,
"action": action,
})
logger.info("Created %d Master Controls with %d members (min %d phases)",
len(master_controls), len(master_members), args.min_phases)
# Stats
if master_controls:
phase_counts = [mc["total_controls"] for mc in master_controls]
phases_per_mc = [len(json.loads(mc["phases_covered"])) for mc in master_controls]
logger.info(" Avg controls per MC: %.1f", sum(phase_counts) / len(phase_counts))
logger.info(" Avg phases per MC: %.1f", sum(phases_per_mc) / len(phases_per_mc))
logger.info(" Max controls in MC: %d", max(phase_counts))
logger.info(" Max phases in MC: %d", max(phases_per_mc))
# Top 10
top10 = sorted(master_controls, key=lambda x: -x["total_controls"])[:10]
logger.info("\nTop 10 Master Controls:")
for mc in top10:
logger.info(" %s: %s (%d controls, phases: %s)",
mc["master_control_id"], mc["canonical_name"],
mc["total_controls"], mc["phases_covered"])
if args.dry_run:
logger.info("DRY RUN — not writing to DB")
return
# Step 5: Write to DB
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
c.execute(text("DELETE FROM master_control_members"))
c.execute(text("DELETE FROM master_controls"))
for mc in master_controls:
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:master_control_id, :object_group_id, :canonical_name,
CAST(:phases_covered AS jsonb), CAST(:phase_control_count AS jsonb),
:total_controls)
"""), mc)
# Get MC UUIDs for member inserts
mc_uuids = {}
for row in c.execute(text("SELECT id, master_control_id FROM master_controls")).fetchall():
mc_uuids[row[1]] = str(row[0])
for mem in master_members:
mc_uuid = mc_uuids.get(mem["mc_id"])
if not mc_uuid:
continue
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
"""), {
"mc_uuid": mc_uuid,
"control_uuid": mem["control_uuid"],
"phase": mem["phase"],
"action": mem["action"],
})
logger.info("Wrote %d Master Controls + %d members to DB",
len(master_controls), len(master_members))
if __name__ == "__main__":
main()
@@ -0,0 +1,267 @@
#!/usr/bin/env python3
"""
G-pre2 INCREMENTAL: Add new atomic controls to Master Controls without rebuild.
Unlike gpre2_master_controls.py which DELETEs and rebuilds the entire
master_controls table, this script is non-destructive:
- Existing master_controls stay untouched (same UUIDs, same MC-IDs)
- For each object_group that gained new atomic controls:
* If MC exists: append new members + update total_controls/phase_counts
* If MC missing AND group now has >= min_phases: create new MC + all members
Usage:
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00 --dry-run
python3 /app/scripts/gpre2_master_controls_incremental.py --since 2026-05-18T02:53:00+00:00 --min-phases 2
"""
import argparse
import json
import logging
import os
from collections import defaultdict
from datetime import datetime
from sqlalchemy import create_engine, text
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("gpre2_incremental")
DB_URL = os.getenv("DATABASE_URL", "postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db")
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--since", required=True, help="ISO datetime — only consider atomics created at/after this")
parser.add_argument("--min-phases", type=int, default=2, help="Min distinct phases to form a new MC (default 2)")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
since_dt = datetime.fromisoformat(args.since.replace("Z", "+00:00"))
logger.info("Incremental run since %s, min_phases=%d, dry_run=%s",
since_dt.isoformat(), args.min_phases, args.dry_run)
engine = create_engine(DB_URL, connect_args={"options": "-c search_path=compliance,public"})
# Step 1: object → group_id reverse index
object_to_group = {}
with engine.connect() as c:
groups = c.execute(text("SELECT group_id, canonical_name, members FROM object_groups")).fetchall()
for gid, canonical, members_json in groups:
members = json.loads(members_json) if isinstance(members_json, str) else members_json
for member in members:
object_to_group[member] = (gid, canonical)
logger.info("Reverse index: %d objects → %d groups", len(object_to_group), len(groups))
# Step 2: Load ALL atomics with merge_group_hint (we need full picture)
with engine.connect() as c:
all_rows = c.execute(text("""
SELECT id, control_id,
generation_metadata->>'merge_group_hint' AS hint,
title,
created_at
FROM canonical_controls
WHERE generation_metadata->>'merge_group_hint' IS NOT NULL
AND generation_metadata->>'merge_group_hint' != ''
AND release_state NOT IN ('deprecated', 'rejected', 'duplicate')
""")).fetchall()
logger.info("Loaded %d atomic controls total", len(all_rows))
# Step 3: Build group_phases (gid → phase → [(uuid, control_id, action, title, is_new)])
from services.control_dedup import normalize_object
group_phases: dict[int, dict[str, list]] = defaultdict(lambda: defaultdict(list))
group_names: dict[int, str] = {}
new_atomic_count = 0
new_groups_touched: set[int] = set()
unmatched = 0
for uuid, control_id, hint, title, created_at in all_rows:
parts = hint.split(":", 2)
if len(parts) < 2:
continue
action = parts[0]
obj = parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
normed = normalize_object(obj)
if normed in object_to_group:
gid, canonical = object_to_group[normed]
elif obj in object_to_group:
gid, canonical = object_to_group[obj]
else:
unmatched += 1
continue
is_new = created_at >= since_dt
group_phases[gid][phase].append((str(uuid), control_id, action, title, is_new))
group_names[gid] = canonical
if is_new:
new_atomic_count += 1
new_groups_touched.add(gid)
logger.info("Total: %d new atomics across %d object_groups (%d unmatched)",
new_atomic_count, len(new_groups_touched), unmatched)
if not new_groups_touched:
logger.info("Nothing to do — no new atomics matched to any object_group.")
return
# Step 4: For each touched object_group, decide action
stats = {
"groups_examined": len(new_groups_touched),
"mcs_existing_updated": 0,
"mcs_new_created": 0,
"members_inserted": 0,
"members_skipped_existing": 0,
"groups_skipped_below_min_phases": 0,
"groups_skipped_no_member_change": 0,
}
# Load existing master_controls index: master_control_id → uuid
with engine.connect() as c:
mc_index = {row[1]: (str(row[0]), row[2]) for row in c.execute(text(
"SELECT id, master_control_id, total_controls FROM master_controls"
)).fetchall()}
logger.info("Existing master_controls: %d", len(mc_index))
# Load existing members for touched MCs (avoid duplicate inserts)
touched_mc_ids = ["MC-%d" % gid for gid in new_groups_touched]
existing_members: dict[str, set[str]] = defaultdict(set)
with engine.connect() as c:
for mc_id_str in touched_mc_ids:
mc_uuid_info = mc_index.get(mc_id_str)
if not mc_uuid_info:
continue
mc_uuid = mc_uuid_info[0]
for row in c.execute(text(
"SELECT control_uuid FROM master_control_members WHERE master_control_uuid = CAST(:u AS uuid)"
), {"u": mc_uuid}).fetchall():
existing_members[mc_id_str].add(str(row[0]))
# Build INSERT/UPDATE plans
inserts_new_mcs = []
inserts_members = []
updates_mcs = []
PHASE_ORDER = {
"scope": 0, "definition": 1, "governance": 1, "design": 2,
"implementation": 3, "configuration": 3, "operation": 4, "training": 4,
"monitoring": 5, "testing": 6, "review": 7, "assessment": 8,
"remediation": 8, "validation": 9, "reporting": 10, "evidence": 11,
}
for gid in new_groups_touched:
mc_id_str = "MC-%d" % gid
phases = group_phases[gid]
canonical = group_names[gid]
all_phases = sorted(phases.keys(), key=lambda p: PHASE_ORDER.get(p, 99))
phase_counts = {p: len(ctrls) for p, ctrls in phases.items()}
total = sum(phase_counts.values())
existing_mc = mc_index.get(mc_id_str)
if existing_mc:
# MC exists — append only NEW atomics that aren't already members
mc_uuid = existing_mc[0]
existing_set = existing_members[mc_id_str]
added_for_this_mc = 0
for phase, controls in phases.items():
for ctrl_uuid, ctrl_id, action, title, is_new in controls:
if ctrl_uuid in existing_set:
stats["members_skipped_existing"] += 1
continue
inserts_members.append({
"mc_uuid": mc_uuid, "control_uuid": ctrl_uuid,
"phase": phase, "action": action,
})
stats["members_inserted"] += 1
added_for_this_mc += 1
if added_for_this_mc > 0:
updates_mcs.append({
"mc_uuid": mc_uuid,
"phases_covered": json.dumps(all_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
})
stats["mcs_existing_updated"] += 1
else:
stats["groups_skipped_no_member_change"] += 1
else:
# MC missing — create only if group now meets min_phases threshold
if len(phases) < args.min_phases:
stats["groups_skipped_below_min_phases"] += 1
continue
inserts_new_mcs.append({
"master_control_id": mc_id_str,
"object_group_id": gid,
"canonical_name": canonical,
"phases_covered": json.dumps(all_phases),
"phase_control_count": json.dumps(phase_counts),
"total_controls": total,
"_members": [
{"control_uuid": c[0], "phase": p, "action": c[2]}
for p, ctrls in phases.items() for c in ctrls
],
})
stats["mcs_new_created"] += 1
logger.info("Plan summary: %s", stats)
if args.dry_run:
logger.info("DRY RUN — no writes")
# Show first few examples
if inserts_new_mcs:
logger.info("Sample NEW MCs (up to 5):")
for mc in inserts_new_mcs[:5]:
logger.info(" %s: %s — total=%d, phases=%s",
mc["master_control_id"], mc["canonical_name"],
mc["total_controls"], mc["phases_covered"])
if updates_mcs:
logger.info("Updates to existing MCs: %d", len(updates_mcs))
return
# Step 5: WRITE — strictly INSERT/UPDATE, no DELETE
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
# 5a: Insert new MCs + their members
for mc in inserts_new_mcs:
new_uuid_row = c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:master_control_id, :object_group_id, :canonical_name,
CAST(:phases_covered AS jsonb), CAST(:phase_control_count AS jsonb),
:total_controls)
RETURNING id
"""), {k: v for k, v in mc.items() if k != "_members"}).fetchone()
new_mc_uuid = str(new_uuid_row[0])
for mem in mc["_members"]:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
"""), {"mc_uuid": new_mc_uuid, **mem})
# 5b: Append new members to existing MCs
for mem in inserts_members:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc_uuid AS uuid), CAST(:control_uuid AS uuid), :phase, :action)
"""), mem)
# 5c: Update phase counts / totals on touched existing MCs
for upd in updates_mcs:
c.execute(text("""
UPDATE master_controls
SET phases_covered = CAST(:phases_covered AS jsonb),
phase_control_count = CAST(:phase_control_count AS jsonb),
total_controls = :total_controls
WHERE id = CAST(:mc_uuid AS uuid)
"""), upd)
logger.info("DONE — wrote %d new MCs, updated %d existing MCs, %d members inserted",
stats["mcs_new_created"], stats["mcs_existing_updated"], stats["members_inserted"])
if __name__ == "__main__":
main()
@@ -0,0 +1,298 @@
#!/usr/bin/env python3
"""
G-pre3: Split large Master Controls by regulation source.
For each MC with >200 controls:
1. Load member controls with parent's source_citation->>'source'
2. Group by regulation source
3. Sources with >= MIN_SOURCE_SIZE new sub-MC
4. Small sources merge into "mixed" bucket
5. UNKNOWN (no source_citation) sub-cluster by embedding if >MAX_MC
6. Delete original large MC, create new sub-MCs
Usage:
python3 /app/scripts/gpre3_regulation_split.py --dry-run
python3 /app/scripts/gpre3_regulation_split.py --min-source 15 --max-mc 100
"""
import argparse
import json
import logging
import os
import re
from collections import defaultdict
from sqlalchemy import create_engine, text
from services.embedding_utils import subcluster_controls
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("gpre3")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
# ── Source key normalization ────────────────────────────────────────
# fmt: off
_SOURCE_SHORT: dict[str, str] = {
"DSGVO (EU) 2016/679": "dsgvo", "NIS2-Richtlinie (EU) 2022/2555": "nis2",
"KI-Verordnung (EU) 2024/1689": "ai_act", "Cyber Resilience Act (CRA)": "cra",
"Digital Services Act (DSA)": "dsa", "Digital Markets Act (DMA)": "dma",
"Digital Operational Resilience Act": "dora", "Data Governance Act (DGA)": "dga",
"Data Act": "data_act", "Maschinenverordnung (EU) 2023/1230": "machinery_reg",
"Medizinprodukteverordnung (EU) 2017/745 (MDR)": "mdr",
"European Health Data Space": "ehds", "European Accessibility Act": "eaa",
"EU Cybersecurity Act": "eu_csa", "EU Blue Guide 2022": "eu_blue_guide",
"EU-US Data Privacy Framework": "eu_us_dpf", "Markets in Crypto-Assets (MiCA)": "mica",
"Standardvertragsklauseln (SCC)": "scc", "ePrivacy-Richtlinie": "eprivacy",
"Batterieverordnung (EU) 2023/1542": "battery_reg",
"Bundesdatenschutzgesetz (BDSG)": "bdsg",
"BSI-Gesetz (BSIG 2025, NIS2-Umsetzung)": "bsig",
"BSI-Kritisverordnung (BSI-KritisV)": "bsi_kritisv",
"Geldwaeschegesetz (GwG)": "gwg", "Hinweisgeberschutzgesetz (HinSchG)": "hinschg",
"Lieferkettensorgfaltspflichtengesetz (LkSG)": "lksg",
"KRITIS-Dachgesetz (KRITISDachG)": "kritisdachg",
"NIST SP 800-53 Rev. 5": "nist_800_53", "NIST Cybersecurity Framework 2.0": "nist_csf",
"NIST Privacy Framework 1.0": "nist_privacy",
"NIST SP 800-207 (Zero Trust)": "nist_zero_trust",
"NIST SP 800-218 (SSDF)": "nist_ssdf", "NIST SP 800-63-3": "nist_800_63",
"NIST AI Risk Management Framework": "nist_ai_rmf",
"NISTIR 8259A IoT Security": "nist_iot",
"OWASP Top 10 (2021)": "owasp_top10", "OWASP API Security Top 10 (2023)": "owasp_api",
"OWASP ASVS 4.0": "owasp_asvs", "OWASP SAMM 2.0": "owasp_samm",
"OWASP MASVS 2.0": "owasp_masvs", "OWASP Mobile Top 10": "owasp_mobile",
"ENISA": "enisa", "TDDDG": "tdddg", "TKG": "tkg", "TMG": "tmg",
"BGB": "bgb", "UWG": "uwg", "UrhG": "urhg",
"BAIT (BaFin 2024)": "bait", "VAIT (BaFin 2022)": "vait",
"AML-Verordnung": "aml_reg", "Zahlungsdiensterichtlinie 2": "psd2",
"Telekommunikationsgesetz Oesterreich": "at_tkg",
"Österreichisches Datenschutzgesetz (DSG)": "at_dsg",
"Allgemeines Gleichbehandlungsgesetz (AGG)": "agg",
"Aktiengesetz (AktG)": "aktg", "Handelsgesetzbuch (HGB)": "hgb",
"GmbH-Gesetz (GmbHG)": "gmbhg", "Insolvenzordnung (InsO)": "inso",
"Gewerbeordnung (GewO)": "gewo", "Abgabenordnung (AO)": "ao",
}
# fmt: on
def source_to_key(source: str) -> str:
"""Normalize regulation source name to a short slug key."""
if source in _SOURCE_SHORT:
return _SOURCE_SHORT[source]
s = source.lower()
s = re.sub(r"\(.*?\)", "", s)
s = re.sub(r"[^a-z0-9äöüß]+", "_", s)
s = re.sub(r"_+", "_", s).strip("_")
return s[:40] if s else "unknown"
# ── Main ───────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--min-source", type=int, default=15,
help="Min controls per source for own sub-MC")
parser.add_argument("--max-mc", type=int, default=100,
help="Max controls per sub-MC before sub-clustering")
parser.add_argument("--threshold", type=int, default=200,
help="Only split MCs with more than N controls")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Step 1: Find large master controls
with engine.connect() as c:
large_mcs = c.execute(text("""
SELECT mc.id, mc.master_control_id, mc.object_group_id,
mc.canonical_name, mc.total_controls
FROM master_controls mc
WHERE mc.total_controls > :threshold
ORDER BY mc.total_controls DESC
"""), {"threshold": args.threshold}).fetchall()
logger.info("Found %d MCs with >%d controls", len(large_mcs), args.threshold)
if not large_mcs:
return
# Step 2: Build split plans
all_splits = []
for mc_uuid, mc_id, og_id, canonical, total in large_mcs:
plan = _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args)
all_splits.append(plan)
total_new = sum(len(sp["sub_groups"]) for sp in all_splits)
total_covered = sum(
sum(len(sg["controls"]) for sg in sp["sub_groups"]) for sp in all_splits
)
logger.info("SUMMARY: %d large MCs → %d sub-MCs (%d controls)", len(all_splits), total_new, total_covered)
if args.dry_run:
logger.info("DRY RUN — not writing to DB")
return
_write_splits(engine, all_splits)
def _build_split_plan(engine, mc_uuid, mc_id, og_id, canonical, total, args) -> dict:
"""Build a regulation-source split plan for one large MC."""
logger.info("\n━━━ %s: %s (%d controls) ━━━", mc_id, canonical, total)
with engine.connect() as c:
members = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(pc.source_citation->>'source', 'UNKNOWN') AS src
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
LEFT JOIN canonical_controls pc ON pc.id = cc.parent_control_uuid
WHERE mcm.master_control_uuid = CAST(:mc_uuid AS uuid)
"""), {"mc_uuid": str(mc_uuid)}).fetchall()
by_source: dict[str, list[dict]] = defaultdict(list)
for ctrl_uuid, phase, action, cid, title, src in members:
by_source[src].append({
"control_uuid": str(ctrl_uuid), "phase": phase,
"action": action, "control_id": cid, "title": title,
})
sorted_sources = sorted(by_source.items(), key=lambda x: -len(x[1]))
for src, ctrls in sorted_sources[:8]:
logger.info(" %4d %s", len(ctrls), src)
if len(sorted_sources) > 8:
logger.info(" ... +%d more sources", len(sorted_sources) - 8)
plan = {"mc_uuid": str(mc_uuid), "mc_id": mc_id, "og_id": og_id,
"canonical": canonical, "total": total, "sub_groups": []}
own_mc_sources = []
mixed_controls = []
for src, ctrls in sorted_sources:
if src == "UNKNOWN":
continue
if len(ctrls) >= args.min_source:
own_mc_sources.append((src, ctrls))
else:
mixed_controls.extend(ctrls)
unknown_controls = by_source.get("UNKNOWN", [])
# (a) Named regulation sub-MCs
for src, ctrls in own_mc_sources:
key = source_to_key(src)
name = f"{canonical}_{key}"
_add_subgroups(plan, name, src, ctrls, args.max_mc)
# (b) Mixed small-source bucket
if mixed_controls:
_add_subgroups(plan, f"{canonical}_mixed", "mixed", mixed_controls, args.max_mc)
# (c) UNKNOWN bucket
if unknown_controls:
_add_subgroups(plan, f"{canonical}_general", "general", unknown_controls, args.max_mc)
logger.info("%d sub-groups:", len(plan["sub_groups"]))
for sg in sorted(plan["sub_groups"], key=lambda x: -len(x["controls"])):
logger.info(" %4d %s", len(sg["controls"]), sg["name"])
return plan
def _add_subgroups(plan: dict, name: str, source: str,
controls: list[dict], max_mc: int):
"""Add controls as one or more sub-groups to the plan."""
if len(controls) <= max_mc:
plan["sub_groups"].append({"name": name, "source": source, "controls": controls})
else:
clusters = subcluster_controls(controls, max_mc)
for i, cluster in enumerate(clusters):
sub_name = f"{name}_{i+1}" if len(clusters) > 1 else name
plan["sub_groups"].append({"name": sub_name, "source": source, "controls": cluster})
def _write_splits(engine, splits: list[dict]):
"""Apply split plan: delete old MCs, create new object_groups + MCs."""
with engine.begin() as c:
c.execute(text("SET search_path TO compliance, public"))
max_gid = c.execute(
text("SELECT COALESCE(MAX(group_id), 0) FROM object_groups")
).scalar()
next_gid = max_gid + 1
total_mc = 0
total_mem = 0
for sp in splits:
c.execute(text(
"DELETE FROM master_control_members "
"WHERE master_control_uuid = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
c.execute(text(
"DELETE FROM master_controls WHERE id = CAST(:u AS uuid)"
), {"u": sp["mc_uuid"]})
logger.info("Deleted %s (%s)", sp["mc_id"], sp["canonical"])
for sg in sp["sub_groups"]:
if not sg["controls"]:
continue
gid = next_gid
next_gid += 1
members_list = list({ctrl["control_id"] for ctrl in sg["controls"]})
c.execute(text("""
INSERT INTO object_groups
(group_id, canonical_name, member_count, members, top_controls_count)
VALUES (:gid, :name, :cnt, CAST(:members AS jsonb), 0)
"""), {"gid": gid, "name": sg["name"], "cnt": len(members_list),
"members": json.dumps(members_list)})
by_phase: dict[str, list[dict]] = defaultdict(list)
for ctrl in sg["controls"]:
by_phase[ctrl["phase"]].append(ctrl)
sorted_phases = sorted(by_phase.keys())
phase_counts = {p: len(v) for p, v in by_phase.items()}
mc_id = f"MC-{gid}"
c.execute(text("""
INSERT INTO master_controls
(master_control_id, object_group_id, canonical_name,
phases_covered, phase_control_count, total_controls)
VALUES (:mcid, :gid, :name,
CAST(:phases AS jsonb), CAST(:pcounts AS jsonb), :total)
"""), {"mcid": mc_id, "gid": gid, "name": sg["name"],
"phases": json.dumps(sorted_phases),
"pcounts": json.dumps(phase_counts),
"total": sum(phase_counts.values())})
mc_uuid = c.execute(text(
"SELECT id FROM master_controls WHERE master_control_id = :mcid"
), {"mcid": mc_id}).scalar()
for ctrl in sg["controls"]:
c.execute(text("""
INSERT INTO master_control_members
(master_control_uuid, control_uuid, phase, action)
VALUES (CAST(:mc AS uuid), CAST(:ctrl AS uuid), :phase, :action)
"""), {"mc": str(mc_uuid), "ctrl": ctrl["control_uuid"],
"phase": ctrl["phase"], "action": ctrl["action"]})
total_mem += 1
total_mc += 1
logger.info("Created %d new MCs with %d members", total_mc, total_mem)
with engine.connect() as c:
stats = c.execute(text("""
SELECT count(*), count(CASE WHEN total_controls > 200 THEN 1 END),
AVG(total_controls)::int
FROM compliance.master_controls
""")).fetchone()
logger.info("Final: %d MCs, %d still >200, avg %d controls/MC", stats[0], stats[1], stats[2])
if __name__ == "__main__":
main()
@@ -0,0 +1,310 @@
#!/usr/bin/env python3
"""
Phase 0: Quality Audit for Master Control Assignments.
Uses Claude Sonnet to validate whether controls are correctly assigned
to their Master Controls. Samples controls from large and small MCs.
Usage:
python3 /app/scripts/gpre_quality_audit.py
python3 /app/scripts/gpre_quality_audit.py --large-sample 50 --small-sample 10
python3 /app/scripts/gpre_quality_audit.py --mc MC-8292 # single MC
"""
import argparse
import json
import logging
import os
import random
import time
from collections import defaultdict
import httpx
from sqlalchemy import create_engine, text
logging.basicConfig(
level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s"
)
logger = logging.getLogger("quality-audit")
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
ANTHROPIC_MODEL = os.getenv("AUDIT_MODEL", "claude-sonnet-4-20250514")
ANTHROPIC_URL = "https://api.anthropic.com/v1/messages"
SYSTEM_PROMPT = """Du bist ein Compliance-Experte der prüft ob Controls korrekt zu Master Controls zugeordnet sind.
Für jeden Control beantworte:
1. MATCH: Gehört dieser Control thematisch zum Master Control Topic?
2. CONFIDENCE: Wie sicher bist du? (0.0-1.0)
3. REASON: Kurze Begründung (max 1 Satz)
4. SUGGESTED_TOPIC: Falls MATCH=false, welches Topic wäre korrekt?
Wichtige Unterscheidungen:
- "monitoring" = kontinuierliche Überwachung, Alerting, Log-Analyse
- "training" = Schulung, Awareness, Lernmaterialien
- "personal_data" = personenbezogene Daten, DSGVO-Betroffenenrechte
- "procedure" = Verfahren, Prozesse (aber NICHT wenn es spezifisch um Incidents geht)
- "incident" = Sicherheitsvorfälle, Breach Notification, Recovery
- "policy" = Richtlinien, Regelwerke, Governance-Dokumente
- "encryption" = Verschlüsselung, Kryptografie, Key Management
- "audit_logging" = Protokollierung, Audit Trail, Nachvollziehbarkeit
Antworte NUR als JSON-Array, ein Objekt pro Control."""
def call_claude(controls_batch: list[dict], mc_topic: str) -> list[dict]:
"""Send a batch of controls to Claude for validation."""
items = []
for c in controls_batch:
items.append(
f"- Control '{c['control_id']}': "
f"Titel=\"{c['title']}\", "
f"Objective=\"{c['objective'][:150]}...\", "
f"Phase={c['phase']}, Action={c['action']}"
)
prompt = (
f"Master Control Topic: \"{mc_topic}\"\n\n"
f"Prüfe diese {len(controls_batch)} Controls:\n\n"
+ "\n".join(items)
+ "\n\nAntwort als JSON-Array mit Feldern: "
"control_id, match (bool), confidence (float), reason (str), "
"suggested_topic (str, nur wenn match=false)."
)
headers = {
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
}
payload = {
"model": ANTHROPIC_MODEL,
"max_tokens": 2048,
"temperature": 0.1,
"system": SYSTEM_PROMPT,
"messages": [{"role": "user", "content": prompt}],
}
for attempt in range(3):
try:
resp = httpx.post(
ANTHROPIC_URL,
headers=headers,
json=payload,
timeout=60.0,
)
resp.raise_for_status()
data = resp.json()
content = data.get("content", [{}])[0].get("text", "")
usage = data.get("usage", {})
# Parse JSON from response
start = content.find("[")
end = content.rfind("]") + 1
if start >= 0 and end > start:
results = json.loads(content[start:end])
return results, usage
logger.warning("No JSON array in response: %s", content[:200])
return [], usage
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
wait = 30 * (attempt + 1)
logger.warning("Rate limited, waiting %ds...", wait)
time.sleep(wait)
else:
logger.error("API error: %s", e)
return [], {}
except Exception as e:
logger.error("Request failed (attempt %d): %s", attempt + 1, e)
if attempt < 2:
time.sleep(5)
return [], {}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--large-sample", type=int, default=50,
help="Controls to sample per large MC")
parser.add_argument("--small-sample", type=int, default=10,
help="Controls to sample per small MC")
parser.add_argument("--small-mc-count", type=int, default=50,
help="Number of small MCs to audit")
parser.add_argument("--mc", type=str, default=None,
help="Audit a single MC by ID (e.g., MC-8292)")
parser.add_argument("--batch-size", type=int, default=10,
help="Controls per API call")
args = parser.parse_args()
engine = create_engine(
DB_URL, connect_args={"options": "-c search_path=compliance,public"}
)
# Load MCs to audit
with engine.connect() as c:
if args.mc:
mcs = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE master_control_id = :mc
"""), {"mc": args.mc}).fetchall()
else:
# Large MCs (>200) + random small MCs
large = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls > 200
ORDER BY total_controls DESC
""")).fetchall()
small = c.execute(text("""
SELECT id, master_control_id, canonical_name, total_controls
FROM master_controls WHERE total_controls BETWEEN 10 AND 200
ORDER BY RANDOM() LIMIT :cnt
"""), {"cnt": args.small_mc_count}).fetchall()
mcs = list(large) + list(small)
logger.info("Auditing %d Master Controls", len(mcs))
# Results tracking
total_checked = 0
total_match = 0
total_mismatch = 0
total_input_tokens = 0
total_output_tokens = 0
mc_results: dict[str, dict] = {}
all_mismatches: list[dict] = []
for mc_uuid, mc_id, canonical, total in mcs:
is_large = total > 200
sample_size = args.large_sample if is_large else args.small_sample
# Sample controls
with engine.connect() as c:
controls = c.execute(text("""
SELECT mcm.control_uuid, mcm.phase, mcm.action,
cc.control_id, cc.title,
COALESCE(cc.objective, '') as objective
FROM master_control_members mcm
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE mcm.master_control_uuid = CAST(:mc AS uuid)
ORDER BY RANDOM()
LIMIT :n
"""), {"mc": str(mc_uuid), "n": sample_size}).fetchall()
if not controls:
continue
control_dicts = [
{"control_uuid": str(r[0]), "phase": r[1], "action": r[2],
"control_id": r[3], "title": r[4] or "", "objective": r[5] or ""}
for r in controls
]
logger.info("\n%s: %s (%d total, sampling %d)",
mc_id, canonical, total, len(control_dicts))
mc_match = 0
mc_mismatch = 0
# Process in batches
for i in range(0, len(control_dicts), args.batch_size):
batch = control_dicts[i:i + args.batch_size]
results, usage = call_claude(batch, canonical)
total_input_tokens += usage.get("input_tokens", 0)
total_output_tokens += usage.get("output_tokens", 0)
for r in results:
if r.get("match", True):
mc_match += 1
total_match += 1
else:
mc_mismatch += 1
total_mismatch += 1
mismatch = {
"mc_id": mc_id,
"mc_topic": canonical,
"control_id": r.get("control_id", "?"),
"confidence": r.get("confidence", 0),
"reason": r.get("reason", ""),
"suggested_topic": r.get("suggested_topic", ""),
}
all_mismatches.append(mismatch)
total_checked += len(results)
# Rate limit
time.sleep(1)
accuracy = mc_match / (mc_match + mc_mismatch) if (mc_match + mc_mismatch) > 0 else 1.0
mc_results[mc_id] = {
"canonical": canonical, "total": total,
"checked": mc_match + mc_mismatch,
"match": mc_match, "mismatch": mc_mismatch,
"accuracy": accuracy,
}
logger.info("%d/%d correct (%.1f%%)",
mc_match, mc_match + mc_mismatch, accuracy * 100)
# Final report
_print_report(mc_results, all_mismatches, total_checked, total_match,
total_mismatch, total_input_tokens, total_output_tokens)
def _print_report(mc_results, mismatches, checked, match, mismatch,
input_tok, output_tok):
"""Print the quality audit report."""
logger.info("\n" + "=" * 70)
logger.info("QUALITY AUDIT REPORT")
logger.info("=" * 70)
logger.info("Total controls checked: %d", checked)
logger.info("Correct assignments: %d (%.1f%%)",
match, match / max(checked, 1) * 100)
logger.info("Wrong assignments: %d (%.1f%%)",
mismatch, mismatch / max(checked, 1) * 100)
# Cost estimate
cost_input = input_tok / 1_000_000 * 3.0 # Sonnet input: $3/MTok
cost_output = output_tok / 1_000_000 * 15.0 # Sonnet output: $15/MTok
logger.info("\nAPI Usage: %d input + %d output tokens",
input_tok, output_tok)
logger.info("Estimated cost: $%.2f", cost_input + cost_output)
# Per-MC breakdown (worst first)
logger.info("\n--- Per-MC Accuracy (worst first) ---")
sorted_mcs = sorted(mc_results.values(), key=lambda x: x["accuracy"])
for mc in sorted_mcs:
flag = "" if mc["accuracy"] < 0.9 else "⚠️" if mc["accuracy"] < 0.95 else ""
logger.info(" %s %s (%s): %d/%d = %.1f%% [total: %d]",
flag, mc["canonical"][:30].ljust(30),
"large" if mc["total"] > 200 else "small",
mc["match"], mc["checked"],
mc["accuracy"] * 100, mc["total"])
# Top mismatches
if mismatches:
logger.info("\n--- Mismatches (all %d) ---", len(mismatches))
for m in sorted(mismatches, key=lambda x: -x.get("confidence", 0)):
logger.info(" %s in %s (%s) → should be '%s': %s",
m["control_id"], m["mc_id"], m["mc_topic"],
m["suggested_topic"], m["reason"])
# Size-class breakdown
large_mcs = [m for m in mc_results.values() if m["total"] > 200]
small_mcs = [m for m in mc_results.values() if m["total"] <= 200]
if large_mcs:
lg_acc = sum(m["match"] for m in large_mcs) / max(sum(m["checked"] for m in large_mcs), 1)
logger.info("\nLarge MCs (>200): %.1f%% accuracy (%d MCs)",
lg_acc * 100, len(large_mcs))
if small_mcs:
sm_acc = sum(m["match"] for m in small_mcs) / max(sum(m["checked"] for m in small_mcs), 1)
logger.info("Small MCs (≤200): %.1f%% accuracy (%d MCs)",
sm_acc * 100, len(small_mcs))
if __name__ == "__main__":
main()
@@ -0,0 +1,242 @@
#!/usr/bin/env python3
"""Parse BSI QUAIDAL Markdown catalog into a structural index.
Clean-Room principle: this script does NOT persist any QUAIDAL prose to disk.
It only extracts non-protectable structural facts (IDs, type, file paths,
cross-references to other QUAIDAL entries, references to external norms).
The derivation step (derive_quaidal_mcs.py) reads the index plus the original
.md files from the gitignored clone and asks the LLM to produce our own
wordings, never copying the BSI prose into our own controls/database.
Input: legal-sources/bsi-quaidal/0000_Markdown/**/*.md (gitignored clone)
Output: control-pipeline/data/quaidal/quaidal_index.json (structural only)
Usage:
python3 control-pipeline/scripts/ingest_bsi_quaidal.py
python3 control-pipeline/scripts/ingest_bsi_quaidal.py --check # validate only
"""
from __future__ import annotations
import argparse
import json
import re
import subprocess
import sys
from dataclasses import asdict, dataclass, field
from pathlib import Path
try:
import yaml
except ImportError:
print("ERROR: PyYAML missing. Install with: pip install pyyaml", file=sys.stderr)
sys.exit(2)
REPO_ROOT = Path(__file__).resolve().parents[2]
SOURCE_ROOT = REPO_ROOT / "legal-sources" / "bsi-quaidal"
MARKDOWN_ROOT = SOURCE_ROOT / "0000_Markdown"
OUTPUT_DIR = REPO_ROOT / "control-pipeline" / "data" / "quaidal"
OUTPUT_FILE = OUTPUT_DIR / "quaidal_index.json"
# Map folder name -> our internal kind. Sub-folders inside the Methoden tree
# (e.g. "QM-10_Dimension Reduction") are treated as method variants of their
# parent QM.
KIND_BY_PARENT_DIR = {
"0000_Qualitätskriterien": "criterion", # QKB → Master Control candidates
"0001_Qualitätsbausteine": "building_block", # QB → atomic controls
"0002_Maßnahmen": "measure", # M → mitigations
"0003_Qualitätsmetriken_methoden": "metric", # QM → runtime check / metric
"0002_Referenz-Matrizen": "matrix", # cross-walk matrix
"9998_CustomTemplates": "template",
}
FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---\s*\n", re.DOTALL)
ID_RE = re.compile(r"\b((?:QKB|QB|MA|QM)-\d+[a-zA-Z]?)", re.IGNORECASE)
@dataclass
class IndexEntry:
id: str # Canonical ID: QKB-01, QB-03, M-12, QM-07
kind: str # criterion / building_block / measure / metric / matrix / template
title_de: str
title_en: str
source_path: str # relative to SOURCE_ROOT
referenced_ids: list[str] = field(default_factory=list) # other QUAIDAL IDs linked in this file
external_refs: list[dict] = field(default_factory=list) # {framework, citation, ref_id}
tags: list[str] = field(default_factory=list)
share: bool | None = None
def parse_frontmatter(text: str) -> dict:
m = FRONTMATTER_RE.match(text)
if not m:
return {}
try:
return yaml.safe_load(m.group(1)) or {}
except yaml.YAMLError:
return {}
def canonical_id(raw_id: str | list | None, filename: str) -> str | None:
"""QUAIDAL files sometimes list multiple IDs or odd casing — normalise."""
candidates: list[str] = []
if isinstance(raw_id, list):
candidates.extend(str(x) for x in raw_id)
elif isinstance(raw_id, str):
candidates.append(raw_id)
# Fallback: derive from filename
candidates.append(filename)
for c in candidates:
m = ID_RE.search(c)
if m:
return m.group(1).upper().replace(" ", "-")
return None
def determine_kind(path: Path) -> str:
for parent in path.parents:
if parent.name in KIND_BY_PARENT_DIR:
return KIND_BY_PARENT_DIR[parent.name]
return "unknown"
def collect_referenced_ids(body: str, own_id: str) -> list[str]:
found = {m.group(1).upper() for m in ID_RE.finditer(body)}
found.discard(own_id)
return sorted(found)
REF_FRAMEWORKS = [
("AI Act", ["AI-Act", "AI Act", "Verordnung (EU) 2024/1689", "KI-VO"]),
("EU GDPR", ["DSGVO", "Verordnung (EU) 2016/679", "GDPR"]),
("ISO/IEC 25012", ["ISO/IEC 25012", "ISO 25012"]),
("ISO/IEC 25024", ["ISO/IEC 25024", "ISO 25024"]),
("ISO/IEC 23894", ["ISO/IEC 23894", "ISO 23894"]),
("ISO/IEC 42001", ["ISO/IEC 42001", "ISO 42001"]),
("NIST AI RMF", ["NIST AI RMF", "AI Risk Management Framework"]),
("BSI Grundschutz", ["IT-Grundschutz", "Grundschutz"]),
("BSI AIC4", ["AIC4", "AI Cloud Service Compliance Criteria"]),
]
def detect_external_refs(body: str) -> list[dict]:
refs: list[dict] = []
seen: set[tuple[str, str]] = set()
# Section "Referenzen" tables — pick up first column ref-id and first
# textual hit of the framework. We do NOT store the BSI "Kurzbeschr."
# column to avoid copying their prose.
for line in body.splitlines():
for framework, patterns in REF_FRAMEWORKS:
for pat in patterns:
if pat.lower() in line.lower():
# Try to grab an article/section nearby (e.g. "Artikel 10")
art = re.search(r"(Artikel|Art\.?|Section|§)\s*([0-9]+[a-z]?)", line, re.IGNORECASE)
citation = f"{art.group(1)} {art.group(2)}" if art else None
key = (framework, citation or "")
if key in seen:
continue
seen.add(key)
refs.append({"framework": framework, "citation": citation})
break
return refs
def parse_file(path: Path) -> IndexEntry | None:
text = path.read_text(encoding="utf-8")
fm = parse_frontmatter(text)
body = text[text.find("---", 3) + 3 :] if text.startswith("---") else text
own_id = canonical_id(fm.get("ID"), path.stem)
if not own_id:
return None
title_de = str(fm.get("TitleGer") or fm.get("Title") or path.stem).strip()
title_en = str(fm.get("Title") or "").strip()
tags_raw = fm.get("tags") or []
if isinstance(tags_raw, str):
tags_raw = [tags_raw]
tags = [str(t).strip() for t in tags_raw if t]
share_val = fm.get("share")
share = bool(share_val) if share_val is not None else None
return IndexEntry(
id=own_id,
kind=determine_kind(path),
title_de=title_de,
title_en=title_en,
source_path=str(path.relative_to(SOURCE_ROOT)),
referenced_ids=collect_referenced_ids(body, own_id),
external_refs=detect_external_refs(body),
tags=tags,
share=share,
)
def get_commit_sha() -> str | None:
try:
out = subprocess.run(
["git", "-C", str(SOURCE_ROOT), "rev-parse", "HEAD"],
capture_output=True,
text=True,
check=True,
)
return out.stdout.strip()
except (subprocess.CalledProcessError, FileNotFoundError):
return None
def main() -> int:
ap = argparse.ArgumentParser(description=__doc__)
ap.add_argument("--check", action="store_true", help="Parse + validate, do not write output")
args = ap.parse_args()
if not MARKDOWN_ROOT.exists():
print(f"ERROR: clone not found at {SOURCE_ROOT}", file=sys.stderr)
print("Run: git clone --depth=1 https://github.com/BSI-Bund/QUAIDAL.git legal-sources/bsi-quaidal", file=sys.stderr)
return 2
entries: list[IndexEntry] = []
skipped: list[Path] = []
for path in sorted(MARKDOWN_ROOT.rglob("*.md")):
entry = parse_file(path)
if entry is None:
skipped.append(path)
continue
entries.append(entry)
by_kind: dict[str, int] = {}
for e in entries:
by_kind[e.kind] = by_kind.get(e.kind, 0) + 1
print(f"Parsed {len(entries)} entries (skipped {len(skipped)} without ID):")
for kind, count in sorted(by_kind.items()):
print(f" {kind:18s} {count}")
if args.check:
return 0
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
payload = {
"source": "BSI QUAIDAL",
"source_url": "https://github.com/BSI-Bund/QUAIDAL",
"commit_sha": get_commit_sha(),
"license_note": (
"BSI-Veroeffentlichung. Repo enthaelt keine SPDX-Lizenzdatei. "
"Frontmatter share:true. Veroeffentlichung durch Bundesbehoerde, "
"§ 5 UrhG (amtliche Werke) anwendbar. BSI hat 05/2026 die Annahme "
"CC-BY-SA-4.0 in unserer Anfrage nicht widersprochen, aber auch "
"nicht aktiv bestaetigt. Wir derivieren Clean-Room (eigene "
"Formulierungen, nur Referenz auf BSI QUAIDAL Sektion)."
),
"entries": [asdict(e) for e in entries],
}
OUTPUT_FILE.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"\nWrote index: {OUTPUT_FILE.relative_to(REPO_ROOT)}")
print(f"Commit SHA: {payload['commit_sha']}")
return 0
if __name__ == "__main__":
sys.exit(main())
+240
View File
@@ -0,0 +1,240 @@
#!/usr/bin/env python3
"""Ingest missing German laws from gesetze-im-internet.de.
Downloads full HTML, strips to text, uploads with legal chunking strategy.
Handles ISO-8859-1 charset typical for gesetze-im-internet.de.
Usage (on Mac Mini):
python3 control-pipeline/scripts/ingest_de_laws.py --dry-run
python3 control-pipeline/scripts/ingest_de_laws.py
"""
import argparse
import json
import logging
import time
from typing import Optional
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ingest-laws")
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
COLLECTION = "bp_compliance_gesetze"
# ---- Laws to ingest ----
# Format: (slug on gesetze-im-internet.de, regulation_id, display_name)
# URL pattern: https://www.gesetze-im-internet.de/{slug}/BJNR*.html (full text)
LAWS = [
{
"url": "https://www.gesetze-im-internet.de/arbzg/BJNR117100994.html",
"regulation_id": "de_arbzg",
"name": "Arbeitszeitgesetz (ArbZG)",
"short": "ArbZG",
},
{
"url": "https://www.gesetze-im-internet.de/muschg_2018/BJNR122810017.html",
"regulation_id": "de_muschg",
"name": "Mutterschutzgesetz (MuSchG)",
"short": "MuSchG",
},
{
"url": "https://www.gesetze-im-internet.de/nachwg/BJNR094610995.html",
"regulation_id": "de_nachwg",
"name": "Nachweisgesetz (NachwG)",
"short": "NachwG",
},
{
"url": "https://www.gesetze-im-internet.de/milog/BJNR134810014.html",
"regulation_id": "de_milog",
"name": "Mindestlohngesetz (MiLoG)",
"short": "MiLoG",
},
{
"url": "https://www.gesetze-im-internet.de/gmbhg/BJNR004770892.html",
"regulation_id": "de_gmbhg",
"name": "GmbH-Gesetz (GmbHG)",
"short": "GmbHG",
},
{
"url": "https://www.gesetze-im-internet.de/aktg/BJNR010890965.html",
"regulation_id": "de_aktg",
"name": "Aktiengesetz (AktG)",
"short": "AktG",
},
{
"url": "https://www.gesetze-im-internet.de/inso/BJNR286600994.html",
"regulation_id": "de_inso",
"name": "Insolvenzordnung (InsO)",
"short": "InsO",
},
# BEG IV ist ein Aenderungsgesetz — kein eigenstaendiger Text auf gesetze-im-internet.de
{
"url": "https://www.gesetze-im-internet.de/verpflg/BJNR009690974.html",
"regulation_id": "de_verpflichtungsgesetz",
"name": "Verpflichtungsgesetz",
"short": "VerpflG",
},
{
"url": "https://www.gesetze-im-internet.de/burlg/BJNR000020963.html",
"regulation_id": "de_burlg",
"name": "Bundesurlaubsgesetz (BUrlG)",
"short": "BUrlG",
},
{
"url": "https://www.gesetze-im-internet.de/entgfg/BJNR118010994.html",
"regulation_id": "de_entgfg",
"name": "Entgeltfortzahlungsgesetz (EntgFG)",
"short": "EntgFG",
},
]
def download_law(url: str) -> Optional[str]:
"""Download law HTML from gesetze-im-internet.de, handle charset."""
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
resp = c.get(url)
if resp.status_code != 200:
logger.error(" HTTP %d for %s", resp.status_code, url)
return None
# gesetze-im-internet.de uses ISO-8859-1
content_type = resp.headers.get("content-type", "")
if "charset" in content_type:
# Use declared charset
html = resp.text
else:
# Try UTF-8 first, fall back to ISO-8859-1
try:
html = resp.content.decode("utf-8")
if "\ufffd" in html:
raise UnicodeDecodeError("utf-8", b"", 0, 1, "replacement chars")
except (UnicodeDecodeError, ValueError):
html = resp.content.decode("iso-8859-1")
return html
def upload_html(
html: str,
filename: str,
regulation_id: str,
name: str,
short: str,
dry_run: bool = False,
) -> Optional[dict]:
"""Upload HTML to RAG service with legal chunking."""
if dry_run:
logger.info(" DRY RUN — would upload %d chars", len(html))
return {"chunks_count": 0, "document_id": "dry-run"}
meta = {
"regulation_id": regulation_id,
"regulation_name_de": name,
"regulation_short": short,
"source": "gesetze-im-internet.de",
"license": "public_domain_de_law",
"jurisdiction": "DE",
"source_type": "law",
}
form_data = {
"collection": COLLECTION,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(meta, ensure_ascii=False),
}
with httpx.Client(timeout=600.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, html.encode("utf-8"), "text/html")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_existing(regulation_id: str) -> int:
"""Check if regulation already exists in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
json={
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def main():
parser = argparse.ArgumentParser(description="Ingest DE laws from gesetze-im-internet.de")
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
logger.info("=" * 60)
logger.info("Ingest German Laws")
logger.info(" Laws: %d", len(LAWS))
logger.info(" Collection: %s", COLLECTION)
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = []
for i, law in enumerate(LAWS, 1):
logger.info("\n[%d/%d] %s (%s)", i, len(LAWS), law["name"], law["regulation_id"])
# Check if already exists
existing = count_existing(law["regulation_id"])
if existing > 0:
logger.info(" Already exists: %d chunks — SKIPPING", existing)
results.append({"law": law["short"], "status": "exists", "chunks": existing})
continue
# Download
logger.info(" Downloading: %s", law["url"])
html = download_law(law["url"])
if not html:
results.append({"law": law["short"], "status": "download_failed", "chunks": 0})
continue
logger.info(" Downloaded: %d chars", len(html))
# Upload
filename = f"{law['regulation_id']}.html"
try:
result = upload_html(
html, filename, law["regulation_id"],
law["name"], law["short"], args.dry_run,
)
chunks = result.get("chunks_count", 0) if result else 0
logger.info(" Uploaded: %d chunks", chunks)
results.append({"law": law["short"], "status": "ok", "chunks": chunks})
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"law": law["short"], "status": "error", "chunks": 0})
if i < len(LAWS):
time.sleep(1)
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for r in results:
logger.info(" %-10s %s chunks=%d", r["law"], r["status"].upper(), r["chunks"])
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
logger.info("\nTotal new chunks: %d", total_new)
if __name__ == "__main__":
main()
@@ -0,0 +1,414 @@
#!/usr/bin/env python3
"""Ingest CRA-relevant ENISA documents into the RAG (collection `bp_compliance_ce`).
Source files live under `legal-sources/enisa/` in this repo. The script extracts
PDF text with pdfplumber (HTML for the SRP FAQ), normalizes it, and uploads via
the RAG service with `chunk_strategy='legal'` so that section metadata is
attached to every chunk.
Each document carries a `requirement_strength` field so downstream consumers
can distinguish normative material from guidance and consultation drafts:
- mandatory binding (none in this batch; CRA itself is the law)
- guidance official ENISA / EUCC guidance, citable
- consultation_draft public-consultation drafts (use with caveat)
Usage (run on Mac Mini after copying the legal-sources/enisa/ folder, or via SSH
with the repo mounted):
python3 control-pipeline/scripts/ingest_enisa_cra.py --dry-run
python3 control-pipeline/scripts/ingest_enisa_cra.py
"""
import argparse
import json
import re
import sys
import time
import unicodedata
from html.parser import HTMLParser
from pathlib import Path
import httpx
import pdfplumber
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
UPLOAD_TIMEOUT = 1800.0
COLLECTION = "bp_compliance_ce"
REPO_ROOT = Path(__file__).resolve().parents[2]
SOURCE_DIR = REPO_ROOT / "legal-sources" / "enisa"
DOCS = [
{
"regulation_id": "enisa_cra_requirements_standards_mapping",
"filename": "enisa_cra_requirements_standards_mapping.pdf",
"upload_filename": "enisa_cra_requirements_standards_mapping.txt",
"extra_metadata": {
"regulation_id": "enisa_cra_requirements_standards_mapping",
"regulation_short": "ENISA CRA Standards Mapping",
"guideline_name": "Cyber Resilience Act Requirements Standards Mapping",
"doc_type": "standards_mapping",
"requirement_strength": "guidance",
"publication_year": "2024",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_cra_implementation_via_eucc",
"filename": "enisa_cra_implementation_via_eucc.pdf",
"upload_filename": "enisa_cra_implementation_via_eucc.txt",
"extra_metadata": {
"regulation_id": "enisa_cra_implementation_via_eucc",
"regulation_short": "ENISA CRA via EUCC",
"guideline_name": "CRA Implementation via EUCC and its Applicable Technical Elements",
"doc_type": "certification_guidance",
"requirement_strength": "guidance",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_cra_implementation_via_eucc_annex",
"filename": "enisa_cra_implementation_via_eucc_annex.pdf",
"upload_filename": "enisa_cra_implementation_via_eucc_annex.txt",
"extra_metadata": {
"regulation_id": "enisa_cra_implementation_via_eucc_annex",
"regulation_short": "ENISA CRA via EUCC (Annex)",
"guideline_name": "Annex — CRA Implementation via EUCC",
"doc_type": "certification_guidance_annex",
"requirement_strength": "guidance",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_eucc_vulnerability_management_disclosure",
"filename": "enisa_eucc_vulnerability_management_disclosure.pdf",
"upload_filename": "enisa_eucc_vulnerability_management_disclosure.txt",
"extra_metadata": {
"regulation_id": "enisa_eucc_vulnerability_management_disclosure",
"regulation_short": "EUCC Vuln Management & Disclosure",
"guideline_name": "EUCC Guidelines — Vulnerability Management and Disclosure v1.1",
"doc_type": "vulnerability_guidance",
"requirement_strength": "guidance",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_eccg_opinion_vulnerability_management",
"filename": "enisa_eccg_opinion_vulnerability_management.pdf",
"upload_filename": "enisa_eccg_opinion_vulnerability_management.txt",
"extra_metadata": {
"regulation_id": "enisa_eccg_opinion_vulnerability_management",
"regulation_short": "ECCG Opinion Vuln Management",
"guideline_name": "Final ECCG Opinion — Guidance on Vulnerability Management",
"doc_type": "eccg_opinion",
"requirement_strength": "guidance",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_nis2_technical_implementation_guidance",
"filename": "enisa_nis2_technical_implementation_guidance.pdf",
"upload_filename": "enisa_nis2_technical_implementation_guidance.txt",
"extra_metadata": {
"regulation_id": "enisa_nis2_technical_implementation_guidance",
"regulation_short": "ENISA NIS2 TIG v1.0",
"guideline_name": "ENISA Technical Implementation Guidance on Cybersecurity Risk Management Measures v1.0",
"doc_type": "technical_guidance",
"requirement_strength": "guidance",
"publication_year": "2025",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_nis2_security_measures_consultation",
"filename": "enisa_nis2_security_measures_implementation_guidance_consultation.pdf",
"upload_filename": "enisa_nis2_security_measures_consultation.txt",
"extra_metadata": {
"regulation_id": "enisa_nis2_security_measures_consultation",
"regulation_short": "ENISA NIS2 Security Measures (Draft)",
"guideline_name": "Implementation Guidance on Security Measures — Public Consultation Draft",
"doc_type": "consultation_draft",
"requirement_strength": "consultation_draft",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_cra_single_reporting_platform_faq",
"filename": "enisa_cra_single_reporting_platform_faq.html",
"upload_filename": "enisa_cra_single_reporting_platform_faq.txt",
"extra_metadata": {
"regulation_id": "enisa_cra_single_reporting_platform_faq",
"regulation_short": "ENISA SRP FAQ",
"guideline_name": "CRA Single Reporting Platform (SRP) FAQ",
"doc_type": "faq",
"requirement_strength": "guidance",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_eucc_evaluation_methodology_product_series",
"filename": "enisa_eucc_evaluation_methodology_product_series.pdf",
"upload_filename": "enisa_eucc_evaluation_methodology_product_series.txt",
"extra_metadata": {
"regulation_id": "enisa_eucc_evaluation_methodology_product_series",
"regulation_short": "EUCC Eval Methodology Product Series",
"guideline_name": "EUCC Guidelines — Evaluation Methodology for Product Series v1.0",
"doc_type": "evaluation_methodology",
"requirement_strength": "guidance",
"publication_year": "2025",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_threat_landscape_2025",
"filename": "enisa_threat_landscape_2025.pdf",
"upload_filename": "enisa_threat_landscape_2025.txt",
"extra_metadata": {
"regulation_id": "enisa_threat_landscape_2025",
"regulation_short": "ENISA Threat Landscape 2025",
"guideline_name": "ENISA Threat Landscape 2025 v1.2",
"doc_type": "threat_landscape",
"requirement_strength": "evidentiary",
"publication_year": "2025",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
{
"regulation_id": "enisa_cvd_policies_eu_2022",
"filename": "enisa_cvd_policies_eu_2022.pdf",
"upload_filename": "enisa_cvd_policies_eu_2022.txt",
"extra_metadata": {
"regulation_id": "enisa_cvd_policies_eu_2022",
"regulation_short": "ENISA CVD Policies EU 2022",
"guideline_name": "Coordinated Vulnerability Disclosure Policies in the EU (2022)",
"doc_type": "policy_study",
"requirement_strength": "guidance",
"publication_year": "2022",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
"attribution": "ENISA, CC BY 4.0",
},
},
]
def normalize_text(text: str) -> str:
text = unicodedata.normalize("NFKC", text)
text = text.replace("­", "").replace("", "")
prev = None
while prev != text:
prev = text
text = re.sub(r"(\d+)\s+\.\s+(\d+)", r"\1.\2", text)
text = re.sub(r"\b([A-Z]{2,4})\s+-\s+(\d+)\b", r"\1-\2", text)
text = re.sub(r"\(\s+(\d+)\s+\)", r"(\1)", text)
text = re.sub(r"[^\S\n]{2,}", " ", text)
return text
class _HTMLToText(HTMLParser):
SKIP = {"script", "style", "nav", "header", "footer", "noscript"}
BLOCK = {"p", "div", "li", "br", "h1", "h2", "h3", "h4", "h5", "h6", "tr", "section"}
def __init__(self) -> None:
super().__init__()
self._buf: list[str] = []
self._skip_depth = 0
def handle_starttag(self, tag, attrs):
if tag in self.SKIP:
self._skip_depth += 1
if tag in self.BLOCK:
self._buf.append("\n")
def handle_endtag(self, tag):
if tag in self.SKIP and self._skip_depth > 0:
self._skip_depth -= 1
if tag in self.BLOCK:
self._buf.append("\n")
def handle_data(self, data):
if self._skip_depth == 0:
self._buf.append(data)
def text(self) -> str:
raw = "".join(self._buf)
raw = re.sub(r"\n{3,}", "\n\n", raw)
return raw.strip()
def extract_pdf(path: Path) -> str:
print(f" Extracting PDF: {path.name}")
parts: list[str] = []
with pdfplumber.open(path) as pdf:
for i, page in enumerate(pdf.pages):
t = page.extract_text(x_tolerance=3, y_tolerance=4)
if t:
parts.append(t)
if (i + 1) % 50 == 0:
print(f" {i + 1}/{len(pdf.pages)} pages...")
return normalize_text("\n\n".join(parts))
def extract_html(path: Path) -> str:
print(f" Extracting HTML: {path.name}")
html = path.read_text(encoding="utf-8", errors="replace")
parser = _HTMLToText()
parser.feed(html)
return normalize_text(parser.text())
def get_text(doc) -> str:
path = SOURCE_DIR / doc["filename"]
if not path.exists():
raise FileNotFoundError(path)
if path.suffix.lower() == ".pdf":
text = extract_pdf(path)
elif path.suffix.lower() in {".html", ".htm"}:
text = extract_html(path)
else:
raise ValueError(f"Unsupported file type: {path.suffix}")
print(f" Extracted {len(text):,} chars")
return text
def upload_text_legal(text: str, filename: str, extra_metadata: dict) -> dict:
form_data = {
"collection": COLLECTION,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, text.encode("utf-8"), "text/plain")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_chunks(regulation_id: str) -> int:
with httpx.Client(timeout=30) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
json={
"filter": {
"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]
},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def main() -> int:
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true",
help="Extract text and report sizes, but do not upload.")
parser.add_argument("--only", action="append", default=[],
help="Limit run to one or more regulation_ids.")
args = parser.parse_args()
if not SOURCE_DIR.exists():
print(f"ERROR: source dir not found: {SOURCE_DIR}")
return 2
docs = DOCS
if args.only:
wanted = set(args.only)
docs = [d for d in DOCS if d["regulation_id"] in wanted]
missing = wanted - {d["regulation_id"] for d in docs}
if missing:
print(f"ERROR: unknown regulation_id(s): {sorted(missing)}")
return 2
print("=" * 70)
print(f"ENISA CRA ingestion → collection={COLLECTION}")
print(f"Source dir: {SOURCE_DIR}")
print(f"Documents: {len(docs)} Dry run: {args.dry_run}")
print("=" * 70)
results = []
for i, doc in enumerate(docs, 1):
reg_id = doc["regulation_id"]
print(f"\n[{i}/{len(docs)}] {reg_id}")
existing = count_chunks(reg_id) if not args.dry_run else "?"
print(f" Existing chunks in Qdrant: {existing}")
try:
text = get_text(doc)
except Exception as e:
print(f" ERROR extracting text: {e}")
results.append({"id": reg_id, "chars": 0, "new": 0,
"strength": doc["extra_metadata"]["requirement_strength"]})
continue
if args.dry_run:
results.append({"id": reg_id, "chars": len(text), "new": "?",
"strength": doc["extra_metadata"]["requirement_strength"]})
continue
if existing and existing > 0:
print(f" SKIP — {existing} chunks already present. "
f"Use Qdrant delete-by-filter before re-ingesting.")
results.append({"id": reg_id, "chars": len(text), "new": 0,
"strength": doc["extra_metadata"]["requirement_strength"]})
continue
print(" Uploading with chunk_strategy='legal'...")
result = upload_text_legal(
text, doc["upload_filename"], doc["extra_metadata"]
)
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
print(f" -> {new_chunks} chunks (doc_id={new_doc_id})")
results.append({"id": reg_id, "chars": len(text), "new": new_chunks,
"strength": doc["extra_metadata"]["requirement_strength"]})
if i < len(docs):
time.sleep(2)
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
for r in results:
print(f" {r['id']:<55} chars={r['chars']:<9} new={r['new']:<5} "
f"strength={r['strength']}")
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
print(f"\nTotal new chunks: {total_new}")
return 0
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,201 @@
#!/usr/bin/env python3
"""Ingest missing EU regulations from EUR-Lex (HTML).
Downloads German HTML from EUR-Lex via CELEX number, uploads with legal chunking.
Usage (on Mac Mini):
python3 control-pipeline/scripts/ingest_eu_regulations.py --dry-run
python3 control-pipeline/scripts/ingest_eu_regulations.py
"""
import argparse
import json
import logging
import time
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ingest-eu")
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
COLLECTION = "bp_compliance_ce"
EURLEX_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
# ---- EU Regulations to ingest ----
REGULATIONS = [
{
"celex": "32022L2464",
"regulation_id": "csrd_2022",
"name": "Corporate Sustainability Reporting Directive (CSRD)",
"short": "CSRD",
"category": "sustainability",
},
{
"celex": "32024L1760",
"regulation_id": "csddd_2024",
"name": "Corporate Sustainability Due Diligence Directive (CSDDD)",
"short": "CSDDD",
"category": "sustainability",
},
{
"celex": "32020R0852",
"regulation_id": "eu_taxonomy_2020",
"name": "EU-Taxonomie-Verordnung",
"short": "EU Taxonomy",
"category": "sustainability",
},
{
"celex": "32024R1183",
"regulation_id": "eidas_2_0_2024",
"name": "eIDAS 2.0 Verordnung (EU Digital Identity)",
"short": "eIDAS 2.0",
"category": "digital_identity",
},
{
"celex": "32023L0970",
"regulation_id": "pay_transparency_2023",
"name": "Entgelttransparenz-Richtlinie",
"short": "Pay Transparency",
"category": "employment",
},
{
"celex": "32022R2065",
"regulation_id": "dsa_2022_updated",
"name": "Digital Services Act (DSA) — aktualisiert",
"short": "DSA",
"category": "digital_services",
"skip_if_exists": "dsa_2022", # already exists under different ID
},
]
def download_eurlex(celex: str) -> str:
"""Download EU regulation HTML from EUR-Lex."""
url = EURLEX_URL.format(celex=celex)
with httpx.Client(timeout=30.0, follow_redirects=True) as c:
resp = c.get(url)
resp.raise_for_status()
return resp.text
def upload_html(html: str, filename: str, reg: dict, dry_run: bool = False):
"""Upload HTML to RAG service."""
if dry_run:
logger.info(" DRY RUN — would upload %d chars", len(html))
return {"chunks_count": 0}
meta = {
"regulation_id": reg["regulation_id"],
"regulation_name_de": reg["name"],
"regulation_short": reg["short"],
"celex": reg["celex"],
"category": reg["category"],
"source": "EUR-Lex",
"license": "EU_law",
"jurisdiction": "EU",
"source_type": "law",
}
form_data = {
"collection": COLLECTION,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(meta, ensure_ascii=False),
}
with httpx.Client(timeout=600.0, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, html.encode("utf-8"), "text/html")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def count_existing(regulation_id: str) -> int:
with httpx.Client(timeout=60.0) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{COLLECTION}/points/count",
json={"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]}, "exact": True},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
logger.info("=" * 60)
logger.info("Ingest EU Regulations from EUR-Lex")
logger.info(" Regulations: %d", len(REGULATIONS))
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = []
for i, reg in enumerate(REGULATIONS, 1):
logger.info("\n[%d/%d] %s (CELEX: %s)", i, len(REGULATIONS), reg["name"], reg["celex"])
# Skip if variant already exists
skip_id = reg.get("skip_if_exists")
if skip_id:
existing = count_existing(skip_id)
if existing > 0:
logger.info(" Already exists as '%s' (%d chunks) — SKIPPING", skip_id, existing)
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
continue
# Check if this exact ID exists
existing = count_existing(reg["regulation_id"])
if existing > 0:
logger.info(" Already exists: %d chunks — SKIPPING", existing)
results.append({"reg": reg["short"], "status": "exists", "chunks": existing})
continue
# Download from EUR-Lex
logger.info(" Downloading from EUR-Lex...")
try:
html = download_eurlex(reg["celex"])
logger.info(" Downloaded: %d chars", len(html))
except Exception as e:
logger.error(" Download FAILED: %s", e)
results.append({"reg": reg["short"], "status": "download_failed", "chunks": 0})
continue
# Upload
filename = f"{reg['regulation_id']}.html"
try:
result = upload_html(html, filename, reg, args.dry_run)
chunks = result.get("chunks_count", 0)
logger.info(" Uploaded: %d chunks", chunks)
results.append({"reg": reg["short"], "status": "ok", "chunks": chunks})
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"reg": reg["short"], "status": "error", "chunks": 0})
if i < len(REGULATIONS):
time.sleep(2)
# Summary
logger.info("\n" + "=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for r in results:
logger.info(" %-20s %s chunks=%d", r["reg"], r["status"].upper(), r["chunks"])
total_new = sum(r["chunks"] for r in results if r["status"] == "ok")
logger.info("\nTotal new chunks: %d", total_new)
if __name__ == "__main__":
main()
+303
View File
@@ -0,0 +1,303 @@
#!/usr/bin/env python3
"""
E2E Quality Report: Verify controls have correct source citations.
Loads N random controls from PostgreSQL, cross-references with Qdrant chunks,
and reports mismatches between source_citation and actual chunk metadata.
Usage:
# Against Mac Mini
python3 scripts/quality_report.py --db-host macmini --qdrant-url http://macmini:6333
# Smaller sample
python3 scripts/quality_report.py --db-host macmini --sample 100
"""
import argparse
import json
import logging
import sys
import httpx
from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("quality-report")
COLLECTIONS = [
"bp_compliance_ce", "bp_compliance_gesetze", "bp_compliance_datenschutz",
"bp_dsfa_corpus", "bp_legal_templates",
]
def load_controls(db_url: str, sample_size: int) -> list[dict]:
"""Load random controls with source_citation from PostgreSQL."""
engine = create_engine(db_url)
Session = sessionmaker(bind=engine)
with Session() as db:
rows = db.execute(text("""
SELECT id::text, control_id, title,
source_citation::text, source_original_text,
generation_metadata::text, release_state
FROM compliance.canonical_controls
WHERE source_citation IS NOT NULL
AND source_original_text IS NOT NULL
AND release_state = 'draft'
ORDER BY RANDOM()
LIMIT :n
"""), {"n": sample_size}).fetchall()
controls = []
for row in rows:
citation = json.loads(row[3]) if row[3] else {}
metadata = json.loads(row[5]) if row[5] else {}
controls.append({
"id": row[0],
"control_id": row[1],
"title": row[2],
"citation": citation,
"source_text": row[4],
"metadata": metadata,
"release_state": row[6],
})
return controls
def build_qdrant_index(qdrant_url: str) -> dict:
"""Build regulation_id → list[chunk] index from Qdrant.
Controls were generated from OLD chunks (512 chars). Qdrant now has
NEW chunks (1500 chars). Hash matching won't work — use regulation +
section matching instead.
"""
logger.info("Building Qdrant chunk index by regulation_id...")
index = {} # regulation_id → [{"section": ..., "text_snippet": ..., ...}]
client = httpx.Client(timeout=60.0)
for coll in COLLECTIONS:
offset = None
for _ in range(600):
body = {"limit": 250, "with_payload": True, "with_vector": False}
if offset:
body["offset"] = offset
r = client.post(f"{qdrant_url}/collections/{coll}/points/scroll", json=body)
if r.status_code != 200:
break
data = r.json()["result"]
for pt in data["points"]:
reg_id = pt["payload"].get("regulation_id", "")
if not reg_id:
continue
chunk = {
"section": pt["payload"].get("section", ""),
"section_title": pt["payload"].get("section_title", ""),
"paragraph": pt["payload"].get("paragraph", ""),
"text_snippet": pt["payload"].get("chunk_text", "")[:200],
"filename": pt["payload"].get("filename", ""),
"collection": coll,
}
index.setdefault(reg_id, []).append(chunk)
offset = data.get("next_page_offset")
if not offset:
break
client.close()
total = sum(len(v) for v in index.values())
logger.info("Qdrant index: %d regulations, %d chunks", len(index), total)
return index
def check_control(ctrl: dict, qdrant_index: dict) -> dict:
"""Check a single control's source_citation against Qdrant chunks.
Strategy: Find chunks by regulation_id from generation_metadata,
then check if any chunk has a matching section/article.
"""
result = {
"control_id": ctrl["control_id"],
"title": (ctrl["title"] or "")[:60],
"citation_source": ctrl["citation"].get("source", ""),
"citation_article": ctrl["citation"].get("article", ""),
"citation_paragraph": ctrl["citation"].get("paragraph", ""),
"citation_page": ctrl["citation"].get("page"),
"issues": [],
}
# Get regulation_id from generation_metadata
reg_code = ctrl["metadata"].get("source_regulation", "")
citation_article = ctrl["citation"].get("article", "")
# Check 1: Does the control have a regulation reference?
if not reg_code:
result["issues"].append("NO_REGULATION_CODE")
return result
# Check 2: Does this regulation exist in Qdrant?
chunks = qdrant_index.get(reg_code, [])
if not chunks:
result["issues"].append(f"REGULATION_NOT_IN_QDRANT: {reg_code}")
result["reg_found"] = False
return result
result["reg_found"] = True
result["reg_chunks"] = len(chunks)
# Check 3: Does the control have an article citation?
if not citation_article:
result["issues"].append("NO_ARTICLE_IN_CITATION")
# Still check if chunks have section metadata at all
has_section = any(c["section"] for c in chunks)
if has_section:
result["issues"].append("CHUNKS_HAVE_SECTIONS_BUT_CONTROL_MISSING")
return result
# Check 4: Is the cited article found in any chunk's section?
norm_article = citation_article.strip().lower()
matching_chunks = [
c for c in chunks
if c["section"] and (
norm_article == c["section"].strip().lower()
or norm_article in c["section"].strip().lower()
or c["section"].strip().lower() in norm_article
)
]
if matching_chunks:
result["article_match"] = True
result["matched_section"] = matching_chunks[0]["section"]
else:
# Check if ANY chunk has sections (the article might just not match)
sections_in_regulation = sorted(set(c["section"] for c in chunks if c["section"]))
if sections_in_regulation:
result["issues"].append(
f"ARTICLE_NOT_FOUND_IN_CHUNKS: '{citation_article}' not in {sections_in_regulation[:5]}"
)
else:
result["issues"].append("NO_SECTIONS_IN_REGULATION_CHUNKS")
# Check 5: Does source_original_text contain the cited article?
source_text = ctrl["source_text"] or ""
if citation_article and source_text:
if citation_article.lower() not in source_text.lower():
if f"[{citation_article}" not in source_text:
result["issues"].append("ARTICLE_NOT_IN_SOURCE_TEXT")
if not result["issues"]:
result["issues"] = ["OK"]
return result
def generate_report(results: list[dict]):
"""Print the quality report."""
total = len(results)
ok = sum(1 for r in results if r["issues"] == ["OK"])
chunk_found = sum(1 for r in results if r.get("chunk_found", False))
no_chunk = sum(1 for r in results if "CHUNK_NOT_FOUND" in r["issues"])
no_article = sum(1 for r in results if "NO_ARTICLE_IN_CITATION" in r["issues"])
no_section = sum(1 for r in results if "NO_SECTION_IN_CHUNK" in r["issues"])
mismatch = sum(1 for r in results if any("MISMATCH" in i for i in r["issues"]))
not_in_text = sum(1 for r in results if "ARTICLE_NOT_IN_SOURCE_TEXT" in r["issues"])
print("\n" + "=" * 100)
print("QUALITAETSREPORT: CONTROL SOURCE CITATION VERIFICATION")
print("=" * 100)
print(f"\nStichprobe: {total} Controls")
print(f"\n{'Metrik':<45} {'Anzahl':>8} {'Anteil':>8}")
print("-" * 65)
print(f"{'OK (keine Probleme)':<45} {ok:>8} {ok*100//max(total,1):>7}%")
print(f"{'Chunk in Qdrant gefunden':<45} {chunk_found:>8} {chunk_found*100//max(total,1):>7}%")
print(f"{'Chunk NICHT gefunden':<45} {no_chunk:>8} {no_chunk*100//max(total,1):>7}%")
print(f"{'Kein article in source_citation':<45} {no_article:>8} {no_article*100//max(total,1):>7}%")
print(f"{'Kein section im Qdrant-Chunk':<45} {no_section:>8} {no_section*100//max(total,1):>7}%")
print(f"{'Article/Section Mismatch':<45} {mismatch:>8} {mismatch*100//max(total,1):>7}%")
print(f"{'Article nicht im Source-Text':<45} {not_in_text:>8} {not_in_text*100//max(total,1):>7}%")
# Show sample mismatches
mismatches = [r for r in results if any("MISMATCH" in i for i in r["issues"])]
if mismatches:
print("\n=== MISMATCHES (erste 10) ===\n")
for r in mismatches[:10]:
issues = [i for i in r["issues"] if "MISMATCH" in i]
print(f" {r['control_id']:20s} {r['title'][:40]:40s}")
for i in issues:
print(f"{i}")
# Show sample NOT_FOUND
not_found = [r for r in results if "CHUNK_NOT_FOUND" in r["issues"]]
if not_found:
print("\n=== CHUNK NOT FOUND (erste 10) ===\n")
for r in not_found[:10]:
src = r.get("citation_source", "?")
art = r.get("citation_article", "?")
print(f" {r['control_id']:20s} {src[:25]:25s} {art}")
# Distribution by source
print("\n=== NACH QUELLE ===\n")
source_stats = {}
for r in results:
src = r.get("citation_source", "?")[:30]
if src not in source_stats:
source_stats[src] = {"total": 0, "ok": 0, "no_chunk": 0, "no_section": 0}
source_stats[src]["total"] += 1
if r["issues"] == ["OK"]:
source_stats[src]["ok"] += 1
if "CHUNK_NOT_FOUND" in r["issues"]:
source_stats[src]["no_chunk"] += 1
if "NO_SECTION_IN_CHUNK" in r["issues"]:
source_stats[src]["no_section"] += 1
print(f" {'Quelle':<32} {'Total':>6} {'OK':>6} {'OK%':>6} {'NoChunk':>8} {'NoSect':>8}")
print(f" {'-'*72}")
for src in sorted(source_stats.keys(), key=lambda s: -source_stats[s]["total"]):
s = source_stats[src]
pct = s["ok"] * 100 // max(s["total"], 1)
print(f" {src:<32} {s['total']:>6} {s['ok']:>6} {pct:>5}% {s['no_chunk']:>8} {s['no_section']:>8}")
print(f"\n{'='*100}")
verdict = "PASS" if ok * 100 // max(total, 1) >= 50 else "NEEDS IMPROVEMENT"
print(f"ERGEBNIS: {verdict}{ok}/{total} Controls ({ok*100//max(total,1)}%) vollstaendig korrekt")
print(f"{'='*100}")
def main():
parser = argparse.ArgumentParser(description="Control Source Citation Quality Report")
parser.add_argument("--db-host", default="macmini")
parser.add_argument("--db-port", type=int, default=5432)
parser.add_argument("--db-name", default="breakpilot_db")
parser.add_argument("--db-user", default="breakpilot")
parser.add_argument("--db-pass", default="breakpilot123")
parser.add_argument("--qdrant-url", default="http://macmini:6333")
parser.add_argument("--sample", type=int, default=500)
args = parser.parse_args()
db_url = f"postgresql://{args.db_user}:{args.db_pass}@{args.db_host}:{args.db_port}/{args.db_name}"
# Load controls
logger.info("Loading %d random controls from DB...", args.sample)
controls = load_controls(db_url, args.sample)
logger.info("Loaded %d controls with source_citation", len(controls))
if not controls:
print("ERROR: No controls found with source_citation")
sys.exit(1)
# Build Qdrant index
qdrant_index = build_qdrant_index(args.qdrant_url)
# Check each control
logger.info("Checking %d controls against Qdrant...", len(controls))
results = []
for ctrl in controls:
result = check_control(ctrl, qdrant_index)
results.append(result)
# Report
generate_report(results)
if __name__ == "__main__":
main()
+486
View File
@@ -0,0 +1,486 @@
#!/usr/bin/env python3
"""
D5 Re-Ingestion: Re-chunk all ~297 legal sources with structural metadata.
Usage:
# Dry-run: build manifest, no changes
python3 scripts/reingest_d5.py --dry-run
# Re-ingest one collection (test)
python3 scripts/reingest_d5.py --collection bp_compliance_gesetze
# Re-ingest all collections (resume-capable)
python3 scripts/reingest_d5.py --resume
# Custom URLs
python3 scripts/reingest_d5.py --rag-url https://macmini:8097 --qdrant-url http://macmini:6333
"""
import argparse
import json
import logging
import random
import sys
import time
from datetime import datetime, timezone
import httpx
from reingest_d5_config import (
CHUNK_OVERLAP,
CHUNK_SIZE,
CHUNK_STRATEGY,
DEFAULT_QDRANT_URL,
DEFAULT_RAG_URL,
MANIFEST_FILE,
TARGET_COLLECTIONS,
content_type_from_filename,
doc_key,
extract_doc_metadata,
load_progress,
save_progress,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("d5-reingest")
UPLOAD_TIMEOUT = httpx.Timeout(timeout=3600.0, connect=30.0)
SCROLL_TIMEOUT = httpx.Timeout(timeout=60.0, connect=10.0)
# ---------------------------------------------------------------------------
# Phase 0: Preflight
# ---------------------------------------------------------------------------
def preflight_checks(rag_url: str, qdrant_url: str) -> dict:
"""Verify services are reachable and record baseline chunk counts."""
logger.info("Phase 0: Preflight checks...")
with httpx.Client(timeout=10.0, verify=False) as c:
r = c.get(f"{rag_url}/health")
r.raise_for_status()
logger.info(" RAG service: OK")
with httpx.Client(timeout=10.0) as c:
r = c.get(f"{qdrant_url}/collections")
r.raise_for_status()
logger.info(" Qdrant: OK")
before_counts = {}
with httpx.Client(timeout=10.0) as c:
for coll in TARGET_COLLECTIONS:
try:
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
json={"exact": True})
r.raise_for_status()
count = r.json()["result"]["count"]
except Exception:
count = 0
before_counts[coll] = count
logger.info(" %s: %d chunks", coll, count)
return before_counts
# ---------------------------------------------------------------------------
# Phase 1: Build manifest
# ---------------------------------------------------------------------------
def build_manifest(qdrant_url: str, collections: list[str]) -> list[dict]:
"""Scroll Qdrant and build a deduplicated document manifest."""
logger.info("Phase 1: Building document manifest...")
documents: dict[str, dict] = {} # keyed by doc_key(object_name, collection)
with httpx.Client(timeout=SCROLL_TIMEOUT) as client:
for coll in collections:
logger.info(" Scrolling %s...", coll)
offset = None
points_seen = 0
while True:
body: dict = {
"limit": 250,
"with_payload": True,
"with_vector": False,
}
if offset:
body["offset"] = offset
resp = client.post(
f"{qdrant_url}/collections/{coll}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
points = data["points"]
for pt in points:
payload = pt.get("payload", {})
obj_name = payload.get("object_name", "")
if not obj_name:
continue
key = doc_key(obj_name, coll)
if key not in documents:
meta = extract_doc_metadata(payload)
documents[key] = {
"object_name": obj_name,
"collection": coll,
"filename": payload.get("filename", obj_name.split("/")[-1]),
"form": meta["form"],
"extra_metadata": meta["extra"],
"old_chunk_count": 0,
}
documents[key]["old_chunk_count"] += 1
points_seen += len(points)
offset = data.get("next_page_offset")
if not offset:
break
logger.info(" %d points → %d unique docs",
points_seen,
sum(1 for d in documents.values() if d["collection"] == coll))
manifest = list(documents.values())
logger.info(" Total: %d unique documents across %d collections",
len(manifest), len(collections))
return manifest
# ---------------------------------------------------------------------------
# Phase 2: Per-document re-ingestion
# ---------------------------------------------------------------------------
def download_file(rag_url: str, object_name: str) -> bytes:
"""Download file bytes via MinIO presigned URL."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
presigned_url = resp.json()["url"]
with httpx.Client(timeout=120.0, verify=False) as c:
resp = c.get(presigned_url)
resp.raise_for_status()
return resp.content
def delete_old_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
"""Delete all chunks for a document from Qdrant. Returns estimated count."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
}
},
)
resp.raise_for_status()
return 0 # Qdrant delete doesn't return count
def _delete_old_chunks_safe(
qdrant_url: str, collection: str, object_name: str, keep_doc_id: str,
) -> None:
"""Delete old chunks for a document, keeping chunks with keep_doc_id."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}],
"must_not": [{
"key": "document_id",
"match": {"value": keep_doc_id},
}],
}
},
)
resp.raise_for_status()
def reupload_document(
rag_url: str,
file_bytes: bytes,
filename: str,
collection: str,
form_fields: dict,
extra_metadata: dict,
) -> dict:
"""Upload document to RAG service with new chunking parameters."""
ct = content_type_from_filename(filename)
form_data = {
"collection": collection,
"data_type": form_fields.get("data_type", "compliance"),
"bundesland": form_fields.get("bundesland", "bund"),
"use_case": form_fields.get("use_case", "compliance"),
"year": form_fields.get("year", "2026"),
"chunk_strategy": CHUNK_STRATEGY,
"chunk_size": str(CHUNK_SIZE),
"chunk_overlap": str(CHUNK_OVERLAP),
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, file_bytes, ct)},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def process_document(
doc: dict,
rag_url: str,
qdrant_url: str,
progress: dict,
max_retries: int = 2,
) -> bool:
"""Process a single document: download → upload → verify → delete old.
Safe order: new chunks are created FIRST, old chunks deleted only after
successful verification (upload-before-delete pattern).
"""
key = doc_key(doc["object_name"], doc["collection"])
# Skip if already done
if progress.get("documents", {}).get(key, {}).get("status") == "done":
return True
for attempt in range(max_retries + 1):
try:
# 1. Download
file_bytes = download_file(rag_url, doc["object_name"])
if not file_bytes:
logger.warning(" Empty file: %s — skipping", doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "skipped", "reason": "empty_file"}
return False
# 2. Upload FIRST (creates new chunks alongside old ones)
result = reupload_document(
rag_url, file_bytes, doc["filename"],
doc["collection"], doc["form"], doc["extra_metadata"],
)
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
if new_chunks == 0:
logger.error(" Upload produced 0 chunks — keeping old data: %s",
doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "error", "error": "0 new chunks"}
return False
# 3. Delete OLD chunks only (exclude the new document_id)
_delete_old_chunks_safe(
qdrant_url, doc["collection"],
doc["object_name"], new_doc_id,
)
# 4. Record success
progress.setdefault("documents", {})[key] = {
"status": "done",
"old_chunks": doc["old_chunk_count"],
"new_chunks": new_chunks,
"new_document_id": result.get("document_id", ""),
"completed_at": datetime.now(timezone.utc).isoformat(),
}
return True
except httpx.HTTPStatusError as e:
if e.response.status_code == 404:
logger.warning(" File not in MinIO (404): %s — skipping", doc["object_name"])
progress.setdefault("documents", {})[key] = {
"status": "skipped", "reason": "not_in_minio"}
return False
if attempt < max_retries:
wait = 5 * (attempt + 1)
logger.warning(" HTTP %d on attempt %d, retrying in %ds...",
e.response.status_code, attempt + 1, wait)
time.sleep(wait)
else:
logger.error(" FAILED after %d retries: %s", max_retries, e)
progress.setdefault("documents", {})[key] = {
"status": "error", "error": str(e), "retries": max_retries}
return False
except Exception as e:
if attempt < max_retries:
wait = 10 * (attempt + 1)
logger.warning(" Error on attempt %d: %s — retrying in %ds",
attempt + 1, e, wait)
time.sleep(wait)
else:
logger.error(" FAILED after %d retries: %s", max_retries, e)
progress.setdefault("documents", {})[key] = {
"status": "error", "error": str(e), "retries": max_retries}
return False
return False
# ---------------------------------------------------------------------------
# Phase 3: Verification
# ---------------------------------------------------------------------------
def verify_results(
qdrant_url: str,
before_counts: dict,
collections: list[str],
manifest: list[dict],
):
"""Compare before/after counts and spot-check metadata."""
logger.info("Phase 3: Verification...")
print("\n" + "=" * 65)
print("D5 RE-INGESTION VERIFICATION REPORT")
print("=" * 65)
after_counts = {}
with httpx.Client(timeout=10.0) as c:
for coll in collections:
try:
r = c.post(f"{qdrant_url}/collections/{coll}/points/count",
json={"exact": True})
r.raise_for_status()
after_counts[coll] = r.json()["result"]["count"]
except Exception:
after_counts[coll] = -1
print(f"\n{'Collection':<35} {'Before':>8} {'After':>8} {'Delta':>8}")
print("-" * 65)
for coll in collections:
before = before_counts.get(coll, 0)
after = after_counts.get(coll, -1)
delta = after - before if after >= 0 else "?"
print(f"{coll:<35} {before:>8} {after:>8} {str(delta):>8}")
# Spot-check: pick 3 random docs and verify metadata
print("\nSpot-check (3 random docs):")
sample = random.sample(manifest, min(3, len(manifest)))
with httpx.Client(timeout=30.0) as c:
for doc in sample:
resp = c.post(
f"{qdrant_url}/collections/{doc['collection']}/points/scroll",
json={
"limit": 3,
"with_payload": True,
"with_vector": False,
"filter": {
"must": [{
"key": "object_name",
"match": {"value": doc["object_name"]},
}]
},
},
)
if resp.status_code != 200:
print(f" {doc['object_name']}: QUERY FAILED")
continue
points = resp.json()["result"]["points"]
if not points:
print(f" {doc['object_name']}: NO CHUNKS FOUND")
continue
has_section = sum(1 for p in points if p["payload"].get("section"))
has_para = sum(1 for p in points if p["payload"].get("paragraph"))
print(f" {doc['filename'][:40]:<42} "
f"chunks={len(points):>3} "
f"with_section={has_section}/{len(points)} "
f"with_para={has_para}/{len(points)}")
print()
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
parser = argparse.ArgumentParser(description="D5 Re-Ingestion Script")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
parser.add_argument("--dry-run", action="store_true",
help="Build manifest only, no changes")
parser.add_argument("--collection", default=None,
help="Process only this collection")
parser.add_argument("--resume", action="store_true",
help="Resume from progress file")
args = parser.parse_args()
collections = [args.collection] if args.collection else TARGET_COLLECTIONS
# Phase 0
before_counts = preflight_checks(args.rag_url, args.qdrant_url)
# Phase 1
manifest = build_manifest(args.qdrant_url, collections)
# Save manifest for inspection
with open(MANIFEST_FILE, "w", encoding="utf-8") as f:
json.dump(manifest, f, indent=2, ensure_ascii=False)
logger.info("Manifest saved to %s", MANIFEST_FILE)
if args.dry_run:
print(f"\nDRY RUN: {len(manifest)} documents found. See {MANIFEST_FILE}")
for doc in manifest[:10]:
reg = doc["extra_metadata"].get("regulation_code", "?")
print(f" {reg:<30} {doc['collection']:<35} chunks={doc['old_chunk_count']}")
if len(manifest) > 10:
print(f" ... and {len(manifest) - 10} more")
sys.exit(0)
# Phase 2
progress = load_progress() if args.resume else {"documents": {}}
progress["started_at"] = datetime.now(timezone.utc).isoformat()
progress["before_counts"] = before_counts
done = 0
skipped = 0
failed = 0
for i, doc in enumerate(manifest, 1):
key = doc_key(doc["object_name"], doc["collection"])
reg = doc["extra_metadata"].get("regulation_code", "?")
if progress.get("documents", {}).get(key, {}).get("status") == "done":
done += 1
continue
logger.info("[%d/%d] %s (%s) — %d old chunks",
i, len(manifest), reg, doc["collection"], doc["old_chunk_count"])
ok = process_document(doc, args.rag_url, args.qdrant_url, progress)
if ok:
done += 1
new_chunks = progress["documents"][key].get("new_chunks", "?")
logger.info(" OK: %d old → %s new chunks", doc["old_chunk_count"], new_chunks)
elif progress["documents"][key].get("status") == "skipped":
skipped += 1
else:
failed += 1
save_progress(progress)
time.sleep(2)
logger.info("Phase 2 complete: %d done, %d skipped, %d failed", done, skipped, failed)
# Phase 3
verify_results(args.qdrant_url, before_counts, collections, manifest)
print(f"Summary: {done} done, {skipped} skipped, {failed} failed")
if failed:
print(f"Re-run with --resume to retry {failed} failed documents")
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,92 @@
"""D5 Re-Ingestion: Constants, helpers, progress tracking."""
import json
import logging
import os
logger = logging.getLogger("d5-reingest")
# ---------------------------------------------------------------------------
# Defaults (overridable via CLI args)
# ---------------------------------------------------------------------------
DEFAULT_RAG_URL = "https://macmini:8097"
DEFAULT_QDRANT_URL = "http://macmini:6333"
TARGET_COLLECTIONS = [
"bp_compliance_ce",
"bp_compliance_gesetze",
"bp_compliance_datenschutz",
"bp_dsfa_corpus",
"bp_legal_templates",
"bp_compliance_schulrecht",
]
# New chunking parameters (D1-D4 validated)
CHUNK_STRATEGY = "recursive"
CHUNK_SIZE = 1500
CHUNK_OVERLAP = 100
PROGRESS_FILE = "d5_reingest_progress.json"
MANIFEST_FILE = "d5_manifest.json"
# Per-chunk fields (NOT carried as extra metadata during re-upload)
PER_CHUNK_FIELDS = frozenset({
"chunk_text", "chunk_index", "document_id", "object_name",
"filename", "data_type", "bundesland", "use_case", "year",
"section", "section_title", "paragraph", "paragraph_num", "page",
})
# Upload form fields that come from the payload (not metadata_json)
FORM_FIELDS = frozenset({"data_type", "bundesland", "use_case", "year"})
# ---------------------------------------------------------------------------
# Progress tracking
# ---------------------------------------------------------------------------
def load_progress(path: str = PROGRESS_FILE) -> dict:
if os.path.exists(path):
with open(path, encoding="utf-8") as f:
return json.load(f)
return {"documents": {}}
def save_progress(data: dict, path: str = PROGRESS_FILE):
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False, default=str)
# ---------------------------------------------------------------------------
# Metadata extraction
# ---------------------------------------------------------------------------
def extract_doc_metadata(payload: dict) -> dict:
"""Split Qdrant payload into form fields + extra metadata.
Returns: {"form": {data_type, bundesland, ...}, "extra": {regulation_code, ...}}
"""
form = {}
extra = {}
for k, v in payload.items():
if k in PER_CHUNK_FIELDS:
continue
if k in FORM_FIELDS:
form[k] = v
else:
extra[k] = v
return {"form": form, "extra": extra}
def doc_key(object_name: str, collection: str) -> str:
"""Unique key for a document in the progress file."""
return f"{object_name}|{collection}"
def content_type_from_filename(filename: str) -> str:
"""Infer MIME type from file extension."""
ext = os.path.splitext(filename)[1].lower()
return {
".pdf": "application/pdf",
".html": "text/html",
".htm": "text/html",
".md": "text/markdown",
".txt": "text/plain",
}.get(ext, "application/octet-stream")
+485
View File
@@ -0,0 +1,485 @@
#!/usr/bin/env python3
"""Safe re-ingestion of NIST/BSI/ENISA PDFs from MinIO.
Uses upload-before-delete pattern: new chunks are created FIRST,
old chunks are only deleted after successful verification.
Usage:
python3 control-pipeline/scripts/reingest_nist.py [--dry-run]
python3 control-pipeline/scripts/reingest_nist.py --only-missing
"""
import argparse
import json
import logging
import sys
import time
import httpx
sys.path.insert(0, "control-pipeline/scripts")
from reingest_d5_config import ( # noqa: E402
CHUNK_OVERLAP,
CHUNK_SIZE,
CHUNK_STRATEGY,
DEFAULT_QDRANT_URL,
DEFAULT_RAG_URL,
content_type_from_filename,
)
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(message)s",
)
logger = logging.getLogger("reingest-nist")
UPLOAD_TIMEOUT = 1800.0 # 30 min for large PDFs
# -------------------------------------------------------------------
# Documents to re-ingest
# -------------------------------------------------------------------
# 4 documents with 0 chunks (deleted by D5, upload failed)
MISSING_DOCS = [
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_53r5.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_53r5.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5 Security and Privacy Controls",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_82r3.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_82r3.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_name_de": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_name_en": "NIST SP 800-82 Rev. 3 — Guide to OT Security",
"regulation_short": "NIST SP 800-82",
"category": "ot_security",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_160v1r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_160v1r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_name_de": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_name_en": "NIST SP 800-160 Vol. 1 Rev. 1",
"regulation_short": "NIST SP 800-160",
"category": "security_engineering",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "NIST_SP_800_207.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"attribution": "NIST",
"source": "nist.gov",
},
},
]
# Additional NIST/BSI/ENISA docs with <10% section rate (re-ingest for quality)
LOW_QUALITY_DOCS = [
{
"object_name": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nist_csf_2_0.pdf",
"extra_metadata": {
"regulation_id": "nist_csf_2_0",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nistir_8259a.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nistir_8259a.pdf",
"extra_metadata": {
"regulation_id": "nistir_8259a",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
"collection": "bp_compliance_datenschutz",
"filename": "nist_ai_rmf.pdf",
"extra_metadata": {
"regulation_id": "nist_ai_rmf",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
"collection": "bp_compliance_ce",
"filename": "nist_sp_800_30r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_30r1",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_supply_chain_good_practices.pdf",
"extra_metadata": {
"regulation_id": "enisa_supply_chain_good_practices",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_ics_scada.pdf",
"extra_metadata": {
"regulation_id": "enisa_ics_scada_dependencies",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
"collection": "bp_compliance_ce",
"filename": "enisa_supply_chain_security.pdf",
"extra_metadata": {
"regulation_id": "enisa_threat_landscape_supply_chain",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"object_name": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
"collection": "bp_compliance_ce",
"filename": "cisa_secure_by_design.pdf",
"extra_metadata": {
"regulation_id": "cisa_secure_by_design",
"license": "public_domain_us",
"source": "cisa.gov",
},
},
{
"object_name": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
"collection": "bp_compliance_ce",
"filename": "cvss_v4_0.pdf",
"extra_metadata": {
"regulation_id": "cvss_v4_0",
"license": "public_domain_us",
"source": "first.org",
},
},
]
# -------------------------------------------------------------------
# Qdrant helpers
# -------------------------------------------------------------------
def count_chunks(qdrant_url: str, collection: str, object_name: str) -> int:
"""Count existing chunks for a document in Qdrant."""
with httpx.Client(timeout=30.0) as c:
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/count",
json={
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"exact": True,
},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def get_old_document_ids(
qdrant_url: str, collection: str, object_name: str,
) -> set:
"""Get all document_ids for existing chunks of this document."""
doc_ids = set()
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"limit": 100,
"with_payload": ["document_id"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
did = pt.get("payload", {}).get("document_id")
if did:
doc_ids.add(did)
offset = data.get("next_page_offset")
if offset is None:
break
return doc_ids
def delete_by_document_ids(
qdrant_url: str, collection: str, doc_ids: set,
) -> None:
"""Delete chunks matching specific document_ids."""
for did in doc_ids:
with httpx.Client(timeout=30.0) as c:
c.post(
f"{qdrant_url}/collections/{collection}/points/delete",
json={
"filter": {
"must": [{
"key": "document_id",
"match": {"value": did},
}]
}
},
).raise_for_status()
def check_section_rate(
qdrant_url: str, collection: str, object_name: str,
) -> tuple:
"""Check section rate for a document's chunks. Returns (total, with_section)."""
total = 0
with_section = 0
offset = None
with httpx.Client(timeout=60.0) as c:
while True:
body = {
"filter": {
"must": [{
"key": "object_name",
"match": {"value": object_name},
}]
},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{qdrant_url}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
sec = pt.get("payload", {}).get("section", "")
if sec and sec.strip():
with_section += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_section
# -------------------------------------------------------------------
# Upload
# -------------------------------------------------------------------
def download_from_minio(rag_url: str, object_name: str) -> bytes:
"""Download file from MinIO via RAG service presigned URL."""
with httpx.Client(timeout=60.0, verify=False) as c:
resp = c.get(f"{rag_url}/api/v1/documents/download/{object_name}")
resp.raise_for_status()
presigned_url = resp.json()["url"]
with httpx.Client(timeout=300.0, verify=False) as c:
resp = c.get(presigned_url)
resp.raise_for_status()
return resp.content
def upload_document(
rag_url: str,
file_bytes: bytes,
filename: str,
collection: str,
extra_metadata: dict,
) -> dict:
"""Upload document to RAG service."""
ct = content_type_from_filename(filename)
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": CHUNK_STRATEGY,
"chunk_size": str(CHUNK_SIZE),
"chunk_overlap": str(CHUNK_OVERLAP),
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, file_bytes, ct)},
data=form_data,
)
resp.raise_for_status()
return resp.json()
# -------------------------------------------------------------------
# Main processing
# -------------------------------------------------------------------
def process_document(
doc: dict,
rag_url: str,
qdrant_url: str,
dry_run: bool = False,
) -> dict:
"""Safe re-ingest: upload first, then delete old. Returns result dict."""
obj = doc["object_name"]
coll = doc["collection"]
fname = doc["filename"]
# 1. Check existing state
old_count = count_chunks(qdrant_url, coll, obj)
old_doc_ids = get_old_document_ids(qdrant_url, coll, obj) if old_count > 0 else set()
logger.info(" [%s] existing: %d chunks, %d document_ids",
fname, old_count, len(old_doc_ids))
if dry_run:
logger.info(" [%s] DRY RUN — would download + upload + delete old", fname)
return {"status": "dry_run", "old_chunks": old_count}
# 2. Download from MinIO
logger.info(" [%s] downloading from MinIO...", fname)
file_bytes = download_from_minio(rag_url, obj)
size_mb = len(file_bytes) / (1024 * 1024)
logger.info(" [%s] downloaded %.1f MB", fname, size_mb)
# 3. Upload FIRST (creates new chunks)
logger.info(" [%s] uploading to RAG service...", fname)
result = upload_document(rag_url, file_bytes, fname, coll, doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
logger.info(" [%s] uploaded: %d new chunks (doc_id=%s)", fname, new_chunks, new_doc_id)
# 4. Verify new chunks exist
if new_chunks == 0:
logger.error(" [%s] UPLOAD PRODUCED 0 CHUNKS — keeping old data!", fname)
return {"status": "error", "error": "0 new chunks", "old_chunks": old_count}
# 5. Delete old chunks (only if there were any)
if old_doc_ids:
logger.info(" [%s] deleting %d old document_ids...", fname, len(old_doc_ids))
delete_by_document_ids(qdrant_url, coll, old_doc_ids)
logger.info(" [%s] old chunks deleted", fname)
# 6. Check section rate
total, with_sec = check_section_rate(qdrant_url, coll, obj)
pct = (with_sec / total * 100) if total > 0 else 0
logger.info(" [%s] section rate: %d/%d (%.0f%%)", fname, with_sec, total, pct)
return {
"status": "ok",
"old_chunks": old_count,
"new_chunks": new_chunks,
"new_document_id": new_doc_id,
"section_rate": round(pct, 1),
}
def main():
parser = argparse.ArgumentParser(description="Safe NIST/BSI/ENISA re-ingestion")
parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
parser.add_argument("--only-missing", action="store_true",
help="Only re-ingest the 4 missing docs (skip low-quality)")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
args = parser.parse_args()
docs = list(MISSING_DOCS)
if not args.only_missing:
docs.extend(LOW_QUALITY_DOCS)
logger.info("=" * 60)
logger.info("NIST/BSI/ENISA Safe Re-Ingestion")
logger.info(" Documents: %d (%d missing + %d low-quality)",
len(docs), len(MISSING_DOCS),
0 if args.only_missing else len(LOW_QUALITY_DOCS))
logger.info(" RAG: %s", args.rag_url)
logger.info(" Qdrant: %s", args.qdrant_url)
logger.info(" Dry run: %s", args.dry_run)
logger.info("=" * 60)
results = {}
ok = 0
errors = 0
for i, doc in enumerate(docs, 1):
logger.info("[%d/%d] %s%s", i, len(docs), doc["filename"], doc["collection"])
try:
r = process_document(doc, args.rag_url, args.qdrant_url, args.dry_run)
results[doc["filename"]] = r
if r["status"] == "ok":
ok += 1
elif r["status"] == "error":
errors += 1
except Exception as e:
logger.error(" FAILED: %s", e)
results[doc["filename"]] = {"status": "error", "error": str(e)}
errors += 1
if i < len(docs):
time.sleep(2)
# Summary
logger.info("")
logger.info("=" * 60)
logger.info("RESULTS")
logger.info("=" * 60)
for fname, r in results.items():
status = r["status"].upper()
old = r.get("old_chunks", "?")
new = r.get("new_chunks", "?")
sec = r.get("section_rate", "?")
logger.info(" %-40s %s old=%s new=%s sect=%.0f%%",
fname, status, old, new, sec if isinstance(sec, float) else 0)
logger.info("")
logger.info("OK: %d, Errors: %d, Total: %d", ok, errors, len(docs))
if errors > 0:
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""
Replace EU regulation PDFs with clean HTML from EUR-Lex.
Downloads HTML versions of EU regulations (using CELEX numbers),
deletes old PDF chunks from Qdrant, uploads HTML via RAG service.
Usage:
python3 scripts/replace_eu_pdfs_with_html.py --dry-run
python3 scripts/replace_eu_pdfs_with_html.py
python3 scripts/replace_eu_pdfs_with_html.py --celex 32016R0679 # single doc
"""
import argparse
import json
import logging
import time
import httpx
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("eurlex-replace")
DEFAULT_RAG_URL = "https://macmini:8097"
DEFAULT_QDRANT_URL = "http://macmini:6333"
EURLEX_HTML_URL = "https://eur-lex.europa.eu/legal-content/DE/TXT/HTML/?uri=CELEX:{celex}"
# EU regulations with CELEX numbers and their current collection + metadata
EU_REGULATIONS = [
{"celex": "32024R1689", "reg_id": "ai_act_2024", "name": "AI Act", "coll": "bp_compliance_ce"},
{"celex": "32024R2847", "reg_id": "cra_2024", "name": "Cyber Resilience Act", "coll": "bp_compliance_ce"},
{"celex": "32022L2555", "reg_id": "nis2_2022", "name": "NIS2-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32016R0679", "reg_id": "dsgvo_2016", "name": "DSGVO", "coll": "bp_compliance_ce"},
{"celex": "32024R1624", "reg_id": "amlr_2024", "name": "Anti-Geldwaesche-VO", "coll": "bp_compliance_ce"},
{"celex": "32017R0745", "reg_id": "eu_mdr_2017", "name": "Medical Device Regulation", "coll": "bp_compliance_ce"},
{"celex": "32022R2065", "reg_id": "dsa_2022", "name": "Digital Services Act", "coll": "bp_compliance_ce"},
{"celex": "32022R1925", "reg_id": "dma_2022", "name": "Digital Markets Act", "coll": "bp_compliance_ce"},
{"celex": "32022R2554", "reg_id": "dora_2022", "name": "DORA", "coll": "bp_compliance_ce"},
{"celex": "32022R0868", "reg_id": "dga_2022", "name": "Data Governance Act", "coll": "bp_compliance_ce"},
{"celex": "32023R2854", "reg_id": "dataact_2023", "name": "Data Act", "coll": "bp_compliance_ce"},
{"celex": "32023R0988", "reg_id": "gpsr_2023", "name": "General Product Safety Regulation", "coll": "bp_compliance_ce"},
{"celex": "32023R1230", "reg_id": "machinery_2023", "name": "Maschinenverordnung", "coll": "bp_compliance_ce"},
{"celex": "32023R1803", "reg_id": "ifrs_2023", "name": "IFRS Regulation", "coll": "bp_compliance_ce"},
{"celex": "32023D1795", "reg_id": "dpf_2023", "name": "Data Privacy Framework", "coll": "bp_compliance_ce"},
{"celex": "32019L2161", "reg_id": "omnibus_2019", "name": "Omnibus-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32019L0790", "reg_id": "dsm_2019", "name": "DSM-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32019L0770", "reg_id": "digital_content_2019", "name": "Digital Content Directive", "coll": "bp_compliance_ce"},
{"celex": "32002L0058", "reg_id": "eprivacy_2002", "name": "ePrivacy-Richtlinie", "coll": "bp_compliance_ce"},
{"celex": "32000L0031", "reg_id": "ecommerce_2000", "name": "E-Commerce-Richtlinie", "coll": "bp_compliance_ce"},
]
def download_eurlex_html(celex: str) -> bytes:
"""Download HTML from EUR-Lex for a given CELEX number."""
url = EURLEX_HTML_URL.format(celex=celex)
with httpx.Client(timeout=60.0, follow_redirects=True) as c:
r = c.get(url)
r.raise_for_status()
return r.content
def delete_old_chunks(qdrant_url: str, collection: str, reg_id: str):
"""Delete chunks matching regulation_id prefix."""
with httpx.Client(timeout=30.0) as c:
# Try multiple field names for regulation_id
for field in ["regulation_id"]:
r = c.post(f"{qdrant_url}/collections/{collection}/points/delete", json={
"filter": {"must": [{"key": field, "match": {"value": reg_id}}]}
})
if r.status_code == 200:
return
def find_old_chunks_by_filename(qdrant_url: str, collection: str, filename_pattern: str) -> int:
"""Count existing chunks matching a filename pattern."""
with httpx.Client(timeout=30.0) as c:
r = c.post(f"{qdrant_url}/collections/{collection}/points/count", json={
"exact": True,
"filter": {"must": [{"key": "regulation_id", "match": {"value": filename_pattern}}]}
})
if r.status_code == 200:
return r.json()["result"]["count"]
return 0
def upload_html(rag_url: str, html_bytes: bytes, reg: dict) -> dict:
"""Upload HTML to RAG service."""
filename = f"{reg['reg_id']}.html"
metadata = json.dumps({
"regulation_id": reg["reg_id"],
"regulation_name_de": reg["name"],
"celex": reg["celex"],
"source": "EUR-Lex",
"license": "EU_law",
"source_type": "law",
"category": "eu_regulation",
}, ensure_ascii=False)
with httpx.Client(timeout=3600.0, verify=False) as c:
r = c.post(f"{rag_url}/api/v1/documents/upload",
files={"file": (filename, html_bytes, "text/html")},
data={
"collection": reg["coll"],
"data_type": "compliance",
"bundesland": "eu",
"use_case": "regulation",
"year": reg["celex"][1:5],
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": metadata,
},
)
r.raise_for_status()
return r.json()
def check_section_rate(qdrant_url: str, collection: str, reg_id: str) -> tuple:
"""Check section rate for a regulation. Returns (total, with_section)."""
total = 0
with_section = 0
with httpx.Client(timeout=30.0) as c:
r = c.post(f"{qdrant_url}/collections/{collection}/points/scroll", json={
"limit": 100, "with_payload": True, "with_vector": False,
"filter": {"must": [{"key": "regulation_id", "match": {"value": reg_id}}]}
})
if r.status_code == 200:
pts = r.json()["result"]["points"]
total = len(pts)
with_section = sum(1 for p in pts if p["payload"].get("section"))
return total, with_section
def main():
parser = argparse.ArgumentParser(description="Replace EU PDFs with EUR-Lex HTML")
parser.add_argument("--rag-url", default=DEFAULT_RAG_URL)
parser.add_argument("--qdrant-url", default=DEFAULT_QDRANT_URL)
parser.add_argument("--dry-run", action="store_true")
parser.add_argument("--celex", default=None, help="Process only this CELEX number")
args = parser.parse_args()
regs = EU_REGULATIONS
if args.celex:
regs = [r for r in regs if r["celex"] == args.celex]
if not regs:
print(f"CELEX {args.celex} not found in list")
return
results = []
for reg in regs:
logger.info("[%s] %s (%s)", reg["celex"], reg["name"], reg["reg_id"])
# Download HTML
try:
html_bytes = download_eurlex_html(reg["celex"])
logger.info(" Downloaded: %d bytes", len(html_bytes))
except Exception as e:
logger.error(" Download FAILED: %s", e)
results.append({"reg": reg, "status": "download_failed", "error": str(e)})
continue
if args.dry_run:
results.append({"reg": reg, "status": "dry_run", "html_size": len(html_bytes)})
continue
# Delete old chunks
old_count = find_old_chunks_by_filename(args.qdrant_url, reg["coll"], reg["reg_id"])
delete_old_chunks(args.qdrant_url, reg["coll"], reg["reg_id"])
logger.info(" Deleted %d old chunks", old_count)
# Upload HTML
try:
result = upload_html(args.rag_url, html_bytes, reg)
new_chunks = result.get("chunks_count", 0)
logger.info(" Uploaded: %d new chunks", new_chunks)
except Exception as e:
logger.error(" Upload FAILED: %s", e)
results.append({"reg": reg, "status": "upload_failed", "error": str(e)})
time.sleep(2)
continue
# Check quality
time.sleep(2)
total, with_sec = check_section_rate(args.qdrant_url, reg["coll"], reg["reg_id"])
pct = with_sec * 100 // max(total, 1)
logger.info(" Section rate: %d/%d = %d%%", with_sec, total, pct)
results.append({
"reg": reg, "status": "ok",
"old_chunks": old_count, "new_chunks": new_chunks,
"section_rate": pct,
})
time.sleep(2)
# Report
print("\n" + "=" * 90)
print("EUR-LEX REPLACEMENT REPORT")
print("=" * 90)
print(f"{'CELEX':<15} {'Name':<30} {'Status':<10} {'Old':>5} {'New':>5} {'Sect%':>6}")
print("-" * 90)
for r in results:
reg = r["reg"]
status = r["status"]
old = r.get("old_chunks", "")
new = r.get("new_chunks", r.get("html_size", ""))
sect = f"{r.get('section_rate', '')}%" if "section_rate" in r else ""
print(f"{reg['celex']:<15} {reg['name'][:30]:<30} {status:<10} {str(old):>5} {str(new):>5} {sect:>6}")
if __name__ == "__main__":
main()
@@ -0,0 +1,437 @@
#!/usr/bin/env python3
"""Re-upload NIST/BSI/ENISA docs with chunk_strategy='legal' for section metadata.
The docs were already uploaded with 'recursive' strategy (no section detection).
This script re-uploads with 'legal' strategy, then deletes old recursive chunks.
Usage (on Mac Mini):
python3 control-pipeline/scripts/reupload_legal_strategy.py
python3 control-pipeline/scripts/reupload_legal_strategy.py --dry-run
"""
import argparse
import io
import json
import re
import sys
import time
import unicodedata
import httpx
import pdfplumber
RAG_URL = "https://localhost:8097"
QDRANT_URL = "http://localhost:6333"
UPLOAD_TIMEOUT = 1800.0
# ---- Documents to process ----
DOCS = [
# 4 NIST docs already extracted at /tmp/nist_*.txt
{
"regulation_id": "nist_sp800_53r5",
"collection": "bp_compliance_datenschutz",
"upload_filename": "NIST_SP_800_53r5.txt",
"local_txt": "/tmp/nist_nist_sp800_53r5.txt",
"minio_pdf": None, # already extracted
"extra_metadata": {
"regulation_id": "nist_sp800_53r5",
"source_id": "nist",
"doc_type": "controls_catalog",
"guideline_name": "NIST SP 800-53 Rev. 5",
"license": "public_domain_us_gov",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_82r3",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_82r3.txt",
"local_txt": "/tmp/nist_nist_sp_800_82r3.txt",
"minio_pdf": None,
"extra_metadata": {
"regulation_id": "nist_sp_800_82r3",
"regulation_short": "NIST SP 800-82",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_160v1r1",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_160v1r1.txt",
"local_txt": "/tmp/nist_160.txt",
"minio_pdf": None,
"extra_metadata": {
"regulation_id": "nist_sp_800_160v1r1",
"regulation_short": "NIST SP 800-160",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp800_207",
"collection": "bp_compliance_datenschutz",
"upload_filename": "NIST_SP_800_207.txt",
"local_txt": None, # needs extraction
"minio_pdf": "compliance/bund/compliance/2026/NIST_SP_800_207.pdf",
"extra_metadata": {
"regulation_id": "nist_sp800_207",
"source_id": "nist",
"doc_type": "architecture",
"guideline_name": "NIST SP 800-207 Zero Trust Architecture",
"license": "public_domain_us_gov",
"source": "nist.gov",
},
},
# Additional low-quality docs (need extraction from MinIO)
{
"regulation_id": "nist_csf_2_0",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nist_csf_2_0.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_csf_2_0.pdf",
"extra_metadata": {
"regulation_id": "nist_csf_2_0",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nistir_8259a",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nistir_8259a.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nistir_8259a.pdf",
"extra_metadata": {
"regulation_id": "nistir_8259a",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_ai_rmf",
"collection": "bp_compliance_datenschutz",
"upload_filename": "nist_ai_rmf.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_ai_rmf.pdf",
"extra_metadata": {
"regulation_id": "nist_ai_rmf",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "nist_sp_800_30r1",
"collection": "bp_compliance_ce",
"upload_filename": "nist_sp_800_30r1.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/nist_sp_800_30r1.pdf",
"extra_metadata": {
"regulation_id": "nist_sp_800_30r1",
"license": "public_domain_us",
"source": "nist.gov",
},
},
{
"regulation_id": "cisa_secure_by_design",
"collection": "bp_compliance_ce",
"upload_filename": "cisa_secure_by_design.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/cisa_secure_by_design.pdf",
"extra_metadata": {
"regulation_id": "cisa_secure_by_design",
"license": "public_domain_us",
"source": "cisa.gov",
},
},
{
"regulation_id": "cvss_v4_0",
"collection": "bp_compliance_ce",
"upload_filename": "cvss_v4_0.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/cvss_v4_0.pdf",
"extra_metadata": {
"regulation_id": "cvss_v4_0",
"license": "public_domain_us",
"source": "first.org",
},
},
{
"regulation_id": "enisa_ics_scada_dependencies",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_ics_scada.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_ics_scada.pdf",
"extra_metadata": {
"regulation_id": "enisa_ics_scada_dependencies",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"regulation_id": "enisa_threat_landscape_supply_chain",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_supply_chain_security.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_security.pdf",
"extra_metadata": {
"regulation_id": "enisa_threat_landscape_supply_chain",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
{
"regulation_id": "enisa_supply_chain_good_practices",
"collection": "bp_compliance_ce",
"upload_filename": "enisa_supply_chain_good_practices.txt",
"local_txt": None,
"minio_pdf": "compliance/bund/compliance/2026/enisa_supply_chain_good_practices.pdf",
"extra_metadata": {
"regulation_id": "enisa_supply_chain_good_practices",
"license": "reuse_with_attribution",
"source": "enisa.europa.eu",
},
},
]
def normalize_pdf_text(text):
text = unicodedata.normalize('NFKC', text)
text = text.replace('\u00ad', '').replace('\u200b', '')
prev = None
while prev != text:
prev = text
text = re.sub(r'(\d+)\s+\.\s+(\d+)', r'\1.\2', text)
text = re.sub(r'\b([A-Z]{2,4})\s+-\s+(\d+)\b', r'\1-\2', text)
text = re.sub(
r'\b([A-Z]{2})\s*\.\s*([A-Z]{2})\s*-\s*(\d{2})\b', r'\1.\2-\3', text
)
text = re.sub(r'\(\s+(\d+)\s+\)', r'(\1)', text)
text = re.sub(r'[^\S\n]{2,}', ' ', text)
return text
def get_text(doc):
"""Get document text: from local file or extract from MinIO PDF."""
if doc["local_txt"]:
print(f" Reading local: {doc['local_txt']}")
with open(doc["local_txt"], encoding="utf-8") as f:
return f.read()
print(f" Downloading from MinIO: {doc['minio_pdf']}")
with httpx.Client(timeout=60, verify=False) as c:
resp = c.get(f"{RAG_URL}/api/v1/documents/download/{doc['minio_pdf']}")
resp.raise_for_status()
url = resp.json()["url"]
with httpx.Client(timeout=300, verify=False) as c:
pdf_bytes = c.get(url).content
print(f" Downloaded {len(pdf_bytes) / 1024 / 1024:.1f} MB")
print(" Extracting with pdfplumber...")
parts = []
with pdfplumber.open(io.BytesIO(pdf_bytes)) as pdf:
for i, page in enumerate(pdf.pages):
t = page.extract_text(x_tolerance=3, y_tolerance=4)
if t:
parts.append(t)
if (i + 1) % 50 == 0:
print(f" {i + 1}/{len(pdf.pages)} pages...")
text = "\n\n".join(parts)
text = normalize_pdf_text(text)
print(f" Extracted {len(text):,} chars")
return text
def get_old_doc_ids(collection, regulation_id):
"""Get all document_ids for existing chunks."""
doc_ids = set()
offset = None
with httpx.Client(timeout=60) as c:
while True:
body = {
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"limit": 100,
"with_payload": ["document_id"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
did = pt.get("payload", {}).get("document_id")
if did:
doc_ids.add(did)
offset = data.get("next_page_offset")
if offset is None:
break
return doc_ids
def upload_text_legal(text, filename, collection, extra_metadata):
"""Upload with chunk_strategy='legal'."""
form_data = {
"collection": collection,
"data_type": "compliance",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "legal",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": json.dumps(extra_metadata, ensure_ascii=False),
}
with httpx.Client(timeout=UPLOAD_TIMEOUT, verify=False) as c:
resp = c.post(
f"{RAG_URL}/api/v1/documents/upload",
files={"file": (filename, text.encode("utf-8"), "text/plain")},
data=form_data,
)
resp.raise_for_status()
return resp.json()
def delete_by_doc_ids(collection, doc_ids):
"""Delete chunks matching specific document_ids."""
with httpx.Client(timeout=30) as c:
for did in doc_ids:
c.post(
f"{QDRANT_URL}/collections/{collection}/points/delete",
json={"filter": {"must": [
{"key": "document_id", "match": {"value": did}}
]}},
).raise_for_status()
def count_chunks(collection, regulation_id):
with httpx.Client(timeout=30) as c:
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/count",
json={"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]}, "exact": True},
)
resp.raise_for_status()
return resp.json()["result"]["count"]
def check_section_rate(collection, regulation_id):
total = 0
with_sec = 0
offset = None
with httpx.Client(timeout=60) as c:
while True:
body = {
"filter": {"must": [
{"key": "regulation_id", "match": {"value": regulation_id}}
]},
"limit": 100,
"with_payload": ["section"],
}
if offset is not None:
body["offset"] = offset
resp = c.post(
f"{QDRANT_URL}/collections/{collection}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
for pt in data["points"]:
total += 1
s = pt.get("payload", {}).get("section", "")
if s and s.strip():
with_sec += 1
offset = data.get("next_page_offset")
if offset is None:
break
return total, with_sec
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--dry-run", action="store_true")
args = parser.parse_args()
print("=" * 60)
print("Re-upload with chunk_strategy='legal'")
print(f"Documents: {len(DOCS)}, Dry run: {args.dry_run}")
print("=" * 60)
results = []
for i, doc in enumerate(DOCS, 1):
reg_id = doc["regulation_id"]
coll = doc["collection"]
print(f"\n[{i}/{len(DOCS)}] {doc['upload_filename']}{coll}")
# 1. Check existing
old_count = count_chunks(coll, reg_id)
old_doc_ids = get_old_doc_ids(coll, reg_id) if old_count > 0 else set()
print(f" Old: {old_count} chunks, {len(old_doc_ids)} doc_ids")
if args.dry_run:
print(" DRY RUN — skipping")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": "?", "sect": "?"})
continue
# 2. Get text
try:
text = get_text(doc)
except Exception as e:
print(f" ERROR extracting text: {e}")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": 0, "sect": 0})
continue
# 3. Upload with legal strategy
print(" Uploading with strategy='legal'...")
result = upload_text_legal(
text, doc["upload_filename"], coll, doc["extra_metadata"])
new_chunks = result.get("chunks_count", 0)
new_doc_id = result.get("document_id", "")
print(f" New: {new_chunks} chunks (doc_id={new_doc_id})")
if new_chunks == 0:
print(" ERROR: 0 chunks — keeping old!")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": 0, "sect": 0})
continue
# 4. Delete old chunks (safe: new ones already exist)
if old_doc_ids:
# Exclude the new document_id just in case
old_doc_ids.discard(new_doc_id)
if old_doc_ids:
print(f" Deleting {len(old_doc_ids)} old doc_ids...")
delete_by_doc_ids(coll, old_doc_ids)
# 5. Check section rate
total, with_sec = check_section_rate(coll, reg_id)
pct = (with_sec / total * 100) if total > 0 else 0
print(f" Section rate: {with_sec}/{total} ({pct:.0f}%)")
results.append({"file": doc["upload_filename"], "old": old_count,
"new": new_chunks, "sect": round(pct, 1)})
if i < len(DOCS):
time.sleep(2)
# Summary
print("\n" + "=" * 60)
print("RESULTS")
print("=" * 60)
for r in results:
print(f" {r['file']:<45} old={r['old']:<6} new={r['new']:<6} sect={r['sect']}%")
total_new = sum(r["new"] for r in results if isinstance(r["new"], int))
print(f"\nTotal new chunks: {total_new}")
if __name__ == "__main__":
main()
@@ -0,0 +1,268 @@
#!/usr/bin/env python3
"""
D4 Integration Test: Upload BGB excerpt verify Qdrant payloads.
Usage:
# Dry-run (local chunking only, no services needed)
python3 scripts/test_d4_integration.py --dry-run
# Against Mac Mini
python3 scripts/test_d4_integration.py \
--rag-url https://macmini:8097 \
--qdrant-url http://macmini:6333
# Against production
python3 scripts/test_d4_integration.py \
--rag-url https://rag-prod:8097 \
--qdrant-url http://qdrant-prod:6333
"""
import argparse
import json
import os
import sys
import time
import httpx
FIXTURE_PATH = os.path.join(
os.path.dirname(__file__), "..", "..", "embedding-service",
"tests", "fixtures", "bgb_312_excerpt.txt",
)
COLLECTION = "bp_compliance_gesetze"
REG_CODE = "BGB_D4_TEST"
# Expected sections in the BGB excerpt
EXPECTED_SECTIONS = {"§ 312", "§ 312a", "§ 312g", "§ 312k"}
def load_fixture() -> str:
with open(FIXTURE_PATH, encoding="utf-8") as f:
return f.read()
def upload_document(rag_url: str, text: str) -> dict:
"""Upload BGB excerpt to RAG service."""
metadata = json.dumps({
"regulation_code": REG_CODE,
"regulation_name_de": "BGB (D4 Test)",
"source_type": "law",
})
with httpx.Client(timeout=60.0, verify=False) as client:
resp = client.post(
f"{rag_url}/api/v1/documents/upload",
files={"file": ("bgb_312_test.txt", text.encode(), "text/plain")},
data={
"collection": COLLECTION,
"data_type": "law",
"bundesland": "bund",
"use_case": "compliance",
"year": "2026",
"chunk_strategy": "recursive",
"chunk_size": "1500",
"chunk_overlap": "100",
"metadata_json": metadata,
},
)
resp.raise_for_status()
return resp.json()
def scroll_chunks(qdrant_url: str, document_id: str) -> list[dict]:
"""Scroll Qdrant for chunks matching this document_id."""
all_points = []
offset = None
with httpx.Client(timeout=30.0) as client:
while True:
body: dict = {
"limit": 100,
"with_payload": True,
"with_vector": False,
"filter": {
"must": [{
"key": "document_id",
"match": {"value": document_id},
}]
},
}
if offset:
body["offset"] = offset
resp = client.post(
f"{qdrant_url}/collections/{COLLECTION}/points/scroll",
json=body,
)
resp.raise_for_status()
data = resp.json()["result"]
all_points.extend(data["points"])
offset = data.get("next_page_offset")
if not offset:
break
return all_points
def delete_test_data(qdrant_url: str, document_id: str):
"""Clean up test chunks from Qdrant."""
with httpx.Client(timeout=30.0) as client:
resp = client.post(
f"{qdrant_url}/collections/{COLLECTION}/points/delete",
json={
"filter": {
"must": [{
"key": "document_id",
"match": {"value": document_id},
}]
}
},
)
resp.raise_for_status()
def verify_chunks(points: list[dict]) -> dict:
"""Analyze chunks and return a verification report."""
report = {
"total_chunks": len(points),
"sections_found": set(),
"chunks_with_section": 0,
"chunks_with_paragraph": 0,
"chunks_with_page": 0,
"section_details": [],
"issues": [],
}
for pt in points:
payload = pt.get("payload", {})
section = payload.get("section", "")
section_title = payload.get("section_title", "")
paragraph = payload.get("paragraph", "")
paragraph_num = payload.get("paragraph_num")
page = payload.get("page")
chunk_idx = payload.get("chunk_index", "?")
if section:
report["sections_found"].add(section)
report["chunks_with_section"] += 1
if paragraph:
report["chunks_with_paragraph"] += 1
if page is not None:
report["chunks_with_page"] += 1
report["section_details"].append({
"chunk_index": chunk_idx,
"section": section,
"section_title": section_title[:40],
"paragraph": paragraph,
"paragraph_num": paragraph_num,
"page": page,
"text_preview": payload.get("chunk_text", "")[:60],
})
# Checks
missing = EXPECTED_SECTIONS - report["sections_found"]
if missing:
report["issues"].append(f"Missing sections: {missing}")
if "§ 312k" not in report["sections_found"]:
report["issues"].append("CRITICAL: § 312k not found!")
section_ratio = report["chunks_with_section"] / max(report["total_chunks"], 1)
if section_ratio < 0.9:
report["issues"].append(
f"Only {section_ratio:.0%} chunks have section metadata (expected >= 90%)"
)
return report
def print_report(report: dict):
"""Print verification report."""
print("\n" + "=" * 60)
print("D4 VALIDATION REPORT")
print("=" * 60)
print(f"Total chunks: {report['total_chunks']}")
print(f"With section: {report['chunks_with_section']}")
print(f"With paragraph: {report['chunks_with_paragraph']}")
print(f"With page: {report['chunks_with_page']}")
print(f"Sections found: {sorted(report['sections_found'])}")
print("\nChunk details:")
for d in sorted(report["section_details"], key=lambda x: x["chunk_index"]):
print(
f" [{d['chunk_index']:2}] "
f"section={d['section']!r:12s} "
f"title={d['section_title']!r:30s} "
f"para={d['paragraph']!r:8s}"
)
if report["issues"]:
print(f"\nISSUES ({len(report['issues'])}):")
for issue in report["issues"]:
print(f" - {issue}")
print("\nRESULT: FAIL")
else:
print("\nRESULT: PASS — all sections detected, metadata quality OK")
def main():
parser = argparse.ArgumentParser(description="D4 Integration Test")
parser.add_argument("--rag-url", default="https://macmini:8097")
parser.add_argument("--qdrant-url", default="http://macmini:6333")
parser.add_argument("--dry-run", action="store_true",
help="Only test local chunking, no upload")
parser.add_argument("--keep", action="store_true",
help="Don't delete test data after verification")
args = parser.parse_args()
text = load_fixture()
print(f"Loaded BGB excerpt: {len(text)} chars")
if args.dry_run:
# Import chunking directly
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "..", "embedding-service"))
from main import chunk_text_legal_structured
chunks = chunk_text_legal_structured(text, 1500, 100)
# Build fake points for verification
points = [{"payload": {
"chunk_index": c["index"],
"chunk_text": c["text"],
"section": c["section"],
"section_title": c["section_title"],
"paragraph": c["paragraph"],
"paragraph_num": c["paragraph_num"],
"page": c["page"],
}} for c in chunks]
report = verify_chunks(points)
print_report(report)
sys.exit(1 if report["issues"] else 0)
# Full integration test
print(f"Uploading to {args.rag_url} → collection={COLLECTION}...")
result = upload_document(args.rag_url, text)
doc_id = result["document_id"]
print(f" document_id: {doc_id}")
print(f" chunks_count: {result['chunks_count']}")
print(f" vectors_indexed: {result['vectors_indexed']}")
print("Waiting 2s for indexing...")
time.sleep(2)
print(f"Scrolling Qdrant at {args.qdrant_url}...")
points = scroll_chunks(args.qdrant_url, doc_id)
print(f" Found {len(points)} points")
report = verify_chunks(points)
print_report(report)
if not args.keep:
print(f"\nCleaning up test data (document_id={doc_id})...")
delete_test_data(args.qdrant_url, doc_id)
print(" Deleted.")
sys.exit(1 if report["issues"] else 0)
if __name__ == "__main__":
main()
@@ -17,9 +17,6 @@ import httpx
from .control_generator import (
GeneratedControl,
REGULATION_LICENSE_MAP,
_RULE2_PREFIXES,
_RULE3_PREFIXES,
_classify_regulation,
)
+209 -96
View File
@@ -22,6 +22,7 @@ import json
import logging
import time
from collections import defaultdict
from datetime import datetime
from sqlalchemy import text
@@ -108,34 +109,56 @@ class BatchDedupRunner:
self._progress_phase = ""
self._progress_count = 0
self._progress_total = 0
self._since = None # set by run() when scoped run requested
async def run(
self,
dry_run: bool = False,
hint_filter: str = None,
since: datetime = None,
) -> dict:
"""Run the full batch dedup pipeline.
Args:
dry_run: If True, compute stats but don't modify DB/Qdrant.
hint_filter: If set, only process groups matching this hint prefix.
since: If set, only process controls with created_at >= since.
Useful for incremental dedup after single-document ingestion.
Returns:
Stats dict with counts.
"""
start = time.monotonic()
logger.info("BatchDedup starting (dry_run=%s, hint_filter=%s)",
dry_run, hint_filter)
logger.info("BatchDedup starting (dry_run=%s, hint_filter=%s, since=%s)",
dry_run, hint_filter, since)
# Scoped runs reset checkpoint to avoid skipping new controls whose
# control_id sorts before the stale last_id of a previous full run.
self._since = since
if since and not dry_run:
self.db.execute(text(
"DELETE FROM canonical_generation_jobs WHERE status = 'dedup_phase2_checkpoint'"
))
self.db.commit()
if not dry_run:
await ensure_qdrant_collection(collection=self.collection)
# Phase 1: Intra-group dedup (same merge_group_hint)
# Optimization: skip singleton groups (they're automatically masters)
self._progress_phase = "phase1"
groups = self._load_merge_groups(hint_filter)
groups = self._load_merge_groups(hint_filter, since)
self._progress_total = self.stats["total_controls"]
for hint, controls in groups:
multi_groups = [(h, c) for h, c in groups if len(c) > 1]
singleton_count = len(groups) - len(multi_groups)
self.stats["singleton_groups_skipped"] = singleton_count
logger.info(
"BatchDedup Phase 1: %d multi-control groups to process, %d singletons skipped",
len(multi_groups), singleton_count,
)
for hint, controls in multi_groups:
try:
await self._process_hint_group(hint, controls, dry_run)
self.stats["phase1_groups_processed"] += 1
@@ -148,8 +171,8 @@ class BatchDedupRunner:
pass
logger.info(
"BatchDedup Phase 1 done: %d masters, %d linked, %d review",
self.stats["masters"], self.stats["linked"], self.stats["review"],
"BatchDedup Phase 1 done: %d masters, %d linked, %d review (skipped %d singletons)",
self.stats["masters"], self.stats["linked"], self.stats["review"], singleton_count,
)
# Phase 2: Cross-group dedup via embeddings
@@ -162,7 +185,7 @@ class BatchDedupRunner:
logger.info("BatchDedup completed in %.1fs: %s", elapsed, self.stats)
return self.stats
def _load_merge_groups(self, hint_filter: str = None) -> list:
def _load_merge_groups(self, hint_filter: str = None, since: datetime = None) -> list:
"""Load all Pass 0b controls grouped by merge_group_hint, largest first."""
conditions = [
"decomposition_method = 'pass0b'",
@@ -175,6 +198,10 @@ class BatchDedupRunner:
conditions.append("generation_metadata->>'merge_group_hint' LIKE :hf")
params["hf"] = f"{hint_filter}%"
if since:
conditions.append("created_at >= :since")
params["since"] = since
where = " AND ".join(conditions)
rows = self.db.execute(text(f"""
SELECT id::text, control_id, title, objective,
@@ -321,114 +348,200 @@ class BatchDedupRunner:
async def _run_cross_group_pass(self):
"""Phase 2: Find cross-group duplicates among surviving masters.
After Phase 1, ~52k masters remain. Many have similar semantics
despite different merge_group_hints (e.g. different German spellings).
This pass embeds all masters and finds near-duplicates via Qdrant.
Paginated DB queries + individual error handling per control.
Never loads all rows into memory at once.
"""
logger.info("BatchDedup Phase 2: Cross-group pass starting...")
rows = self.db.execute(text("""
SELECT id::text, control_id, title,
generation_metadata->>'merge_group_hint' as merge_group_hint
FROM canonical_controls
# Count total — respect scoped run if since is set
since_clause = " AND created_at >= :since" if self._since else ""
params = {"since": self._since} if self._since else {}
total_row = self.db.execute(text(f"""
SELECT COUNT(*) FROM canonical_controls
WHERE decomposition_method = 'pass0b'
AND release_state != 'duplicate'
AND release_state != 'deprecated'
ORDER BY control_id
""")).fetchall()
AND release_state != 'deprecated'{since_clause}
"""), params).fetchone()
total = total_row[0] if total_row else 0
self._progress_total = len(rows)
self._progress_total = total
self._progress_count = 0
logger.info("BatchDedup Cross-group: %d masters to check", len(rows))
cross_linked = 0
cross_review = 0
# Process in parallel batches for embedding + Qdrant search
PARALLEL_BATCH = 10
# Checkpoint: resume from last processed control_id
DB_PAGE = 100
# Checkpoint: resume from last processed control_id (survives container restart)
checkpoint_row = self.db.execute(text("""
SELECT config FROM canonical_generation_jobs
WHERE status = 'dedup_phase2_checkpoint'
LIMIT 1
""")).fetchone()
last_control_id = checkpoint_row[0] if checkpoint_row else ""
async def _embed_and_search(r):
"""Embed one control and search Qdrant — safe for asyncio.gather."""
hint = r[3] or ""
parts = hint.split(":", 2)
action = parts[0] if len(parts) > 0 else ""
obj = parts[1] if len(parts) > 1 else ""
canonical = canonicalize_text(action, obj, r[2])
embedding = await get_embedding(canonical)
if not embedding:
return None
results = await qdrant_search_cross_regulation(
embedding, top_k=5, collection=self.collection,
)
return (r, results)
if last_control_id:
skip_params = {"last_id": last_control_id}
if self._since:
skip_params["since"] = self._since
skip_row = self.db.execute(text(f"""
SELECT COUNT(*) FROM canonical_controls
WHERE decomposition_method = 'pass0b'
AND release_state != 'duplicate'
AND release_state != 'deprecated'
AND control_id <= :last_id{since_clause}
"""), skip_params).fetchone()
skipped = skip_row[0] if skip_row else 0
self._progress_count = skipped
logger.info("BatchDedup Cross-group: RESUMING from %s (skipping %d already processed)",
last_control_id, skipped)
else:
self.db.execute(text("""
INSERT INTO canonical_generation_jobs (id, status, config)
VALUES (gen_random_uuid(), 'dedup_phase2_checkpoint', '')
"""))
self.db.commit()
for batch_start in range(0, len(rows), PARALLEL_BATCH):
batch = rows[batch_start:batch_start + PARALLEL_BATCH]
tasks = [_embed_and_search(r) for r in batch]
results_batch = await asyncio.gather(*tasks, return_exceptions=True)
logger.info("BatchDedup Cross-group: %d masters to check (starting from %s)",
total, last_control_id or "beginning")
for res in results_batch:
if res is None or isinstance(res, Exception):
if isinstance(res, Exception):
logger.error("BatchDedup embed/search error: %s", res)
while True:
page_params = {"last_id": last_control_id, "page_size": DB_PAGE}
if self._since:
page_params["since"] = self._since
rows = self.db.execute(text(f"""
SELECT id::text, control_id, title,
generation_metadata->>'merge_group_hint' as merge_group_hint
FROM canonical_controls
WHERE decomposition_method = 'pass0b'
AND release_state != 'duplicate'
AND release_state != 'deprecated'
AND control_id > :last_id{since_clause}
ORDER BY control_id
LIMIT :page_size
"""), page_params).fetchall()
if not rows:
break
last_control_id = rows[-1][1]
# Process each control individually (no asyncio.gather — more stable)
for r in rows:
try:
hint = r[3] or ""
parts = hint.split(":", 2)
action = parts[0] if len(parts) > 0 else ""
obj = parts[1] if len(parts) > 1 else ""
canonical = canonicalize_text(action, obj, r[2])
# Timeout per embedding call
try:
embedding = await asyncio.wait_for(
get_embedding(canonical), timeout=30.0
)
except asyncio.TimeoutError:
self.stats["errors"] += 1
continue
r, results = res
ctrl_uuid = r[0]
hint = r[3] or ""
if not results:
continue
for match in results:
match_score = match.get("score", 0.0)
match_payload = match.get("payload", {})
match_uuid = match_payload.get("control_uuid", "")
if match_uuid == ctrl_uuid:
continue
if match_score > LINK_THRESHOLD:
try:
self.db.execute(text("""
UPDATE canonical_controls
SET release_state = 'duplicate', merged_into_uuid = CAST(:master AS uuid)
WHERE id = CAST(:dup AS uuid)
AND release_state != 'duplicate'
"""), {"master": match_uuid, "dup": ctrl_uuid})
if not embedding:
continue
self.db.execute(text("""
INSERT INTO control_parent_links
(control_uuid, parent_control_uuid, link_type, confidence)
VALUES (CAST(:cu AS uuid), CAST(:pu AS uuid), 'cross_regulation', :conf)
ON CONFLICT (control_uuid, parent_control_uuid) DO NOTHING
"""), {"cu": match_uuid, "pu": ctrl_uuid, "conf": match_score})
transferred = self._transfer_parent_links(match_uuid, ctrl_uuid)
self.stats["parent_links_transferred"] += transferred
self.db.commit()
cross_linked += 1
except Exception as e:
logger.error("BatchDedup cross-group link error %s%s: %s",
ctrl_uuid, match_uuid, e)
self.db.rollback()
self.stats["errors"] += 1
break
elif match_score > REVIEW_THRESHOLD:
self._write_review(
{"control_id": r[1], "title": r[2], "objective": "",
"merge_group_hint": hint, "pattern_id": None},
match_payload, match_score,
try:
results = await asyncio.wait_for(
qdrant_search_cross_regulation(
embedding, top_k=5, collection=self.collection,
), timeout=30.0
)
cross_review += 1
break
except asyncio.TimeoutError:
self.stats["errors"] += 1
continue
processed = min(batch_start + PARALLEL_BATCH, len(rows))
self._progress_count = processed
if processed % 500 < PARALLEL_BATCH:
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review",
processed, len(rows), cross_linked, cross_review)
ctrl_uuid = r[0]
for match in (results or []):
match_score = match.get("score", 0.0)
match_payload = match.get("payload", {})
match_uuid = match_payload.get("control_uuid", "")
if match_uuid == ctrl_uuid:
continue
if match_score > LINK_THRESHOLD:
try:
self.db.execute(text("""
UPDATE canonical_controls
SET release_state = 'duplicate', merged_into_uuid = CAST(:master AS uuid)
WHERE id = CAST(:dup AS uuid)
AND release_state != 'duplicate'
"""), {"master": match_uuid, "dup": ctrl_uuid})
self.db.execute(text("""
INSERT INTO control_parent_links
(control_uuid, parent_control_uuid, link_type, confidence)
VALUES (CAST(:cu AS uuid), CAST(:pu AS uuid), 'cross_regulation', :conf)
ON CONFLICT (control_uuid, parent_control_uuid) DO NOTHING
"""), {"cu": match_uuid, "pu": ctrl_uuid, "conf": match_score})
transferred = self._transfer_parent_links(match_uuid, ctrl_uuid)
self.stats["parent_links_transferred"] += transferred
self.db.commit()
cross_linked += 1
except Exception as e:
logger.error("BatchDedup cross-group link error %s%s: %s",
ctrl_uuid, match_uuid, e)
try:
self.db.rollback()
except Exception:
pass
self.stats["errors"] += 1
break
elif match_score > REVIEW_THRESHOLD:
self._write_review(
{"control_id": r[1], "title": r[2], "objective": "",
"merge_group_hint": hint, "pattern_id": None},
match_payload, match_score,
)
cross_review += 1
break
except Exception as e:
logger.error("BatchDedup cross-group control %s error: %s", r[1], e)
self.stats["errors"] += 1
try:
self.db.rollback()
except Exception:
pass
self._progress_count += 1
# Save checkpoint + log progress every page
try:
self.db.execute(text("""
UPDATE canonical_generation_jobs
SET config = :cid
WHERE status = 'dedup_phase2_checkpoint'
"""), {"cid": last_control_id})
self.db.commit()
except Exception:
try:
self.db.rollback()
except Exception:
pass
processed = self._progress_count
if processed % 500 < DB_PAGE:
logger.info("BatchDedup Cross-group: %d/%d checked, %d linked, %d review (checkpoint: %s)",
processed, total, cross_linked, cross_review, last_control_id)
# Clear checkpoint on completion
try:
self.db.execute(text("""
DELETE FROM canonical_generation_jobs
WHERE status = 'dedup_phase2_checkpoint'
"""))
self.db.commit()
except Exception:
pass
self.stats["cross_group_linked"] = cross_linked
self.stats["cross_group_review"] = cross_review
+90 -111
View File
@@ -7,7 +7,6 @@ Citation Backfill Service — enrich existing controls with article/paragraph pr
Tier 3 Ollama LLM: ask local LLM to identify article/paragraph from text
"""
import hashlib
import json
import logging
import os
@@ -28,12 +27,13 @@ OLLAMA_URL = os.getenv("OLLAMA_URL", "http://host.docker.internal:11434")
OLLAMA_MODEL = os.getenv("CONTROL_GEN_OLLAMA_MODEL", "qwen3.5:35b-a3b")
LLM_TIMEOUT = float(os.getenv("CONTROL_GEN_LLM_TIMEOUT", "180"))
ALL_COLLECTIONS = [
"bp_compliance_ce",
# Tier-1 semantic re-link: min cosine for a source_original_text → chunk match.
EMBED_THRESHOLD = float(os.getenv("CITATION_EMBED_THRESHOLD", "0.80"))
# Collections that carry re-ingested, article_label-bearing chunks.
RELINK_COLLECTIONS = [
"bp_compliance_gesetze",
"bp_compliance_datenschutz",
"bp_dsfa_corpus",
"bp_legal_templates",
"bp_compliance_ce",
]
BACKFILL_SYSTEM_PROMPT = (
@@ -51,13 +51,14 @@ _SOURCE_ARTICLE_RE = re.compile(
class MatchResult:
article: str
paragraph: str
method: str # "hash", "regex", "llm"
method: str # "embed", "regex", "llm"
source: str = "" # regulation short/name (embed tier sets the cleaned source)
@dataclass
class BackfillResult:
total_controls: int = 0
matched_hash: int = 0
matched_embed: int = 0
matched_regex: int = 0
matched_llm: int = 0
unmatched: int = 0
@@ -71,7 +72,6 @@ class CitationBackfill:
def __init__(self, db: Session, rag_client: ComplianceRAGClient):
self.db = db
self.rag = rag_client
self._rag_index: dict[str, RAGSearchResult] = {}
async def run(self, dry_run: bool = True, limit: int = 0) -> BackfillResult:
"""Main entry: iterate controls missing article/paragraph, match to RAG, update."""
@@ -85,20 +85,10 @@ class CitationBackfill:
if not controls:
return result
# Collect hashes we need to find — only build index for controls with source text
needed_hashes: set[str] = set()
for ctrl in controls:
src = ctrl.get("source_original_text")
if src:
needed_hashes.add(hashlib.sha256(src.encode()).hexdigest())
if needed_hashes:
# Build targeted RAG index — only scroll collections that our controls reference
logger.info("Building targeted RAG hash index for %d source texts...", len(needed_hashes))
await self._build_rag_index_targeted(controls)
logger.info("RAG index built: %d chunks indexed, %d hashes needed", len(self._rag_index), len(needed_hashes))
else:
logger.info("No source_original_text found — skipping RAG index build")
# Tier-1 = per-control semantic search against the re-ingested, labeled chunks.
# (The old sha256(chunk.text) hash index died with re-chunking and is gone.)
with_source = sum(1 for c in controls if c.get("source_original_text"))
logger.info("Embedding-relink candidates (with source_original_text): %d", with_source)
# Process each control
for i, ctrl in enumerate(controls):
@@ -108,8 +98,8 @@ class CitationBackfill:
try:
match = await self._match_control(ctrl)
if match:
if match.method == "hash":
result.matched_hash += 1
if match.method == "embed":
result.matched_embed += 1
elif match.method == "regex":
result.matched_regex += 1
elif match.method == "llm":
@@ -139,8 +129,8 @@ class CitationBackfill:
result.errors.append(f"Commit failed: {e}")
logger.info(
"Backfill complete: %d total, hash=%d regex=%d llm=%d unmatched=%d updated=%d",
result.total_controls, result.matched_hash, result.matched_regex,
"Backfill complete: %d total, embed=%d regex=%d llm=%d unmatched=%d updated=%d",
result.total_controls, result.matched_embed, result.matched_regex,
result.matched_llm, result.unmatched, result.updated,
)
return result
@@ -178,93 +168,13 @@ class CitationBackfill:
controls.append(ctrl)
return controls
async def _build_rag_index_targeted(self, controls: list[dict]):
"""Build RAG index by scrolling only collections relevant to our controls.
Uses regulation codes from generation_metadata to identify which collections
to search, falling back to all collections only if needed.
"""
# Determine which collections are relevant based on regulation codes
regulation_to_collection = self._map_regulations_to_collections(controls)
collections_to_search = set(regulation_to_collection.values()) or set(ALL_COLLECTIONS)
logger.info("Targeted index: searching %d collections: %s",
len(collections_to_search), ", ".join(collections_to_search))
for collection in collections_to_search:
offset = None
page = 0
seen_offsets: set[str] = set()
while True:
chunks, next_offset = await self.rag.scroll(
collection=collection, offset=offset, limit=200,
)
if not chunks:
break
for chunk in chunks:
if chunk.text and len(chunk.text.strip()) >= 50:
h = hashlib.sha256(chunk.text.encode()).hexdigest()
self._rag_index[h] = chunk
page += 1
if page % 50 == 0:
logger.info("Indexing %s: page %d (%d chunks so far)",
collection, page, len(self._rag_index))
if not next_offset:
break
if next_offset in seen_offsets:
logger.warning("Scroll loop in %s at page %d — stopping", collection, page)
break
seen_offsets.add(next_offset)
offset = next_offset
logger.info("Indexed collection %s: %d pages", collection, page)
def _map_regulations_to_collections(self, controls: list[dict]) -> dict[str, str]:
"""Map regulation codes from controls to likely Qdrant collections."""
# Heuristic: regulation code prefix → collection
collection_map = {
"eu_": "bp_compliance_gesetze",
"dsgvo": "bp_compliance_datenschutz",
"bdsg": "bp_compliance_gesetze",
"ttdsg": "bp_compliance_gesetze",
"nist_": "bp_compliance_ce",
"owasp": "bp_compliance_ce",
"bsi_": "bp_compliance_ce",
"enisa": "bp_compliance_ce",
"at_": "bp_compliance_recht",
"fr_": "bp_compliance_recht",
"es_": "bp_compliance_recht",
}
result: dict[str, str] = {}
for ctrl in controls:
meta = ctrl.get("generation_metadata") or {}
reg = meta.get("source_regulation", "")
if not reg:
continue
for prefix, coll in collection_map.items():
if reg.startswith(prefix):
result[reg] = coll
break
else:
# Unknown regulation — search all
for coll in ALL_COLLECTIONS:
result[f"_all_{coll}"] = coll
return result
async def _match_control(self, ctrl: dict) -> Optional[MatchResult]:
"""3-tier matching: hash → regex → LLM."""
# Tier 1: Hash match against RAG index
source_text = ctrl.get("source_original_text")
if source_text:
h = hashlib.sha256(source_text.encode()).hexdigest()
chunk = self._rag_index.get(h)
if chunk and (chunk.article or chunk.paragraph):
return MatchResult(
article=chunk.article or "",
paragraph=chunk.paragraph or "",
method="hash",
)
# Tier 1: Semantic search against the re-ingested, labeled chunks
embed = await self._embedding_match(ctrl)
if embed:
return embed
# Tier 2: Regex parse concatenated source
citation = ctrl.get("source_citation") or {}
@@ -278,11 +188,60 @@ class CitationBackfill:
)
# Tier 3: Ollama LLM
if source_text:
if ctrl.get("source_original_text"):
return await self._llm_match(ctrl)
return None
async def _embedding_match(self, ctrl: dict) -> Optional[MatchResult]:
"""Tier 1: semantic-search source_original_text against the labeled chunks.
Takes the top hit (cosine >= EMBED_THRESHOLD) that carries a real article
and turns its article_label into a precise citation.
"""
source_text = ctrl.get("source_original_text")
if not source_text:
return None
query = source_text.strip()[:512]
best: Optional[RAGSearchResult] = None
for collection in self._collections_for(ctrl):
try:
hits = await self.rag.search(query, collection=collection, top_k=3)
except Exception as e:
logger.debug("embed search failed (%s): %s", collection, e)
hits = []
if hits and (best is None or hits[0].score > best.score):
best = hits[0]
if not best or best.score < EMBED_THRESHOLD:
return None
article = _article_part(best)
if not article:
return None
return MatchResult(
article=article,
paragraph=best.paragraph or "",
method="embed",
source=best.regulation_short or best.regulation_name or "",
)
def _collections_for(self, ctrl: dict) -> list[str]:
"""Likely collection(s) for a control's regulation; falls back to all three."""
meta = ctrl.get("generation_metadata") or {}
reg = (meta.get("source_regulation") or "").lower()
prefix_map = {
"eu_": "bp_compliance_gesetze", "bdsg": "bp_compliance_gesetze",
"de_": "bp_compliance_gesetze", "at_": "bp_compliance_gesetze",
"ch_": "bp_compliance_gesetze", "dsgvo": "bp_compliance_gesetze",
"trgs": "bp_compliance_ce", "trbs": "bp_compliance_ce", "asr": "bp_compliance_ce",
"nist": "bp_compliance_ce", "owasp": "bp_compliance_ce", "enisa": "bp_compliance_ce",
"edpb": "bp_compliance_datenschutz", "dsk": "bp_compliance_datenschutz",
"bfdi": "bp_compliance_datenschutz",
}
for prefix, coll in prefix_map.items():
if reg.startswith(prefix):
return [coll]
return list(RELINK_COLLECTIONS)
async def _llm_match(self, ctrl: dict) -> Optional[MatchResult]:
"""Use Ollama to identify article/paragraph from source text."""
citation = ctrl.get("source_citation") or {}
@@ -331,6 +290,9 @@ Bei deutschen Gesetzen mit § verwende: "§ XX" statt "Art. XX"."""
if parsed:
citation["source"] = parsed["name"]
# Embed tier carries the cleaned regulation name → prefer it as source.
if match.source:
citation["source"] = match.source
# Add separate article/paragraph fields
citation["article"] = match.article
citation["paragraph"] = match.paragraph
@@ -359,6 +321,23 @@ Bei deutschen Gesetzen mit § verwende: "§ XX" statt "Art. XX"."""
)
def _article_part(chunk: RAGSearchResult) -> str:
"""Precise article from a chunk: article_label minus the regulation name.
'BDSG § 38' -> '§ 38'; 'Art. 39 DSGVO' -> 'Art. 39'; 'NIST SP 800-53r5 SA-12' -> 'SA-12'.
Falls back to the bare article field. Returns '' if only a doc-level name is present.
"""
label = (chunk.article_label or "").strip()
reg = (chunk.regulation_short or "").strip()
if label:
part = label
if reg and reg in label:
part = label.replace(reg, "").strip(" ,;-")
if part and part != reg:
return part
return (chunk.article or "").strip()
def _parse_concatenated_source(source: str) -> Optional[dict]:
"""Parse 'DSGVO Art. 35'{name: 'DSGVO', article: 'Art. 35'}.
+25 -6
View File
@@ -126,22 +126,29 @@ _ACTION_SYNONYMS: dict[str, str] = {
def normalize_action(action: str) -> str:
"""Normalize an action verb to a canonical English form."""
"""Normalize an action verb to a canonical English form.
Delegates to DB-backed OntologyRegistry with dict fallback.
"""
try:
from .ontology_registry import get_ontology_registry
return get_ontology_registry().normalize_action(action)
except Exception:
pass
# Fallback: original logic
if not action:
return ""
action = action.strip().lower()
# Strip German infinitive/conjugation suffixes for lookup
action_base = re.sub(r"(en|t|st|e|te|tet|end)$", "", action)
# Try exact match first, then base form
if action in _ACTION_SYNONYMS:
return _ACTION_SYNONYMS[action]
if action_base in _ACTION_SYNONYMS:
return _ACTION_SYNONYMS[action_base]
# Fuzzy: check if action starts with any known verb
for verb, canonical in _ACTION_SYNONYMS.items():
if action.startswith(verb) or verb.startswith(action):
return canonical
return action # fallback: return as-is
return action
# ── Object Normalization ─────────────────────────────────────────────
@@ -237,7 +244,19 @@ _OBJECT_KEYS_SORTED = sorted(_OBJECT_SYNONYMS.keys(), key=len, reverse=True)
def normalize_object(obj: str) -> str:
"""Normalize a compliance object to a canonical token."""
"""Normalize a compliance object to a canonical token.
Delegates to DB-backed OntologyRegistry with dict fallback.
"""
# Try DB-backed registry first
try:
from .ontology_registry import get_ontology_registry
result = get_ontology_registry().normalize_object(obj)
if result != obj.strip().lower():
return result
except Exception:
pass
if not obj:
return ""
obj_lower = obj.strip().lower()
+26 -26
View File
@@ -25,8 +25,7 @@ import re
import uuid
from collections import defaultdict
from dataclasses import dataclass, field, asdict
from datetime import datetime, timezone
from typing import Dict, List, Optional, Set
from typing import Dict, List, Optional
import httpx
from pydantic import BaseModel
@@ -34,7 +33,8 @@ from sqlalchemy import text
from sqlalchemy.orm import Session
from .rag_client import ComplianceRAGClient, RAGSearchResult, get_rag_client
from .similarity_detector import check_similarity, SimilarityReport
from .regulation_registry import get_registry as _get_regulation_registry
from .similarity_detector import check_similarity
logger = logging.getLogger(__name__)
@@ -246,28 +246,21 @@ def _classify_regulation(regulation_code: str) -> dict:
Returns dict with keys: license, rule, name, source_type.
source_type is one of: law, guideline, standard, restricted.
Delegates to DB-backed RegulationRegistry (with 5min cache).
Falls back to REGULATION_LICENSE_MAP if DB is unavailable.
"""
code = regulation_code.lower().strip()
registry = _get_regulation_registry()
result = registry.classify_regulation(regulation_code)
# Exact match first
if code in REGULATION_LICENSE_MAP:
return REGULATION_LICENSE_MAP[code]
# If registry returned the unknown fallback AND we have a local match,
# prefer the local dict (graceful degradation during migration)
if result.get("license") == "UNKNOWN":
code = regulation_code.lower().strip()
if code in REGULATION_LICENSE_MAP:
return REGULATION_LICENSE_MAP[code]
# Prefix match for Rule 2 (ENISA = standard)
for prefix in _RULE2_PREFIXES:
if code.startswith(prefix):
return {"license": "CC-BY-4.0", "rule": 2, "source_type": "standard",
"name": "ENISA", "attribution": "ENISA, CC BY 4.0"}
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
for prefix in _RULE3_PREFIXES:
if code.startswith(prefix):
return {"license": f"{prefix.rstrip('_').upper()}_RESTRICTED", "rule": 3,
"source_type": "restricted", "name": "INTERNAL_ONLY"}
# Unknown → treat as restricted (safe default)
logger.warning("Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code)
return {"license": "UNKNOWN", "rule": 3, "source_type": "restricted", "name": "INTERNAL_ONLY"}
return result
# ---------------------------------------------------------------------------
@@ -1019,11 +1012,12 @@ class ControlGeneratorPipeline:
regulation_name=reg_name,
regulation_short=reg_short,
category=payload.get("category", "") or payload.get("data_type", ""),
article=payload.get("article", "") or payload.get("section_title", "") or payload.get("section", ""),
article=payload.get("section", "") or payload.get("article", "") or payload.get("section_title", ""),
paragraph=payload.get("paragraph", ""),
source_url=payload.get("source_url", "") or payload.get("source", "") or payload.get("url", ""),
score=0.0,
collection=collection,
page=payload.get("page"),
)
all_results.append(chunk)
collection_new += 1
@@ -1127,6 +1121,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": license_info.get("license", ""),
"source_type": license_info.get("source_type", "law"),
"url": chunk.source_url or "",
@@ -1141,6 +1136,7 @@ Quelle: {chunk.regulation_name} ({chunk.regulation_code}), {chunk.article}"""
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
}
return control
@@ -1194,6 +1190,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": license_info.get("license", ""),
"license_notice": attribution,
"source_type": license_info.get("source_type", "standard"),
@@ -1209,6 +1206,7 @@ Quelle: {chunk.regulation_name}, {chunk.article}"""
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
}
return control
@@ -1368,6 +1366,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
"source": canonical_source,
"article": effective_article,
"paragraph": effective_paragraph,
"page": chunk.page,
"license": lic.get("license", ""),
"license_notice": lic.get("attribution", ""),
"source_type": lic.get("source_type", "law"),
@@ -1384,6 +1383,7 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Chunks ohne A
"source_regulation": chunk.regulation_code,
"source_article": effective_article,
"source_paragraph": effective_paragraph,
"source_page": chunk.page,
"batch_size": len(chunks),
"document_grouped": same_doc,
}
@@ -1479,14 +1479,14 @@ Gib ein JSON-Array zurueck mit GENAU {len(chunks)} Elementen. Fuer Aspekte ohne
) -> list[Optional[GeneratedControl]]:
"""Process a batch of (chunk, license_info) through stages 3-5."""
# Split by license rule: Rule 1+2 → structure, Rule 3 → reform
structure_items = [(c, l) for c, l in batch_items if l["rule"] in (1, 2)]
reform_items = [(c, l) for c, l in batch_items if l["rule"] == 3]
structure_items = [(c, lic) for c, lic in batch_items if lic["rule"] in (1, 2)]
reform_items = [(c, lic) for c, lic in batch_items if lic["rule"] == 3]
all_controls: dict[int, Optional[GeneratedControl]] = {}
if structure_items:
s_chunks = [c for c, _ in structure_items]
s_lics = [l for _, l in structure_items]
s_lics = [lic for _, lic in structure_items]
try:
s_controls = await self._structure_batch(s_chunks, s_lics)
except Exception as e:
+22 -10
View File
@@ -223,31 +223,43 @@ _FRAMEWORK_PATTERNS: list[str] = [
def classify_action(text: str) -> str:
"""Classify an obligation action text into a canonical action_type."""
text_lower = text.lower().strip()
"""Classify an obligation action text into a canonical action_type.
# Check negative patterns first
Delegates to DB-backed OntologyRegistry (with 5min cache).
Falls back to hardcoded dicts if DB is unavailable.
"""
try:
from .ontology_registry import get_ontology_registry
return get_ontology_registry().classify_action(text)
except Exception:
pass
# Fallback: original logic
text_lower = text.lower().strip()
for pattern, action_type in _NEGATIVE_PATTERNS:
if pattern in text_lower:
return action_type
# Direct alias match
if text_lower in _ALIAS_TO_ACTION:
return _ALIAS_TO_ACTION[text_lower]
# Substring match (longest first)
best_match = ""
best_action = "implement" # default fallback
best_action = "implement"
for alias, action_type in sorted(_ALIAS_TO_ACTION.items(), key=lambda x: -len(x[0])):
if alias in text_lower and len(alias) > len(best_match):
best_match = alias
best_action = action_type
return best_action
def get_phase(action_type: str) -> str:
"""Get the control_phase for an action_type."""
"""Get the control_phase for an action_type.
Delegates to DB-backed OntologyRegistry with dict fallback.
"""
try:
from .ontology_registry import get_ontology_registry
return get_ontology_registry().get_phase(action_type)
except Exception:
pass
info = ACTION_TYPES.get(action_type, {})
return info.get("phase", "implementation")
+123 -9
View File
@@ -24,7 +24,6 @@ import json
import logging
import os
import re
import uuid
from dataclasses import dataclass, field
from typing import Optional
@@ -56,7 +55,7 @@ ANTHROPIC_API_URL = "https://api.anthropic.com/v1"
# Patterns are defined in normative_patterns.py and imported here
# with local aliases for backward compatibility.
from .normative_patterns import (
from .normative_patterns import ( # noqa: E402
PFLICHT_RE as _PFLICHT_RE,
EMPFEHLUNG_RE as _EMPFEHLUNG_RE,
KANN_RE as _KANN_RE,
@@ -461,12 +460,50 @@ WICHTIGE REGELN:
7. MERGE-KEY: Erzeuge im JSON-Output ein zusaetzliches Feld "merge_key" mit
dem Format: "action_type:normalized_object:control_phase"
WICHTIG: Waehle normalized_object NUR aus dieser Liste kanonischer Tokens:
SECURITY: multi_factor_auth, password_policy, credentials, session_management,
privileged_access, access_control, encryption, transport_encryption,
key_management, certificate_management, network_security, network_segmentation,
firewall, vpn, remote_access, monitoring, audit_logging, siem, alerting,
compliance_audit, vulnerability, patch_management, backup, disaster_recovery,
physical_security, secure_development, api_security, input_validation,
container_security, logging_configuration
DATA_PROTECTION: personal_data, sensitive_data, health_data, consent,
data_subject_rights, data_retention, data_transfer, data_breach_notification,
dpia, data_processing_agreement, privacy_by_design, data_processing_register,
data_classification, cookie_consent, video_surveillance
GOVERNANCE: policy, procedure, process, training, awareness, incident,
risk_management, third_party_management, change_management, documentation,
records_management, compliance_reporting, asset_management,
human_resources_security
REGULATORY: supervisory_authority, certification, product_safety, ai_system,
financial_reporting, aml, whistleblowing, consumer_protection, ecommerce,
telecommunications, medical_device, payment_services, critical_infrastructure,
supply_chain_due_diligence, sustainability_reporting
Wenn KEIN Token passt: "OTHER:kurzbeschreibung" (z.B. "OTHER:battery_recycling")
ABGRENZUNGEN (haeufige Fehler vermeiden!):
- monitoring = NUR kontinuierliche Echtzeit-Ueberwachung von Systemen
- audit_logging = Protokollierung, Audit Trail, Nachvollziehbarkeit
- compliance_audit = externe Pruefungen, Zertifizierungsaudits
- training = Schulungen DURCHFUEHREN (nicht "ueberwachen")
- procedure = Verfahren DEFINIEREN (nicht Incident-Behandlung)
- incident = Sicherheitsvorfaelle BEHANDELN
- alerting = Meldepflichten und Benachrichtigungen
- personal_data = DSGVO-Verarbeitungsgrundsaetze (nicht Zertifizierung!)
- certification = Zertifizierung/Konformitaet (nicht Datenschutz)
Beispiele:
- "implement:api_rate_limiting:implementation"
- "define:access_control_policy:definition"
- "monitor:third_party_vulnerabilities:monitoring"
- "test:authentication_mechanism:testing"
- "implement:multi_factor_auth:implementation"
- "define:access_control:definition"
- "monitor:network_security:monitoring"
- "test:vulnerability:testing"
- "report:supervisory_authority:reporting"
- "implement:audit_logging:implementation" (NICHT monitoring!)
- "define:incident:definition" (Incident-Verfahren, NICHT procedure!)
- "train:training:operation" (Schulung, NICHT monitoring!)
8. APPLICABILITY + SCANNER: Bestimme fuer jedes Control:
- applicability: Unter welchen Bedingungen gilt dieses Control?
@@ -2473,6 +2510,81 @@ def _ensure_list(val) -> list:
return []
# Canonical object tokens from object_ontology (loaded once)
_CANONICAL_OBJECTS: set[str] | None = None
def _load_canonical_objects() -> set[str]:
"""Load canonical tokens from DB, fallback to hardcoded set."""
global _CANONICAL_OBJECTS
if _CANONICAL_OBJECTS is not None:
return _CANONICAL_OBJECTS
try:
from db.session import get_engine
from sqlalchemy import text
engine = get_engine()
with engine.connect() as c:
rows = c.execute(text(
"SELECT canonical_token FROM compliance.object_ontology"
)).fetchall()
_CANONICAL_OBJECTS = {r[0] for r in rows}
except Exception:
_CANONICAL_OBJECTS = set()
if not _CANONICAL_OBJECTS:
_CANONICAL_OBJECTS = {
"multi_factor_auth", "password_policy", "credentials",
"session_management", "privileged_access", "access_control",
"encryption", "transport_encryption", "key_management",
"certificate_management", "network_security",
"network_segmentation", "firewall", "vpn", "remote_access",
"monitoring", "audit_logging", "siem", "alerting",
"compliance_audit", "vulnerability", "patch_management",
"backup", "disaster_recovery", "personal_data",
"sensitive_data", "consent", "data_subject_rights",
"data_retention", "data_transfer", "data_breach_notification",
"dpia", "data_processing_agreement", "privacy_by_design",
"policy", "procedure", "process", "training", "awareness",
"incident", "risk_management", "third_party_management",
"change_management", "documentation", "supervisory_authority",
"certification", "product_safety", "ai_system", "aml",
"critical_infrastructure", "medical_device",
}
return _CANONICAL_OBJECTS
def _validate_merge_key(merge_key: str) -> str:
"""Validate merge_key object against canonical ontology.
Returns the merge_key (possibly corrected). Logs warnings for
unknown objects so they can be tracked.
"""
parts = merge_key.split(":", 2)
if len(parts) < 2:
return merge_key
action, obj = parts[0], parts[1]
phase = parts[2] if len(parts) > 2 else "implementation"
# Accept OTHER: prefix (LLM signaling unknown object)
if obj.startswith("OTHER:"):
return merge_key
# Check against canonical ontology
canonical = _load_canonical_objects()
if obj in canonical:
return merge_key
# Try normalize_object() as fallback
from services.control_dedup import normalize_object
normed = normalize_object(obj)
if normed in canonical:
return f"{action}:{normed}:{phase}"
# Unknown object — log and keep as-is (will be clustered by embedding)
logger.debug("merge_key unknown object: %s (normed: %s)", obj, normed)
return merge_key
# ---------------------------------------------------------------------------
# Decomposition Pass
# ---------------------------------------------------------------------------
@@ -3026,10 +3138,10 @@ class DecompositionPass:
evidence_type=parsed.get("evidence_type", ""),
provides_context=_ensure_list(parsed.get("provides_context", [])),
)
# Store merge_key from LLM output in metadata
# Store merge_key from LLM output in metadata — with validation
llm_merge_key = parsed.get("merge_key", "")
if llm_merge_key:
atomic.merge_group_hint = llm_merge_key
atomic.merge_group_hint = _validate_merge_key(llm_merge_key)
atomic.parent_control_uuid = obl["parent_uuid"]
atomic.obligation_candidate_id = obl["candidate_id"]
@@ -3472,7 +3584,7 @@ class DecompositionPass:
"category": atomic.category,
"parent_uuid": parent_uuid,
"gen_meta": json.dumps({
"decomposition_source": candidate_id,
"decomposition_source_id": candidate_id,
"decomposition_method": "pass0b",
"engine_version": "v2",
"action_object_class": getattr(atomic, "domain", ""),
@@ -4104,6 +4216,8 @@ def _format_citation(citation) -> str:
parts.append(c["article"])
if c.get("paragraph"):
parts.append(c["paragraph"])
if c.get("page") is not None:
parts.append(f"S. {c['page']}")
return " ".join(parts) if parts else citation
except (json.JSONDecodeError, TypeError):
return citation
@@ -0,0 +1,84 @@
"""Shared embedding + sub-clustering utilities for the control pipeline."""
import logging
import os
from collections import defaultdict
import httpx
import numpy as np
from sklearn.cluster import MiniBatchKMeans
logger = logging.getLogger(__name__)
EMBEDDING_URL = os.getenv(
"EMBEDDING_SERVICE_URL", "http://embedding-service:8087"
)
def embed_texts(texts: list[str]) -> np.ndarray | None:
"""Embed texts via the embedding-service in batches of 64."""
try:
result = np.zeros((len(texts), 1024), dtype=np.float32)
batch_size = 64
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
for attempt in range(3):
try:
with httpx.Client(
timeout=httpx.Timeout(60.0, connect=10.0)
) as client:
resp = client.post(
f"{EMBEDDING_URL}/embed", json={"texts": batch}
)
resp.raise_for_status()
embs = resp.json().get("embeddings", [])
end = min(i + len(embs), len(texts))
result[i:end] = np.array(embs, dtype=np.float32)
break
except Exception as e:
if attempt == 2:
logger.error("Embed batch %d failed: %s", i, e)
import time
time.sleep(2)
return result
except Exception as e:
logger.error("Embedding failed: %s", e)
return None
def subcluster_controls(
controls: list[dict], target_size: int = 50
) -> list[list[dict]]:
"""Sub-cluster controls by embedding similarity.
Returns a list of clusters. Falls back to naive chunking
if embedding fails.
"""
if len(controls) <= target_size:
return [controls]
texts = [c.get("title", "") or c.get("control_id", "") for c in controls]
embeddings = embed_texts(texts)
if embeddings is None:
return [
controls[i : i + target_size]
for i in range(0, len(controls), target_size)
]
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
norms[norms == 0] = 1
normalized = embeddings / norms
k = max(2, min(len(controls) // target_size, 30))
kmeans = MiniBatchKMeans(
n_clusters=k,
batch_size=min(100, len(controls)),
max_iter=50,
random_state=42,
)
labels = kmeans.fit_predict(normalized)
clusters: dict[int, list[dict]] = defaultdict(list)
for i, ctrl in enumerate(controls):
clusters[int(labels[i])].append(ctrl)
return list(clusters.values())
@@ -0,0 +1,217 @@
"""
DB-backed Action & Object Ontology Registry with in-memory cache.
Replaces hardcoded ACTION_TYPES, _NEGATIVE_PATTERNS, _ACTION_SYNONYMS,
and _OBJECT_SYNONYMS with PostgreSQL tables.
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
Falls back to hardcoded dicts if DB is unavailable.
"""
import logging
import re
import time
from typing import Optional
from sqlalchemy import text
from sqlalchemy.exc import SQLAlchemyError
from db.session import SessionLocal
logger = logging.getLogger(__name__)
_CACHE_TTL_SECONDS = 300 # 5 minutes
class OntologyRegistry:
"""In-memory cache of action_types, action_synonyms, and object_synonyms."""
def __init__(self):
# Action types: canonical_name → phase
self._action_phases: dict[str, str] = {}
# Alias → canonical action (for classify_action)
self._alias_to_action: dict[str, str] = {}
# Negative patterns: [(pattern, action_type)] ordered longest first
self._negative_patterns: list[tuple[str, str]] = []
# Action synonyms for dedup: synonym → canonical (for normalize_action)
self._action_synonyms: dict[str, str] = {}
# Object synonyms: synonym → canonical_token (for normalize_object)
self._object_synonyms: dict[str, str] = {}
# Sorted object keys (longest first) for substring matching
self._object_keys_sorted: list[str] = []
self._loaded_at: float = 0.0
def _is_stale(self) -> bool:
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
def _load(self) -> bool:
"""Load all ontology data from DB into memory."""
try:
db = SessionLocal()
try:
return self._load_from_db(db)
finally:
db.close()
except SQLAlchemyError:
logger.warning(
"Failed to load ontology from DB — using stale cache",
exc_info=True,
)
return False
def _load_from_db(self, db) -> bool:
"""Load from DB session."""
# 1. Action types
rows = db.execute(text(
"SELECT canonical_name, phase FROM action_types"
)).fetchall()
action_phases = {r[0]: r[1] for r in rows}
# 2. Action synonyms (aliases + negative patterns)
rows = db.execute(text(
"SELECT canonical_action, synonym, pattern_type FROM action_synonyms"
)).fetchall()
alias_to_action: dict[str, str] = {}
negative_patterns: list[tuple[str, str]] = []
action_synonyms: dict[str, str] = {}
for canonical, synonym, ptype in rows:
if ptype == "negative_pattern":
negative_patterns.append((synonym, canonical))
else:
alias_to_action[synonym] = canonical
action_synonyms[synonym] = canonical
# Sort negative patterns: longest first (for priority matching)
negative_patterns.sort(key=lambda x: -len(x[0]))
# 3. Object synonyms
rows = db.execute(text(
"SELECT canonical_token, synonym FROM object_synonyms"
)).fetchall()
object_synonyms = {r[1]: r[0] for r in rows}
object_keys_sorted = sorted(object_synonyms.keys(), key=len, reverse=True)
# Commit to cache
self._action_phases = action_phases
self._alias_to_action = alias_to_action
self._negative_patterns = negative_patterns
self._action_synonyms = action_synonyms
self._object_synonyms = object_synonyms
self._object_keys_sorted = object_keys_sorted
self._loaded_at = time.monotonic()
logger.info(
"Ontology loaded: %d action_types, %d aliases, %d neg_patterns, %d object_synonyms",
len(action_phases), len(alias_to_action),
len(negative_patterns), len(object_synonyms),
)
return True
@property
def is_loaded(self) -> bool:
"""True if the cache has any data."""
return len(self._action_phases) > 0
def _ensure_loaded(self) -> None:
if self._is_stale():
self._load()
if not self.is_loaded:
raise RuntimeError("OntologyRegistry has no data")
# ── Action Classification (replaces control_ontology.classify_action) ──
def classify_action(self, text_input: str) -> str:
"""Classify text into a canonical action_type."""
self._ensure_loaded()
text_lower = text_input.lower().strip()
# Check negative patterns first
for pattern, action_type in self._negative_patterns:
if pattern in text_lower:
return action_type
# Direct alias match
if text_lower in self._alias_to_action:
return self._alias_to_action[text_lower]
# Substring match (longest first)
best_match = ""
best_action = "implement"
for alias, action_type in sorted(
self._alias_to_action.items(), key=lambda x: -len(x[0])
):
if alias in text_lower and len(alias) > len(best_match):
best_match = alias
best_action = action_type
return best_action
def get_phase(self, action_type: str) -> str:
"""Get the control_phase for an action_type."""
self._ensure_loaded()
return self._action_phases.get(action_type, "implementation")
# ── Action Normalization (replaces control_dedup.normalize_action) ──
def normalize_action(self, action: str) -> str:
"""Normalize an action verb to a canonical English form."""
self._ensure_loaded()
if not action:
return ""
action = action.strip().lower()
action_base = re.sub(r"(en|t|st|e|te|tet|end)$", "", action)
if action in self._action_synonyms:
return self._action_synonyms[action]
if action_base in self._action_synonyms:
return self._action_synonyms[action_base]
for verb, canonical in self._action_synonyms.items():
if action.startswith(verb) or verb.startswith(action):
return canonical
return action
# ── Object Normalization (replaces control_dedup.normalize_object) ──
def normalize_object(self, obj: str) -> str:
"""Normalize an object to a canonical token."""
self._ensure_loaded()
if not obj:
return ""
obj_lower = obj.strip().lower()
# Exact match
if obj_lower in self._object_synonyms:
return self._object_synonyms[obj_lower]
# Substring match (longest phrase first)
for phrase in self._object_keys_sorted:
if phrase in obj_lower:
return self._object_synonyms[phrase]
return obj_lower
def get_action_types(self) -> dict[str, str]:
"""Return all action_type → phase mappings."""
self._ensure_loaded()
return dict(self._action_phases)
def get_object_synonyms(self) -> dict[str, str]:
"""Return all object synonym → canonical mappings."""
self._ensure_loaded()
return dict(self._object_synonyms)
# Module-level singleton
_registry: Optional[OntologyRegistry] = None
def get_ontology_registry() -> OntologyRegistry:
"""Get or create the singleton OntologyRegistry instance."""
global _registry
if _registry is None:
_registry = OntologyRegistry()
return _registry
+4
View File
@@ -33,7 +33,9 @@ class RAGSearchResult:
paragraph: str
source_url: str
score: float
article_label: str = ""
collection: str = ""
page: Optional[int] = None
class ComplianceRAGClient:
@@ -89,6 +91,7 @@ class ComplianceRAGClient:
regulation_short=r.get("regulation_short", ""),
category=r.get("category", ""),
article=r.get("article", ""),
article_label=r.get("article_label", ""),
paragraph=r.get("paragraph", ""),
source_url=r.get("source_url", ""),
score=r.get("score", 0.0),
@@ -170,6 +173,7 @@ class ComplianceRAGClient:
regulation_short=r.get("regulation_short", ""),
category=r.get("category", ""),
article=r.get("article", ""),
article_label=r.get("article_label", ""),
paragraph=r.get("paragraph", ""),
source_url=r.get("source_url", ""),
score=0.0,
@@ -0,0 +1,220 @@
"""
DB-backed Regulation Registry with in-memory cache.
Replaces hardcoded REGULATION_LICENSE_MAP and SOURCE_REGULATION_CLASSIFICATION
with a single PostgreSQL table (compliance.regulation_registry).
Cache TTL: 5 minutes. Thread-safe via simple timestamp check.
Falls back to hardcoded dicts if DB is unavailable (graceful degradation).
"""
import logging
import time
from typing import Optional
from sqlalchemy import text
from sqlalchemy.exc import SQLAlchemyError
from db.session import SessionLocal
logger = logging.getLogger(__name__)
_CACHE_TTL_SECONDS = 300 # 5 minutes
# Prefix-based fallback rules (unchanged from original logic)
_RULE2_PREFIXES = ("enisa_",)
_RULE3_PREFIXES = ("bsi_", "iso_", "etsi_")
# Fallback for unknown regulations
_UNKNOWN_REGULATION = {
"license": "UNKNOWN",
"rule": 3,
"source_type": "restricted",
"name": "INTERNAL_ONLY",
"attribution": None,
}
class RegulationRegistry:
"""In-memory cache of the regulation_registry table.
Provides two lookup modes:
1. by_code(regulation_id) replaces REGULATION_LICENSE_MAP[code]
2. source_type_by_name(name) replaces SOURCE_REGULATION_CLASSIFICATION[name]
"""
def __init__(self):
self._by_code: dict[str, dict] = {}
self._by_name: dict[str, str] = {}
self._loaded_at: float = 0.0
def _is_stale(self) -> bool:
return (time.monotonic() - self._loaded_at) > _CACHE_TTL_SECONDS
def _load(self) -> bool:
"""Load all rows from regulation_registry into memory."""
try:
db = SessionLocal()
try:
rows = db.execute(
text("""
SELECT regulation_id, regulation_name_de, license_rule,
license_type, attribution, source_type, jurisdiction,
status
FROM regulation_registry
WHERE status != 'deprecated'
""")
).fetchall()
finally:
db.close()
by_code: dict[str, dict] = {}
by_name: dict[str, str] = {}
for row in rows:
entry = {
"license": row[3] or "", # license_type
"rule": row[2], # license_rule
"source_type": row[5] or "law", # source_type
"name": row[1] or row[0], # regulation_name_de or regulation_id
"attribution": row[4], # attribution
"jurisdiction": row[6], # jurisdiction
}
by_code[row[0].lower()] = entry
# Also index by name for source_type lookups
if row[1]:
by_name[row[1]] = row[5] or "law"
self._by_code = by_code
self._by_name = by_name
self._loaded_at = time.monotonic()
logger.info(
"Regulation registry loaded: %d entries by code, %d by name",
len(by_code), len(by_name),
)
return True
except SQLAlchemyError:
logger.warning(
"Failed to load regulation_registry from DB — using stale cache",
exc_info=True,
)
return False
def _ensure_loaded(self) -> None:
"""Reload cache if stale."""
if self._is_stale():
self._load()
def classify_regulation(self, regulation_code: str) -> dict:
"""Look up license info for a regulation_code.
Returns dict with keys: license, rule, name, source_type, attribution.
Equivalent to the old _classify_regulation() function.
"""
self._ensure_loaded()
code = regulation_code.lower().strip()
# Exact match from DB
if code in self._by_code:
return self._by_code[code]
# Prefix match for Rule 2 (ENISA = standard)
for prefix in _RULE2_PREFIXES:
if code.startswith(prefix):
return {
"license": "CC-BY-4.0",
"rule": 2,
"source_type": "standard",
"name": "ENISA",
"attribution": "ENISA, CC BY 4.0",
}
# Prefix match for Rule 3 (BSI/ISO/ETSI = restricted)
for prefix in _RULE3_PREFIXES:
if code.startswith(prefix):
return {
"license": f"{prefix.rstrip('_').upper()}_RESTRICTED",
"rule": 3,
"source_type": "restricted",
"name": "INTERNAL_ONLY",
"attribution": None,
}
# Unknown → restricted (safe default)
logger.warning(
"Unknown regulation_code %r — defaulting to Rule 3 (restricted)", code
)
return dict(_UNKNOWN_REGULATION)
def source_type_by_name(self, source_regulation: str) -> str:
"""Look up source_type by regulation display name.
Equivalent to old classify_source_regulation().
Falls back to heuristic for unknown names.
"""
self._ensure_loaded()
if not source_regulation:
return "framework"
# Exact match from DB
if source_regulation in self._by_name:
return self._by_name[source_regulation]
# Heuristic fallback for unknown sources
lower = source_regulation.lower()
law_indicators = [
"verordnung", "richtlinie", "gesetz", "directive", "regulation",
"(eu)", "(eg)", "act", "ley", "loi", "törvény", "código",
]
if any(ind in lower for ind in law_indicators):
return "law"
guideline_indicators = [
"edpb", "leitlinie", "guideline", "wp2", "bsi", "empfehlung",
]
if any(ind in lower for ind in guideline_indicators):
return "guideline"
framework_indicators = [
"enisa", "nist", "owasp", "oecd", "cisa", "framework", "iso",
]
if any(ind in lower for ind in framework_indicators):
return "framework"
return "framework"
def get_all(self) -> dict[str, dict]:
"""Return all cached entries (by regulation_code)."""
self._ensure_loaded()
return dict(self._by_code)
def is_open_source(self, regulation_code: str) -> bool:
"""Check if regulation is Rule 1 or 2 (safe to reference)."""
info = self.classify_regulation(regulation_code)
return info["rule"] in (1, 2)
# Module-level singleton
_registry: Optional[RegulationRegistry] = None
def get_registry() -> RegulationRegistry:
"""Get or create the singleton RegulationRegistry instance."""
global _registry
if _registry is None:
_registry = RegulationRegistry()
return _registry
def classify_regulation(regulation_code: str) -> dict:
"""Convenience: look up license info for a regulation_code."""
return get_registry().classify_regulation(regulation_code)
def classify_source_regulation(source_regulation: str) -> str:
"""Convenience: look up source_type by regulation display name."""
return get_registry().source_type_by_name(source_regulation)
@@ -0,0 +1,318 @@
# Adversarial Test Suite — 30 tricky Cases die Controls/Agent herausfordern
version: "1.0"
purpose: "Testen ob Controls und Agent bei grenzwertigen Formulierungen korrekt entscheiden"
tests:
# A. Falsche Rechtsgrundlage (plausibel klingend) — 8 Cases
- id: ADV-LIT-001
category: wrong_legal_basis
input: "Wir verarbeiten Ihre Daten fuer Webanalyse auf Grundlage unseres berechtigten Interesses (Art. 6 Abs. 1 lit. f DSGVO)."
context: "DSE-Abschnitt ueber Google Analytics"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Analytics erfordert Einwilligung, nicht berechtigtes Interesse (EuGH C-673/17 Planet49)"
difficulty: medium
- id: ADV-LIT-002
category: wrong_legal_basis
input: "Der Versand unseres Newsletters erfolgt auf Grundlage des Vertrages (Art. 6 Abs. 1 lit. b DSGVO)."
context: "DSE-Abschnitt ueber Marketing"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Newsletter ist kein Vertragsbestandteil, erfordert separate Einwilligung"
difficulty: medium
- id: ADV-LIT-003
category: wrong_legal_basis
input: "Die Ueberwachung der Arbeitsleistung unserer Mitarbeiter erfolgt auf Grundlage unseres berechtigten Interesses."
context: "Interne Datenschutzrichtlinie"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Betriebsvereinbarung + Art. 88 DSGVO i.V.m. § 26 BDSG"
reason: "Mitarbeiterueberwachung erfordert Betriebsvereinbarung (BAG Keylogger-Urteil)"
difficulty: hard
- id: ADV-LIT-004
category: wrong_legal_basis
input: "Biometrische Zutrittskontrolle auf Basis von Art. 6 Abs. 1 lit. f DSGVO."
context: "Sicherheitskonzept"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 9 Abs. 2 DSGVO (ausdrueckliche Einwilligung oder Arbeitsrecht)"
reason: "Biometrische Daten = besondere Kategorie nach Art. 9, lit. f reicht nicht"
difficulty: hard
- id: ADV-LIT-005
category: wrong_legal_basis
input: "Wir erstellen automatisierte Kreditentscheidungen auf Grundlage berechtigter Interessen."
context: "DSE einer Bank"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 22 DSGVO (ausdrueckliche Einwilligung oder gesetzliche Erlaubnis)"
reason: "Automatisierte Einzelentscheidungen erfordern Art. 22 Schutz (EuGH SCHUFA C-634/21)"
difficulty: hard
- id: ADV-LIT-006
category: wrong_legal_basis
input: "Social Login ueber Google wird als Vertragsdurchfuehrung (lit. b) verarbeitet."
context: "DSE mit Social Login"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Social Login ist keine Vertragspflicht, Nutzer kann sich auch ohne Google anmelden"
difficulty: medium
- id: ADV-LIT-007
category: wrong_legal_basis
input: "Personalisierte Werbung basiert auf unserem berechtigten Interesse an Direktmarketing."
context: "DSE eines marktbeherrschenden Unternehmens"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Marktbeherrschende Unternehmen koennen sich nicht auf lit. f fuer Werbung berufen (EuGH Meta C-252/21)"
difficulty: hard
- id: ADV-LIT-008
category: wrong_legal_basis
input: "Die Einbindung von Facebook Pixel erfolgt zur Vertragserfuellung (Art. 6 Abs. 1 lit. b DSGVO)."
context: "DSE eines Online-Shops"
expected:
finding: true
finding_type: wrong_legal_basis
correct_basis: "Art. 6 Abs. 1 lit. a (Einwilligung)"
reason: "Facebook Pixel dient Tracking/Marketing, nicht der Vertragserfuellung"
difficulty: easy
# B. Dark Patterns (subtil) — 6 Cases
- id: ADV-DP-001
category: dark_pattern
input:
accept_button: {text: "Alle akzeptieren", size: "16px", color: "#ffffff", background: "#0066cc", prominent: true}
reject_button: {text: "Ablehnen", size: "10px", color: "#cccccc", background: "transparent", prominent: false}
expected:
finding: true
finding_type: dark_pattern_visual_bias
reason: "Ablehnen-Button ist kleiner, weniger sichtbar (OLG Koeln 6 U 58/21)"
difficulty: easy
- id: ADV-DP-002
category: dark_pattern
input:
accept_button: {text: "Alle akzeptieren", clicks_to_complete: 1}
reject_option: {text: "Einstellungen verwalten", clicks_to_complete: 3, label: "Einstellungen"}
expected:
finding: true
finding_type: dark_pattern_friction_asymmetry
reason: "Ablehnen erfordert 3 Klicks, Akzeptieren nur 1 (CNIL Cookie-Banner)"
difficulty: medium
- id: ADV-DP-003
category: dark_pattern
input:
type: "cookie_wall"
description: "Inhalt erst nach Cookie-Zustimmung sichtbar"
expected:
finding: true
finding_type: dark_pattern_cookie_wall
reason: "Cookie-Wall = keine freiwillige Einwilligung (EDPB Guidelines 05/2020)"
difficulty: medium
- id: ADV-DP-004
category: dark_pattern
input:
type: "prechecked_boxes"
description: "Checkboxen fuer Marketing und Analytics sind vorausgefuellt"
expected:
finding: true
finding_type: dark_pattern_prechecked
reason: "Vorausgefuellte Checkboxen sind keine wirksame Einwilligung (BGH Planet49)"
difficulty: easy
- id: ADV-DP-005
category: dark_pattern
input:
type: "confirm_shaming"
accept_text: "Ja, ich moechte sicher surfen"
reject_text: "Nein, ich verzichte auf Sicherheit"
expected:
finding: true
finding_type: dark_pattern_confirm_shaming
reason: "Manipulative Formulierung beeinflusst Entscheidung"
difficulty: medium
- id: ADV-DP-006
category: dark_pattern
input:
type: "hidden_reject"
description: "Ablehnen-Link ist 3px gross, Farbe #f0f0f0 auf weissem Hintergrund"
expected:
finding: true
finding_type: dark_pattern_hidden_option
reason: "Ablehnen-Option praktisch unsichtbar (OLG Koeln)"
difficulty: easy
# C. Fast-vollstaendige Dokumente — 6 Cases
- id: ADV-DOC-001
category: incomplete_document
input: "Impressum: Max Mustermann GmbH, Musterstr. 1, 10115 Berlin, info@example.com, HRB 12345"
expected:
finding: true
finding_type: missing_field
missing: "USt-ID"
reason: "§ 5 Abs. 1 Nr. 6 DDG: USt-IdNr. oder Wirtschafts-ID Pflicht"
difficulty: easy
- id: ADV-DOC-002
category: incomplete_document
input: "Datenschutzerklaerung mit Zwecken, Rechtsgrundlagen, Empfaengern, Betroffenenrechten — aber ohne Speicherdauer"
expected:
finding: true
finding_type: missing_field
missing: "Speicherdauer"
reason: "Art. 13 Abs. 2 lit. a DSGVO: Dauer der Speicherung oder Kriterien"
difficulty: medium
- id: ADV-DOC-003
category: incomplete_document
input: "DSE ohne Kontaktdaten des Datenschutzbeauftragten"
expected:
finding: true
finding_type: missing_field
missing: "DSB-Kontakt"
reason: "Art. 13 Abs. 1 lit. b DSGVO: Kontaktdaten des DSB"
difficulty: easy
- id: ADV-DOC-004
category: incomplete_document
input: "Widerrufsbelehrung mit 14-Tage-Frist, Muster-Formular, aber Fristbeginn fehlt"
expected:
finding: true
finding_type: missing_field
missing: "Fristbeginn"
reason: "Anlage 1 zu Art. 246a § 1 EGBGB: Fristbeginn muss angegeben werden"
difficulty: medium
- id: ADV-DOC-005
category: incomplete_document
input: "AGB eines Online-Shops ohne Angabe des Gerichtsstands"
expected:
finding: false
reason: "Gerichtsstand in AGB ist bei B2C nicht erforderlich (sogar oft unzulaessig)"
difficulty: hard
- id: ADV-DOC-006
category: incomplete_document
input: "Cookie-Policy listet Google Analytics und Facebook Pixel auf, aber nicht das CMP-Cookie selbst"
expected:
finding: true
finding_type: missing_field
missing: "CMP-eigene Cookies"
reason: "Auch technisch notwendige Cookies muessen in der Cookie-Policy stehen"
difficulty: hard
# D. Semantisch aehnlich aber verschieden — 5 Cases
- id: ADV-SEM-001
category: similar_but_different
control_a: "MFA fuer privilegierte Admin-Accounts aktivieren"
control_b: "MFA fuer alle Endnutzer-Accounts aktivieren"
expected:
is_duplicate: false
reason: "Verschiedene Scopes (Admin vs. Endnutzer) = verschiedene Controls"
difficulty: medium
- id: ADV-SEM-002
category: similar_but_different
control_a: "Daten nach Vertragsende loeschen"
control_b: "Daten nach Ablauf der gesetzlichen Aufbewahrungsfrist loeschen"
expected:
is_duplicate: false
reason: "Verschiedene Trigger (Vertragsende vs. Aufbewahrungsfrist)"
difficulty: hard
- id: ADV-SEM-003
category: similar_but_different
control_a: "Rate Limiting fuer oeffentliche API-Endpunkte"
control_b: "Rate Limiting fuer Login-Endpunkte"
expected:
is_duplicate: false
reason: "Verschiedene Asset-Scopes (API vs. Login)"
difficulty: medium
- id: ADV-SEM-004
category: similar_but_different
control_a: "Verschluesselung personenbezogener Daten at rest"
control_b: "Verschluesselung personenbezogener Daten in transit"
expected:
is_duplicate: false
reason: "Verschiedene Phasen (Speicherung vs. Uebertragung)"
difficulty: easy
- id: ADV-SEM-005
category: similar_but_different
control_a: "Incident Response Plan erstellen"
control_b: "Business Continuity Plan erstellen"
expected:
is_duplicate: false
reason: "IRP = Sicherheitsvorfaelle, BCP = Geschaeftskontinuitaet (verschiedene Ziele)"
difficulty: medium
# E. Semantisch verschieden aber gleich klingend — 5 Cases
- id: ADV-HOM-001
category: homonym_different
control_a: "Einwilligung des Nutzers fuer Datenverarbeitung einholen (DSGVO)"
control_b: "Einwilligung des Nutzers fuer Werbeanrufe einholen (UWG)"
expected:
is_duplicate: false
reason: "Verschiedene Rechtsgrundlagen (DSGVO vs. UWG) und verschiedene Rechtsfolgen"
difficulty: hard
- id: ADV-HOM-002
category: homonym_different
control_a: "Risikobewertung fuer Datenschutz-Folgenabschaetzung (DSFA)"
control_b: "Risikobewertung fuer finanzielle Risiken (MaRisk)"
expected:
is_duplicate: false
reason: "Verschiedene Risikokategorien und verschiedene regulatorische Grundlagen"
difficulty: hard
- id: ADV-HOM-003
category: homonym_different
control_a: "Audit der Datenschutz-Compliance (Art. 5 Abs. 2 DSGVO)"
control_b: "Audit der Jahresabschlusspruefung (HGB)"
expected:
is_duplicate: false
reason: "Verschiedene Audit-Typen mit verschiedenen Pruefungsstandards"
difficulty: medium
- id: ADV-HOM-004
category: homonym_different
control_a: "Zertifizierung nach ISO 27001 (Informationssicherheit)"
control_b: "Zertifizierung nach CE-Konformitaet (Produktsicherheit)"
expected:
is_duplicate: false
reason: "Verschiedene Zertifizierungsrahmen, verschiedene Pruefer, verschiedene Ziele"
difficulty: easy
- id: ADV-HOM-005
category: homonym_different
control_a: "Verarbeitung personenbezogener Daten dokumentieren (DSGVO VVT)"
control_b: "Verarbeitung von Lebensmitteln dokumentieren (HACCP)"
expected:
is_duplicate: false
reason: "Komplett verschiedene Domaenen trotz gleicher Woerter"
difficulty: easy
+36
View File
@@ -0,0 +1,36 @@
"""Shared test fixtures for the control pipeline test suite."""
import os
import sys
import pytest
# Ensure control-pipeline is in path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
@pytest.fixture(scope="session")
def db_session():
"""DB session for integration tests — skip if no DATABASE_URL."""
url = os.getenv("DATABASE_URL")
if not url:
pytest.skip("DATABASE_URL not set — skipping DB tests")
from db.session import SessionLocal
db = SessionLocal()
yield db
db.close()
@pytest.fixture
def sample_controls(db_session):
"""Load 100 random draft controls for regression testing."""
from sqlalchemy import text
rows = db_session.execute(text("""
SELECT control_id, title, category, severity,
generation_metadata->>'assertion' as assertion,
generation_metadata->>'check_type' as check_type,
generation_metadata->>'merge_group_hint' as merge_key
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
ORDER BY random() LIMIT 100
""")).fetchall()
return [dict(r._mapping) for r in rows]
@@ -0,0 +1,94 @@
# Golden Dataset for MC Assignment Quality
# Manually verified controls with their expected MC topics.
# Used for regression testing after pipeline changes.
# Created: 2026-05-10, verified by manual review (19/20 correct)
golden_controls:
# ── Data Protection ──
- control_id: "DATA-3291-A06"
expected_topic_prefix: "data_retention"
reason: "Speicherfristen für personenbezogene Daten definieren"
- control_id: "SEC-7449-A01"
expected_topic_prefix: "personal_data"
reason: "Fahrzeugnutzungsdaten in Telematikbox (Datenminimierung)"
- control_id: "DATA-3518-A06"
expected_topic_prefix: "data_subject_rights"
reason: "Betroffene über Lösch-Ausnahmen informieren"
- control_id: "GOV-963-A02"
expected_topic_prefix: "consent"
reason: "Zustimmung des Urhebers vor Veröffentlichung einholen"
# ── Security ──
- control_id: "CRYP-1454-A07"
expected_topic_prefix: "encryption"
reason: "RSASSA-PSS in TLS 1.3 verifizieren"
- control_id: "NET-1141-A08"
expected_topic_prefix: "monitoring"
reason: "Sampling-Strategien konfigurieren"
- control_id: "SEC-2244-A05"
expected_topic_prefix: "asset_management"
reason: "Systeminventar kontinuierlich aktualisieren"
- control_id: "AUTH-3468-A06"
expected_topic_prefix: "access_control"
reason: "Rollenkonzept mit abgestuften Zugriffsrechten"
# ── Governance ──
- control_id: "AUTH-2364-A09"
expected_topic_prefix: "supervisory_authority"
reason: "Zusammenarbeit mit Wirtschaftsakteuren dokumentieren"
- control_id: "SEC-5972-A14"
expected_topic_prefix: "third_party_management"
reason: "Cybersicherheitsrichtlinien kritischer Lieferanten prüfen"
- control_id: "SEC-3441-A02"
expected_topic_prefix: "human_resources_security"
reason: "Mitarbeiter vor Nachteil bei Verweigerung schützen"
- control_id: "SEC-3502-A06"
expected_topic_prefix: "awareness"
reason: "Organisationskultur für Sicherheitsverbesserung"
- control_id: "GOV-1748-A04"
expected_topic_prefix: "policy"
reason: "Annahme von Geschenken untersagen"
# ── Regulatory ──
- control_id: "AI-1287-A01"
expected_topic_prefix: "ai_system"
reason: "Akteure des KI-Systems identifizieren"
- control_id: "AI-1732-A11"
expected_topic_prefix: "ai_system"
reason: "Menschliche Kontrolle für KI-Entscheidungen"
- control_id: "COMP-1352-A04"
expected_topic_prefix: "certification"
reason: "Amateurfunkprüfungszeugnis vorlegen"
- control_id: "FIN-1212-A02"
expected_topic_prefix: "financial_reporting"
reason: "Jahresabschluss gemäß EU-Richtlinie aufstellen"
- control_id: "AUTH-1165-A01"
expected_topic_prefix: "data_classification"
reason: "Öffentliche IP-Adressen als Stammdaten klassifizieren"
- control_id: "SEC-7367-A10"
expected_topic_prefix: "audit_logging"
reason: "Banner-Version Rückverfolgung testen"
- control_id: "LAB-034-A03"
expected_topic_prefix: "third_party_management"
reason: "Verträge auf unzulässige Klauseln prüfen"
quality_thresholds:
min_accuracy: 0.90
max_controls_per_mc: 300
min_master_controls: 10000
+190
View File
@@ -0,0 +1,190 @@
"""
Adversarial Test Suite 30 tricky cases that challenge the control ontology
and dedup engine with edge cases.
Tests categories:
A. Wrong legal basis (plausible but incorrect) 8 cases
B. Dark patterns (subtle UI manipulation) 6 cases
C. Almost-complete documents (missing 1 field) 6 cases
D. Semantically similar but different controls 5 cases
E. Homonyms (different meaning, same words) 5 cases
"""
import os
import sys
import yaml
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from services.control_ontology import classify_obligation, classify_action
ADVERSARIAL_PATH = os.path.join(os.path.dirname(__file__), "adversarial_cases.yaml")
with open(ADVERSARIAL_PATH) as f:
_ADV = yaml.safe_load(f)
TESTS = _ADV["tests"]
def _tests_by_category(cat: str) -> list:
return [t for t in TESTS if t["category"] == cat]
# ============================================================================
# D. Semantically similar but different — must NOT be deduped
# ============================================================================
class TestSimilarButDifferent:
"""Controls that sound alike but are different — dedup must keep both."""
@pytest.mark.parametrize("case", _tests_by_category("similar_but_different"),
ids=lambda c: c["id"])
def test_not_duplicate(self, case):
assert case["expected"]["is_duplicate"] is False, (
f"{case['id']}: These controls MUST NOT be marked as duplicates"
)
def test_admin_vs_user_mfa(self):
"""ADV-SEM-001: Admin-MFA and User-MFA are different controls."""
case = next(t for t in TESTS if t["id"] == "ADV-SEM-001")
a = classify_obligation(case["control_a"], "")
b = classify_obligation(case["control_b"], "")
# Both should be atomic (not filtered out)
assert a["routing"] == "atomic"
assert b["routing"] == "atomic"
def test_encryption_at_rest_vs_in_transit(self):
"""ADV-SEM-004: at rest vs in transit are different controls."""
a_action = classify_action("Verschluesselung at rest implementieren")
b_action = classify_action("Verschluesselung in transit implementieren")
# Both should classify as "encrypt" or "implement"
assert a_action in ("encrypt", "implement")
assert b_action in ("encrypt", "implement")
# ============================================================================
# E. Homonyms — same words, different domains
# ============================================================================
class TestHomonymDifferent:
"""Controls using same words but from different domains — must NOT merge."""
@pytest.mark.parametrize("case", _tests_by_category("homonym_different"),
ids=lambda c: c["id"])
def test_not_duplicate(self, case):
assert case["expected"]["is_duplicate"] is False, (
f"{case['id']}: Homonyms must NOT be treated as duplicates"
)
def test_dsgvo_audit_vs_hgb_audit(self):
"""ADV-HOM-003: Data protection audit vs financial audit."""
a = classify_obligation("Audit der Datenschutz-Compliance durchfuehren", "")
b = classify_obligation("Audit der Jahresabschlusspruefung durchfuehren", "")
assert a["routing"] == "atomic"
assert b["routing"] == "atomic"
# "durchfuehren" maps to "implement" — key point is both are atomic, not filtered
# ============================================================================
# A. Wrong legal basis — structural tests
# ============================================================================
class TestWrongLegalBasis:
"""Verify that wrong legal basis cases have correct expected metadata."""
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
ids=lambda c: c["id"])
def test_finding_expected(self, case):
"""All wrong_legal_basis cases must expect a finding."""
assert case["expected"]["finding"] is True
@pytest.mark.parametrize("case", _tests_by_category("wrong_legal_basis"),
ids=lambda c: c["id"])
def test_has_correct_basis(self, case):
"""All cases must specify what the correct basis should be."""
assert "correct_basis" in case["expected"]
assert len(case["expected"]["correct_basis"]) > 0
def test_analytics_requires_consent(self):
"""ADV-LIT-001: Analytics on lit. f is always wrong."""
case = next(t for t in TESTS if t["id"] == "ADV-LIT-001")
assert "lit. a" in case["expected"]["correct_basis"]
assert "Planet49" in case["expected"]["reason"]
# ============================================================================
# B. Dark Patterns — structural tests
# ============================================================================
class TestDarkPatterns:
"""Verify dark pattern test case structure."""
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
ids=lambda c: c["id"])
def test_finding_expected(self, case):
"""All dark pattern cases must expect a finding."""
assert case["expected"]["finding"] is True
@pytest.mark.parametrize("case", _tests_by_category("dark_pattern"),
ids=lambda c: c["id"])
def test_has_finding_type(self, case):
"""All cases must specify the dark pattern type."""
assert "finding_type" in case["expected"]
assert case["expected"]["finding_type"].startswith("dark_pattern_")
# ============================================================================
# C. Incomplete documents — structural tests
# ============================================================================
class TestIncompleteDocuments:
"""Verify incomplete document test case structure."""
@pytest.mark.parametrize("case", _tests_by_category("incomplete_document"),
ids=lambda c: c["id"])
def test_has_reason(self, case):
"""All cases must have a reason."""
assert "reason" in case["expected"]
assert len(case["expected"]["reason"]) > 0
def test_agb_gerichtsstand_no_finding(self):
"""ADV-DOC-005: Missing Gerichtsstand in B2C AGB is NOT a finding."""
case = next(t for t in TESTS if t["id"] == "ADV-DOC-005")
assert case["expected"]["finding"] is False
# ============================================================================
# Meta tests — validate test suite integrity
# ============================================================================
class TestSuiteIntegrity:
"""Verify the adversarial test suite itself is complete and consistent."""
def test_total_count(self):
assert len(TESTS) == 30
def test_unique_ids(self):
ids = [t["id"] for t in TESTS]
assert len(ids) == len(set(ids)), "Duplicate test IDs found"
def test_all_categories_present(self):
categories = {t["category"] for t in TESTS}
expected = {"wrong_legal_basis", "dark_pattern", "incomplete_document",
"similar_but_different", "homonym_different"}
assert categories == expected
def test_category_counts(self):
counts = {}
for t in TESTS:
counts[t["category"]] = counts.get(t["category"], 0) + 1
assert counts["wrong_legal_basis"] == 8
assert counts["dark_pattern"] == 6
assert counts["incomplete_document"] == 6
assert counts["similar_but_different"] == 5
assert counts["homonym_different"] == 5
def test_all_have_difficulty(self):
for t in TESTS:
assert "difficulty" in t, f"{t['id']} missing difficulty"
assert t["difficulty"] in ("easy", "medium", "hard")
+166
View File
@@ -0,0 +1,166 @@
"""Tests for D3: Structural metadata flow (section priority, page in citation)."""
import json
from typing import Optional
from services.rag_client import RAGSearchResult
def _make_chunk(
article: str = "",
paragraph: str = "",
page: Optional[int] = None,
) -> RAGSearchResult:
return RAGSearchResult(
text="Test chunk text",
regulation_code="DSGVO",
regulation_name="Datenschutz-Grundverordnung",
regulation_short="DSGVO",
category="data_protection",
article=article,
paragraph=paragraph,
source_url="https://example.com",
score=0.95,
collection="bp_compliance_de",
page=page,
)
class TestRAGSearchResultPage:
"""RAGSearchResult now carries a page field."""
def test_page_default_none(self):
chunk = _make_chunk()
assert chunk.page is None
def test_page_set(self):
chunk = _make_chunk(page=42)
assert chunk.page == 42
def test_page_zero(self):
chunk = _make_chunk(page=0)
assert chunk.page == 0
class TestQdrantPayloadPriority:
"""section (D2) should take priority over article (legacy)."""
def test_section_preferred_over_article(self):
payload = {"section": "§ 312k", "article": "Art. 312", "section_title": "Kuendigungsbutton"}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "§ 312k"
def test_article_fallback_when_no_section(self):
payload = {"section": "", "article": "Art. 35", "section_title": ""}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "Art. 35"
def test_section_title_last_resort(self):
payload = {"section": "", "article": "", "section_title": "Informationspflichten"}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == "Informationspflichten"
def test_all_empty(self):
payload = {"section": "", "article": "", "section_title": ""}
article = payload.get("section", "") or payload.get("article", "") or payload.get("section_title", "")
assert article == ""
def test_page_from_payload(self):
payload = {"page": 847}
assert payload.get("page") == 847
def test_page_none_from_payload(self):
payload = {}
assert payload.get("page") is None
class TestSourceCitationPage:
"""source_citation dict should include page when available."""
def _build_citation(self, chunk: RAGSearchResult) -> dict:
"""Mirrors the citation-building logic from control_generator.py."""
return {
"source": chunk.regulation_name,
"article": chunk.article,
"paragraph": chunk.paragraph,
"page": chunk.page,
"license": "free_use",
"source_type": "law",
"url": chunk.source_url or "",
}
def test_citation_with_page(self):
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1", page=847)
citation = self._build_citation(chunk)
assert citation["page"] == 847
def test_citation_without_page(self):
chunk = _make_chunk(article="§ 312k", paragraph="Abs. 1")
citation = self._build_citation(chunk)
assert citation["page"] is None
def test_citation_serializable(self):
chunk = _make_chunk(article="Art. 35", page=12)
citation = self._build_citation(chunk)
serialized = json.dumps(citation)
restored = json.loads(serialized)
assert restored["page"] == 12
class TestFormatCitation:
"""_format_citation should include page number."""
def _format_citation(self, citation) -> str:
"""Mirrors _format_citation from decomposition_pass.py."""
if not citation:
return ""
if isinstance(citation, str):
try:
c = json.loads(citation)
if isinstance(c, dict):
parts = []
if c.get("source"):
parts.append(c["source"])
if c.get("article"):
parts.append(c["article"])
if c.get("paragraph"):
parts.append(c["paragraph"])
if c.get("page") is not None:
parts.append(f"S. {c['page']}")
return " ".join(parts) if parts else citation
except (json.JSONDecodeError, TypeError):
return citation
return str(citation)
def test_format_with_page(self):
citation = json.dumps({
"source": "DSGVO",
"article": "Art. 35",
"paragraph": "Abs. 1",
"page": 42,
})
result = self._format_citation(citation)
assert result == "DSGVO Art. 35 Abs. 1 S. 42"
def test_format_without_page(self):
citation = json.dumps({
"source": "BGB",
"article": "§ 312k",
"paragraph": "",
})
result = self._format_citation(citation)
assert result == "BGB § 312k"
def test_format_page_zero(self):
citation = json.dumps({
"source": "BGB",
"article": "§ 1",
"paragraph": "",
"page": 0,
})
result = self._format_citation(citation)
assert result == "BGB § 1 S. 0"
def test_format_empty_citation(self):
assert self._format_citation("") == ""
assert self._format_citation(None) == ""
@@ -0,0 +1,122 @@
"""F5 Validation: Verify DB-backed lookups match old hardcoded dicts."""
import pytest
class TestRegulationRegistryConsistency:
"""Ensure all old REGULATION_LICENSE_MAP entries are in the DB."""
def test_all_old_entries_in_db(self):
from services.control_generator import REGULATION_LICENSE_MAP
from scripts.f1_migrate_regulation_registry import build_rows
db_ids = {r["regulation_id"] for r in build_rows()}
for reg_id in REGULATION_LICENSE_MAP:
assert reg_id in db_ids, f"Missing from DB: {reg_id}"
def test_classify_regulation_matches_old(self):
"""DB-backed classify_regulation returns same rule as old dict."""
from services.control_generator import REGULATION_LICENSE_MAP
from services.regulation_registry import RegulationRegistry
from unittest.mock import patch, MagicMock
# Build mock DB with migration data
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
mock_rows = [
(r["regulation_id"], r["regulation_name_de"], r["license_rule"],
r["license_type"], r.get("attribution"), r["source_type"],
r["jurisdiction"], r["status"])
for r in rows
]
reg = RegulationRegistry()
with patch("services.regulation_registry.SessionLocal") as mock_cls:
mock_session = MagicMock()
mock_result = MagicMock()
mock_result.fetchall.return_value = mock_rows
mock_session.execute.return_value = mock_result
mock_cls.return_value = mock_session
reg._load()
# Compare every entry
mismatches = []
for reg_id, info in REGULATION_LICENSE_MAP.items():
db_result = reg.classify_regulation(reg_id)
if db_result["rule"] != info["rule"]:
mismatches.append(f"{reg_id}: DB rule={db_result['rule']} vs dict rule={info['rule']}")
assert not mismatches, f"Rule mismatches:\n" + "\n".join(mismatches)
class TestActionOntologyConsistency:
"""Ensure all old ACTION_TYPES entries are in the DB."""
def test_all_action_types_migrated(self):
from services.control_ontology import ACTION_TYPES
from scripts.f2_migrate_actions import build_action_types
db_names = {t["canonical_name"] for t in build_action_types()}
for action in ACTION_TYPES:
assert action in db_names, f"Missing action_type: {action}"
def test_all_aliases_migrated(self):
from services.control_ontology import ACTION_TYPES
from scripts.f2_migrate_actions import build_action_synonyms
db_synonyms = {s["synonym"] for s in build_action_synonyms() if s["pattern_type"] == "alias"}
missing = []
for action, info in ACTION_TYPES.items():
for alias in info.get("aliases", []):
if alias.lower() not in db_synonyms:
missing.append(f"{action}: {alias}")
assert not missing, f"Missing aliases:\n" + "\n".join(missing)
def test_all_negative_patterns_migrated(self):
from services.control_ontology import _NEGATIVE_PATTERNS
from scripts.f2_migrate_actions import build_action_synonyms
db_patterns = {s["synonym"] for s in build_action_synonyms() if s["pattern_type"] == "negative_pattern"}
for pattern, _ in _NEGATIVE_PATTERNS:
assert pattern.lower() in db_patterns, f"Missing negative pattern: {pattern}"
class TestObjectSynonymsConsistency:
"""Ensure all old _OBJECT_SYNONYMS are in the DB."""
def test_all_objects_migrated(self):
from services.control_dedup import _OBJECT_SYNONYMS
from scripts.f3_migrate_objects import build_rows
db_synonyms = {r["synonym"] for r in build_rows()}
missing = []
for syn in _OBJECT_SYNONYMS:
if syn.lower() not in db_synonyms:
missing.append(syn)
assert not missing, f"Missing object synonyms:\n" + "\n".join(missing)
class TestLLMEnrichmentQuality:
"""Basic quality checks on LLM-generated synonyms."""
def test_no_empty_synonyms_in_db(self):
"""All synonyms should have content."""
from scripts.f2_migrate_actions import build_action_synonyms
for s in build_action_synonyms():
assert len(s["synonym"].strip()) >= 2, f"Too short: {s}"
def test_no_duplicate_canonical_in_actions(self):
"""Each synonym should map to exactly one canonical action."""
from scripts.f2_migrate_actions import build_action_synonyms
synonyms = build_action_synonyms()
seen = {}
for s in synonyms:
key = (s["synonym"], s["language"], s["pattern_type"])
if key in seen:
assert seen[key] == s["canonical_action"], (
f"Duplicate synonym '{s['synonym']}' maps to both "
f"'{seen[key]}' and '{s['canonical_action']}'"
)
seen[key] = s["canonical_action"]
+166
View File
@@ -0,0 +1,166 @@
"""
Master Control Quality Tests.
Regression tests to ensure MC assignment quality stays above 90%.
Uses golden dataset of manually verified controls.
"""
import os
import yaml
import pytest
from sqlalchemy import create_engine, text
DB_URL = os.getenv(
"DATABASE_URL",
"postgresql://breakpilot:breakpilot123@postgres:5432/breakpilot_db",
)
_engine = None
def get_engine():
global _engine
if _engine is None:
_engine = create_engine(
DB_URL,
connect_args={"options": "-c search_path=compliance,public"},
)
return _engine
def load_golden():
path = os.path.join(os.path.dirname(__file__), "golden_mc_assignments.yaml")
with open(path) as f:
return yaml.safe_load(f)
# ── Golden Dataset Tests ──
class TestGoldenMCAssignments:
"""Each golden control must be in the correct MC."""
@pytest.fixture(autouse=True)
def setup(self):
self.golden = load_golden()
self.engine = get_engine()
def test_golden_controls_correctly_assigned(self):
"""All golden controls must be in an MC matching their expected topic prefix."""
errors = []
with self.engine.connect() as c:
for gc in self.golden["golden_controls"]:
row = c.execute(text("""
SELECT mc.canonical_name
FROM master_controls mc
JOIN master_control_members mcm ON mcm.master_control_uuid = mc.id
JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE cc.control_id = :cid
LIMIT 1
"""), {"cid": gc["control_id"]}).fetchone()
if row is None:
errors.append(f"{gc['control_id']}: not found in any MC")
elif not row[0].startswith(gc["expected_topic_prefix"]):
errors.append(
f"{gc['control_id']}: expected {gc['expected_topic_prefix']}*, "
f"got {row[0]}"
)
if errors:
pytest.fail(
f"{len(errors)} golden controls misassigned:\n"
+ "\n".join(f" - {e}" for e in errors)
)
# ── Structural Quality Tests ──
class TestMCStructuralQuality:
"""Structural invariants for Master Controls."""
@pytest.fixture(autouse=True)
def setup(self):
self.golden = load_golden()
self.thresholds = self.golden["quality_thresholds"]
self.engine = get_engine()
def test_minimum_master_controls(self):
"""Must have at least 10K Master Controls."""
with self.engine.connect() as c:
count = c.execute(
text("SELECT count(*) FROM master_controls")
).scalar()
assert count >= self.thresholds["min_master_controls"], (
f"Only {count} MCs, expected >= {self.thresholds['min_master_controls']}"
)
def test_max_controls_per_mc(self):
"""No MC should have more than 300 controls."""
with self.engine.connect() as c:
max_mc = c.execute(
text("SELECT max(total_controls) FROM master_controls")
).scalar()
assert max_mc <= self.thresholds["max_controls_per_mc"], (
f"Max MC has {max_mc} controls, limit is {self.thresholds['max_controls_per_mc']}"
)
def test_no_empty_master_controls(self):
"""Every MC must have at least 1 member."""
with self.engine.connect() as c:
empty = c.execute(text("""
SELECT count(*) FROM master_controls
WHERE total_controls = 0
""")).scalar()
assert empty == 0, f"{empty} empty MCs found"
def test_all_members_reference_valid_controls(self):
"""Every MC member must reference an existing control."""
with self.engine.connect() as c:
orphans = c.execute(text("""
SELECT count(*) FROM master_control_members mcm
LEFT JOIN canonical_controls cc ON cc.id = mcm.control_uuid
WHERE cc.id IS NULL
""")).scalar()
assert orphans == 0, f"{orphans} orphan members found"
# ── Doc Check Controls Tests ──
class TestDocCheckControls:
"""Validate doc_check_controls table."""
@pytest.fixture(autouse=True)
def setup(self):
self.engine = get_engine()
def test_doc_check_controls_exist(self):
"""Must have doc_check_controls."""
with self.engine.connect() as c:
count = c.execute(
text("SELECT count(*) FROM doc_check_controls")
).scalar()
assert count > 100, f"Only {count} doc_check_controls"
def test_all_doc_types_covered(self):
"""All 8 document types must have controls."""
expected = {"dse", "cookie", "impressum", "widerruf",
"agb", "dsfa", "avv", "loeschkonzept"}
with self.engine.connect() as c:
rows = c.execute(text(
"SELECT DISTINCT doc_type FROM doc_check_controls"
)).fetchall()
actual = {r[0] for r in rows}
missing = expected - actual
assert not missing, f"Missing doc types: {missing}"
def test_check_questions_not_empty(self):
"""Every doc_check_control must have a check_question."""
with self.engine.connect() as c:
empty = c.execute(text("""
SELECT count(*) FROM doc_check_controls
WHERE check_question IS NULL OR check_question = ''
""")).scalar()
assert empty == 0, f"{empty} controls without check_question"
@@ -0,0 +1,226 @@
"""Tests for OntologyRegistry — DB-backed action/object normalization."""
import time
from unittest.mock import MagicMock, patch
import pytest
from services.ontology_registry import OntologyRegistry, _CACHE_TTL_SECONDS
# ── Mock DB data ──────────────────────────────────────────────────────
_MOCK_ACTION_TYPES = [
("implement", "implementation"),
("monitor", "monitoring"),
("prevent", "implementation"),
("exclude", "implementation"),
("test", "testing"),
("encrypt", "implementation"),
("document", "evidence"),
("train", "training"),
]
_MOCK_ACTION_SYNONYMS = [
# (canonical_action, synonym, pattern_type)
("implement", "implementieren", "alias"),
("implement", "umsetzen", "alias"),
("implement", "einführen", "alias"),
("monitor", "überwachen", "alias"),
("test", "testen", "alias"),
("encrypt", "verschlüsseln", "alias"),
("document", "dokumentieren", "alias"),
("train", "schulen", "alias"),
# Negative patterns
("exclude", "dürfen nicht", "negative_pattern"),
("exclude", "darf nicht", "negative_pattern"),
("prevent", "verhindern", "negative_pattern"),
("prevent", "nicht gespeichert", "negative_pattern"),
]
_MOCK_OBJECT_SYNONYMS = [
("multi_factor_auth", "mfa"),
("multi_factor_auth", "2fa"),
("password_policy", "passwort"),
("encryption", "verschlüsselung"),
("audit_logging", "audit-log"),
("firewall", "firewall"),
("personal_data", "personenbezogene daten"),
]
def _mock_execute(query):
"""Route mock queries to correct test data."""
q = str(query)
mock_result = MagicMock()
if "action_types" in q:
mock_result.fetchall.return_value = _MOCK_ACTION_TYPES
elif "action_synonyms" in q:
mock_result.fetchall.return_value = _MOCK_ACTION_SYNONYMS
elif "object_synonyms" in q:
mock_result.fetchall.return_value = _MOCK_OBJECT_SYNONYMS
else:
mock_result.fetchall.return_value = []
return mock_result
@pytest.fixture
def registry():
"""Create a registry with mocked DB."""
reg = OntologyRegistry()
with patch("services.ontology_registry.SessionLocal") as mock_cls:
mock_session = MagicMock()
mock_session.execute = _mock_execute
mock_cls.return_value = mock_session
reg._load()
return reg
# ── classify_action tests ────────────────────────────────────────────
class TestClassifyAction:
def test_direct_alias(self, registry):
assert registry.classify_action("implementieren") == "implement"
assert registry.classify_action("überwachen") == "monitor"
assert registry.classify_action("testen") == "test"
def test_case_insensitive(self, registry):
assert registry.classify_action("IMPLEMENTIEREN") == "implement"
def test_negative_pattern(self, registry):
assert registry.classify_action("dürfen nicht verwendet werden") == "exclude"
assert registry.classify_action("darf nicht gespeichert werden") == "prevent"
def test_negative_pattern_priority(self, registry):
# "nicht gespeichert" is more specific than "darf nicht"
assert registry.classify_action("nicht gespeichert") == "prevent"
def test_substring_match(self, registry):
assert registry.classify_action("Maßnahmen implementieren und dokumentieren") == "implement"
def test_unknown_defaults_to_implement(self, registry):
assert registry.classify_action("fliegen") == "implement"
# ── get_phase tests ──────────────────────────────────────────────────
class TestGetPhase:
def test_known_phase(self, registry):
assert registry.get_phase("implement") == "implementation"
assert registry.get_phase("monitor") == "monitoring"
assert registry.get_phase("test") == "testing"
def test_unknown_defaults_to_implementation(self, registry):
assert registry.get_phase("unknown_action") == "implementation"
# ── normalize_action tests ───────────────────────────────────────────
class TestNormalizeAction:
def test_exact_match(self, registry):
assert registry.normalize_action("implementieren") == "implement"
assert registry.normalize_action("testen") == "test"
def test_empty(self, registry):
assert registry.normalize_action("") == ""
def test_passthrough_unknown(self, registry):
assert registry.normalize_action("fliegen") == "fliegen"
# ── normalize_object tests ───────────────────────────────────────────
class TestNormalizeObject:
def test_exact_match(self, registry):
assert registry.normalize_object("mfa") == "multi_factor_auth"
assert registry.normalize_object("2fa") == "multi_factor_auth"
assert registry.normalize_object("passwort") == "password_policy"
def test_case_insensitive(self, registry):
assert registry.normalize_object("MFA") == "multi_factor_auth"
def test_substring_match(self, registry):
assert registry.normalize_object("die personenbezogene daten verarbeiten") == "personal_data"
def test_empty(self, registry):
assert registry.normalize_object("") == ""
def test_unknown_passthrough(self, registry):
assert registry.normalize_object("raumschiff") == "raumschiff"
# ── Cache behavior tests ────────────────────────────────────────────
class TestCacheBehavior:
def test_fresh_cache_not_stale(self, registry):
assert registry._is_stale() is False
def test_old_cache_is_stale(self, registry):
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
assert registry._is_stale() is True
# ── Migration data consistency ───────────────────────────────────────
class TestF2MigrationData:
def test_build_action_types(self):
from scripts.f2_migrate_actions import build_action_types
types = build_action_types()
assert len(types) >= 26
names = {t["canonical_name"] for t in types}
assert "implement" in names
assert "monitor" in names
assert "encrypt" in names
def test_build_action_synonyms(self):
from scripts.f2_migrate_actions import build_action_synonyms
synonyms = build_action_synonyms()
assert len(synonyms) > 100
# Check pattern types
aliases = [s for s in synonyms if s["pattern_type"] == "alias"]
negatives = [s for s in synonyms if s["pattern_type"] == "negative_pattern"]
assert len(aliases) > 80
assert len(negatives) > 15
def test_no_duplicate_synonyms(self):
from scripts.f2_migrate_actions import build_action_synonyms
synonyms = build_action_synonyms()
keys = [(s["synonym"], s["language"], s["pattern_type"]) for s in synonyms]
assert len(keys) == len(set(keys))
def test_all_canonical_actions_exist(self):
from scripts.f2_migrate_actions import build_action_types, build_action_synonyms
type_names = {t["canonical_name"] for t in build_action_types()}
synonyms = build_action_synonyms()
for s in synonyms:
assert s["canonical_action"] in type_names, (
"Synonym '%s' references unknown action '%s'" % (s["synonym"], s["canonical_action"])
)
class TestF3MigrationData:
def test_build_object_rows(self):
from scripts.f3_migrate_objects import build_rows
rows = build_rows()
assert len(rows) >= 70
def test_no_duplicate_objects(self):
from scripts.f3_migrate_objects import build_rows
rows = build_rows()
keys = [(r["synonym"], r["language"]) for r in rows]
assert len(keys) == len(set(keys))
def test_known_objects_present(self):
from scripts.f3_migrate_objects import build_rows
rows = build_rows()
synonyms = {r["synonym"] for r in rows}
assert "mfa" in synonyms
assert "passwort" in synonyms
assert "firewall" in synonyms
+196
View File
@@ -0,0 +1,196 @@
"""
Regression Tests verify pipeline updates don't break existing controls.
Requires: DATABASE_URL environment variable for DB tests.
Tests without DB run always (structural checks).
"""
import os
import sys
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
# ============================================================================
# Structural tests (no DB needed)
# ============================================================================
class TestOntologyStability:
"""Verify ontology constants haven't accidentally changed."""
def test_action_types_count(self):
from services.control_ontology import ACTION_TYPES
assert len(ACTION_TYPES) >= 26, f"ACTION_TYPES shrank to {len(ACTION_TYPES)}"
def test_phase_order_count(self):
from services.control_ontology import PHASE_ORDER
assert len(PHASE_ORDER) >= 15, f"PHASE_ORDER shrank to {len(PHASE_ORDER)}"
def test_key_action_types_exist(self):
from services.control_ontology import ACTION_TYPES
required = ["define", "implement", "monitor", "test", "prevent", "exclude", "train"]
for action in required:
assert action in ACTION_TYPES, f"Missing action_type: {action}"
def test_classify_action_deterministic(self):
"""Same input must always produce same output."""
from services.control_ontology import classify_action
for _ in range(10):
assert classify_action("implementieren") == "implement"
assert classify_action("überwachen") == "monitor"
assert classify_action("verhindern") == "prevent"
class TestDependencyEngineStability:
"""Verify dependency engine core functions haven't changed behavior."""
def test_evaluate_condition_empty(self):
from services.dependency_engine import evaluate_condition
assert evaluate_condition({}, {}) is True
def test_evaluate_condition_simple(self):
from services.dependency_engine import evaluate_condition
cond = {"field": "source.status", "op": "==", "value": "pass"}
assert evaluate_condition(cond, {"source": {"status": "pass"}}) is True
assert evaluate_condition(cond, {"source": {"status": "fail"}}) is False
def test_apply_effect_not_applicable(self):
from services.dependency_engine import apply_effect
assert apply_effect({"set_status": "not_applicable"}, "fail") == "not_applicable"
def test_default_priorities_unchanged(self):
from services.dependency_engine import DEFAULT_PRIORITIES
assert DEFAULT_PRIORITIES["supersedes"] == 10
assert DEFAULT_PRIORITIES["scope_exclusion"] == 20
assert DEFAULT_PRIORITIES["prerequisite"] == 50
assert DEFAULT_PRIORITIES["compensating_control"] == 80
class TestDocumentComplianceStability:
"""Verify document compliance rules haven't changed."""
def test_basic_website_requires_impressum(self):
from services.document_scope_resolver import resolve_required_documents
result = resolve_required_documents({"has_website": True})
docs = result.get("required_documents", [])
doc_types = [d["document_type"] if isinstance(d, dict) else d.document_type for d in docs]
assert "impressum" in doc_types
assert "privacy_policy" in doc_types
# ============================================================================
# DB tests (require DATABASE_URL)
# ============================================================================
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestControlCountStability:
"""Draft count must stay within expected range."""
def test_draft_count_minimum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count > 140000, f"Draft count too low: {count} (expected >140k)"
def test_draft_count_maximum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
assert count < 200000, f"Draft count too high: {count} (expected <200k)"
def test_no_null_titles(self, db_session):
from sqlalchemy import text
null_count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (title IS NULL OR title = '')"
)).scalar()
assert null_count == 0, f"{null_count} controls without title"
def test_assertion_coverage(self, db_session):
from sqlalchemy import text
no_assertion = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND (generation_metadata->>'assertion' IS NULL "
" OR generation_metadata->>'assertion' = '')"
)).scalar()
total = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b'"
)).scalar()
coverage = (total - no_assertion) / max(total, 1) * 100
assert coverage > 99, f"Assertion coverage only {coverage:.1f}% (expected >99%)"
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestDependencyGraphStability:
"""Dependency graph must be valid and within expected size."""
def test_dependency_count_minimum(self, db_session):
from sqlalchemy import text
count = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies WHERE is_active = true"
)).scalar()
assert count > 10000, f"Too few dependencies: {count} (expected >10k)"
def test_no_self_dependencies(self, db_session):
from sqlalchemy import text
self_deps = db_session.execute(text(
"SELECT COUNT(*) FROM compliance.control_dependencies "
"WHERE source_control_id = target_control_id AND is_active = true"
)).scalar()
assert self_deps == 0, f"{self_deps} self-referencing dependencies"
def test_no_orphan_dependencies(self, db_session):
from sqlalchemy import text
orphans = db_session.execute(text("""
SELECT COUNT(*) FROM compliance.control_dependencies d
WHERE d.is_active = true
AND NOT EXISTS (
SELECT 1 FROM compliance.canonical_controls c
WHERE c.id = d.source_control_id AND c.release_state = 'draft'
)
""")).scalar()
# Some orphans OK (pointing to deprecated/duplicate controls)
assert orphans < 1000, f"Too many orphan dependencies: {orphans}"
@pytest.mark.skipif(
not os.getenv("DATABASE_URL"),
reason="DATABASE_URL not set"
)
class TestQualityMetrics:
"""Quality metrics must stay within target ranges."""
def test_duplicate_rate(self, db_session):
from sqlalchemy import text
total = db_session.execute(text(
"SELECT COUNT(DISTINCT generation_metadata->>'merge_group_hint') "
"FROM compliance.canonical_controls "
"WHERE release_state = 'draft' AND decomposition_method = 'pass0b' "
"AND generation_metadata->>'merge_group_hint' IS NOT NULL"
)).scalar()
dups = db_session.execute(text("""
SELECT COUNT(*) FROM (
SELECT generation_metadata->>'merge_group_hint', COUNT(*)
FROM compliance.canonical_controls
WHERE release_state = 'draft' AND decomposition_method = 'pass0b'
AND generation_metadata->>'merge_group_hint' IS NOT NULL
GROUP BY generation_metadata->>'merge_group_hint'
HAVING COUNT(*) > 1
) sub
""")).scalar()
rate = dups / max(total, 1) * 100
assert rate < 5, f"Duplicate merge_key rate {rate:.1f}% exceeds 5% threshold"
@@ -0,0 +1,285 @@
"""Tests for RegulationRegistry — DB-backed lookup with cache and fallback."""
import time
from unittest.mock import patch, MagicMock
import pytest
from services.regulation_registry import (
RegulationRegistry,
_CACHE_TTL_SECONDS,
)
# ── Test data: simulates DB rows ──────────────────────────────────────────
_MOCK_DB_ROWS = [
# (regulation_id, regulation_name_de, license_rule, license_type,
# attribution, source_type, jurisdiction, status)
("eu_2016_679", "DSGVO (EU) 2016/679", 1, "EU_LAW",
None, "law", "EU", "active"),
("nist_sp_800_53", "NIST SP 800-53 Rev. 5", 1, "NIST_PUBLIC_DOMAIN",
None, "standard", "US", "active"),
("owasp_asvs", "OWASP ASVS 4.0", 2, "CC-BY-SA-4.0",
"OWASP Foundation, CC BY-SA 4.0", "standard", "INT", "active"),
("bdsg", "Bundesdatenschutzgesetz (BDSG)", 1, "DE_LAW",
None, "law", "DE", "active"),
("at_dsg", "Österreichisches Datenschutzgesetz (DSG)", 1, "AT_LAW",
None, "law", "AT", "active"),
]
def _mock_db_execute(query):
"""Mock that returns our test rows."""
mock_result = MagicMock()
mock_result.fetchall.return_value = _MOCK_DB_ROWS
return mock_result
@pytest.fixture
def registry():
"""Create a registry with mocked DB."""
reg = RegulationRegistry()
with patch("services.regulation_registry.SessionLocal") as mock_session_cls:
mock_session = MagicMock()
mock_session.execute = _mock_db_execute
mock_session_cls.return_value = mock_session
reg._load()
return reg
# ── classify_regulation tests ─────────────────────────────────────────────
class TestClassifyRegulation:
def test_exact_match_eu_law(self, registry):
result = registry.classify_regulation("eu_2016_679")
assert result["rule"] == 1
assert result["license"] == "EU_LAW"
assert result["source_type"] == "law"
assert result["name"] == "DSGVO (EU) 2016/679"
def test_exact_match_case_insensitive(self, registry):
result = registry.classify_regulation("EU_2016_679")
assert result["rule"] == 1
assert result["name"] == "DSGVO (EU) 2016/679"
def test_exact_match_with_whitespace(self, registry):
result = registry.classify_regulation(" eu_2016_679 ")
assert result["rule"] == 1
def test_nist_standard(self, registry):
result = registry.classify_regulation("nist_sp_800_53")
assert result["rule"] == 1
assert result["source_type"] == "standard"
def test_owasp_rule2(self, registry):
result = registry.classify_regulation("owasp_asvs")
assert result["rule"] == 2
assert result["attribution"] == "OWASP Foundation, CC BY-SA 4.0"
def test_german_law(self, registry):
result = registry.classify_regulation("bdsg")
assert result["rule"] == 1
assert result["source_type"] == "law"
assert result["jurisdiction"] == "DE"
def test_austrian_law(self, registry):
result = registry.classify_regulation("at_dsg")
assert result["rule"] == 1
assert result["jurisdiction"] == "AT"
def test_prefix_enisa_rule2(self, registry):
result = registry.classify_regulation("enisa_supply_chain_2024")
assert result["rule"] == 2
assert result["source_type"] == "standard"
assert "ENISA" in result["attribution"]
def test_prefix_bsi_rule3(self, registry):
result = registry.classify_regulation("bsi_tr_03161")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
assert result["name"] == "INTERNAL_ONLY"
def test_prefix_iso_rule3(self, registry):
result = registry.classify_regulation("iso_27001")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
def test_prefix_etsi_rule3(self, registry):
result = registry.classify_regulation("etsi_en_303_645")
assert result["rule"] == 3
def test_unknown_defaults_to_restricted(self, registry):
result = registry.classify_regulation("some_unknown_regulation")
assert result["rule"] == 3
assert result["source_type"] == "restricted"
assert result["license"] == "UNKNOWN"
# ── source_type_by_name tests ────────────────────────────────────────────
class TestSourceTypeByName:
def test_exact_match_law(self, registry):
result = registry.source_type_by_name("DSGVO (EU) 2016/679")
assert result == "law"
def test_exact_match_standard(self, registry):
result = registry.source_type_by_name("NIST SP 800-53 Rev. 5")
assert result == "standard"
def test_empty_returns_framework(self, registry):
assert registry.source_type_by_name("") == "framework"
assert registry.source_type_by_name(None) == "framework"
def test_heuristic_law(self, registry):
assert registry.source_type_by_name("Verordnung XYZ") == "law"
assert registry.source_type_by_name("Some EU Directive") == "law"
def test_heuristic_guideline(self, registry):
assert registry.source_type_by_name("EDPB Leitlinie 99/2025") == "guideline"
assert registry.source_type_by_name("BSI Standard 200-1") == "guideline"
def test_heuristic_framework(self, registry):
# "ENISA Cloud Guidelines" matches "guideline" before "enisa" in heuristic order
assert registry.source_type_by_name("ENISA Cloud Report") == "framework"
assert registry.source_type_by_name("OWASP Testing Guide") == "framework"
def test_unknown_returns_framework(self, registry):
assert registry.source_type_by_name("Completely Unknown Document") == "framework"
# ── is_open_source tests ────────────────────────────────────────────────
class TestIsOpenSource:
def test_rule1_is_open(self, registry):
assert registry.is_open_source("eu_2016_679") is True
def test_rule2_is_open(self, registry):
assert registry.is_open_source("owasp_asvs") is True
def test_rule3_is_not_open(self, registry):
assert registry.is_open_source("bsi_tr_03161") is False
def test_unknown_is_not_open(self, registry):
assert registry.is_open_source("unknown_thing") is False
# ── Cache behavior tests ────────────────────────────────────────────────
class TestCacheBehavior:
def test_fresh_cache_not_stale(self, registry):
assert registry._is_stale() is False
def test_old_cache_is_stale(self, registry):
registry._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
assert registry._is_stale() is True
def test_ensure_loaded_reloads_when_stale(self):
reg = RegulationRegistry()
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 100 # force stale
load_called = False
original_load = reg._load
def tracking_load():
nonlocal load_called
load_called = True
reg._load = tracking_load
reg._ensure_loaded()
assert load_called, "_load should have been called when cache is stale"
def test_ensure_loaded_skips_when_fresh(self, registry):
with patch.object(registry, "_load") as mock_load:
registry._ensure_loaded()
mock_load.assert_not_called()
# ── Graceful degradation tests ──────────────────────────────────────────
class TestGracefulDegradation:
def test_db_failure_uses_stale_cache(self):
"""If DB fails, stale cache entries are still usable."""
reg = RegulationRegistry()
# First load succeeds
with patch("services.regulation_registry.SessionLocal") as mock_cls:
mock_session = MagicMock()
mock_session.execute = _mock_db_execute
mock_cls.return_value = mock_session
reg._load()
# Force stale
reg._loaded_at = time.monotonic() - _CACHE_TTL_SECONDS - 1
# Second load fails — DB error
from sqlalchemy.exc import OperationalError
with patch("services.regulation_registry.SessionLocal") as mock_cls:
mock_cls.side_effect = OperationalError("connection refused", None, None)
reg._ensure_loaded()
# Should still have cached data
result = reg.classify_regulation("eu_2016_679")
assert result["rule"] == 1
def test_empty_registry_returns_unknown(self):
"""Unloaded registry returns safe defaults."""
reg = RegulationRegistry()
reg._loaded_at = time.monotonic() # pretend fresh but empty
result = reg.classify_regulation("eu_2016_679")
assert result["rule"] == 3 # safe default
assert result["license"] == "UNKNOWN"
# ── Migration data consistency tests ────────────────────────────────────
class TestMigrationDataConsistency:
"""Verify that the migration script produces valid data."""
def test_build_rows_produces_data(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
assert len(rows) > 100 # at least 100 entries
def test_all_rows_have_required_fields(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
for row in rows:
assert row["regulation_id"], f"Missing regulation_id: {row}"
assert row["regulation_name_de"], f"Missing name: {row}"
assert row["license_rule"] in (1, 2, 3), f"Bad rule: {row}"
assert row["source_type"] in (
"law", "guideline", "standard", "framework", "restricted"
), f"Bad source_type: {row}"
assert row["jurisdiction"], f"Missing jurisdiction: {row}"
assert row["status"] in ("active", "needs_review", "deprecated")
def test_no_duplicate_regulation_ids(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
ids = [r["regulation_id"] for r in rows]
assert len(ids) == len(set(ids)), f"Duplicates: {[x for x in ids if ids.count(x) > 1]}"
def test_known_regulations_present(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
ids = {r["regulation_id"] for r in rows}
assert "eu_2016_679" in ids # DSGVO
assert "bdsg" in ids # BDSG
assert "nist_sp_800_53" in ids # NIST
assert "owasp_asvs" in ids # OWASP
def test_owasp_has_attribution(self):
from scripts.f1_migrate_regulation_registry import build_rows
rows = build_rows()
owasp = [r for r in rows if r["regulation_id"] == "owasp_asvs"][0]
assert owasp["attribution"] is not None
assert "OWASP" in owasp["attribution"]
assert owasp["license_rule"] == 2
-2
View File
@@ -162,8 +162,6 @@ services:
profiles: ["disabled"]
gitea-runner:
profiles: ["disabled"]
night-scheduler:
profiles: ["disabled"]
admin-core:
profiles: ["disabled"]
pitch-deck:
+23 -34
View File
@@ -414,10 +414,10 @@ services:
condition: service_healthy
healthcheck:
test: ["CMD", "curl", "-f", "http://127.0.0.1:8098/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10s
interval: 60s
timeout: 30s
retries: 10
start_period: 30s
restart: unless-stopped
networks:
- breakpilot-network
@@ -434,7 +434,7 @@ services:
EMBEDDING_BACKEND: ${EMBEDDING_BACKEND:-local}
LOCAL_EMBEDDING_MODEL: ${LOCAL_EMBEDDING_MODEL:-BAAI/bge-m3}
LOCAL_RERANKER_MODEL: ${LOCAL_RERANKER_MODEL:-cross-encoder/ms-marco-MiniLM-L-6-v2}
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-pymupdf}
PDF_EXTRACTION_BACKEND: ${PDF_EXTRACTION_BACKEND:-auto}
OPENAI_API_KEY: ${OPENAI_API_KEY:-}
COHERE_API_KEY: ${COHERE_API_KEY:-}
LOG_LEVEL: ${LOG_LEVEL:-INFO}
@@ -490,9 +490,8 @@ services:
volumes:
- gitea_data:/var/lib/gitea
- gitea_config:/etc/gitea
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
environment:
TZ: "Europe/Berlin"
USER_UID: "1000"
USER_GID: "1000"
GITEA__database__DB_TYPE: postgres
@@ -583,33 +582,6 @@ services:
networks:
- breakpilot-network
# =========================================================
# NIGHT SCHEDULER
# =========================================================
night-scheduler:
build:
context: ./night-scheduler
dockerfile: Dockerfile
container_name: bp-core-night-scheduler
ports:
- "8096:8096"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./night-scheduler/config:/config
environment:
COMPOSE_PROJECT_NAME: breakpilot-core
CONTAINER_PATTERN: "bp-*"
EXCLUDED_CONTAINERS: "bp-core-night-scheduler,bp-core-nginx,bp-core-postgres,bp-core-valkey"
healthcheck:
test: ["CMD", "curl", "-f", "http://127.0.0.1:8096/health"]
interval: 30s
timeout: 10s
start_period: 10s
retries: 3
restart: unless-stopped
networks:
- breakpilot-network
# =========================================================
# ADMIN CORE
# =========================================================
@@ -910,3 +882,20 @@ services:
restart: unless-stopped
networks:
- breakpilot-network
# =========================================================
# MARKETING WEBSITE - BreakPilot Produktwebsite
# =========================================================
marketing-website:
build:
context: ./marketing-website
dockerfile: Dockerfile
container_name: bp-core-marketing-website
platform: linux/arm64
ports:
- "3014:3000"
environment:
NODE_ENV: production
restart: unless-stopped
networks:
- breakpilot-network

Some files were not shown because too many files have changed in this diff Show More