Commit Graph

96 Commits

Author SHA1 Message Date
Benjamin Admin dbd44ecc20 feat(licenses): postgres + qdrant license_rule backfill scripts
Two idempotent scripts that complete Task #22 (300k atomic_controls
reclassification) across both Postgres DBs and all Qdrant collections
on Mac Mini + Production.

backfill_license_rule.py
- iterative parent_control_uuid inheritance with cycle cap
- dry-run + apply modes, per-iteration row counts
- residual-orphan cluster report for manual review

backfill_qdrant_license_payload.py
- joins canonical_controls.id (or regulation_id) → license_rule
- scrolls + grouped set_payload per rule (3 batches per collection)
- supports both lookup tables (canonical_controls / regulation_registry)
- supports managed Qdrant via --qdrant-api-key (Production)

Backfill bilance:
- Mac Mini canonical_controls: 0 NULL (was 279,384) across 314,811 rows
- Mac Mini Qdrant atomic_controls_dedup: 44,987 points patched
- Mac Mini bp_compliance_gesetze: 37,634 points patched
- Mac Mini bp_compliance_datenschutz: 11,338 points patched
- Production canonical_controls: 0 NULL (was 259,914) across 294,027 rows
- Production Qdrant bp_compliance_gesetze: 55,836 patched
- Production Qdrant bp_compliance_datenschutz: 18,980 patched
- Production Qdrant bp_compliance_ce: 23,239 patched

Schema migration 002_regulation_registry.sql + 252 registry rows were
replicated to Production (was missing — only existed on Mac Mini).
20 BSI/DE-Gesetz entries added to registry to close Qdrant lookup gap.

100% deterministic classification achieved on both DBs via:
- parent_control_uuid inheritance (94% coverage)
- control_parent_links.source_regulation → regulation_registry
- source_citation->>'source' → regulation_registry
- canonical_processed_chunks ground truth (chunk-validated)
- ungrouped LLM-aggregate Vorfahren → own works (Rule 3)

[migration-approved]
2026-05-21 18:46:57 +02:00
Benjamin Admin 93687a32fe docs(licenses): freeze 3-rule license mapping + audit script
Defines the authoritative mapping from license_type to license_rule
in docs/LICENSE_RULES.md, and adds scripts/audit_license_classification.py
to surface classification gaps in registry/canonical_controls/Qdrant.

Key finding from first audit run against bp-core-postgres + Qdrant:

- regulation_registry: 232 rows, 224 rule=1, 8 rule=2, 0 rule=3;
  36 rows without license_type (need backfill)
- canonical_controls: 314,811 rows, 279,384 (89%) have NULL
  license_rule (target of Task #22 reclassification)
- Qdrant atomic_controls_dedup: 100% of sampled points lack both
  license and license_rule payload fields
- Qdrant bp_compliance_gesetze: 80.6% lack both fields
- Qdrant bp_compliance_ce + bp_compliance: nearly clean

Rule definitions clarified (was loosely remembered as
"law / cite / rewrite"):
- Rule 1 = verbatim, sovereign law (EU/DE/AT/CH/US, TRBS/TRGS/ASR,
  OSHA, NIST, EU guidelines, DGUV UVV)
- Rule 2 = verbatim with attribution (CC-BY, Apache, OWASP,
  OECD AI Principles, ENISA)
- Rule 3 = identifier citation only, no full text (DIN/EN/ISO,
  ANSI/UL/IEC, DGUV Regeln/Informationen/Grundsaetze, BSI,
  proprietary standards). Pipeline drops chunk_text when rule=3
  in pipeline_adapter.py:147.

The 4th category I had proposed ("R1-A") turned out to be already
implemented as rule=2; the mapping doc reflects the actual code
behaviour rather than the original 3-name verbal model.

No schema change. No data migration in this commit — reclassification
of the 279k controls is staged as Task #22 and will be cluster-based
by source/regulation_id.
2026-05-21 11:29:38 +02:00
Benjamin Admin 7d721a6787 feat(control-pipeline): BSI QUAIDAL Clean-Room ingestion (AI Act Art. 10)
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 33s
Clean-Room derivation of 195 controls from BSI QUAIDAL (10 criteria + 15
building blocks + 30 measures + 140 metrics) for EU AI Act Art. 10
training-data quality compliance.

- ingest_bsi_quaidal.py parses YAML frontmatter into a structural index
  (no protected prose stored on disk).
- derive_quaidal_mcs.py rewrites each entry via local LLM (qwen3.5:35b-a3b)
  with a hard 4-gram plagiarism gate < 20%; achieved mean overlap 0.5%.
- Migration 011 adds compliance.derived_controls table with full source
  provenance (framework, section, url, commit SHA, license note).
- apply_quaidal_to_db.py UPSERTs YAML into DB.
- Source repo (legal-sources/bsi-quaidal/) gitignored.

Same pattern as IACE module DIN-reference handling: name the norm and
section, never quote.

Backed by BSI license clarification 2026-05: § 5 UrhG anwendbar,
share:true im Frontmatter; Clean-Room derivation is the safe path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 13:02:49 +02:00
Benjamin Admin 911697bab4 feat(marketing): Saving-Section + Landingpages + Pipeline Lessons-Learned [split-required]
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 35s
Marketing-Website
- Neue SavingsSection auf Homepage: "Compliance entdeckt sechsstellige
  Einsparungen". Pitch-Position der Cookie-Audit-Cost-Optimization-Story
  fuer DAX-Konzern-Sales (BMW-Case-Style: 90 Vendors -> 25 nach
  Konsolidierung, EUR 500k-3M / Jahr).
- /savings-scan: Kostenloser 5-Min-Saving-Scan-Form (URL + E-Mail).
  Form-Submit ist Placeholder, soll an Compliance-Backend gehaengt werden.
- /savings-methodik: 4-Stufen-Erklaerung der Cookie-Tier-Inferenz +
  ehrliche Caveats (Listpreise != Vertragspreise, Media-Spend nicht
  enthalten) + Datenquellen.
- Content-de + Content-en in content.ts beide um savings-Block ergaenzt
  und Section-Numerierung angepasst (03=Savings, 04=Deterministic).
- LOC-Split: savings-Inhalte (DE+EN, ~100 LOC) in content.savings.ts
  ausgelagert damit content.ts unter 500-LOC-Hard-Cap bleibt.

Control-Pipeline
- LESSONS-LEARNED-mc-check-types.md fuer die parallele CRA-MC-Generation.
  Erklaert die TEXT/PROCESS/REVIEW-Klassifikation die im Compliance-Repo
  retrofitted wurde. Verhindert dass CRA-MCs denselben Defekt bekommen.
  Mapping-Heuristik fuer verification_method -> check_type, plus
  Backfill-Workflow fuer ~62 ambiguous Eintraege.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:38:30 +02:00
Benjamin Admin 9783657da3 feat(control-pipeline): incremental dedup + ENISA CRA ingestion
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 43s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 37s
BatchDedup since-Parameter (services/batch_dedup_runner.py + api):
- Neuer 'since: datetime' Param scoped Phase 1 + Phase 2 SQL auf created_at >= since.
- Phase 2 checkpoint wird beim scoped Lauf geloescht (verhindert Skip neuer Atomics
  deren control_id alphabetisch unter dem stale last_id liegt).
- 6-13x schneller fuer nachgeschobene Dokumente (19k statt 172k Atomics).
- Doku: control-pipeline/docs/incremental-dedup.md.

Neue Scripts:
- gpre1_object_groups_incremental.py: Append neuer Objects an object_groups via
  bge-m3 nearest-neighbor (threshold default 0.85, empfehlbar 0.78 fuer breiteres
  Synonym-Matching). Pure INSERT/UPDATE, kein DELETE.
- gpre2_master_controls_incremental.py: Non-destructive Master-Controls-Update.
  Existing MCs unangetastet (UUIDs + master_control_id bleiben), nur neue Members
  appended + neue MCs fuer Object-Groups die jetzt min-phases erreichen.
- ingest_enisa_cra.py: Ingestion der 8 CRA-relevanten ENISA-Dokumente
  (Standards Mapping, EUCC-Implementation, NIS2 TIG, SRP FAQ, EUCC Eval Methodology,
  CVD Policies, Threat Landscape 2025). chunk_strategy=legal,
  requirement_strength=guidance|consultation_draft|evidentiary.

Quelldaten: legal-sources/enisa/enisa_cra_single_reporting_platform_faq.html
(PDFs sind .gitignore-gefiltert).

Ergebnis dieser Pipeline-Iteration:
- 1.296 neue CRA-Controls + 19.652 atomare Children
- +362 neue Master-Controls, 10.017 existing erweitert
- Total: 13.950 MCs, 620 CRA-MCs (vorher 566), 1.304 CRA-Atomics (vorher 841)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 18:21:46 +02:00
Benjamin Admin 519cc274bb docs: session handover — MC Quality + Gap Engine + RAG Ingestion (5 Tage)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 21:47:22 +02:00
Benjamin Admin 937eca6b77 test(pipeline): Phase 6 — Golden Dataset + MC Quality Tests
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 35s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 34s
- 20 manually verified golden controls with expected MC topics
- Structural quality tests: min 10K MCs, max 300/MC, no orphans
- Doc-check controls tests: 8 doc types covered, no empty questions
- Quality thresholds: 90% accuracy, enforced by regression tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 21:03:49 +02:00
Benjamin Admin 0c1561d6cc feat(pipeline): derive 1,874 doc_check_controls from Master Controls
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 45s
CI / test-python-voice (push) Successful in 44s
CI / test-bqas (push) Successful in 40s
8 document types: DSE (571), Cookie (381), Löschkonzept (309),
Widerrufsbelehrung (153), DSFA (147), AVV (125), AGB (113), Impressum (75).

Each control has binary check_question + pass_criteria + fail_criteria.
Derived via Claude Haiku from existing MCs filtered by regulation source.

Table: compliance.doc_check_controls (local + production synced)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 20:56:23 +02:00
Benjamin Admin 8510af46eb feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 15:08:15 +02:00
Benjamin Admin f022b489e2 docs: comprehensive session handover — Blocks F+G complete, next: MC quality refinement
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 21:06:01 +02:00
Benjamin Admin 0092c4fe47 feat(pipeline): G-pre1 refinement script for large object groups
Splits master controls >200 members by re-clustering their object groups
with k=4-20 per group. First round: 38 groups → 325 sub-groups → 253 new MCs.
25 generic MCs remain (monitoring, procedure, etc.) — need regulation-source split.

Session summary: Block F complete, Control Generation (1,599+), Pass 0a/0b,
Production Sync, G-pre1/2/3 Object Clustering + Master Controls + API,
G1-G4 Compliance Execution Layer (Decision Trace, Commit Ledger, Decision Memory,
Pre-Deployment Enforcement).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:41:49 +02:00
Benjamin Admin d5bcd0bd5b feat(pipeline): G4 Pre-Deployment Enforcement — CI/CD compliance gate
New table: deployment_checks (verdict, blocking/warning controls, risk score)
New API:
  POST /v1/deployment-checks (SDK asks: "can I deploy?")
  GET /v1/deployment-checks/{id} (check result)
  POST /v1/deployment-checks/{id}/override (manual override with justification)
  GET /v1/deployment-checks/stats (approval/block rate)

Check logic: queries G1 decision_traces + G3 open failures per affected control.
Verdict: approved (0 blocking) or blocked (with fix recommendations).
454 tests pass, 0 regressions.

Block G complete: G1-G4 all implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:24:45 +02:00
Benjamin Admin c398e74d5e feat(pipeline): G3 Full Decision Memory — compliance lifecycle event stream
New table: decision_events (assessment→decision→fix→verification→failure cycle)
New API:
  POST /v1/decision-events (record lifecycle event)
  GET /v1/decision-events (list with filters)
  GET /v1/decision-events/timeline/{control_id} (full chronological timeline)
  GET /v1/decision-events/stats (failure rate, cycle times)

Each event captures input_state, output_state, actor, evidence.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:16:25 +02:00
Benjamin Admin e82f99b8cb feat(pipeline): G2 Compliance Commit Ledger — code↔control audit trail
New table: compliance_commits (commit hash, affected controls, risk level)
New API:
  POST /v1/compliance-commits (SDK registers commit + impact)
  GET /v1/compliance-commits (list with filters)
  GET /v1/compliance-commits/by-control/{id} (all commits for a control)
  GET /v1/compliance-commits/stats (dashboard)
  GET /v1/compliance-commits/{id} (detail)

GIN index on affected_control_ids for fast @> containment queries.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 19:17:45 +02:00
Benjamin Admin 66a70ab31c feat(pipeline): G1 Decision Trace — compliance decision tracking
New table: decision_traces (status, reason, evidence, fix plan per control)
New API:
  POST/GET/PUT /v1/decision-traces (CRUD for decisions)
  GET /v1/decision-traces/stats (compliance dashboard)
  GET /v1/controls/{id}/full-trace (Regulation→Obligation→Control→Decision→Evidence)

454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 18:26:21 +02:00
Benjamin Admin ad24835940 feat(pipeline): G-pre1/2/3 — Object Clustering + Master Controls + API
G-pre1: 144k objects clustered into 7,466 groups via Mini-Batch K-Means
  on bge-m3 embeddings. Two-stage: k=5000 base + sub-cluster groups >50.
G-pre2: 5,114 Master Controls from lifecycle phase chains
  (define→implement→test→monitor), linking 172,504 atomic controls.
G-pre3: REST API for Master Controls
  GET /v1/master-controls (list, search, filter)
  GET /v1/master-controls/stats
  GET /v1/master-controls/{mc_id} (detail with phase-controls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:11:38 +02:00
Benjamin Admin 0bad74a3bd docs: session handover — Block F complete, pipeline done, G-pre1 analysis
Session 03-05.05.2026:
- Block F1-F5 complete (DB migration of hardcoded dicts)
- Control Generation: 1,599 controls + 11,522 obligations + 1,147 atomics
- Production sync: 2,625 controls + 11,522 obligations synced
- G-pre1 analysis: 183k objects → 144k after normalize (needs hierarchical clustering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:02:10 +02:00
Benjamin Admin 22257a7ed8 feat(pipeline): F5 validation tests — verify DB matches hardcoded dicts
8 tests confirm all REGULATION_LICENSE_MAP, ACTION_TYPES, _NEGATIVE_PATTERNS,
_ACTION_SYNONYMS, and _OBJECT_SYNONYMS entries are correctly migrated to DB.
Dicts kept as fallback for DB-unavailability resilience.

Block F complete: F1-F5 all done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 16:06:59 +02:00
Benjamin Admin a20de0b52b feat(pipeline): F4 LLM synonym enrichment script
Uses Ollama (qwen3.5:35b-a3b, think:false) to generate additional
German synonyms for action types and object tokens. Results stored
with source='llm' in action_synonyms/object_synonyms tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:45:43 +02:00
Benjamin Admin 64f45be63a feat(pipeline): add Pass 0a endpoint to core control-pipeline
Registers /generate/run-pass0a and /generate/pass0a-status/{job_id}
on the core control-pipeline (port 8098). Previously Pass 0a was only
available on the compliance backend which connects to Production DB,
causing a split-brain when controls are generated locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 07:21:58 +02:00
Benjamin Admin e869cabc81 docs: session handover — F1-F3 done, control generation running
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 07:21:24 +02:00
Benjamin Admin 652e3a65a3 feat(pipeline): F2+F3 action/object ontology — DB-backed normalization
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 31s
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.

- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:47:53 +02:00
Benjamin Admin 9437e029d0 feat(pipeline): F1 regulation registry — DB-backed license/source-type lookup
Migrates REGULATION_LICENSE_MAP (135 entries) and SOURCE_REGULATION_CLASSIFICATION
(58 entries) from hardcoded Python dicts to compliance.regulation_registry table.

- SQL migration: 002_regulation_registry.sql (table + indexes + trigger)
- Migration script: f1_migrate_regulation_registry.py (162 rows, --dry-run)
- RegulationRegistry cache: 5min TTL, prefix fallback, graceful degradation
- control_generator._classify_regulation() delegates to DB with dict fallback
- source_type_classification.classify_source_regulation() delegates to DB
- 34 new tests (lookup, cache, degradation, migration data consistency)
- 421 total tests pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:14:06 +02:00
Benjamin Admin 4fd2bfefcd docs: session handover updated for Block F start
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:51:23 +02:00
Benjamin Admin fac9280716 feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments:
- D5+: NIST/ENISA PDF quality fix (0%→45% section rate)
- D5+: 4 lost NIST PDFs restored (11k chunks)
- D5+: Text normalization + section detection for NIST/BSI
- D6: Citation backfill (3,651 controls updated, old archived)
- E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...)
- E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.)
- E4: Standards (GoBD, BAIT, VAIT)
- E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG)
- E7: 9 court judgments as full text (Schrems II 154 chunks,
  Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41,
  Schadenersatz 29, BAG 48, Google Fonts 14)
- Infra: Qdrant snapshot mechanism, upload-before-delete safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:31:57 +02:00
Benjamin Admin 118be3540d feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap),
  archives old citations, updated 3.651 controls (93.6% coverage)
- ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG,
  MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks)
- ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due
  to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency
  manually ingested (1.057 chunks)
- Updated session handover with current state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
Benjamin Admin 2f4a3f2ea2 fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:42:06 +02:00
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 97a7f6f264 docs: comprehensive session handover with full roadmap (Blocks A-G)
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:30:50 +02:00
Benjamin Admin ff21bc258a docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements:
- Structural metadata end-to-end (D2-D4)
- 430 docs re-ingested with new chunking
- HTML stripping + charset detection (0% → 97.6%)
- 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%)
- Quality report script (500 controls: 13% fully correct)
- Frontend requirements.map fix

Open: NIST/ENISA text normalization, citation backfill,
D5 script safety (upload-before-delete), BEG IV ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 5a6e588641 docs: update session handover — D2-D5 complete, EU PDF issue documented
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.

Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:34:34 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
Benjamin Admin ddad58f607 fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.

Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.

Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.

27 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:18:25 +02:00
Benjamin Admin 93099b2770 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.

D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.

D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.

Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).

456 tests passing (58 embedding + 387 pipeline + 11 rag-service).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 20:34:00 +02:00
Benjamin Admin da21339e76 docs: add session handover instructions for next session
Covers: completed blocks A-D1, remaining D2-G, critical files,
DB state, memory files, test commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:33:05 +02:00
Benjamin Admin d9c16fb914 feat(pipeline): add adversarial tests (30 cases) + regression harness
Block C implementation:
- adversarial_cases.yaml: 30 tricky cases in 5 categories
  (wrong legal basis, dark patterns, incomplete docs, similar-but-different, homonyms)
- test_adversarial.py: 63 tests validating adversarial cases
- test_regression.py: ontology stability, dependency engine, quality metrics
- conftest.py: shared fixtures (DB session, sample controls)

Total: 371 tests passing (221 existing + 150 new).
Real-world benchmarks (C1) need manual ground truth creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 13:02:29 +02:00
Benjamin Admin 6f58fdbaa5 docs: add test strategy instruction for dedicated session (Block C)
3 test levels: Real-World Benchmarks (10 DE websites), Adversarial Tests
(30 tricky cases), Regression Harness (CI/CD quality gate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 12:28:58 +02:00
Benjamin Admin b8ff4e9290 feat(pipeline): add review-verify endpoint — LLM decides DUPLIKAT/VERSCHIEDEN
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 09:36:30 +02:00
Benjamin Admin 7c5592b50e feat(pipeline): add checkpoint to dedup Phase 2 — survives container restart
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 09:12:23 +02:00
Benjamin Admin b151951448 fix(pipeline): make dedup Phase 2 resilient — paginated, timeout, per-control error handling
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:31:28 +02:00
Benjamin Admin 9dc16674e2 perf(pipeline): skip singleton groups in dedup Phase 1
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.

Reduces Phase 1 from ~96h to ~2h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 00:31:22 +02:00
Benjamin Admin e6e2688b56 fix(pipeline): add idempotency guard to submit-pass0b endpoint
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 18:59:03 +02:00
Benjamin Admin 8e37441782 perf(pipeline): switch back to v4 prompt — backfill costs nearly the same
v3+backfill=$31.60/10k vs v4=$33/10k — not worth the extra complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 00:44:23 +02:00
Benjamin Admin 6a0e7c947f perf(pipeline): switch to v3 prompt for generation, v4 fields via Haiku backfill
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).

Saves ~$200 over the remaining 160k obligations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 00:14:47 +02:00
Benjamin Admin 5ef039a6bc feat(pipeline): Pass 0b prompt v4 + Haiku backfill endpoint
Prompt v4 adds 6 new fields to Pass 0b output:
- applicability: condition rules (same format as dependency engine)
- check_type: expanded to 10 granular types
- scanner_hint: search_terms + negative_indicators for MCP
- manual_review_required_if: escalation conditions
- evidence_type: code/process/hybrid
- provides_context: context variables this control creates

New endpoint POST /generate/backfill-extended:
- Backfills existing 9k controls via Haiku Batch API (~$1.50)
- Adds all 6 new fields to generation_metadata
- Supports dry_run mode

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 23:14:59 +02:00
Benjamin Admin 96b8f25747 fix(pipeline): use action_type-derived phase order in ontology generator
LLM merge_key phases (e.g. "submission") don't always match PHASE_ORDER
keys. Derive phase order from action_type via get_phase_order() instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 20:32:58 +02:00
Benjamin Admin 42ab5ead26 feat(pipeline): implement Control Dependency Engine (Block 9)
Core engine (dependency_engine.py):
- 5 dependency types: prerequisite, supersedes, compensating_control,
  conditional_requirement, scope_exclusion
- Generic condition evaluator (JSONB rules with AND/OR/NOT/field ops)
- Priority-based conflict resolution
- Cycle detection (DFS) + topological sort
- Full evaluation with MCP-compatible dependency_resolution trace
- 39 tests all passing (incl. GHV scenario from user requirements)

Automatic generator (dependency_generator.py):
- Ontology-based: same normalized_object + phase sequence -> prerequisite
- Pattern-based: define->implement, implement->monitor, etc.
- Domain packs: YAML rules for GDPR, AI Act, CRA, Security, Labor Contracts
- 14 tests all passing

API routes (dependency_routes.py):
- CRUD for dependencies
- POST /evaluate with dependency resolution
- POST /generate (auto-generation with dry_run)
- POST /validate (cycle detection)
- GET /graph (nodes + edges for visualization)

Prompt enhancement (decomposition_pass.py):
- Added dependency_hints + lifecycle_phase_order to Pass 0b prompt
- Stored in generation_metadata for post-processing

DB migration: control_dependencies + control_evaluation_results tables

126 tests total, all passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 20:28:10 +02:00
Benjamin Admin 5aaa62dca7 fix(pipeline): improve quality metrics heuristics
- Fix truncated title detection: only flag near-200-char titles or mid-word cutoffs
- Fix evidence leak detection: check title start patterns, not keyword substring
  ("nachweisen" verb is valid action, "Nachweis vorliegen" is evidence)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 09:53:52 +02:00
Benjamin Admin d583971afd feat(pipeline): add quality metrics endpoint for Pass 0b controls
GET /generate/quality-metrics — reports:
- controls_per_obligation ratio
- duplicate merge_key rate
- evidence leak rate
- truncated title rate
- MCP field coverage
- merge_key coverage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-26 09:51:27 +02:00