Commit Graph

296 Commits

Author SHA1 Message Date
Benjamin Admin 97e39579d5 feat(cookie+routing): Storage-Typ-Filter + legal_notice capture-only
#3 Storage-Filter: cookie-check exponiert per-Cookie-Speichertyp
(storage_inventory.per_cookie); CookieResultView bekommt Filter-Chips
(Cookie/Local Storage/Framework …) + eine Speicher-Spalte, Anbieter ohne
passenden Treffer werden ausgeblendet, KPI zeigt gefilterte Zahl.

A-Routing: legal_notice ist jetzt ein kanonischer Doc-Type. Eigene
Discovery-Regel (legal-disclaimer/rechtlicher-hinweis) VOR impressum →
die Disclaimer-Seite wird nicht mehr als Impressum substituiert (Ursache,
dass die Cross-Doc-Reconciliation nie zündete). capture-only: als
doc_entry für B persistiert, aber nicht einzeln gescort (keine 0%-Noise,
da ohne eigene Checkliste). Im Scan-Form als Option auswählbar.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 20:45:18 +02:00
Benjamin Admin 0f6cdc93fd fix(snapshot): Cookie-Dedup + schneller Impressum-Tab + Tabellen-Zahl
- Cookies werden je Vendor nach Name dedupliziert (Consent-Phasen-Dubletten;
  BMW 2196 → ~772) — in cookie-check + get_snapshot, behebt aufgeblähte
  Kachel-/Finding-Zahlen.
- Impressum-Snapshot-Check überspringt den ~40s-LLM-Schritt (context skip_llm)
  → Tab lädt sofort statt leer zu bleiben.
- Vendor-Tabelle zeigt nur die Cookie-Zahl (kein 'Cookies'-Wort je Zeile).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 19:54:15 +02:00
Benjamin Admin ba7d98be36 feat(reconcile): B — Cross-Doc-Reconciliation (Pflicht in anderem Doc erfüllt)
Ein 'X fehlt'/'zu prüfen'-Finding wird unterdrückt, wenn die Pflicht in einem
ANDEREN Snapshot-Dokument erfüllt ist (z.B. § 36 VSBG / OS-Link stehen bei BMW
in AGB/'Rechtlicher Hinweis', nicht im Impressum → war False Positive).
Konservative Allowlist (impressum: verbraucher_streitbeilegung, odr_link) gegen
False-Reconciliation. Verdrahtet in _run_doc_agent (alle Doc-Checks). Frontend:
'In anderem Dokument abgedeckt'-Sektion. Greift voll nach Scan + Legal-Capture.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 15:42:16 +02:00
Benjamin Admin 9dfdaae8e4 feat(cookie): präfix-bewusster Library-Match (Runtime-Suffixe)
load_big_library matchte nur EXAKT → nur ~27% der BMW-Cookies trafen die
Open-Cookie-DB, weil Per-Instanz-Suffixe abweichen (_ga_GTM-XYZ, AMCVS_###@
AdobeOrg, _pk_id.5.7d8). Jetzt: Library einmal laden, Namen entwildcarden,
über _candidate_keys (exact + Präfix an Trennzeichen, Mindestlänge 3 gegen
Über-Match) matchen. Reuse der bewährten _strip_wildcards-Logik.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 15:24:45 +02:00
Benjamin Admin 4fb476e4be fix(agb): Gewährleistung erkennt 'bei Mängeln' / '§ N Mängel'-Heading
BMW-AGB nutzt '§ 9 Mängel' + 'Rechte und Ansprüche bei Mängeln' statt
'Gewährleistung' — Pattern ergänzt (False Negative auf Realdaten behoben).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 14:31:41 +02:00
Benjamin Admin 7258744107 refactor+feat: Snapshot-Router-Split + generischer ChecklistAgent + AGB-Modul
- Item 2: Snapshot-Doc-Checks (cookie/impressum/dse/agb) in snapshot_check_routes.py
  (agent_compliance_check_routes.py 464→365 Z.); gleiche Pfade, in main.py registriert.
- ChecklistAgent-Basis: DSE-Logik generalisiert (L1/L2, kurze Titel, _severity_
  override-Hook). DSEAgent + AGBAgent sind jetzt Thin-Subclasses → künftige
  Doc-Agenten (widerruf/avv/…) trivial.
- Item 4: AGBAgent (§§ 305 ff. BGB, AGB_CHECKLIST) + agb-check + AGB-Tab via
  AgentModuleTab. Kein Library-Firehose.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 14:23:29 +02:00
Benjamin Admin 3c6deac1c5 fix(dse+linter): Drittland-Applicability, kein na-Detail, kurze Titel, Linter-Wortgrenzen
- Linter: FORBIDDEN_OUTPUT_TERMS per Wortgrenze → 'Schutzgarantien'/'geeignete
  Garantien' (Art. 46) passieren, 'garantiert'-Claims bleiben geblockt.
- DSE: L2-Detail wird übersprungen statt 'na', wenn die L1-Pflichtangabe fehlt
  (kein irreführendes 'nicht anwendbar' für z.B. Transfermechanismus).
- DSE: Drittland → HIGH bei dokumentiertem Drittlandtransfer (scan_context via
  AgentInput.context) — BMW (Konzern, US-Provider) ist kein weiches MEDIUM.
- DSE: Titel/Maßnahme kurz (treibt den Recommendation-Titel); ausführliche
  Begründung als evidence — behebt 120-Zeichen-abgeschnittene Überschriften.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 13:43:24 +02:00
Benjamin Admin 76be96556d feat(dse): kuratierter DSEAgent + Snapshot-Tab (Art. 13/14, kein Firehose)
DSEAgent wrappt die existierende ART13_CHECKLIST (33 kuratierte Pflichtangaben
L1 + Detailchecks L2) → strukturierter AgentOutput, NICHT der 90k-Library-
Firehose (eCall/Gesundheit/Telekom-Lärm). GET /snapshots/{id}/dse-check spiegelt
impressum-check; doc_input_from_snapshot generalisiert. Frontend: generischer
AgentModuleTab (lazy → AgentResultTab) für Impressum + DSE; DSE-Tab in der
Snapshot-Seite. Plus HRB-Pattern \d→\d+ (volle Registernummer als Beleg).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 12:46:46 +02:00
Benjamin Admin be93859645 fix(impressum): Pflichtangaben-Beleg = exakter Treffer statt Textpassage
_match_value gibt genau den gematchten Bereich zurück (nur die E-Mail unter
Email, nur die USt-IdNr, nur die Telefonnummer) — nicht mehr ein Fenster/den
umgebenden Satz. Behebt die Wiederholung desselben Anfangssatzes bei Texten
ohne Zeilenumbrüche (BMW = ein Block).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 12:25:27 +02:00
Benjamin Admin 877d540ce1 fix(cookie+impressum): Drittland-FP, Impressum-Beleg, neuer Opt-Out-Finding
- Drittland: unbekannte Herkunft ('N/A') + Self-Hosting feuern nicht mehr —
  First-Party-Session-Cookies (PHPSESSID/JSESSIONID) waren False Positives.
- Impressum _line_of: enges Fenster um den Treffer bei Texten ohne Umbrüche
  (BMW = ein Block) → jede Pflichtangabe zeigt IHREN Beleg statt denselben Satz.
- Neuer Finding-Typ missing_opt_out: einwilligungspflichtiger Anbieter mit
  Cookies ohne Opt-Out-/Widerspruchs-Link (Art. 7 Abs. 3 + Art. 21).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 11:59:48 +02:00
Benjamin Admin 39cb6afc23 feat(cookie): Findings bearbeitbar — gruppiert nach Typ + Matrix + Hinweise-Split
CookieFindings: Umschalter [Nach Fehlertyp] (je Typ: Maßnahme + betroffene
Cookies + Ticket-Text) ↔ [Matrix] (Cookie×Typ, ✗ Handlung / ⚠ Hinweis).
Trennung FINDINGS (zu beheben) vs HINWEISE (neutral, gegen DSE zu prüfen).
Backend: kind-Klassifikation (third_country/eu_alternative=hinweis); Drittland-
Remediation neutral formuliert (pro Verarbeiter prüfen, keine 'in DSE benennen'-
Befehle, da DSE-Abdeckung wie BMWs 'in der Regel SCC' oft unzureichend).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 11:02:34 +02:00
Benjamin Admin b0115cb10b feat(cookie): 2. Sicht Banner-Kategorie + Fehl-Einsortierung
CookieResultView bekommt einen Umschalter [Rechtliche Rolle] ↔
[Banner-Kategorie] (Notwendig/Funktional/Statistik/Marketing). In beiden
Sichten zeigt jede Cookie-Zeile '→ sollte: Marketing', wenn die tatsächliche
Kategorie laut Library von der deklarierten abweicht (rot bei Tracker als
notwendig, § 25 TDDDG). Neue KPI 'Falsch einsortiert'. Backend liefert dazu
cookie_categories (name→actual_category) aus big_lib im cookie-check-Output;
Seite lädt cookie-check einmal und reicht es an beide Komponenten.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 10:33:33 +02:00
Benjamin Admin 7fa9968ce1 feat(cookie): missing_retention — Vendor ohne Speicherdauer/Löschfrist
Vendor-Ebenen-Finding: greift, wenn ein Vendor eine Verarbeitung deklariert
(Kategorie/Zweck), aber KEINE Cookies gelistet sind UND keine persistence
angegeben ist (z.B. Nayoki GmbH — 'necessary' Auftragsverarbeiter ohne
Löschfrist). Die Pro-Cookie-Schleife sah solche Vendors nie (0 Cookies →
0 Findings). Remediation = Ticket-Text 'bitte Löschfrist festlegen'.
Art. 5 Abs. 1 lit. e + Art. 13 Abs. 2 lit. a → Control AUTH-2051-A03.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 10:02:59 +02:00
Benjamin Admin 289988d23e feat(cookie): ① Storage Inventory + storage_transparency-Finding
Trennt echte Cookies von anderem Endgeraete-Speicher (Local/Session Storage,
IndexedDB, Salesforce-Framework-Artefakte) — § 25 TDDDG ist technologieneutral.
- cookie_storage_inventory: detect_storage_type (Name-Muster ComponentDefStorage/
  __MUTEX/LSKey + Laufzeit-Text) + build_storage_inventory + storage_transparency-
  Summenbefund ('X als Cookie gelistet -> Y echte + Z andere').
- Endpoint cookie-check liefert storage_inventory; Frontend zeigt den Breakdown.

Tests: 4 + Frontend-Vitest gruen. Differenzierungsmerkmal: '740 -> 132 + 608'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 09:05:29 +02:00
Benjamin Admin 901de1ca97 feat(cookie): A — Findings auditfest an Controls verdrahten
Jeder Cookie-Befund traegt jetzt ein strukturiertes control-Feld
(control_id aus doc_check_controls + regulation + article) statt nur
hardcodeter Strings: vague_duration->AUTH-2051-A03 (Art.5(1)e+13),
tracker_as_necessary->DATA-2851-A05 (§25 TDDDG), third_country->
DATA-1624-A04 (Art.44). Kette Regulation->Article->Control->Finding.
Frontend zeigt die Rechtsgrundlage je Befund. (Controls tragen
regulation/article noch NULL -> hier mitgeliefert bis gepflegt.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 08:44:19 +02:00
Benjamin Admin 4c45f11e43 feat(cookie): Finding 'vague_duration' — unkonkrete Speicherdauer
Flaggt Laufzeit-Angaben ohne konkrete Dauer/Kriterium ('dauerhaft', 'bis zur
Loeschung', 'bis Nutzer deaktiviert', 'unbegrenzt' …) — Art. 5(1)(e) + Art. 13
DSGVO. Library-unabhaengig, gilt fuer ALLE Cookies (Coverage auf BMWs 780).
'13 Monate'/'Session'/'bis Widerruf, max. X' bleiben ok.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 08:33:06 +02:00
Benjamin Admin d18ef79f18 feat(cookie): Pro-Cookie-Library-Abgleich (2287er OCD + 35er rich) + Panel
- analyze_cookies gleicht Cookies gegen BEIDE Libraries ab: compliance.cookie_library
  (2287, OCD/CC0 — Kategorie/Retention) + 35er rich-DB (technical_necessity/reid/
  schrems/eu_alternative). 5 Befund-Typen: tracker_as_necessary, missing_purpose,
  excessive_lifetime (Art.5), third_country (Art.44), eu_alternative (kommerziell).
- Endpoint GET /snapshots/{id}/cookie-check (load_big_library batch + analyze).
- Frontend CookieLibraryPanel im Snapshot-Detail.
- Fix CookieResultView: Zweck nicht mehr auf 60 Zeichen gekuerzt; Rolle 'unknown'
  als Strich statt 'Unbekannt'.

Tests: 7 backend + frontend vitest gruen.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-11 08:18:25 +02:00
Benjamin Admin 3f23a64d5f feat(agent): Impressum-Tab auf Haupt-Engine + Profil/§36-Fixes
Ergebnis-Tab rendert jetzt result.results (Haupt-Doc-Check) statt des
abweichenden v3-Agenten — BMW korrekt statt False Positives:
- DocResultView: ein Dokument als Pflichtangaben-Tabelle (Label + gefundener
  Text + 3-Tier-Status), KEINE MC-IDs. ComplianceResultTabs speist Tabs aus
  result.results; ChecklistView-Bausteine exportiert + wiederverwendet.
- profile_extractor: Firmenname/Rechtsform = fruehester Treffer + ausge-
  schriebene Formen (Aktiengesellschaft) -> BMW AG statt "juris GmbH".
- 36 VSBG (MC-010): reines b2c -> POSSIBLY_APPLICABLE (Pruef-Hinweis) statt
  MEDIUM-FAIL; hart nur bei ecommerce. possibly_hint pro MC.
- McCoverage traegt label + found (Snippet); mc_possibly-Aggregat.
- AgentFindingCard/Methodik: interne check_id/mc_id nicht mehr angezeigt.

Tests: test_four_status (16) + Frontend-Vitest gruen; CI-Suite 206, v3/GT
unveraendert. Nur eigene Dateien (geteilter Tree).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-10 23:44:01 +02:00
Benjamin Admin 97575cc9c0 feat(agent): 4-Status-Modell (NOT_APPLICABLE/INSUFFICIENT_EVIDENCE/POSSIBLY_APPLICABLE) für Impressum
Kanonisches Compliance-Datenmodell, Impressum-Agent als Referenz:
- CheckStatus-Enum + Finding.status GETRENNT von severity (Verdikt ≠ Risiko)
- Unbestimmte Rechtsform (weder Text noch Wizard) → INSUFFICIENT_EVIDENCE (INFO)
  statt hartem HIGH-FAIL; legal_form_dependent-Gate + detect_legal_form_present
- §18-MStV-Graubereich (Corporate-Blog via has_editorial_content) →
  POSSIBLY_APPLICABLE (LOW Prüf-Hinweis); 3-stufig via scope_disposition
- Recommendations nur aus echten FAILs; mc_insufficient/mc_possibly-Aggregate
- Frontend: Verdikt-Pill + Coverage-Vokabular
- 19 neue Tests (test_four_status.py, AgentFindingCard); CI-Suite 204 grün,
  v3 25 / GT 13 unverändert

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-10 22:38:11 +02:00
Benjamin Admin b7a7e70731 feat(agent): Impressum Rechtsform-Gates + USt-optional (Phase 3)
Die 8 Audit-Klassifizierungs-Felder (scan_context) treiben jetzt den
business_scope der Agenten (vorher gespeichert, aber nicht genutzt).
Rechtsform-Gates als opt-out (excludes_scope): Verein -> kein
Handelsregister-Finding, e.K. -> kein Vertretungsberechtigte-Finding;
unbekannte Rechtsform bleibt anwendbar. USt-IdNr optional -> fehlt =
kein Finding. Rechts-Zuordnung vom Domain-Experten bestaetigt.

- _classification.py: scan_context_to_scope (8 Felder -> scope-Tokens)
- mcs.py: MC.excludes_scope + MC.optional; IMP-MC-004/006 Gate-Tokens;
  IMP-MC-005 optional; scope_matches respektiert excludes_scope
- agent.py: optional -> kein Finding bei Abwesenheit
- _agent_outputs.py: scope = scan_context vereinigt LLM-Profil-Fallback
- Tests gruen: v3 25, Groundtruth 13, CI-Pfad 14 (+ SSE-Loop-Fix)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-10 20:37:56 +02:00
Benjamin Admin bc78ddd3e5 fix(impressum): Findings aus 12 §5-TMG-Pattern-MCs statt verunreinigtem DB-Set
CI / detect-changes (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 5s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Der Agent lieferte "alles gruen": _load_controls gab auf macmini nur 3 von 75
doc_type='impressum'-MCs zurueck (Sidecar mc_classification.db hat nur 4/75 als
text-matchbar klassifiziert). Tiefere Ursache: die 75 doc_type='impressum'-MCs
sind fehl-klassifiziert (60/75 canonical_scope='other'; Prefixes TRD/SEC/GOV =
Geschaeftsbriefe/Marktplatz/Bestellung, NICHT §5 TMG Website-Impressum).

Fix: Der Impressum-Agent erzeugt Findings jetzt aus seinen 12 autoritativen
§5-TMG/DDG-Pattern-MCs (mcs.py) statt aus dem verunreinigten DB-Set —
deterministisch, scope-aware, field_id = semantisches Feld. Semantic-Validator-
Demote + Massnahmen + Rollup bleiben. Die 5-Impressum-GT-Tests laufen jetzt
echt durch: 0 Falsch-Positive.

DB-Master-Controls fuer Impressum deaktiviert bis zum MC-Re-Filtering (separate
Aufgabe: die doc_type-Klassifizierung der Vorgaenger-Session muss bereinigt
werden).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 13:15:34 +02:00
Benjamin Admin 08c08fcba2 feat(crawl): Vollstaendigkeit — Shadow-DOM/versteckte Links + Interaktions-Fixpunkt + Wayback-CDX-Orphans
CI / test-python-backend (push) Successful in 30s
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 12s
CI / loc-budget (push) Successful in 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Damit die Specialist-Agents auf vollstaendigem Website-Content arbeiten:

A — _find_dsi_links pierct jetzt Shadow-DOM (Web-Components wie Usercentrics/
    Mercedes) rekursiv; versteckte (display:none) Links werden erfasst + als
    Coverage-Metadatum geflaggt.
B — _expand_to_fixpoint klappt Akkordeons/Tabs/Hover-Menues in einer Schleife
    auf, bis das DOM stabil ist (statt 1 Pass); erweiterte Selektoren;
    Coverage-Telemetrie (Runden, expandierte Elemente, DOM-Wachstum, Shadow-/
    versteckte Links) → Response + Backend-Log.
C — legacy_url_cdx.cdx_enumerate listet via Wayback-CDX-API ALLE je
    archivierten URLs der Domain → findet Orphan-/Legacy-Seiten, die nie im
    Slug-Raster standen (z.B. nicht mehr verlinktes /datenschutz, per Direkt-
    URL noch erreichbar). Fliesst durch das bestehende Legacy-URL-Inventar.

Tests: test_legacy_url_cdx.py (6) + consent-tester/tests/test_dsi_discovery.py
(Pure-Helper + Real-Browser-Integration). Alle gruen, LOC-Gate gruen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 12:33:34 +02:00
Benjamin Admin 389e6de0c7 fix(agents): Impressum+Cookie delegieren MC-Laden ans Main Tool — Scope-Filter + Maßnahmen
CI / detect-changes (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
Regression: Der v3-Agent-Pfad baute eine parallele MC-Pipeline
(_load_impressum_mcs / _load_cookie_mcs, Roh-SELECT) und lief damit an
allen Schutzmechanismen der Engine vorbei → GOV/Branchen-MCs als HIGH bei
OEM/Zulieferer, fremde MCs (Bestellbestätigung), und action=check_question
(Fragen statt Maßnahmen im Frontend).

- Agent delegiert MC-Laden an rag_document_checker._load_controls
  (P72-Scope, check_type='text', fits_doc_type/scope_requires).
- Subtraktives Sektor-Gate (SECTOR_PREFIXES) + Themen-Gate am Agent-Rand.
- action = konkrete Maßnahme (Imperativ) statt check_question.
- rag_document_checker: from __future__ import annotations (3.9-Import).
- mcs: Name-Pattern erkennt "Aktiengesellschaft" (OEM-Impressums).
- Tote GT-/Semantic-/Routes-Tests wiederbelebt (v3-Mismatch +
  agent.cascade-Patch-Target). Alle 72 Specialist-Tests grün.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 11:30:16 +02:00
Benjamin Admin bd4882e143 feat(agents): Sprint 1.12 Phase 2 — Cookie-Policy v3 + ImpressumAgent v3 finetune
CI / detect-changes (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / sbom-scan (push) Has been skipped
ImpressumAgent v3 (Refactor):
  - v3_engine: laedt direkt alle 75 doc_check_controls['impressum'] ohne
    Sidecar-Filter (Sidecar war zu streng, lieferte nur 3 von 75 MCs).
  - Layer 0 Boost prueft pass+fail_criteria gegen meine 12 Patterns mit
    erweiterten Initial-Seeds (User-Vorgabe 2026-06-09:
    manuelle Initial-Seeds OK, Auto-Learning erweitert zur Laufzeit).
  - ETO-Smoke: 75 DB-MCs · 7 Pattern-Boosts · 24 Boost-Overrides
    (versus 3 DB-MCs vorher).

CookiePolicyAgent v3 (Refactor):
  - cookie_policy/v3_engine.py + cookie_policy/regex_boost.py
  - Laedt direkt alle 381 Cookie-MCs aus doc_check_controls
  - Layer 0 mit 12 eigenen Patterns als Initial-Seed
  - KB-Layer (CMP-Vendor-Cross-Check) bleibt erhalten
  - agent_version='3.0'

Tests: 27/27 gruen (12 v3-impressum, 6 cookie-policy, 9 cross-placement).
Alte v2-cookie-tests umgeschrieben auf v3-Pipeline-Mock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 09:23:12 +02:00
Benjamin Admin d3ac33d53a feat(impressum): v3 — Layer-Architektur auf doc_check_controls (75 DB-MCs)
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 12s
CI / loc-budget (push) Successful in 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 31s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Sprint 1.12 Phase 1 (User-Vorgabe 2026-06-09):

Statt eigener 12 hartgepatchter Patterns nutzt der Impressum-Agent jetzt
die 75 echten Master-Controls aus compliance.doc_check_controls. Pipeline:

  Layer 0  — Regex-Boost (meine 12 Patterns aus mcs.py / regex_boost.py)
             → wenn Pattern hits, MC wird zu PASS überschrieben
  Layer 1  — Keyword-Match aus pass_criteria der 75 DB-MCs
             (rag_document_checker.check_document_with_controls)
  Layer 2  — BGE-M3 Embedding-Match (in rag_document_checker integriert)
  Layer 3  — Semantic-Validator (LLM) für übriggebliebene HIGH/MEDIUM
             + Auto-Learning-Pattern-Library

Output-Layer bleibt unverändert: Disclaimer-Linter + Rollup-Dedup +
Methodik-First-UI.

Neue Dateien:
  - impressum/v3_engine.py       — Pipeline-Orchestrator
  - impressum/regex_boost.py     — meine 12 Patterns + Boost-Mapping

Refactored:
  - impressum/agent.py           — komplett umgeschrieben, agent_version=3.0
                                    255 LOC (unter 500-Cap)

Tests: test_impressum_v3.py mit 10 neuen Tests, alle gruen. Mockt
run_v3_pipeline für offline-Lauf. Bestaetigt:
  - Layer-0 erkennt Tesla-typische Felder
  - Boost matched DB-MC nur bei ≥2 Keyword-Treffern in pass_criteria
  - 12 Pattern-Boost-Slots + N DB-MCs in coverage
  - Notes enthalten Telemetrie (v3-pipeline, Boost-Overrides)

Telemetrie wird in AgentOutput.notes ausgegeben, damit Frontend
sehen kann: 75 DB-MCs geprueft · 5 Pattern-Boosts · 3 Boost-Overrides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 08:58:53 +02:00
Benjamin Admin 18e4f98201 fix(agents): klarere Naming + korrektes LLM-Default-Modell
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / nodejs-build (push) Successful in 2m20s
CI / test-go (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
User-Korrektur 2026-06-09:

(1) Begriff 'MC' steht im Projekt fuer Master-Control aus
canonical_controls (314k Eintraege, ~1.800 fuer dieses Tool). Mein
neuer Agent-Code hatte 'MC' als Abkuerzung fuer 'Machine-Check'
verwendet — Naming-Konflikt. Frontend-Methodik-Box jetzt:
  - 'Pattern-Check' statt 'Machine-Check'
  - Explizit: 'Diese Pattern-IDs (IMP-MC-001) sind interne Test-IDs,
    NICHT die Master-Control-IDs aus der canonical_controls-DB'
  - Roadmap-Hinweis: formale Verknuepfung Pattern→Master-Control folgt

Backend-Variablen mc_id bleiben technisch unveraendert (Refactor
waere gross), aber UI darf sie nicht als 'Master-Control' bezeichnen.

(2) LLM-Modell-Default war 'qwen2.5:7b' — Projekt nutzt aber das
groessere 'qwen3.5:35b-a3b' auf macmini (ENV SELF_HOSTED_LLM_MODEL).
_escalation.py default jetzt: SELF_HOSTED_LLM_MODEL als Fallback,
und Methodik-Erklaerung nennt das richtige Modell.

(3) Methodik-Erklaerung erweitert um Sprint-1.10 Semantic-Validator
und Sprint-1.11 Auto-Learning-Pattern-Library + Cross-Placement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 08:29:00 +02:00
Benjamin Admin 154e8c293b feat(agents): Cross-Placement-Agent (deplatzierter Content)
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Sprint 1.9 (User-Vorgabe 2026-06-09):

Erkennt im Impressum Inhalts-Sektionen die thematisch besser in
einen Footer-Reiter 'Legal' gehoeren:
  - Urheberrecht / Copyright          -> LOW  (Footer 'Legal')
  - Bilder & Lizenzen                  -> LOW  (Seite 'Bildquellen')
  - Haftungsausschluss / Disclaimer    -> LOW  (Seite 'Disclaimer')
  - Nutzungsbedingungen                -> LOW  (Seite 'AGB')
  - Aenderungsvorbehalt                -> LOW
  - ElektroG / WEEE-Reg                -> MEDIUM (Produktinfo)
  - VerpackG / LUCID                   -> MEDIUM
  - BattG                              -> MEDIUM

Each Finding empfiehlt konkret den 'Legal'-Footer-Reiter
einzufuehren als Best Practice ('Impressum bleibt schlank
und enthaelt ausschliesslich die Pflichtangaben nach § 5
TMG/DDG').

Tests gegen die 5 GT-Impressums:
  - Safetykon: 3 Findings (Urheberrecht, Bilder/Lizenzen,
    Haftungsausschluss)
  - Hectronic: 3 Findings (WEEE-MEDIUM, Copyright, Haftung)
  - ETO/BMW/Elli: 0 Findings (sauber)
  - 9/9 Tests gruen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 08:19:57 +02:00
Benjamin Admin ca8c388f37 feat(agents): Semantic-Validator + Auto-Learning-Pattern-Library
CI / detect-changes (push) Successful in 5s
CI / branch-name (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / test-python-document-crawler (push) Has been skipped
Sprint 1.10 — Semantic-Validator (User-Vorgabe 2026-06-09):
  - Statt unendlich Regex-Pattern fuer jede Schreibweise zu pflegen
    (Tel/Telefon/Telefonnr/Phone/Fon/Funkanschluss/…), nutzen wir
    bei MC-MISS einen LLM-Call: 'Ist die Pflichtangabe semantisch
    doch da, nur unter abweichendem Label?'
  - Bei LLM-Treffer: HIGH/MEDIUM-Finding wird zu LOW demoted,
    Empfehlung wird zu 'Best-Practice Umbenennung: Management ->
    Geschaeftsfuehrer' (mit STANDARD_LABELS-Mapping).
  - 1 LLM-Call pro Slot statt N: cost-effizient.

Sprint 1.11 — Auto-Learning-Pattern-Library:
  - Jedes Label das SVL findet wird in JSON persistiert:
    /tmp/breakpilot/agent_learned_patterns.json
  - Beim naechsten Run prueft der Agent zuerst gelernte Patterns
    BEVOR er das HIGH-Finding emittiert -> kein LLM-Call mehr.
  - Asymptotisch 0 LLM-Calls fuer haeufige Edge-Cases.
  - Halluzinations-Schutz: prune_low_confidence() loescht Patterns
    mit <0.5 Avg-Confidence nach 100 Beobachtungen.
  - Idempotent: gleicher (field_id, label, agent) -> Counter +1.

Tests: 40/40 gruen (10 Pattern-Library + 7 SVL + 13 GT + 11 v2).

STANDARD_LABELS-Map deckt Impressum + Cookie-Policy. Spaeter
erweiterbar fuer DSE, AGB, Widerrufs-Agenten.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 08:16:21 +02:00
Benjamin Admin 882e4f9798 test(impressum): GT-Fixtures + Fix 'Telefonnummer' Pattern
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / detect-changes (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 13s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Ground-Truth-Fixtures fuer 5 echte Impressums (ETO, Safetykon, BMW,
Elli, Hectronic). Pro Impressum:
  - text (User-eingegeben)
  - expected_clean (Felder die da sind → keine Findings)
  - business_scope
  - placement_concerns (Texte die deplatziert sind — fuer kommenden
    Cross-Placement-Agent)

13 GT-Tests + 11 Specialist-Tests = 24/24 gruen.

Bug-Fix: Elli schreibt 'Telefonnummer:' (kein 'Telefon:'),
mein Pattern matched nur Tel/Telefon. Erweitert:
'Tel(?:efon(?:nummer)?)?|Phone|Fon'

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-09 08:07:11 +02:00
Benjamin Admin 593baace7c fix(agents): HTML-Entity-Decode vor Agent + Pattern duldet '('
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
Bug bei BMW: dsi-discovery liefert HTML-Entities (&nbsp;) als
Literal-Strings ohne Decode. Beispiel im BMW-Impressum:
  'wird gesetzlich durch den Vorstand&nbsp;(Milan Nedeljkovic, …)'
Mein Pattern erwartet ':' / '.' / Whitespace nach Vorstand →
matched nicht das '&' → false-positive HIGH-Finding.

Fix 1 (Hauptfix): Test-Harness ruft html.unescape() vor agent.evaluate()
auf, so dass jeder Agent sauberen Text bekommt — entkoppelt von
dsi-discovery-Eigenarten.

Fix 2 (Belt-and-suspenders): Pattern duldet jetzt auch '(' direkt
nach Vorstand/Geschaeftsfuehrer (falls Decode mal fehlschlaegt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 18:45:37 +02:00
Benjamin Admin 702e7a6333 fix(impressum): Pattern fasst Geschäftsführung/Vorstand/Inhaber
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m21s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / detect-changes (push) Successful in 8s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Safetykon-Bug: 'Geschäftsführung:' (Sammelbegriff für GF einer GmbH)
matched das alte Pattern 'Geschäftsführer' nicht — False-Positive
IMPRESSUM-AGENT-VERTRETUNGSBERECHTIGTE_LABEL_KORREKT.
Pattern erweitert: Geschäftsführer|Geschäftsführung|Geschäftsführerin
+ Vorstand|Vorstandsvorsitzender + Inhaber|persönlich haftend.
Test test_safetykon_geschaeftsfuehrung_passes ergänzt (11/11 grün).

frontend: SlotCard zeigt jetzt Badge bei 0/0/0-Slots
('Dokument konnte nicht geladen werden') statt silent-fail, +
bei 0 Findings ein 'alle MCs OK'-Badge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 18:24:01 +02:00
Benjamin Admin 860469d4b1 fix(agents): Default-Vault-Pfad nach /tmp damit Container-User schreiben kann
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / loc-budget (push) Successful in 13s
CI / validate-canonical-controls (push) Successful in 11s
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
/app/artifacts gehört root und appuser darf nicht mkdir machen — Endpoint
crashte mit PermissionError. Default jetzt /tmp/breakpilot/agent_runs.
EVIDENCE_VAULT_ROOT-Env-Var bleibt für persistente Volumes nutzbar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 18:15:11 +02:00
Benjamin Admin f4357a2e9b feat(agents): Specialist-Agents Phase 2 Foundation + Cookie-Policy-Agent
Sprint 1 — Foundation (User-Vorgabe 2026-06-08):

Foundation:
- _base.py: BaseSpecialistAgent ABC + Pydantic Contract
  (AgentInput/AgentOutput/Finding/Recommendation/McCoverage/EscalationLog).
- _base.lint_output(): Disclaimer-Linter verbietet "rechtssicher" /
  "garantiert" / "gesetzeskonform" — scrubbed inline + Log in notes.
- _registry.py: AgentRegistry mit MC-Owner-Mapping (verhindert
  Doppel-Ownership).
- _escalation.py: cascade(local → ovh). qwen2.5:7b default,
  OVH 120b als Stage-2 (deaktiviert wenn OVH_URL leer).
- _rollup.py: deterministisches Dedup ähnlicher actions zu
  Recommendations mit related_finding_ids[].
- _evidence_vault.py: Pro-Run File-Vault für Playwright-Videos,
  Screenshots, CSV. SHA256 + manifest.json. DSR-tauglich (delete_run).

Agenten:
- ImpressumAgent v2 (impressum/agent.py + mcs.py) — konsolidiert
  v1-Pattern-Match + v2-LLM-MVP unter dem neuen Contract. 12 MCs.
- CookiePolicyAgent v1 (cookie_policy/agent.py + mcs.py) — 12 MCs
  zu Cookie-Richtlinie-Vollständigkeit + KB-Layer für
  CMP-Vendor-Cross-Check.

Tests: 25/25 grün (10 Impressum + 9 Vault + 6 Cookie-Policy).

Roadmap: SSE-Test-Endpoint + Frontend-Tab → DSE/AGB-Agents →
Cookie-Banner-Themen-Agent → Cross-Doc-Konsistenz-Agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 17:40:05 +02:00
Benjamin Admin d6b8bf87c2 fix: 4 Bugs gemeinsam — B22 PDF + B17 Walk-Fallback + company_name + Plausibility-Fallback
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / test-python-backend (push) Successful in 29s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / loc-budget (push) Successful in 13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
(1) B22 Cross-Domain (fix #59):
  Elli-Test fand AGB auf logpay.de NICHT obwohl URL in doc_entries
  korrekt. Vermutete Ursache: Discovery-Phase A drops/überschreibt
  Original-URL bei PDF-Fetch-Fail (word_count=0).
  Fix: _collect_audit_urls() iteriert über state.doc_entries +
  rejected_url + req.documents — Cross-Domain-Hosting ist
  unabhängig vom Text-Inhalt. Plus Trace-Logging für künftige
  Diagnose. Dedup per (doc_type, host_sld).

(2) B17 Audit-Walk-Fail-Fallback (fix #60):
  BMW v5 hatte audit_walk=None ohne Mail-Hinweis. Vermutlich
  180s-Timeout bei OneTrust-CMP-Banner-Tour.
  Fix: Timeout 180s → 300s. Plus: Bei Fail wird ein Hinweis-
  Stub mit error-Grund in state["audit_walk"] + HTML-Block
  geschrieben — Reviewer sieht den Fail statt silent-skip.

(3) company_name + origin_domain im Backend (fix #61):
  Frontend sendet seit ec03317 die zwei Felder — Backend ignorierte
  sie.
  Fix: ComplianceCheckRequest-Schema um company_name +
  origin_domain erweitert. phase_e_email priorisiert User-Input
  vor URL-Heuristik für site_name. Bei origin_domain ohne
  ableitbare doc_entries-domain wird der User-Input als domain
  übernommen.

(4) Plausibility-LLM Fallback-Modell (fix #62):
  qwen3:30b-a3b liefert auf großen DSEs (BMW 122 FAIL) gehäuft
  leere format='json'-Responses — Circuit-Breaker griff aber
  Phase blieb nutzlos.
  Fix: Default-Modell auf qwen2.5:7b umgestellt (4× kleiner,
  zuverlässiger bei format=json, ausreichendes Reasoning für
  PASS/MODIFY/DROP-Klassifikation). Plus Strategy-C eingeführt
  — Fallback-Modell (llama3.2:3b) wenn primary leer bleibt.
  BATCH_SIZE 4 → 3. ENV-Switches PLAUSIBILITY_LLM_MODEL +
  PLAUSIBILITY_FALLBACK_MODEL für Tuning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 16:39:33 +02:00
Benjamin Admin b4ce3528e5 feat(impressum-agent): Tesla-Pattern + KBA-Hint + News-Doc-Type
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m20s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
User-Feedback Tesla-Impressum: 10 FAIL bei 46 Worten — viele False-
Positives. Nach Tuning: 5 juristisch saubere Findings.

Impressum-Agent Patterns:
  - name_anbieter zusätzlich label-frei matchen (Firma+Rechtsform+
    Anschrift, Tesla schreibt ohne "Anbieter:" Label).
  - vertretungsberechtigte akzeptiert jetzt "Management" / "Director"
    als alternative (US-Konzern-Habit), aber emittiert separates
    Sub-Finding "Label sollte Geschäftsführer für § 5 TMG sein".
  - aufsichtsbehoerde-Pattern um KBA / Bundesnetzagentur erweitert.
  - NEU: verantwortlicher_redaktion (§ 18 MStV bei Blog/News).
  - NEU: verbraucher_streitbeilegung (§ 36 VSBG bei B2C).
  - Auto-Detection von Automotive-Branche: explizite Begriffe ODER
    bekannte Hersteller-Namen (Tesla/BMW/Mercedes/Audi/VW/Porsche…).
    Triggert KBA-Hint im aufsichtsbehoerde-Finding-Action.

Frontend (_document_types.ts):
  - Extrahiert aus ComplianceCheckTab.tsx (vorher inline).
  - NEU: doc_type "news" für Blog/Newsroom-URL → § 18 MStV-Pflicht-
    angaben prüfen. User-Hinweis: tesla.com/de_de/blog ist
    relevanter Audit-Input neben DSE/Impressum.

Smoke gegen Tesla-Impressum (46 Worte):
  Vorher 10 Findings (5 davon FP).
  Jetzt 5 Findings — alle juristisch korrekt:
    [MED] Management statt Geschäftsführer
    [LOW] KBA als Aufsichtsbehörde fehlt
    [MED] § 18 MStV-Verantwortlicher fehlt (Tesla Blog!)
    [MED] § 36 VSBG-Hinweis fehlt
    [MED] ODR-Plattform-Link fehlt

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 12:07:08 +02:00
Benjamin Admin d208a2bde2 feat: Mail-Restrukturierung + B22 Cross-Domain-Doc-Detector
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Successful in 13s
CI / go-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / python-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
User-Feedback BMW v5: "740 Cookies verschwunden auf 31, Übersicht
verloren". Drei Anpassungen:

Mail-Restrukturierung (_executive_summary.py + _compose.py):
  - render_executive_summary(): Top-of-mail TL;DR mit
    Compliance-Score (gross + farbig), Top-3-Findings nach
    Severity, Cookie-Statistik (deklariert/Browser/Drittland),
    Severity-Verteilungs-Chips.
  - collapsible(): wrapt jeden Block in <details>/<summary>.
    Mailpit + alle modernen Mail-Clients rendern das nativ.
  - _compose.py: alle 18+ B-Blöcke + per_doc + per_theme +
    legacy_html in Akkordeons. NUR Critical-Findings + Sofort-
    massnahmen sind immer offen — Reviewer sieht ~15 Zeilen
    Übersicht und klappt selektiv auf.
  - Cookie-Inventar (742) hat jetzt eigene Sektion ganz oben
    (Akkordeon "🍪 Cookie-Inventar"), Vendor-Karten parallel.

B22 Cross-Domain-Legal-Doc-Detector (cross_domain_doc_check.py):
  Real-Beispiel User-Feedback: Elli's AGB liegt auf docs.logpay.de
  statt elli.eco. Detektor erkennt SLD-Mismatch:
  - HIGH bei agb / widerruf (vertragsrelevant)
  - MEDIUM bei dse / nutzungsbedingungen
  - INFO bei cookie / impressum (Best-Practice)
  Norm: DSGVO Art. 28 (AVV-Pflicht für Hosting) + Art. 13 Abs. 1
  lit. e (Empfänger) + § 312i BGB (Cool-URLs).
  9/9 Tests grün inkl. Elli/LogPay Pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 11:35:55 +02:00
Benjamin Admin 5c5d676f01 feat: Plan B + A + C — DSE-Versions-MCs + Legacy-URL + Multi-Version
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / loc-budget (push) Failing after 11s
CI / python-lint (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / go-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
Drei verwandte Mechanismen für DSE-Beweisbarkeit + URL-Hygiene.

Plan B + PDF — Versions-Beweisbarkeit-MCs (dse_checks.py):
  - mc-dse_version_date (HIGH) — sichtbares Stand/Versionsdatum
    Pflicht. 12 Regex-Pattern: "Stand: April 2024", ISO-Datum,
    "Letzte Aktualisierung", "Version 3.2", englische
    Varianten ("Last updated", "Effective date as of …").
    Norm: Art. 7 Abs. 1 DSGVO (Nachweisbarkeit Einwilligung).
  - mc-dse_version_proof (MED) — PDF-Download oder
    versionierte Archiv-URL. Reine HTML-DSE ohne Snapshot ist
    juristisch fragil. 8 Pattern: .pdf, Download-Hinweis,
    web.archive.org, /dse-vNNN.html.
    Norm: DSK-Orientierungshilfe 2024.

Plan A — Legacy-URL-Discovery (legacy_url_discovery.py + B20):
  Vier komplementäre Quellen:
    A.1 /sitemap.xml + Sub-Sitemaps parsen, auf compliance-
        relevante Slugs filtern
    A.2 archive.org/wayback/available pro Slug — wenn Wayback
        zeigt ≥18 Monate alten Snapshot UND Seite heute noch
        200 liefert UND nicht im Footer → Legacy-Verdacht
    A.3 Slug-Permutations: 6 doc_types × 6 Slug-Varianten ×
        5 Lang-Prefixe × 4 Brand-Parameter
    A.4 Banner-Modal-Links (über consent-tester Stufe 4 Tour)
  Mail-Block "🗂️ Legacy-URL-Inventar" mit Tabelle: URL · HTTP ·
  Wayback-Alter · Footer · Empfehlung (301/Offline/Behalten).
  Engine entscheidet NICHT was Legacy ist — präsentiert das
  Inventar, Kunde wählt.

  Real-World-Smoke Elli:
    /en/cookies → HTTP 200, Wayback 69 Mo alt, nicht im Footer
                  → "Legacy-Verdacht, 301 setzen"
    /en/impressum → HTTP 302, redirected → "behalten"

Plan C — Multi-Version-DSE-Analyse (multi_version_dse.py):
  Wenn ≥2 DSE-URLs reachable: pro Variante DSB-Name + Datum +
  Wortzahl + SHA-256 extrahieren, Inkonsistenzen flaggen
  (date_divergent, dsb_divergent, no_date_count).
  Mail-Block "📑 Mehrere DSE-Versionen erkannt" mit
  Vergleichstabelle + rotem Hinweis "Nur eine Version kann
  gültig sein". Beispiel Elli: /de/datenschutz (Mollstr-DSB,
  2022) vs /de/datenschutzerklaerung?brand=elli (Proliance,
  ohne Datum).

API-Response erweitert um legacy_url_inventory +
html_blocks.legacy_urls + multi_version_dse_html im V2-Layout.

ENV-Override: LEGACY_URL_DISABLED=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 10:04:14 +02:00
Benjamin Admin 663a1c3e38 feat(document-library): zentrale Doc-Übersicht + Workflow-Auto-Select (Phase 3)
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Failing after 12s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m16s
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Neue Compliance-Admin-Seite /sdk/document-library: zeigt alle compliance_
legal_documents mit aktueller Version, gruppiert nach Empfehlungs-Klassi-
fikation, filterbar nach Status + Volltextsuche.

Backend (Service + Routes):
- LegalDocumentService.list_documents_with_versions() — JOIN über docs +
  latest/published version in einem Roundtrip statt N+1
- GET /api/v1/compliance/legal-documents/documents-with-versions
  liefert {documents:[{...doc, latest_version, published_version}]}

Admin-Frontend:
- app/sdk/document-library/page.tsx (350 LOC)
  - Lädt Docs + Recommend parallel
  - Mapped jedes Doc per .type → Recommend-Item (klassifiziert in
    required/recommended/optional/uncategorized)
  - 4 Sektionen mit Klassifikations-Chip + Anzahl-Badge
  - Tabelle pro Sektion: Titel · Type · Status · Version · Geändert · Override
  - Status-Filter (alle / draft / review_internal / review_client /
    approved / published / archived / rejected)
  - Klick auf Zeile → /sdk/workflow?doc=<uuid>
  - Empty state mit Link zum Generator (Bulk-Modus)
- workflow/page.tsx: auto-select bei ?doc=<uuid> URL-Param
- lib/sdk/types/sdk-steps.ts: 'document-library' bei seq=2500 im Paket
  'dokumentation' registriert (sichtbar in der SDK-Sidebar)

Workflow-Hookup vervollständigt: Library → click → Workflow öffnet
direkt das gewünschte Dokument im SplitViewEditor, keine manuelle
Selektion über DocumentSelectorBar mehr nötig.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 09:32:25 +02:00
Benjamin Admin e34f7cb507 feat(legal-docs): 5-stage lifecycle (draft → review_internal → review_client → approved → published)
CI / detect-changes (push) Successful in 7s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 11s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Phase 1 of the workspace-cutover initiative: compliance becomes the
single source of truth for documents. Step one is making the existing
compliance_legal_documents workflow rich enough to express the DSB→
Mandant approval pattern that the workspace's 5-stage UI needed.

Migration 148:
- Adds CHECK constraint on status (was free-form VARCHAR20)
- Allows: draft, review, review_internal, review_client, approved,
  published, archived, rejected (legacy "review" kept for backward
  compat — 0 existing rows so no backfill needed)
- Adds CHECK on approvals.action with extended values:
  submitted_internal, submitted_client, approved_internal,
  approved_client, rejected_internal, rejected_client
- Adds 6 new columns for the richer audit trail: submitted_by/at,
  approved_internal_by/at, approved_client_by/at

Service:
- New methods submit_internal_review, approve_internal, approve_client
- submit_review / approve kept as backwards-compat aliases that map to
  the new methods
- reject() now reads current status to log specific rejected_internal
  or rejected_client action
- _version_to_response includes all new audit fields

Routes:
- POST /versions/{id}/submit-internal-review
- POST /versions/{id}/approve-internal  (DSB sagt OK → Mandant ist dran)
- POST /versions/{id}/approve-client    (Mandant sagt OK → approved)
- Existing submit-review / approve endpoints stay but map through aliases

Schema:
- VersionResponse extended with optional submitted_by/at,
  approved_internal_by/at, approved_client_by/at fields

This unlocks Phase 2 (Generate-All in compliance generator), Phase 3
(Document-Library tab in admin), Phase 4 (workspace cutover — drop its
own document storage and route everything through this lifecycle).

[migration-approved]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 08:31:08 +02:00
Benjamin Admin 327e6a8984 fix(b19): UNK-Noise drastisch reduzieren
BMW4 zeigte 1037 UNK-Findings — die Mail wurde damit unleserlich.
Drei pragmatische Anpassungen:

1. UNK severity: LOW → INFO. Mail-Renderer zeigt jetzt nur
   HIGH/MEDIUM/LOW; INFO bleibt im API-Payload + CSV.
2. UNK wird NICHT emittiert wenn Vendor=First-Party-Owner
   (z.B. "BMW AG" auf bmw.de). Heuristik _is_first_party_owner
   vergleicht Vendor-Name gegen Domain-SLD.
3. auto_learning threshold ≥3 Sites → ≥1 Site. Second-time-Audit
   einer Site hat ihre eigenen Cookies bereits gelernt → kein
   UNK mehr. Single-site Auto-Learning ist absichtlich
   konservativ (Annotation, kein Truth).

Effekt: erwartete Reduktion bei BMW von 1037 UNK → ~50-100
(nur unbekannte 3rd-party-Vendoren). Mail wird lesbar, MAE-
Findings (Salesforce-as-essential) bleiben prominent sichtbar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-08 08:20:39 +02:00
Benjamin Admin c908fcd5eb feat(b19): Cookie-Coherence — 3-Layer-Lookup + Vendor-Karten + CSV
Adressiert das BMW-Beispiel (740 Cookies, Salesforce als "essential"
mit 1-Jahres-Lifetime, Pseudo-Zwecke wie "Siehe dazugehörige
Datenverarbeitung"). User-Konzept "Regulation als Code".

Step 1 — cookie_library_lookup.py (3 Layer):
  1. Override = cookie_knowledge_db.py + extended (74) für
     Schrems-II / EUGH / EU-Alternative — BreakPilot-juristische-IP.
  2. Truth-Base = compliance.cookie_library (2287 aus Open Cookie
     Database, CC0). actual_category als Wahrheit.
  3. Auto-Learning = cookie_behavior_audits — Cross-Site-Konsens
     wenn ≥3 Sites denselben Cookie melden.

  Match: exact > prefix (mit Separator-Check) > wildcard. Kurze
  Library-Namen ("c", "ID") brauchen exact-match — verhindert
  False-Positive auf "completely_unknown". Trailing-Underscore
  in OCD ("guest_uuid_essential_") wird als implicit-wildcard
  interpretiert.

Step 2 — cookie_coherence_check.py (B19, 6 Finding-Typen):
  - MARKETING_AS_ESSENTIAL (HIGH): KB sagt actual=marketing, Site
    deklariert essential/erforderlich → Einwilligung wird umgangen
  - LIFETIME_TOO_LONG_FOR_ESSENTIAL (MED): essential + >90d
  - PSEUDO_PURPOSE (LOW): "Siehe dazugehörige Datenverarbeitung"
    / <4 Wörter (suppressed wenn Vendor-Purpose substantial ist)
  - MISSING_COUNTRY (LOW): vendor_country leer trotz KB-Hit
  - UNKNOWN_VENDOR (LOW): nicht in KB → Auto-Learning-Kandidat
  - DUPLICATE_VENDOR (MED): selber Vendor in N Kategorien =
    Stack-Aufspaltung um Marketing unter "essential" zu schmuggeln

  Jedes Finding mit recommended_action ("Cookie X aus 'erforderlich'
  raus und in 'Marketing' setzen").

Step 3 — cookie_observation_logger.py:
  Loggt nach jedem Audit alle (cookie, site, declared_purpose) in
  compliance.cookie_behavior_audits → Basis für Cross-Site-Konsens
  in Layer 3.

Step 4 — cookie_csv_exporter.py:
  cookies-full-{check_id}.csv mit 21 Spalten (Name, Vendor decl/KB,
  Cat decl/KB, Lifetime decl/KB, Country, Opt-Out, 8x FIND_* flags,
  recommended_action). UTF-8 BOM für Excel.
  ZIP-Attachment: erweitert audit_walk_zip_builder um extra_files=
  parameter; phase_e ruft mit cookies-full-...csv auf.

Step 5 — mail_render_v2/_vendor_cards.py:
  Statt 740 Cookie-Rows: Aggregation pro Vendor mit Cookie-Count +
  Issue-Count + 1-2 Beispiel-Cookies + Issue-Type-Tags. Top 30
  Vendoren in der Mail, Rest nur in CSV. Sortiert nach Issue-Score.

Step 6 — render_info_box_rechtsrahmen():
  Generic Header-Info-Box mit Art. 13 DSGVO + § 25 TDDDG + Art. 5
  + § 5 UWG + § 30/130 OWiG. Immer angezeigt, kein explicit-
  finding-mapping (User-mündigkeit).

Orchestrator + _compose: run_b19 + render_vendor_cards +
  render_info_box_rechtsrahmen ins V2-Layout.

Tests: 28/28 grün (15 lookup + 13 coherence).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 23:48:04 +02:00
Benjamin Admin 0b29d1fada fix(cookie-inventory): fuzzy prefix-match + BMW-GT-File
BMW-Mail zeigte 738 deklariert / 31 Browser / **0 OK** — alle
Browser-Cookies landeten als UNDOC, alle deklarierten als ORPH.
Ursache: exact-string-match scheitert bei Suffix-Cookies.

_norm_for_match() + _matches() Helper:
  - Strippt Wildcards (`*`, `.*`, `<id>`, `{var}`) + Lower-Case
  - Erhält führende Underscores (`__cf_bm`, `_ga` sind meaningful)
  - Prefix-Match in BEIDE Richtungen, min 3 Chars (kein "_"-Garbage)

build_cookie_inventory():
  - Für jeden Browser-Cookie: längster Prefix-Match in declared wählen
  - browser-to-decl Index + decl-match-Index für O(N×M) → O(N+M)
  - matched browser-keys werden aus all_keys entfernt → kein
    Double-Count (vorher: ORPH + UNDOC parallel)

Realistischer BMW-Match-Test:
  declared=[_ga, _gid, __cf_bm, AMP_TOKEN, _fbp, intercom-session,
            _pk_id.*, OptanonConsent]
  browser= [_ga_K8YL3M9T, _gid_xyz, __cf_bm_actual_hash,
            AMP_TOKEN_runtime, _fbp_123, intercom-session-2026,
            _pk_id.5.7d8, OptanonConsent]
  → 8 OK (vorher 0)

BMW-GT-File (zeroclaw/docs/ground-truth/bmw_de_2026-06-07.json):
  - OneTrust CMP + 14 erwartete Vendoren
  - Cookie-Count-Ranges (browser 80-250, deklariert 300-800)
  - 7 expected findings inkl. neuem COOKIE-INVENTORY-MATCH-001 als
    Benchmark gegen den Fuzzy-Match-Bug

Tests: 14/14 grün (4 _norm_for_match + 5 _matches + 5
build_cookie_inventory inkl. realistic_bmw_pattern).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 21:29:21 +02:00
Benjamin Admin b16130369a feat(b17): Stufe 4 banner-tour + Stufe 5 annotierte Screenshots + V2-default
Stufe 4 — Cookie-Banner-Tour vor dem Accept-Klick:
  - audit_walk_banner_tour.tour_cookie_banner(): öffnet Settings
    (16 Phrase-Varianten), scrollt vertikal, aktiviert jedes
    [role=tab], expandet jedes [aria-expanded=false] / details /
    summary + 14 CMP-spezifische Selektoren. Max 35 Klicks,
    Best-Effort.
  - audit_walk_recorder ruft tour_cookie_banner() VOR
    _try_accept_banner auf — Reviewer sieht den vollen Consent-
    Katalog im Video (Vendor-Liste, Kategorien, Zwecke).
  - Recorder unter 500 LOC (412+155 split).

Stufe 5 — Annotierte Screenshots pro Finding:
  - finding_annotator.annotate_url(): WebKit headless, JS-Inject
    eines rot-banner-Labels oben + roter Outline um das Element
    (Selector oder Text-Match).
  - finding_annotator.annotate_findings(): dispatched 3 Cases —
    B1 Tap-Target (Anchor markiert mit "Tap-Target X×Y px"),
    B16 URL-Slug-Drift (404-Seite mit "/<slug> 404"),
    B13 Widerruf (Footer markiert "Widerruf-Link fehlt").
  - routes_audit_walk.POST /annotate-findings (consent-tester).
  - _b17_wiring ruft annotate-findings nach record_audit_walk und
    speichert annotations in walk.annotations.
  - audit_walk_zip_builder packt PNGs nach findings/<name>.png ins
    ZIP — Reviewer hat Beweis-Bilder im Postfach.

Plausibility Circuit-Breaker:
  - Nach 6 consecutive empty batches (PLAUSIBILITY_EMPTY_BUDGET=6)
    bricht die ganze Phase ab statt 200 Calls zu warten. Fix für
    qwen3-down + große DSE-Sites (BMW: ohne Breaker 21min, mit
    Breaker ~3min).

audit_walk_zip_builder fängt walk.annotations ab und legt sie unter
  findings/<fname>.png im ZIP-Anhang ab.

V2-Default:
  - docker-compose.yml backend-compliance.environment.MAIL_RENDER_V2:
    default 'true'. Ohne diesen Override liefert die Engine
    weiterhin das alte Legacy-Mail-Layout, in dem die B-Wiring-
    Blöcke nicht sichtbar sind.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 20:44:42 +02:00
Benjamin Admin e8ff75cbfe feat: Backlog 1-5 — soft-hints, chatbot-discovery, API-payload, LLM-Agent
5 Backlog-Items aus dem Multi-Site-Briefing in einem Sprint:

1. B13 B2C-Soft-Hints — Versicherungs/Tarif/Buchungs-Marker
   _B2C_WEAK erweitert um "Reiseversicherung", "Tarifrechner",
   "Online-Antrag", "Flug buchen", "Stromtarif" etc.
   Fängt Allianz-Reise-Chatbot (vorher False-Negative).

2. Chatbot-Policy-Discovery (chatbot_policy_discovery.py)
   Probt 14 Standard-Slugs (privacypolicychatbot, chatbot-datenschutz,
   ai-policy, ki-datenschutz, ...) × 5 Lang-Prefixe auf jeder
   submitted Origin. Successful >300-Wort-Findings werden in
   doc_texts['dse'] gemerged. Audit-Trail über
   doc_entries[dse].chatbot_policy_sources.
   Hebt Westfield-iAdvize-Lücke.

3. API-Response-Payload erweitert
   phase_f_persist.response um extra_findings, audit_walk und
   html_blocks erweitert. B-Wiring-Output (B1, B3-B18) ist nicht
   mehr nur im Mail-HTML versteckt — externe Aufrufer sehen jeden
   Finding. Schema additiv, legacy clients ignorieren neue Felder.

4. Plausibility-LLM Empty-Response-Fix
   Resilienz-Strategie A→B→C→D:
   A) format='json' (strict, default)
   B) format='' (loose, _try_extract_json mit ```json-fence + prose-
      wrap-Unterstützung)
   C) Split-Batch-Recursion (vorhanden)
   D) Give up, leeres dict (callers behandeln als skipped)
   Plus _post_llm() als isolierter LLM-Call-Helper, catched
   Network-Errors.

5. Specialist-Agents Phase 2 LLM (MVP) — Impressum-Agent
   impressum_agent_llm.py: qwen3:30b-a3b mit § 5 TMG System-Prompt,
   business_scope-hints aus profile_dict. Output identisches Schema
   wie pattern-agent für ein Merge ohne API-Bruch.
   _b18_wiring.py orchestriert beide Agents + deduplet nach
   field_id, rendert lila V2-Block mit KB/LLM-Tags pro Finding.
   Pattern-first im Dedup (deterministisch + stable).

Tests: 107/107 grün (7 Test-Suites + chatbot-discovery + b18).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 18:41:54 +02:00
Benjamin Admin a2cae94526 fix(b9)+test: real-world false-positives + multi-site GT-bench
Real-World-Smoke gegen Westfield Hamburg (englische DSE) deckte
B9-Bug auf: Pattern matched "If mfi Immobilien Marketing GmbH",
"Discover our Se", "Centre Se" usw. als angebliche Entitäten —
englische Connector-Worte + abgeschnittene "Services"-Strings.

B9 Fix:
  - _name_is_blocked() strenger: min 2 Worte, mind. einer ≥4 Chars
    UND capitalized (vor Legal-Form-Suffix). Filtert "Se", "ag",
    "If ...", "Centre Se" zuverlässig.
  - _clean_entity_name() strippt jetzt führende Lowercase-
    Connector-Worte (kontextuelle Verben wie "by", "If",
    "according to").
  - _dedup_substring() collapses
    "mfi Immobilien Marketing GmbH" + "Marketing GmbH" zum längeren.
  - Anwendung sowohl im HRB-Pfad als auch im Fallback-Pfad.

Multi-Site-Bench (2 neue GTs, 2 Engine-Runs):
  - zeroclaw/docs/ground-truth/westfield_hamburg_2026-06-07.json:
    iAdvize-Chatbot bekannt, Unibail-Management-Verantwortlicher.
  - zeroclaw/docs/ground-truth/allianz_reise_chatbot_2026-06-07.json:
    Twilio-Infrastruktur (US-Transfer), lit. f + 2-Mo-Retention.
  - zeroclaw/docs/audits/2026-06-07-multi-site-walk-results.md:
    Sprint-Briefing mit Detektor × Site Matrix, Audit-Walk-DSMS-
    CIDs, identifizierte Real-World-Bugs + Backlog.

Audit-Walk-Endstand (B17 Stufen 1-3):
  - Westfield: 400 KB Video, CID Qm…WJYfYDt…BXgwt
  - Allianz:   1 MB Video,   CID Qm…XFuiC4z…9mSMM
  Beide DSMS-persistiert, Reviewer kann jederzeit verifizieren.

Tests: 21/21 grün (test_impressum/test_elli_gt_coverage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 17:51:17 +02:00
Benjamin Admin cb4b352846 feat(b17): Playwright Audit-Walk-Video (Stufe 1, #7)
Nimmt einen kompletten Site-Walk als WebKit-Browser-Session
inkl. Video auf. Reviewer kann nachträglich exakt nachvollziehen,
wie die Engine zum Befund kam.

consent-tester:
  - services/audit_walk_recorder.py: Playwright record_video_dir,
    iPhone-Viewport-free 1280×800. Goto homepage → Banner-Accept
    (Best-Effort: 12 Text-Phrasen + 5 CMP-Fallback-Selektoren) →
    Footer-Links sammeln (compliance-relevant gefiltert) →
    pro Link navigate + Dwell-Time → JSON-Action-Index mit
    UTC-Timestamps + SHA-256 vom Video als Manipulation-Schutz.
  - routes_audit_walk.py: POST /scan-audit-walk; statische
    Serves für /audit-walks/{walk_id}/video.webm + walk.json.
  - main.py: Router registriert.

backend:
  - _b17_wiring.py: Triggert /scan-audit-walk, speichert
    Walk-Metadata in state["audit_walk"]. Render-Block mit
    HTML-Tabelle aller Actions (HH:MM:SS + Aktion + Detail) +
    Links zu Video und walk.json.
  - _orchestrator.py: run_b17 nach run_b16, async-aufgerufen.
  - mail_render_v2/_compose.py: audit_walk_html im V2-Layout.
  - test_b17_audit_walk.py: 8 Tests (Render-Pfade + Wiring).

Stufe-2 (Akkordeon-Expansion) und Stufe-3 (DSMS-CID-Anchor)
folgen separat.

Real-World-Smoke gegen Elli:
  - 581 KB Video, SHA-256 verifizierbar
  - 3 Footer-Links besucht (Impressum, Datenschutzerkl., Nutzungs-)
  - 6 Actions im JSON-Index

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 17:20:13 +02:00
Benjamin Admin 529c032641 fix(b9+b14): Real-World-Smoke-Befunde aus Elli-Audit (2026-06-07)
Smoke gegen www.elli.eco hat 3 Bugs offengelegt, die in den
synthetischen Tests nicht greifbar waren — Real-Texte haben
Abkürzungen, HTML-Stripping-Artefakte, andere Formulierungen.

B9 Multi-Entity-Impressum — vorher: 13 "Entities" statt 2.

  - Block-Boundary jetzt HRB-Anker-basiert (jeder HRB-Eintrag
    markiert eine Entity). Robuster als Legal-Form-Anker, der bei
    "Programmierung der Webseite Acme GmbH" über-matchte.
  - _NAME_BLOCKLIST gegen 11 typische False-Positives
    (programmierung, webseite, umsatzsteueridentifik, ...).
  - _LEADING_NOISE_RE strippt Email-TLD-Artefakte ("eco "),
    deutsche Artikel ("Die "), URL-Fragmente.
  - _USTID_PAT fängt jetzt auch die Vollform
    ("Umsatzsteueridentifikationsnummer der … ist DE…") über eine
    zweite Pattern-Alternative mit [\s\S]{0,80}? Bridge.
  - Dedup gleicher Entity-Namen — Mehrfacherwähnung in einem Doc
    zählt als EINE Entity.
  - Fallback auf alten Legal-Form-Anker wenn keine HRBs vorhanden
    (z.B. e.V. ohne HR-Pflicht).

B14 Retention-Conflict — Anchor-Liste erweitert:

  - "protokolldat" / "protokollierung der zugriffe" /
    "zugriffsdat" / "zugriffsprotokoll" als zusätzliche
    Logfile-Anchors (Elli's reale DSE-Wortwahl statt "Logfile").

B15 AI-Legal-Basis — kein Code-Fix. Elli's aktuelle DSE enthält
keine LLM-Provider-Erwähnung mehr; der GT-Anker (2026-06-06) ist
seither veraltet. 0 Findings ist korrekt für den aktuellen Stand.

Tests: 3 neue Real-World-Regression-Tests in
test_impressum_multi_entity_check.py::TestRealWorldElliPattern.
Combined: 75/75 grün.

Real-World-Smoke gegen Elli (HTTP→Text via crude strip):
  B9:  Entities 13→2 ✓, IMPRESSUM-MULTI-UST_ID → VW ✓
  B13: 1 Finding (b2c_strong) ✓
  B14: 0 (Elli hat aktuell nur EINEN Retention-Wert für Logs)
  B15: 0 (LLM nicht erwähnt, korrekt)
  B16: 3 Findings (impressum/dse/cookie Standard-Slug-Brüche) ✓

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 08:50:46 +02:00
Benjamin Admin 4cad0a29ad fix(company-profile): deserialize JSONB columns in row_to_response
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / build-sha-integrity (push) Failing after 3s
CI / validate-canonical-controls (push) Successful in 13s
CI / loc-budget (push) Failing after 15s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 30s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Raw text() queries return JSONB columns as JSON-encoded Python strings,
not as Python list/dict objects. The existing isinstance check then fails
and silently falls back to defaults — so list-valued fields like
target_markets, offerings, processing_systems, ai_systems were always
returned as their defaults regardless of stored content.

Add a JSON-decode pass over _JSONB_FIELDS before the type check.

Verified: PATCH of target_markets=["DE","EU"] now round-trips through
GET correctly. Previously the DB had the right data but GET returned
["DE"] (the default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 08:26:14 +02:00
Benjamin Admin 5958b575b1 fix(company-profile): replace :param::jsonb with CAST(:param AS JSONB)
CI / detect-changes (push) Successful in 9s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / build-sha-integrity (push) Failing after 4s
CI / validate-canonical-controls (push) Successful in 10s
CI / loc-budget (push) Failing after 14s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Has been skipped
CI / test-go (push) Has been skipped
CI / iace-gt-coverage (push) Has been skipped
CI / test-python-backend (push) Successful in 28s
CI / test-python-dsms-gateway (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
SQLAlchemy's text() parser treats `:name::jsonb` ambiguously when the
trailing `::jsonb` follows immediately — psycopg2 receives the literal
`:name::jsonb` string and raises a SyntaxError because `:` isn't a
psycopg2 placeholder syntax.

The fix uses ANSI CAST(:name AS JSONB) which is semantically identical
in PostgreSQL but lets SQLAlchemy unambiguously substitute the
parameter.

Effects: PATCH and POST/upsert on /api/v1/company-profile now actually
update the row. Before this fix both endpoints returned 500 (or 200
with stale data) and never persisted edits.

Files touched:
  - _company_profile_sql.py (build_upsert_params / execute_update /
    execute_insert): 12 JSONB columns
  - company_profile_service.py: PATCH dynamic JSONB column,
    audit log insert

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 00:42:16 +02:00
Benjamin Admin 65e8bb9d42 feat(b16): Footer-Label-vs-URL-Slug-Drift-Check (GT URL-STRUCTURE-001)
Erkennt: gängige Footer-Labels / Bookmark- + SEO-Erwartungs-Slugs
(z.B. "Cookie-Richtlinie", "AGB", "Datenschutzerklärung") liefern
404, während das Doc tatsächlich unter einem abweichenden Slug
ausgeliefert wird.

GT-Anker (Elli URL-STRUCTURE-001):
  Footer-Label "Cookie-Richtlinie" → /cookie-richtlinie 404
  Real: /de/cookies
  → externe Bookmarks und Google-Treffer brechen.

Heuristik:
  - Aus auto-discovered URLs Origin + Sprach-Prefix extrahieren
    (z.B. /de, /de-de)
  - Pro doc_type 2-4 kanonische Standard-Slugs probieren (parallel
    via ThreadPoolExecutor, 2s Timeout, HEAD → GET fallback bei 405)
  - Wenn alternative Slug 404/410 → LOW Finding pro doc_type
  - Probe-Cap auf 18 Requests gesamt (Network-Noise-Schutz)
  - Abschaltbar via URL_SLUG_PROBE_DISABLED=1

Severity: LOW (Best-Practice, kein juristisches Hardfail).

Tests: 13/13 grün (Strip-Helper 4 + Origin-Helper 3 + Check-Pfade 6
inkl. mocked _head_status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-07 00:23:25 +02:00