Commit Graph

1182 Commits

Author SHA1 Message Date
Benjamin Admin 873997c13b feat(vvt): V3 — LLM vendor extraction fallback for unknown CMPs
When the cookie text has no captured CMP payload (long-tail sites that
don't use ePaaS/OneTrust/Cookiebot/etc.) we now fall back to a Qwen → OVH
LLM cascade to extract a structured vendor list from the policy text.

New module backend/compliance/services/vendor_llm_extractor.py:
- extract_vendors_via_llm(cookie_text): runs Qwen first (local Ollama),
  then OVH if Qwen returns nothing usable.
- System prompt instructs the model to return STRICT JSON only:
  {vendors: [{name, country, purpose, category, opt_out_url,
   privacy_policy_url, persistence, cookies: [...]}]}
- Lenient JSON parser tolerates code-fences, prose wrappers, dict vs list.
- _normalize() caps array sizes (80 vendors, 30 cookies each), validates
  URLs (must be http(s)), trims fields to reasonable lengths.

Route integration (agent_compliance_check_routes.py):
- After named-CMP extract: if cmp_vendors is empty AND the cookie text
  has ≥500 words (otherwise it's likely navigation chrome), invoke the
  LLM extractor. Progress message 'Vendor-Liste per LLM extrahieren...'.
- Vendors then run through the same validate_vendor_urls + score_vendors
  pipeline → VVT table rendered identically regardless of source.

docker-compose.yml: backend-compliance gains OLLAMA_URL, CMP_LLM_MODEL,
OVH_LLM_URL/KEY/MODEL env vars (same names as consent-tester so the
configuration is unified).

This closes the 'every site eventually gets a VVT table' goal:
- Known CMP → V1/V2 structured extraction (fast, exact)
- Unknown CMP → V3 LLM extraction (slow, best-effort)
- No text at all → no vendors, but other compliance checks still run.
2026-05-17 09:55:42 +02:00
Benjamin Admin 9c0cc0f59f feat(vvt): V2 — vendor extractors for Cookiebot/Usercentrics/Didomi/TrustArc
Backend vendor_extractor.py gets 4 new per-CMP dispatchers, mirroring the
JSON schemas observed in each platform:

- Cookiebot: 'Categories[*].Cookies[*]' with Vendor/Host, expiry, purpose
- Usercentrics: 'services[*]' with cookieMaxAgeSeconds, processingCompanyCountry
- Didomi: 'app.vendors[*]' with country + policyUrl
- TrustArc: 'vendors[*]' + per-category 'Cookies' with provider

All 6 named CMPs (ePaaS, OneTrust, Cookiebot, Usercentrics, Didomi,
TrustArc) plus the generic-shape fallback are now mapped — every site
hitting Phase B of the cascade gets a structured vendor list, scored
opt-out links, and a VVT-Tabelle in the email.
2026-05-17 09:52:10 +02:00
Benjamin Admin ea4dbb223f feat(vvt): per-vendor extraction + opt-out check + VVT table in email (V1)
When a known CMP (ePaaS, OneTrust) renders the cookie policy, we now
extract structured vendor records, probe their opt-out + privacy URLs,
score each vendor (0-100), and append a 'VVT-Vorschlag' table to the
compliance email — one row per vendor, sortable by compliance score.

consent-tester:
- DSIDiscoveryResult.cmp_payloads: surfaces raw CMP JSON to callers
- DSIDiscoveryResponse: new cmp_payloads field
- discover_dsi_documents sets cmp_payloads from cmp_capture
- cmp_library/{epaas,onetrust}.py: new extract_vendors(d) returning
  list[VendorRecord]

backend:
- _fetch_text() now returns (text, cmp_payloads) tuple
- doc_entries store cmp_payloads per doc (mostly cookie)
- _autodiscover_missing forwards homepage payloads to the cookie entry
- New module vendor_extractor.py: dispatches ePaaS/OneTrust/generic
  schemas; dedupes vendors across multiple payloads
- cookie_link_validator.py extended with validate_vendor_urls(vendors)
  and score_vendors(vendors) — 0-100 score per vendor based on name,
  purpose, country, opt-out reachable, privacy URL reachable, cookies
  with names + expiry
- agent_doc_check_extras.build_vvt_table_html: renders the table
- Route appends VVT HTML after the provider list, before the
  document-by-document report
- Response JSON gains cmp_vendors for future frontend rendering

Example for BMW: ~30 ePaaS providers → table with Name | Kategorie |
Sitz | Cookies | Opt-Out (✓/✗) | Privacy (✓/✗) | Score. Sorted by
score ascending so the worst-compliant vendors are at the top.
2026-05-17 09:50:11 +02:00
Benjamin Admin c9c0fb5965 feat(cookie-check): enhanced patterns + active opt-out link validator
cookie_checks.py:
- cookie_names_listed: now also matches CMP placeholder notation
  (BMW: 'Adfpc###', 'CT###') and 'Diese Datenverarbeitung verwendet die
  folgenden Cookies oder ähnliche Technologien' as list-shape signal.
  Cryptic vendor names like 'audience', 'adformfrpid' are accepted via
  the surrounding markup, not by hard-coding each one.
- cookie_providers_named: new pattern 'Gesetzt von: <Firma>' (BMW/ePaaS
  per-cookie vendor naming) + recognition of full legal-form names
  (Adform A/S, BMW AG, Adobe Systems Software Ireland Limited).
- cookie_duration_values: now matches 'Ablauf: 1 Jahr' / 'Speicherdauer:
  30 Tage' (BMW format) in addition to the legacy '<n> <unit>'.

New L1 + L2 checks for controller in cookie-policy:
- cookie_controller (L1): the cookie policy must name Verantwortlich(er)
- cookie_controller_address (L2): PLZ + Ort or address keywords
- cookie_controller_contact_or_link (L2): email/phone OR link back to
  Datenschutzerklärung (the practical equivalent — BMW does this)

New L2 checks (parented under opt_out):
- cookie_optout_links: detects per-provider opt-out URLs in the text
- cookie_privacy_policy_links: per-provider privacy-policy URLs

New service: cookie_link_validator.py
- extract_links(text): pulls all https?://… URLs that follow 'Opt-Out
  Link:' / 'Link zur Privacy Policy:' (deduped)
- validate_links(links): probes every URL concurrently (HEAD first, GET
  fallback for 405/403). 10 parallel, 8s per request, 60s batch cap.
  Returns reachable=True/False + status + final_url.
- build_check_items(): renders 2 CheckItems (opt-out + privacy-policy),
  each pass if ALL links 2xx/3xx, fail with up-to-5 broken-link examples.

Hook in _check_single: doc_type=='cookie' triggers the validator after
regex+MC checks. Recomputes correctness with the new L2 items.

This addresses two concrete BMW observations:
1. BMW's per-cookie structure (Name + Zweck + Ablauf, Gesetzt von: …,
   Opt-Out Link: …) now recognised → 'Konkrete Cookie-Namen aufgelistet'
   and 'Konkrete Speicherdauern' should pass.
2. Defective opt-out URLs surface as compliance findings rather than
   silently passing — Art. 7(3) DSGVO requires a working withdrawal
   path per provider.
2026-05-17 09:38:32 +02:00
Benjamin Admin 4a5924b8c4 feat(iace): CRA / DIN EN 40000-1-2 cyber-resilience spur
[guardrail-change]

Phase 18 adds an EU Cyber Resilience Act compliance track to IACE:
the engine now fires patterns that surface the manufacturer-side CRA
obligations whenever a project's components carry digital elements.

Patterns (HP1910-HP1918, hazard_patterns_cra.go):
  HP1910  Missing SBOM
  HP1911  Unsigned firmware/software updates
  HP1912  Factory-default credentials still active
  HP1913  No coordinated vulnerability disclosure (CVD) policy
  HP1914  No documented security patch SLA
  HP1915  Missing user-facing hardening guide
  HP1916  No incident-notification process to ENISA / CSIRT
  HP1917  No security assessment prior to placing on market
  HP1918  AI component without cybersecurity risk assessment

Each pattern carries ClarificationQuestionsDE so the operator gets
auditor-grade questions to take back to the Anlagenbauer instead of
the engine inventing prose. PatternMatch carries DefaultAvoidability
(P=1 for all CRA patterns), feeding the PLr graph from Phase 17.

Measures (M540-M548, measures_library_cra.go):
  M540  SBOM (SPDX or CycloneDX) with each machine release
  M541  Signed updates with rollback protection
  M542  Forced default-password change at first boot
  M543  Published CVD policy (security.txt / PSIRT)
  M544  Documented patch SLA with CVSS-tier response times
  M545  User-facing hardening guide in the machine docs
  M546  ENISA incident-notification process (24h/72h/14d)
  M547  Authenticated update channel + integrity check
  M548  Pre-market security assessment / pen-test

The library is urheberrechtlich neutral: identifiers only
(Verordnung (EU) 2024/2847, DIN EN 40000-1-2 Entwurf, IEC 62443,
ETSI EN 303 645, ISO/IEC 5962, ISO/IEC 29147). No normative text
is reproduced — DIN/Beuth proprietary content is referenced by
section number only.

Category-compatibility:
  cyber_resilience pattern category accepts measures with
  HazardCategory cyber_resilience, cyber_network, or
  software_control. Updated in both the runtime helper
  (iace_handler_init_helpers.go) and its test-mirror
  (pattern_coverage_test.go) — both must move in lockstep.

Frontend (clarifications page):
  When at least one clarification references "2024/2847" or
  "40000-1-2" in its norm_references, a blue info-banner is
  rendered at the top of the page:
    "Cyber Resilience Act (CRA) — Hinweis zur Geltung
     Diese Klärungsliste enthält Fragen zur Verordnung (EU)
     2024/2847 (CRA). Die CRA gilt für Produkte mit digitalen
     Elementen, die ab dem 11.12.2027 auf dem EU-Markt bereit-
     gestellt werden. ..."
  Reminds the user that the CRA pflichten are forward-looking
  while still allowing the manufacturer to bake them in now.

LOC exceptions:
  Added three pre-existing files to .claude/rules/loc-exceptions.txt
  (manufacturer_safety_features.go, iace_handler_clarifications.go,
  routes.go). All three grew across Phases 16-17 and are tagged as
  Phase 5+ refactor backlog. [guardrail-change] marker required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:15:51 +02:00
Benjamin Admin 2afa5a179b feat(iace): Risikograph EN ISO 13849-1 PLr + Methoden-Kopf im Bericht
Phase 17 of the risk-assessment polish. Two pieces:

A) PLr per EN ISO 13849-1 Anhang A (Risikograph)
   - HazardPattern.DefaultAvoidability (1 = P1, 2 = P2). Optional;
     defaults to P1 if unset (conservative — operator can raise after
     review).
   - ComputePLr(s,f,p) implements the canonical 8-leaf binary tree
     (S1F1P1 -> a, ..., S2F2P2 -> e). Pinned by 8 table-driven tests.
   - SeverityToS / ExposureToF map the existing 1-5 fields to the
     binary S/F at the documented threshold (3).
   - At project initialise, every hazard's Description is appended
     with "Risikograph EN ISO 13849-1 (Anhang A): S2 · F1 · P1 -> PLr c"
     so the audit value is visible without leaving the hazard view.
   - PatternMatch carries DefaultAvoidability so the init handler can
     pick it up without a second pattern lookup.

B) Methoden-Kopf am Bericht
   - GET /clarifications.html now opens with a standardised methodology
     block: ISO 12100 Anhang B (hazard ID) + ISO 13849-1 Anhang A
     (PLr graph) + ISO 12100 6.2/6.3/6.4 (reduction hierarchy). Same
     wording on every export, ready for the Anlagenbauer-Uebergabe.
   - Only norm identifiers — no norm text reproduced.

C) ISO12100Section in Hazard Description
   - When a pattern is labeled with ISO12100Section, the hazard
     description gets a "Klassifikation: EN ISO 12100 Anhang B,
     Abschnitt 6.3.5.4" suffix. Provenance for the auditor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 02:03:10 +02:00
Benjamin Admin 71d31c914b feat(iace): ISO 12100 Anhang B mapping — split noise/vibration + section identifier
Phase 16 of the Klaerungen / risk-assessment polish. Sources from
EN ISO 12100 Anhang B Tabelle B.1 are now first-class:

A) HazardPattern.ISO12100Section identifier (string), persisted only as
   the section number (e.g. "6.3.5.5") — not the norm text. Keeps the
   library urheberrechtlich neutral (DIN/Beuth license). 57 patterns
   labeled today; rest will follow on touch.

B) Category split per ISO 12100 Nr. 4 vs Nr. 5:
   - 16 patterns reclassified noise_vibration -> noise_hazard
   - 7  patterns reclassified noise_vibration -> vibration_hazard
   - 1  pattern (HP228 UV-/Laermexposition) kept multi-cat
   acceptableMeasureCategories now accepts both new aliases plus the
   legacy noise_vibration. Coverage test recognises both as valid.

C) 5 new ISO-12100-Annex-B gap patterns (HP1900-HP1904):
   - HP1900 Vakuum-Verletzung (6.3.5.5)
   - HP1901 Federenergie / elastische Elemente (6.2.10)
   - HP1902 Rutschen/Stolpern auf rauer Oberflaeche (6.3.5.6)
   - HP1903 Hochdruckinjektion (6.3.5.4) — includes clarifying
            "no hand-locating of leaks" question
   - HP1904 Ersticken durch Brustkorbquetschung (6.3.5.2)

The library now mirrors the ISO 12100 Annex B structure for the gaps
the Bremse benchmark surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:59:16 +02:00
Benjamin Admin b090662524 fix(compliance-check): respect auto-discovery 'not found' verdict; DSB not canonical
Two related bugs in the BMW test result:

1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB:
   - Auto-discovery correctly returned 'not found' for AGB (no link on
     bmw.de matches AGB keywords).
   - But auto_fill_from_dsi then found the substring 'AGB' in a section
     of the DSI and pseudo-filled the AGB entry with a 264-word DSI
     fragment.
   - cross_search_documents would have done the same.
   - Both now skip entries where discovery_attempted=True AND
     auto_discovered=False — the 'not found' verdict stands.

2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words
   = the entire DSI text:
   - GDPR practice: the DSB is named *inside* the DSI as an email or
     contact block (Art. 13(1)(b)), not as a stand-alone page.
   - cross_search_documents had been assigning the full DSI to the DSB
     row because it matched 'datenschutzbeauftragte' keywords.
   - DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer
     padded as missing, no longer auto-discovered. The frontend row
     remains so a tenant with a separate DSB page can still submit one.

After this fix BMW should render:
- DSE: OK
- Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately)
- Cookie-Richtlinie: OK
- Social Media: NICHT GEFUNDEN (bmw.de does not link to it)
- AGB: NICHT GEFUNDEN (correct — BMW has no public AGB)
- Nutzungsbedingungen: NICHT GEFUNDEN
- Widerruf: NICHT GEFUNDEN
2026-05-17 01:53:09 +02:00
Benjamin Admin c4be077c5d feat(iace): Klaerungen Phase 3 — DB-Tabelle + Multi-User + PDF-Export
[migration-approved]

Three pieces complete the Klaerungen lifecycle:

1. Migration 028: iace_clarifications + iace_clarification_comments +
   iace_clarification_history. Deterministic clarification_key
   (UNIQUE per project) so engine re-inits don't lose answers.
   History table logs every status/answer transition. The previous
   JSONB-in-metadata storage is kept as read-only fallback for
   pre-migration projects until a one-shot upcopy script runs.

2. Multi-User-Workflow:
   - assigned_to field on every clarification (free-text user kuerzel
     for now; an FK to users can be added in a follow-up).
   - Comment thread per clarification (POST .../comment, GET
     .../detail returns the thread).
   - Status-history log written by UpsertClarification when the
     status or answer actually changes.
   - Frontend Modal: Zugewiesen-an + Bearbeiter fields, comment
     thread with inline post, collapsible history section.

3. PDF-Export via print-friendly HTML:
   - GET /clarifications.html returns a standalone A4-styled
     document with status badges, norm references, affected hazards
     and a signature row at the bottom. The Bediener opens the link
     and uses Strg-P / Cmd-P to save as PDF. No server-side PDF
     dependency added.
   - Frontend "PDF / Druck" button next to CSV export.

Backend:
- internal/iace/store_clarifications.go: UpsertClarification,
  ListClarificationsForProject, GetClarificationByKey,
  AddClarificationComment, ListClarificationComments,
  ListClarificationHistory.
- internal/api/handlers/iace_handler_clarifications.go:
  - AnswerClarification now writes the SQL row, falls back to legacy
    JSONB read on list.
  - PostClarificationComment, ListClarificationDetail,
    ExportClarificationsHTML added.

Migration must be applied manually on Mac Mini and prod via
psql -f /migrations/028_iace_clarifications.sql — pattern as in
scripts/apply_*_migration.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:39:17 +02:00
Benjamin Admin b2b4d77877 fix(auto-discovery): compute missing against canonical 8 types, not submitted
Frontend filters out empty doc rows -> req.documents only contains the
N submitted entries (3 in BMW case). The old auto-discovery loop
computed 'missing' as 'entries in doc_entries with empty text', which
was always empty for those N entries -> discovery never fired.

Fix:
- missing = _ALL_DOC_TYPES - {canonical doc_types in doc_entries}
- For each missing type, APPEND a new entry to doc_entries with
  discovery_attempted=True. If a discovered doc matched, fill text/url
  and set auto_discovered=True.
- Check loop: skip entries with no URL and no text (let padding label
  them). Entries with URL but no text keep the 'Kein Text' error so the
  user sees fetch failures explicitly.
2026-05-17 01:28:51 +02:00
Benjamin Admin f19a75d83d feat(iace): Klaerungen Phase 2 — Sidebar-Counter + CSV-Export + Hazard-Banner
Three pieces complete the Klaerungen UX:

1. Sidebar-Counter: layout.tsx polls /clarifications and shows a
   colored open-count badge on the "Klaerungen" nav item. Refreshes
   whenever the user changes route.

2. CSV-Export: new backend endpoint
   GET /sdk/v1/iace/projects/:id/clarifications.csv produces a UTF-8-
   BOM-prefixed semicolon-separated CSV (Excel-friendly) with ID,
   Quelle, Kategorie, Frage, Status, Antwort, Begruendung, Bearbeiter,
   answered_at, anzahl Gefaehrdungen, Gefaehrdungs-Namen, Norm-Refs.
   Frontend Klaerungen-Seite bekommt einen "CSV-Export"-Button.

3. Hazard-Banner statt Fragentext im Benchmark-Detail: the previous
   bulleted clarification list was duplicated across 48 hazards for a
   single FANUC question. Phase 2 replaces it with a compact status
   badge — "N offene Klaerung(en) — Klaerungen-Seite oeffnen" (orange)
   or "Alle N Klaerungen beantwortet" (green) with a direct link.

Backend cleanup: iace_handler_init.go no longer appends the "Mit
Anlagenbauer zu klaeren" block to Hazard.Description. The description
stays focused on the scenario; clarifications live in the dedicated
endpoint and answers persist across re-inits via project.metadata.
The aggregated "Referenzierte Normen" line on the hazard is kept.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:25:36 +02:00
Benjamin Admin 525038359a feat(compliance-check): auto-discover missing doc types from homepage
When the user leaves some doc-type rows empty, the tool now actively
searches the website for them — only marks 'not found' as last resort.

Flow:
1. User submits N URLs (e.g. just DSI)
2. For each canonical doc_type with no submitted URL/text, the route
   identifies the most-common base (scheme://netloc) from submitted URLs
3. Calls consent-tester /dsi-discovery on the homepage with
   max_documents=15 (180s timeout)
4. Classifies every discovered doc into a canonical doc_type via
   title/URL keyword rules (_DISCOVERY_RULES — covers cookie/widerruf/
   social_media/agb/nutzungsbedingungen/dsb/impressum/dse)
5. Fills matching empty entries with the discovered text, marks
   auto_discovered=True and discovery_attempted=True

Padding now differentiates:
- 'Auf der Website nicht gefunden' — discovery was attempted, no doc
  matched. Amber badge, friendly hint to add URL manually.
- 'Nicht eingereicht — Quelle nicht angegeben' — user gave NO URLs at
  all, nothing to crawl from. Grey badge.

Email + frontend:
- Status labels: NICHT GEFUNDEN (amber) vs NICHT EINGEREICHT (grey)
- 'Gepruefte Quellen' table tags auto-discovered URLs with a small blue
  'auto-entdeckt' badge so GF sees what tool found vs user submitted.

Implementation only runs when ≥1 URL was submitted (no base to crawl
from otherwise). Adds 30-90s for unsubmitted types but avoids the
'just say nicht gefunden' anti-pattern.
2026-05-17 01:14:05 +02:00
Benjamin Admin 79efa54898 feat(iace): Klaerungen MVP — Phase 1
New page "Klaerungen" between Massnahmen and Verifikation.

Backend:
- internal/iace/clarifications.go: Clarification struct + ClarificationAnswer +
  BuildProjectClarifications() — aggregates pattern-level + manufacturer-
  level questions from collectAllPatterns + GetManufacturerSafetyFeatures.
  Deterministic IDs ("pattern:HP1640:0", "manuf:fanuc:dual-check-safety-dcs:1")
  so persisted answers survive every re-init.
- internal/api/handlers/iace_handler_clarifications.go:
  - GET /projects/:id/clarifications returns aggregated list with affected
    hazard names + persisted answer state, sorted (open first).
  - POST /projects/:id/clarifications/:cid/answer writes status/answer/
    reasoning/answered_by/answered_at to project.metadata.clarification_-
    answers — no DB schema change.

Frontend:
- admin-compliance/app/sdk/iace/layout.tsx: new "Klaerungen" nav item.
- app/sdk/iace/[projectId]/clarifications/page.tsx: table grouped by
  source (FANUC / Pattern HP1640 / …), Filter Offen/Beantwortet/Alle,
  search field, Antwort-Modal with status/answer/Begruendung/Bearbeiter.

A clarification answered once applies to ALL referenced hazards — the
operator no longer has to answer the same FANUC DCS question on 48
mechanical hazards individually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 01:05:53 +02:00
Benjamin Admin bc21480a2a fix(compliance-check): always render 8 doc types + 4 BMW GT-gap fixes
Always-show-8 (user-requested):
- agent_compliance_check_routes.py: _pad_results_with_missing pads the
  results list to always include all 8 canonical doc_types in canonical
  order. Missing types get a placeholder DocCheckResult with error=
  'Nicht eingereicht' + scenario='missing'.
- agent_doc_check_report.py: NICHT EINGEREICHT status label (neutral),
  friendly grey body block instead of red error.
- ChecklistView.tsx: 'Nicht eingereicht' chip (neutral grey, not red
  'Fehler'); SCENARIO_LABELS adds missing entry + header chip counter.

Impressum-Regression fix (#18):
- _fetch_text(url, doc_type): cookie/dse/social_media -> max_documents=1
  (CMP capture authoritative, sub-pages dilute). Other types -> =3
  (Impressum needs Versicherungsvermittler, Aufsicht, Berufsrecht sub-
  pages). 15s networkidle bail keeps timing safe.

ODR/Verbraucherstreitbeilegung filter (#19):
- _apply_profile_filter: when profile.needs_odr=True (B2C), override the
  check's default B2B-oriented hint with action-oriented B2C guidance
  pointing at Art. 14 EU-VO 524/2013 + §36 VSBG. Previously the check
  contradicted itself: 'profile says B2C' + hint 'only relevant for B2C
  online vendors'.

Registergericht regex (#20):
- impressum_checks.py: accept colon/dot/dash between keyword and city
  (BMW writes 'registergericht: münchen hrb 42243'). Add 'sitz und
  registergericht: X' as separate pattern.

Industry detection (#21):
- business_profiler.py: 'automotive' keywords broadened (antriebs,
  motor, leasing, werkstatt, probefahrt, plus brand names BMW/Mercedes/
  Audi/VW/Porsche/Opel). 'it_services' keywords narrowed — software/
  cloud/hosting are mentioned in every privacy policy and were biasing
  the result toward IT for any tech-aware company.
2026-05-17 01:03:58 +02:00
Benjamin Admin 74f66c4c34 fix(admin/iace/benchmark): show Klaerungsfragen + Normen on Engine column
The Go init handler appends two annotated blocks to Hazard.Description
("Mit Anlagenbauer zu klaeren: ..." and "Referenzierte Normen: ...")
without changing the DB schema. The benchmark detail view only rendered
hazard.scenario || hazard.description, so the appended blocks were
silently hidden because scenario is always populated.

Split the description into three structured pieces:
1. extractScenario() — pure scenario text, stripped of trailing blocks
2. extractClarifications() — bullet list of "Mit Anlagenbauer zu klaeren"
3. extractEngineNorms() — pipe-separated norm references

Each piece is rendered as its own DetailRow. The FANUC DCS clarification
that already lives in the DB (48/115 hazards on the Bremse project) is
now visible in the Engine column.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 00:42:41 +02:00
Benjamin Admin 5f2da1de88 feat(consent-tester): Phase E — self-improving CMP library
cmp_discovery_log.py:
- sqlite log at /data/cmp_discoveries.db: every LLM-discovered CMP
  pattern recorded with domain, strategy, value, sample text
- Auto-promote (user-chosen 'voll automatisch' mode): when LLM returns
  strategy=url AND extracted text >= 800 words, write a new module
  /data/auto_cmp/auto_<slug>.py with derived regex matcher + reconstruct
- record_discovery() called from dsi_discovery._try_llm_cascade on success

cmp_library/_registry.py:
- Loads both hand-written modules from services/cmp_library/ AND
  auto-promoted modules from /data/auto_cmp/ (CMP_AUTO_DIR env)
- Auto modules use importlib.util.spec_from_file_location, no package
  install needed; restart consent-tester to pick up new ones

dsi_discovery.py:
- _try_llm_cascade now calls record_discovery() on every successful
  LLM analysis (cached AND fresh)

main.py:
- GET /cmp-discoveries — admin endpoint listing all logged discoveries
- DELETE /cmp-discoveries/{id} — rollback (unlinks auto_*.py)

This closes the self-improving loop: first encounter with a new CMP fires
the LLM (cost) → discovery is auto-promoted → all future runs against the
same vendor pattern hit Phase B (Named CMP) at <50ms with no LLM call.
2026-05-16 23:09:23 +02:00
Benjamin Admin 2400aa6a9e feat(consent-tester): Phase C+D — LLM cascade fallback (Qwen → OVH)
New module consent-tester/services/cmp_llm_fallback.py:
- LLMCookieExtractor: single-endpoint adapter (Ollama OR OpenAI-compat)
- LLMCascade: tries Qwen (local Mac Mini Ollama) first; falls through to
  OVH (managed 120B) when Qwen returns no usable strategy
- LLMCascade.from_env(): reads OLLAMA_URL/CMP_LLM_MODEL + OVH_LLM_URL/
  OVH_LLM_KEY/OVH_LLM_MODEL from environment
- LLM returns JSON {strategy: url|selector|text, value: ...}
- Valkey-backed cache per netloc (cmp:hint:<netloc>, 7-day TTL) — next run
  against the same domain skips the LLM entirely

dsi_discovery.py:
- Wired network_log collector (URL/status/content-type/size of every JSON
  response on the page) — passed to LLM prompt as observation
- After Named CMP (Phase B) + Heuristic (Phase A) both fail AND DOM
  < 300 words: invoke LLMCascade.analyze(...)
- _apply_llm_hint executes the LLM's strategy: refetch URL via Playwright
  request context, query DOM selector, or use text directly
- Cache HIT path: apply cached hint, only fall back to LLM if cache is stale

docker-compose.yml:
- consent-tester gets env vars + cmp-data volume (for Phase E)
- All LLM endpoints configurable via env, sensible defaults

consent-tester/requirements.txt:
- redis>=5.0 (asyncio client, Valkey-compatible)
- httpx>=0.27
2026-05-16 23:06:05 +02:00
Benjamin Admin e9002175ac feat(iace): manufacturer safety feature library (Stufe A — 50+ entries)
Adds a curated database of safety-relevant features for the major
manufacturers across mechanical/plant engineering, written entirely in
own words with norm anchors. No verbatim manufacturer texts — therefore
no copyright issue:

- Markennennung (§ 23 MarkenG nominative use) is permitted.
- Fakten ueber Produkt-Sicherheitsfunktionen are not protected by § 2
  UrhG (only Werke, not facts).
- NormReferences contain only the identifiers (e.g. "EN ISO 13849-1
  PLd Kat.3"), never the norm text itself.

Coverage (52 entries across 12 categories):
  Industrieroboter (10): FANUC DCS, KUKA SafeOperation, ABB SafeMove,
    Yaskawa FSU, Staeubli CS9, Kawasaki Cubic-S, Mitsubishi MELFA,
    Universal Robots PolyScope, Doosan PRS, Comau SafeNet
  CNC/WZM (8): DMG MORI, Mazak, TRUMPF, Okuma, Hermle, Heidenhain
    SPLC, GROB, Heller
  Pneumatik (4): Festo, SMC, AVENTICS, Parker
  Hydraulik (3): Bosch Rexroth, HAWE, HYDAC
  Safety-PLC / Sicherheitstechnik (8): PILZ, SICK, Schmersal, Euchner,
    Leuze, Phoenix Contact, Banner, Wieland
  Standard-PLC (5): Siemens, Beckhoff, Rockwell, Schneider, B&R
  Pressen (3): Schuler, Bruderer, AIDA
  Spritzguss (3): Arburg, KraussMaffei, ENGEL
  Verpackung (2): Krones, Bosch Packaging/Syntegon
  Laser/Schweissen (3): Bystronic, Amada, Fronius
  Foerdertechnik (2): Interroll, SEW EURODRIVE

Engine integration:
- LookupManufacturerFeaturesInText() scans the project narrative for
  any of the manufacturer aliases (case-insensitive, umlaut-tolerant).
- Init-Handler appends matched feature clarifications to the relevant
  hazard's "Mit Anlagenbauer zu klaeren:" block — for the right
  HazardCategory only (e.g. FANUC DCS only on mechanical_hazard).
- For a Bremse project narrative mentioning "Fanuc Robodrill", the
  engine now adds clarification questions like "Ist DCS am Roboter
  konfiguriert?" to relevant mechanical hazards automatically.

Tests: 7 new pin tests — manufacturer count, norm prefixes, FANUC/KUKA
detection in narrative, umlaut robustness (Staeubli vs Staubli).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:04:56 +02:00
Benjamin Admin 7e426c31f1 feat(consent-tester): Phase B — named CMP library + plugin architecture
cmp_extractor.py refactored to thin coordinator (123 LOC, was 223).
Discovers all CMP modules via cmp_library/_registry.py:load_all() at
import time. Restart consent-tester to pick up new modules.

New cmp_library/ folder:
- _registry.py: auto-discovers all modules with MATCHER + reconstruct()
- epaas.py:     BMW Group ePaaS (extracted from cmp_extractor)
- onetrust.py:  cdn.cookielaw.org Groups/Cookies schema
- cookiebot.py: consent.cookiebot.com Categories schema
- usercentrics.py: api.usercentrics.eu services schema
- didomi.py:    sdk.privacy-center.org notice + vendors + purposes
- trustarc.py:  consent.trustarc.com categories + vendors

Each module:
- MATCHER: re.Pattern matching the CMP JSON endpoint URL
- reconstruct(d: dict) -> str: builds German Markdown cookie-policy text

Phase E (self-improving) will write auto_*.py files into the same folder;
_registry already picks those up via pkgutil.iter_modules.
2026-05-16 22:59:48 +02:00
Benjamin Admin 4f19310130 fix(iace): HP1654 Greifer durchschlaegt Zaun — DCS-Bezug
GT 1.8 fordert konkret den 'sicher begrenzten Bewegungsbereich (Dual
Check Safety)'. HP1654 hatte nur M061 'Feste trennende Schutzeinrich-
tung' als Mitigation. Ergaenzt um M494 (Safe Limited Position/Space mit
DCS-Erlaeuterung), M501 (Schutzzaun-Lastbemessung) und M502 (Greifer-
Fail-Safe). Klaerungsfragen verweisen explizit auf DCS bei FANUC,
SafeMove bei ABB, SafeOperation bei KUKA und die EN ISO 13849-1 PLd/
Kat.3-Validierung.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:56:40 +02:00
Benjamin Admin 8283483909 feat(consent-tester): Phase A — generic JSON cookie-policy heuristic
New module cmp_heuristic.py with:
- looks_like_cookie_policy(data): shape-based classifier (top-level keys
  cookies/categories/providers/vendors/purposes/cookieList/etc. + at
  least 2 name+description objects, or IAB TCF v2 vendors[]+purposes[])
- reconstruct_generic(data): walks JSON, extracts name + description
  fields + standalone prologue/dataController/persistence fields,
  emits flat German Markdown text (max 5000 words, dedup)

cmp_extractor.py wired so that AFTER named CMP matchers (epaas,
onetrust) fail, every JSON response on the page is tested for the
heuristic. If matched, payload is captured as '_heuristic' kind and
reconstructed via the generic walker.

This is Phase A of the 4-stage cascade (B-D follow). Unknown CMPs that
return JSON now work without hand-coding each one.

Pre-filter: skips response paths /api/config, /beacon, /track,
/analytics, /fonts/, /log/, /heartbeat/, /.well-known/ to avoid
spamming the heuristic on every Playwright load.
2026-05-16 22:56:20 +02:00
Benjamin Admin 9814b56f2f fix(cookie-extract): max_documents=1 + faster networkidle bail (Phase 0 fix)
Root cause of the recurring 603-word BMW result:
- DSI discovery for cookie-policy URL was hitting 4x networkidle timeouts
  (60s each = ~240s total).
- Backend httpx timeout (180s after the previous fix) gave up before the
  consent-tester finished, falling through to the raw HTTP fetch which
  returned BMWs SSR navigation chrome (603 words) as the 'cookie policy'.

Two orthogonal fixes:
1. _fetch_text now passes max_documents=1 for user-specified URLs. We only
   want self-extraction of THAT page; link-following is unnecessary noise.
2. networkidle wait_until window dropped 60s -> 15s. SPAs like BMW/Daimler
   never reach networkidle anyway; the 60s wait was pure latency. Falls
   through to domcontentloaded+5s render-wait, same as before.
2026-05-16 22:53:23 +02:00
Benjamin Admin 69729ef6ac feat(iace): norm references in mitigations + aggregated norm panel per hazard
Library measures carry NormReferences (EN/IEC/ISO/DIN/TRBS/TRGS Ziff./Kap./
Pos.) but they were dropped on persist: CreateMitigationRequest only
wrote Name + Description. The Fachmann benchmark file lists Normen for
34 of 60 hazards — the engine had this data already but lost it on the
way to the UI.

Fix without DB schema change:
- Mitigation.Description gets a "Normen: EN 60204-1 Ziff. 6.2 | EN 61140"
  line appended when the measure has NormReferences. Pipe separator keeps
  the inline panel short and grep-friendly.
- After all mitigations land, the aggregated dedup'd norm list for the
  hazard is appended to Hazard.Description as a single "Referenzierte
  Normen: ..." line so the UI can show one panel per hazard without
  scanning every mitigation.

Audit of library coverage (per-pattern) showed GT-Bremse Normen are
generally present and richer:
- HP1640 covers GT 2.2 (EN 60204-1 Ziff. 6.2, Ziff. 8.2.3, EN 61140 +)
- HP1641 covers GT 2.4 (EN 60204-1 Ziff. 8.2.6 +)
- HP1605 covers GT 1.7 (ISO 10218-1 Ziff. 5.6.2, 5.8.3 — Ziff. 5.7.3 fehlt)
- HP1671 covers GT 1.30 (EN 12417 — Pos. detail fehlt)

Followup: 2 fine-grained sub-paragraph references (5.7.3, Pos. 1.1.4)
can be added later as measure-text updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:51:50 +02:00
Benjamin Admin 35d6422247 fix(iace): HP1632 Bersten-Pattern eindeutige Zone fuer Dedup
ZoneDE 'Pneumatikkomponenten der Anlage' kollidiert nach normalizeZoneKey
mit HP1630 'Pneumatikschlaeuche der Automation' im 3-signifikante-Wort-
Vergleich. Neue Zone 'Berstgefaehrdete Druckwandungen Pneumatik (Leitungs-
wand, Dichtung, Verschraubung)' hat semantisch eigenstaendige Schluessel-
woerter — Dedup mergt nicht mehr in HP1630.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:34:51 +02:00
Benjamin Admin 5ea68ebea4 feat(iace): clarification questions + HP1632 Bersten + HP1637 KSS-Aerosol fix
Drei nachhaltige Verbesserungen, getrieben durch die Bremse-Benchmark-
Faelle GT 1.4, GT 1.30 und GT 7.4. Die Engine erfindet weiterhin
keine Fachmann-Kommentare — Kommentare bleiben aus, weil sie ein
Verstaendnis der konkreten Anlage erfordern, das die Engine nicht
hat. Statt dessen liefert die Engine norm-basierte Klaerungsfragen
und ein praeziseres Pattern-Vokabular.

A) HazardPattern.ClarificationQuestionsDE — neues optionales Feld:
   - Pattern hinterlegt prueffaehige Fragen, die der Bediener mit dem
     Anlagenbauer abklaert. Beispiele:
     - HP1640: "Liegt ein Pruefprotokoll nach EN 60204-1 vor?"
     - HP1666: "Ist die WZM als CE-konformes Subsystem integriert?"
     - HP1604: "Ist DCS am Roboter konfiguriert und validiert?"
   - Init-Handler haengt die Fragen an Hazard.Description an mit dem
     Marker "Mit Anlagenbauer zu klaeren:". Kein DB-Schema-Aenderungs-
     bedarf.
   - 11 Patterns mit Klaerungsfragen versehen (HP1602, HP1604, HP1611,
     HP1612, HP1620, HP1622, HP1637, HP1640, HP1641, HP1666, HP1685).

B) HP1632 "Bersten druckbeaufschlagter Pneumatik-Komponente" — neues
   Pattern, semantisch DISTINKT zu HP1630 "Abspringen":
   - Bersten = Material-/Druckversagen der Komponente, Mediumaustritt
   - Abspringen = Verbindung loest sich, Peitscheneffekt
   Bremse-Benchmark GT 1.4 sprach von Bersten, HP1630 nur von
   Abspringen — ein 66%-Frontend-Match war eine Sackgasse. Mit
   HP1632 feuert die Engine ein eigenes Hazard, das auf GT 1.4
   einen sauberen Volltreffer liefert.

C) HP1637 "Einatmen von KSS-Aerosolen" — Massnahmen vervollstaendigt:
   Vorher nur M141 (Sicherheitszeichen), neu zusaetzlich M405 (KSS-
   Aerosolabsaugung), M418 (AGW-Ueberwachung), M526 (WZM-Tueren
   geschlossen waehrend Bearbeitung), M408 (Hautschutzplan).
   Klaerungsfrage: "Wurde die Aerosolkonzentration nach Bearbeitungs-
   ende messtechnisch ermittelt und mit dem AGW verglichen?"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:23:56 +02:00
Benjamin Admin 41023f6343 fix(iace): HP1671 Druckluft-Verletzung — 4 zusaetzliche GT-1.30 Massnahmen
HP1671 "Druckluft-Verletzung in Bearbeitungszelle" matched zwar das
GT-1.30 Szenario "Einstich, Augenverletzung in Bearbeitungszelle" exakt
nach Name und Scenario, hatte aber nur eine einzige Massnahme M061
"Feste trennende Schutzeinrichtung". Die drei spezifischen Massnahmen
des Fachmanns (Reinigungsduese in Zelle integriert / Druckluft bei
Tueroeffnung aus / Einhausung-Lastbemessung) blieben unsichtbar, weil
mein neuer GT-Bremse-Pattern HP1712 zwar diese Massnahmen kennt, aber
durch RequiredEnergyTags=["pneumatic"] in diesem Projekt nicht feuert.

Fix: HP1671 SuggestedMeasureIDs ["M061"] -> ["M504", "M505", "M501",
"M061", "M141"]. EN 12417 Kap. 5.2 / Pos. 1.1.4 ist jetzt durch
M504/M505 abgedeckt. HP1712 bleibt als Backup-Pattern fuer Projekte
mit explizitem pneumatic-Tag bestehen.

Followup: HP1671 und HP1712 sind semantisch redundant — Konsolidierung
ist Teil der naechsten Pattern-Hygiene-Iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:08:05 +02:00
Benjamin Admin 6689b37f95 fix(agent): bump _fetch_text timeout 60s->180s
The dsi-discovery in consent-tester does self-extraction + follows up to
3 sub-links + waits for CMP JSON payloads. On big SPAs (BMW, Daimler)
this routinely exceeds 60s. When it timed out, the HTTP fallback returned
the SSR shell as text — for the BMW cookie page that's 603 words of site
navigation, which then registered as 'Cookie-Richtlinie nicht im
eingereichten Text' (33%). With 180s the consent-tester finishes cleanly
and we get the CMP-captured 1824 words of real policy.
2026-05-16 22:00:42 +02:00
Benjamin Admin 80d62a0c5f fix(iace): rename 58 duplicate HP-IDs in extended.go/extended2.go
Background: hazard_patterns_extended.go (HP045-074) and _extended2.go
(HP074-102) shared their entire ID range with the semantically-different
patterns in hazard_patterns_cobot.go, hazard_patterns_press.go,
hazard_patterns_operational.go and hazard_patterns_extended_dguv.go.
The collision had lived unnoticed because TestGetBuiltinHazardPatterns_-
UniqueIDs only checks the 44 builtin patterns (HP001-HP044).

Examples of the collision:
- HP059 = "Kollision Mensch-Roboter" (cobot.go) vs "Kupplung — mechanisch" (extended.go)
- HP060 = "Quetschen durch Werkzeug am Cobot" (cobot.go) vs "Diagnosemodul — Software" (extended.go)
- HP073 = "Wartung ohne LOTO" (operational.go) vs "Hydraulikventil — hydraulisch" (extended.go)

At runtime collectAllPatterns() returned both patterns under the same ID
which made downstream lookups (e.g. hazardPatternMeasures map keyed by
pattern_id) non-deterministic — last-loaded wins, dropping the other
pattern's mitigation set silently.

Rename strategy (no deletes — both patterns are real and earn their
SuggestedMeasureIDs after the category-filter work):
  extended.go  HP045..HP073 -> HP1800..HP1828 (29 IDs)
  extended2.go HP074..HP102 -> HP1830..HP1858 (29 IDs)

cobot/press/operational/extended_dguv keep their original IDs because:
- compliance_triggers.go references HP059/HP060 with the cobot meaning
- pattern_engine_test.go references HP073 with the LOTO/maintenance meaning
- phase3_4_test.go references HP073 the same way

New regression test:
- TestAllPatterns_UniqueIDs runs over collectAllPatterns() and fails if
  ANY pattern in the runtime set duplicates an ID. The old
  TestGetBuiltinHazardPatterns_UniqueIDs stays for the builtin subset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:00:06 +02:00
Benjamin Admin 6a3e96d54c fix(iace): set-based measure-category filter + 235 pattern-author fixes
Two-part nachhaltiger fix replacing the previous "fill to 5 mitigations
no matter what" behavior that the GT-Bremse benchmark proved
unfaithful (e.g. HP1625 "scharfe Kanten" returning M005 "Rotations-
bewegung vermeiden" via category fallback; HP1651 "Wiederanlauf
Roboter" returning M054 "Sichere thermische Auslegung" via
mismatched pattern reference).

PART A — Set-based category filter (handlers package):
- acceptableMeasureCategories: replaces 1:1 patternCatToMeasureCat
  with a curated set per pattern category, so e.g.
  safety_function_failure now accepts software_control measures
  (watchdogs, plausibility checks) and emc_hazard accepts both
  electrical and software_control measures
- isCategoryCompatible: gate every measure id against the accepted
  set before creating a mitigation; mismatches log MEASURE-SKIP
- The old category fallback is REMOVED. A hazard whose pattern has
  no category-compatible measure is now created with zero mitigations
  and logged as COVERAGE-GAP — the operator must consult an expert.
  No more silent invention of generic defaults.

PART B — 235 pattern author-error fixes across 26 files:
- HP040-HP044 (AI): M101/M102/M103 (Auffangwanne/Absauganlage) ->
  M133 Anomalieerkennung + M214 Plausibilitaet + M213 Sensor-Redundanz
  + M044 Zweikanalige Steuerung + others
- HP011-HP015, HP104-HP109, HP1085-HP1095, HP1281-HP1334 (electrical):
  M001-M005/M054/M061 placeholders -> M481/M482 Isolation +
  M511-M522 PE/Schutzleiter/RCD/Hauptschalter
- HP110-HP1331 (material_environmental): M101-M103 -> M384-M395
  Brandschutz/Laserschutz + M533/M408 SDB/PSA
- HP800-HP858, HP1178-HP1264 (software/sensor/hmi):
  M101/M104 -> M105/M106/M107/M214 SPS/Watchdog/Plausibilitaet
- HP026, HP611-HP1690 (ergonomic): M001/M082 -> M353-M360 +
  M530-M532 Hebehilfe/ergonomische Hoehe
- HP201-HP1697 (mechanical): M054/M051 -> M002/M008/M061/M141 +
  M487/M488 Tueroeffnung-Stillsetzung/Wiederanlauf
- Plus EMF/Strahlung/Brand/Lärm/Vibration/Kommunikation/Cyber

Coverage shift (Pattern-Author-Fehler bei aktiviertem Set-Filter):
   start:         237 patterns with zero category-compatible measures
   after Stufe 1A:   5 (AI)
   after Stufe 1B:  20 (mechanical Bestand)
   after Stufe 1C:  35 (electrical Bestand)
   after Stufe 1D:  29 (material_environmental)
   after Stufe 1E:  29 (software/sensor/hmi)
   after Stufe 1F:  20 (ergonomic)
   after Stufe 1G:  80 (thermal/comm/radiation/fire/safety)
   final:           0  (28 extended.go/extended2.go duplicates fixed)

New regression tests:
- TestEveryPattern_HasCategoryCompatibleMeasure: every pattern in
  collectAllPatterns() must reference at least one category-compatible
  measure; gaps must be explicitly listed in AllowlistKnownGaps
  (currently empty). Fails CI for any new pattern that drifts.
- TestAcceptableMeasureCategories: pins the set-mapping for the
  7 most-bug-prone pattern categories.
- TestIsCategoryCompatible_EmptyMeasureCat: protects legacy entries.

A separate task #11 tracks 58 HP-ID duplicates between
extended.go/extended2.go and cobot.go/press.go/operational.go —
patterns are semantically different and TestGetBuiltinHazardPatterns_-
UniqueIDs misses them because it only checks HP001-HP044.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 21:11:02 +02:00
Benjamin Admin 938f9a6c51 fix(cmp): tolerate variable URL segments in ePaaS policy pattern
BMW ePaaS URLs use 3 segments between /policypage/ and .epaas.json:
  /epaas/prod/policypage/<tenant>/<config-hash>/<locale>.epaas.json
The old pattern only matched 2 segments. Switch to a tolerant pattern
that matches any path before .epaas.json (anchored at .epaas.json end).
2026-05-16 20:58:48 +02:00
Benjamin Admin 17a93bc694 fix(consent-tester): prefer CMP-JSON over thin DOM extraction
Previous threshold (DOM < 300 words) missed the BMW case where Playwright
extracted 346 words of pure site navigation. The CMP JSON had 1673 words
of real policy content but was discarded.

New heuristic: prefer CMP when ANY of:
  - DOM < 300 words (existing)
  - CMP text >= 1000 words (authoritative at scale)
  - CMP text >1.5x longer than DOM
2026-05-16 20:56:11 +02:00
Benjamin Admin 1792c6f896 fix(consent-tester): capture CMP JSON to extract dynamically-loaded cookie policies
BMW (and other big enterprise sites) do NOT render cookie policies as
static HTML. Their widget loads structured data from a JSON endpoint
(BMW: ePaaS at /epaas/prod/policypage/.../<locale>.epaas.json) and
renders it client-side after consent. Our DOM extraction therefore only
captured site navigation (603 words of header/footer chrome), not the
actual policy.

New module consent-tester/services/cmp_extractor.py:
- CMPCapture: response listener that catches policy JSON during navigation
- Reconstructors for ePaaS (BMW) + OneTrust placeholder
- Returns Cookie-Richtlinie text built from policyPageMetadata +
  categories + providers (BMW: 1673 words reconstructed vs. 603 noise)

dsi_discovery.py:
- Attach CMPCapture before page.goto
- After self-extraction: if rendered DOM < 300 words AND CMP captured a
  payload, prefer the CMP-reconstructed text. This bypasses the empty
  '.cookie-policy' div problem entirely.
2026-05-16 20:50:15 +02:00
Benjamin Admin e61e9d9e2a feat(agent): progress_pct + 6 BMW-Run Verbesserungen
Backend (agent_compliance_check_routes.py):
- progress_pct (0-100%) im Job-State, ueber alle Phasen verteilt
  (Laden 0-30, Profil 35-40, Pruefen 40-80, Banner 80-92, Report 95-100)
- Status-Texte vereinheitlicht ("Texte laden X/N", "Pruefen X/N")
- Firmenname fuer Email-Subject jetzt aus URL abgeleitet
  (bmw.de -> "BMW", mercedes-benz.de -> "Mercedes-Benz") statt
  unzuverlaessigem extracted_profile.companyName (matchte oft juris.de)
- E-Mail-Report enthaelt jetzt Banner+TCF-Vendor-Liste (build_provider_list_html)

Backend (agent_doc_check_extras.py — neu):
- build_scanned_urls_html: gepruefte URLs als Tabelle oben im Report
  (transparent fuer GF, welche Quellen wirklich gezogen wurden)
- Cross-Domain-Hinweis bei >1 netloc (BMW: bmw.de / bmwgroup.com /
  bmwgroup.jobs — Auffindbarkeit nach Art. 12 DSGVO)
- build_provider_list_html: Banner-Box + TCF-Vendor-Tabelle mit Spalten
  Name | Kategorie | Zweck | Drittland | Rechtsgrundlage

Backend (business_profiler.py):
- §34d-GewO Versicherungsvermittler-Hinweise zaehlen nicht mehr als
  "finance"-Industrie (BMW wurde dadurch falsch als B2B/finance erkannt)
- Neue Industry "automotive" (Fahrzeug/KFZ/Konfigurator/Modellpalette)
- B2B-Keywords: generische Begriffe wie "unternehmen", "beratung",
  "consulting" entfernt (matchten in jedem Konzerntext)
- B2C-Fallback: bei Verbraucher-Signalen ("widerruf", "kunde",
  redaktioneller Inhalt) tendiert auf b2c statt b2b

Frontend (ComplianceCheckTab.tsx):
- Progress-Balken mit Width-% und XX%-Anzeige rechts
- liest data.progress_pct aus Polling-Response

Consent-Tester (dsi_discovery.py):
- Cookie-Policy-Extraktion kritisch fixt: wait_for_function bis
  body.innerText > 500 chars (BMW SPA-Rendering brauchte mehr Zeit)
- _extract_text_robust: 3-Strategien-Extraktion (Selektoren -> Body-
  Cleanup -> P/LI/TD-Tags)
- _extract_text_from_iframes: liest OneTrust/Sourcepoint/Usercentrics
  Iframe-Inhalte (manche Cookie-Policies leben dort)

Adressiert alle Findings aus dem BMW-Ground-Truth-Vergleich.
2026-05-16 17:53:14 +02:00
Benjamin Admin 4d1e0a7f8e feat(iace): GT-Bremse coverage — 59 expert measures + 7 hazard patterns
Systematic gap analysis of the Bremse ground-truth file (60 entries,
100 unique expert measures) revealed only ~5% library coverage. This
commit closes the documented gaps with concrete, norm-anchored
mitigations.

Library additions (M481-M539, 59 entries):
- M481-M482  Low-voltage isolation (>= 2,0 / 2x1,0 / 1,0 MOhm +
             IP2X/IPXXB per EN 60204-1 Ziff. 6.2/8.2.3) — primary
             trigger of this work
- M483-M485  Pneumatic safety (component pressure rating, hose
             retention, depressurization per EN ISO 4414)
- M486-M490  Robot-cell access (tool-secured fence, dual-channel
             door monitor, intentional restart, anti-trap inside
             opening, HMI sight line per ISO 10218-2)
- M491-M493  Teach mode (key/password mode selector, safe reduced
             speed <= 250 mm/s, hold-to-run with 3-stage enabler
             per ISO 10218-1)
- M494-M500  Geometry constants (Safe Limited Position, reach-over
             250 mm @ 2250 mm fence, conveyor opening >= 850 mm,
             25 mm finger gap, band speed <= 100 mm/s per
             EN ISO 13857 / EN 619)
- M501-M507  Enclosure load rating, gripper fail-safe, centring
             gripper stop on door, MWF nozzle integration, floor
             load capacity per DIN 1055-3
- M508-M517  Electrical cabling + PE protection (environment-rated,
             drag chain, strain relief, 10 mm² Cu PE, dual PE,
             monitoring, continuity check, class-II equipment,
             SELV/PELV per EN 60204-1)
- M518-M522  RCD, cable cross-section, overcurrent in each active
             conductor, IP22 water ingress, lockable main switch
- M523-M539  Teach-locked door, WZM door interlock, dual-channel
             door switch, machining-doors-closed for aerosol
             retention, post-NOTHALT release, >25 kg lifting aid
             (DGUV 208-016), 95-120 cm control height, ergonomic
             conveyor height, SDS/PSA reference, BA instructions
             for depressurization/clamp release/max weight/pinch
             warning/slip warning/dead-state cleaning

New hazard patterns (HP1710-HP1717):
floor overload, gripper failure throw, compressed-air injury in
machining cell, manual handling load + awkward posture, MWF skin
contact, live-cabinet cleaning short, pneumatic stored-energy.

Existing patterns rewired to the new measures: HP1600, HP1602-1606,
HP1610-1612, HP1620-1622, HP1630/1631/1633, HP1640/1641, HP1660/1661,
HP1675, HP1685, HP1688, HP1689, HP1698-1704.

Tooling:
- scripts/gt_measure_gap_analysis.py: 4-signal fuzzy matcher
  (Jaccard, token recall, substring containment, norm-reference
  overlap). Outputs markdown + JSON.
- gt_coverage_test.go: 23 expert-validated (GT-Nr, pattern, measure)
  triples + a norm-reference presence test for every new expert
  measure (no generic 'do X safely' entries allowed).
- .gitea/workflows/ci.yaml: new iace-gt-coverage job enforces
  MIN_COVERAGE_PCT (70%) on Strong+Weak GT coverage; never lower
  without explicit decision.

Coverage shift: 5% Strong -> 30% Strong, 0% -> 72% Strong+Weak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 13:08:52 +02:00
Benjamin Admin bf9d8a5ed3 fix(iace): resolve M-ID collisions for electrical/pressure patterns
6 supplementary measures (M410-M420) were silently overwritten by
metalworking duplicates in measureByID lookups, so robot-cell electrical
patterns resolved to chip-extraction/cleaning fallbacks instead of
equipotential bonding, creepage, EMC, or hose-burst protection. Rename
supplementary IDs to M475-M480 and rewire 13 affected pattern references
in robot_cell + robot_cell_ext.

HP1640 (direct contact with live parts, GT 2.2): priority 98->99, drop
RequiredEnergyTags gate so it fires in robot cells without an electrical
tag, expand mitigations to 5 concrete TRBS 2131 / IEC 60204-1 / EN 61140
measures (basic protection, double insulation, earthing, insulation
monitoring, equipotential bonding) — was previously losing to HP1688
even though HP1688 describes a different scenario.

HP1688 (touch voltage from potential differences): priority 98->96 so it
no longer outranks HP1640 for the direct-contact case; mitigations
expanded from M410-only to 4 concrete electrical measures.

Add regression tests pinning HP1640 contact-protection resolution and
M475 = Potentialausgleich. Existing TestGetProtectiveMeasureLibrary_-
UniqueIDs now actually enforces uniqueness (previously masked by
last-wins map override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 10:12:55 +02:00
Benjamin Admin d45e08e25f fix: reduce Playwright timeout 180s→60s, increase poll limit 15→25min 2026-05-16 00:47:28 +02:00
Benjamin Admin 3dbf3aa34a feat: HTTP fallback for text extraction when Playwright times out
BMW Impressum/Cookie pages timeout in Playwright (>180s) because the
SPA has many sub-links to follow. But the HTML source already contains
the text (SSR). New fallback: direct HTTP GET + HTML tag stripping.

Order: 1. Consent-tester (Playwright, 180s) → 2. HTTP GET (30s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 23:16:10 +02:00
Benjamin Admin 77308b783f debug: log CreateMitigation errors 2026-05-15 21:52:04 +02:00
Sharang Parnerkar 3784988d00 chore: bump next 15.1.0 → 15.5.16 (CVE-2026-44578)
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 1m37s
CI / detect-changes (push) Successful in 1m6s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / loc-budget (push) Successful in 21s
CI / nodejs-build (push) Successful in 4m16s
CI / test-go (push) Has been skipped
CI / test-python-backend (push) Has been skipped
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Patches unauthenticated SSRF in WebSocket upgrade handler.
Applies to admin-compliance, developer-portal.

Compliance-SDK admin-dashboard skipped — has a pre-existing TS
type mismatch that blocks the build regardless of Next version.
Needs separate migration work.

GHSA-c4j6-fc7j-m34r.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 18:48:36 +02:00
Benjamin Admin 9797234ff6 fix(iace): add abbreviations + action words to genericSafetyTerms
KSS, EMV, ESD, DCS, PLR, SIL, HMI, SPS, RCD, LOTO, PSA are
abbreviations that should NOT trigger the relevance filter.
bersten, platzen, abspringen, spritzen, einatmen, ausrutschen,
herabfallen, durchschlaegen, wegschleudern are action words that
appear in many patterns and don't indicate a specific machine.

Fixes: HP1633-HP1675 (KSS patterns) were filtered out because
"kss" was not in the narrative but also not in genericSafetyTerms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 16:05:20 +02:00
Benjamin Admin 7080eb5f45 fix(iace): boost robot cell priorities 96-99, remove debug code
Robot cell patterns now fire BEFORE generic patterns (Priority 96-99
vs generic 85-95). This ensures pattern-specific SuggestedMeasureIDs
(M420 for KSS, M410 for Potentialausgleich) reach the hazard.

Removed debug fmt.Println statements.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 16:01:52 +02:00
Benjamin Admin c93cf2719a debug: trace M420 in Priority-1 loop 2026-05-15 14:56:05 +02:00
Benjamin Admin 7a27dbc01b debug: check M420 in measureByID 2026-05-15 14:53:49 +02:00
Benjamin Admin de35dfce18 debug: add pattern-measure count to init step details 2026-05-15 14:51:26 +02:00
Benjamin Admin 69240faf24 fix(iace): accumulate SuggestedMeasureIDs across dedup'd patterns
When multiple patterns match the same category+zone, the first creates
the hazard and later patterns add their SuggestedMeasureIDs to the
existing hazard. This ensures KSS-specific measures (M420) reach the
hazard even if a generic pattern created it first.

seenCatZone changed from map[string]bool to map[string]uuid.UUID
to track which hazard ID was created for each dedupKey.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:45:37 +02:00
Benjamin Admin f34305c0a1 fix: increase dsi-discovery timeout 90s→300s, reduce max_documents 10→5 2026-05-15 14:21:13 +02:00
Benjamin Admin 2b5376ed54 fix(iace): pattern-specific measures take priority over category fallback
Each hazard now gets measures from its SOURCE PATTERN first
(SuggestedMeasureIDs), then category fallback for remaining slots.

Previously all mechanical hazards got the same generic top-5 measures
(Gefahrstelle eliminieren, Sicherheitsabstaende, Scharfe Kanten...).
Now a KSS-Schlauch hazard gets M420 (Druckfeste Auslegung) first.

SuggestedMeasureIDs added to PatternMatch struct and passed through
from pattern definition to hazard creation to measure assignment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 14:17:32 +02:00
Benjamin Admin 958c03ab40 fix(iace): add human reference to all 33 robot cell patterns
Every ScenarioDE now describes how a PERSON is affected, not just
what happens to the machine. Every HarmDE describes the INJURY,
not just the technical effect.

Examples:
- "Peitscheneffekt des Schlauchs" → "Person wird von abspringendem
  Schlauch getroffen. KSS-Spritzer verletzen Haut und Augen."
- "Kurzschluss, Brand" → "Person wird durch Brand oder toxische
  Rauchgase verletzt. Verbrennungen, Rauchvergiftung."

Rule: Risikobeurteilung bewertet Gefahr fuer PERSONEN.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 13:43:54 +02:00
Benjamin Admin fca67c1f43 fix: accordion close bug + merge multi-page DSIs (BMW fix)
1. _expand_all_interactive(): Only click aria-expanded="false" buttons.
   Before: clicked ALL accordion buttons including open ones → BMW's
   pre-expanded accordions got CLOSED, reducing text from 1151 to 361w.

2. _fetch_text() + /extract-text: merge ALL documents found on a page
   (max_documents=10 instead of 1). BMW splits DSI across 5 sub-pages
   that the discovery finds as separate documents — now merged.

3. Tab panels: unhide hidden tabpanels instead of clicking tabs
   (clicking tabs can hide the currently visible panel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-15 13:32:04 +02:00
Benjamin Admin 70af018da5 docs(gt): BMW cross-domain finding — 3 domains, no AGB, Social Media on jobs portal 2026-05-15 13:21:27 +02:00