breakpilot-compliance

Author	SHA1	Message	Date
Benjamin Admin	b090662524	fix(compliance-check): respect auto-discovery 'not found' verdict; DSB not canonical Two related bugs in the BMW test result: 1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB: - Auto-discovery correctly returned 'not found' for AGB (no link on bmw.de matches AGB keywords). - But auto_fill_from_dsi then found the substring 'AGB' in a section of the DSI and pseudo-filled the AGB entry with a 264-word DSI fragment. - cross_search_documents would have done the same. - Both now skip entries where discovery_attempted=True AND auto_discovered=False — the 'not found' verdict stands. 2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words = the entire DSI text: - GDPR practice: the DSB is named inside the DSI as an email or contact block (Art. 13(1)(b)), not as a stand-alone page. - cross_search_documents had been assigning the full DSI to the DSB row because it matched 'datenschutzbeauftragte' keywords. - DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer padded as missing, no longer auto-discovered. The frontend row remains so a tenant with a separate DSB page can still submit one. After this fix BMW should render: - DSE: OK - Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately) - Cookie-Richtlinie: OK - Social Media: NICHT GEFUNDEN (bmw.de does not link to it) - AGB: NICHT GEFUNDEN (correct — BMW has no public AGB) - Nutzungsbedingungen: NICHT GEFUNDEN - Widerruf: NICHT GEFUNDEN	2026-05-17 01:53:09 +02:00
Benjamin Admin	bd2d6976d6	fix(cross-doc): also check entries with wrong text, not just empty ones Cross-search now validates if existing text matches the expected doc_type using keyword scoring. If text is present but doesn't match (e.g. Nutzungsbedingungen in Widerruf row), searches other texts and creates a finding explaining the mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-15 00:19:40 +02:00
Benjamin Admin	4e9043f26d	feat(cross-doc): search all texts for all doc_types + misplacement finding Cross-Document Intelligence: When a doc_type row is empty, searches ALL other loaded documents for that content. If found (e.g. Widerruf in AGB), extracts the section, runs the check, AND creates a finding: "Widerrufsbelehrung in falschem Dokument gefunden — schwer auffindbar" Keywords for: widerruf, cookie, social_media, impressum, agb, dsb. Integrated as Step 1c in compliance check pipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-14 23:19:39 +02:00
Benjamin Admin	c702260ec1	fix: 5 regex bugs + text extraction scroll + GT update Build + Deploy / build-admin-compliance (push) Successful in 13s Details Build + Deploy / build-backend-compliance (push) Successful in 23s Details Build + Deploy / build-ai-sdk (push) Successful in 13s Details Build + Deploy / build-developer-portal (push) Successful in 14s Details Build + Deploy / build-tts (push) Successful in 15s Details Build + Deploy / build-document-crawler (push) Successful in 13s Details Build + Deploy / build-dsms-gateway (push) Successful in 15s Details Build + Deploy / build-dsms-node (push) Successful in 14s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 15s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m26s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 39s Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Successful in 25s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 2m28s Details Root cause: Spiegel DSI text was truncated (lazy-loading) — the rights/DSB/complaints sections at the bottom were never extracted. Fixes: 1. Text extraction: scroll to bottom before innerText (dsi_discovery.py) 2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern 3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces) 4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc. 5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media rows from sections found in the DSI text Ground Truth 06-spiegel.md fully rewritten with verified data from live website — 3 L1 False Negatives identified (DSB, Beschwerderecht, Betroffenenrechte all present on website but not in extracted text). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-13 01:20:55 +02:00
Benjamin Admin	74f00bbb0f	feat(compliance-check): split shared URLs into sections per doc_type Build + Deploy / build-admin-compliance (push) Successful in 2m4s Details Build + Deploy / build-backend-compliance (push) Successful in 3m39s Details Build + Deploy / build-ai-sdk (push) Successful in 50s Details Build + Deploy / build-developer-portal (push) Successful in 1m12s Details Build + Deploy / build-tts (push) Successful in 2m16s Details Build + Deploy / build-document-crawler (push) Successful in 1m9s Details Build + Deploy / build-dsms-gateway (push) Successful in 35s Details Build + Deploy / build-dsms-node (push) Successful in 32s Details CI / branch-name (push) Has been skipped Details CI / guardrail-integrity (push) Has been skipped Details CI / loc-budget (push) Failing after 16s Details CI / secret-scan (push) Has been skipped Details CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / nodejs-build (push) Successful in 2m37s Details CI / dep-audit (push) Has been skipped Details CI / sbom-scan (push) Has been skipped Details CI / test-go (push) Successful in 43s Details CI / test-python-backend (push) Successful in 39s Details CI / test-python-document-crawler (push) Successful in 27s Details CI / test-python-dsms-gateway (push) Successful in 22s Details CI / validate-canonical-controls (push) Successful in 15s Details Build + Deploy / trigger-orca (push) Successful in 3m16s Details When the same URL is used for multiple document types (e.g. /datenschutz for DSI + Cookie + DSB), the section splitter now: - Detects duplicate URLs and fetches text only once - Splits text at classified headings (Cookie, Google Analytics, etc.) - Assigns matching sections to each doc_type - DSI always keeps the full text Extracted to section_splitter.py (170 LOC) to keep routes under 500. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-12 12:49:57 +02:00

5 Commits