Two related bugs in the BMW test result:
1. AGB rendered as 'MANGELHAFT 0/13' even though BMW has no public AGB:
- Auto-discovery correctly returned 'not found' for AGB (no link on
bmw.de matches AGB keywords).
- But auto_fill_from_dsi then found the substring 'AGB' in a section
of the DSI and pseudo-filled the AGB entry with a 264-word DSI
fragment.
- cross_search_documents would have done the same.
- Both now skip entries where discovery_attempted=True AND
auto_discovered=False — the 'not found' verdict stands.
2. DSB-Kontakt rendered as a separate 100% OK document with 7566 words
= the entire DSI text:
- GDPR practice: the DSB is named *inside* the DSI as an email or
contact block (Art. 13(1)(b)), not as a stand-alone page.
- cross_search_documents had been assigning the full DSI to the DSB
row because it matched 'datenschutzbeauftragte' keywords.
- DSB removed from _ALL_DOC_TYPES — no longer canonical, no longer
padded as missing, no longer auto-discovered. The frontend row
remains so a tenant with a separate DSB page can still submit one.
After this fix BMW should render:
- DSE: OK
- Impressum: LUECKENHAFT (unchanged — regex gaps to fix separately)
- Cookie-Richtlinie: OK
- Social Media: NICHT GEFUNDEN (bmw.de does not link to it)
- AGB: NICHT GEFUNDEN (correct — BMW has no public AGB)
- Nutzungsbedingungen: NICHT GEFUNDEN
- Widerruf: NICHT GEFUNDEN
Cross-search now validates if existing text matches the expected
doc_type using keyword scoring. If text is present but doesn't match
(e.g. Nutzungsbedingungen in Widerruf row), searches other texts
and creates a finding explaining the mismatch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cross-Document Intelligence: When a doc_type row is empty, searches
ALL other loaded documents for that content. If found (e.g. Widerruf
in AGB), extracts the section, runs the check, AND creates a finding:
"Widerrufsbelehrung in falschem Dokument gefunden — schwer auffindbar"
Keywords for: widerruf, cookie, social_media, impressum, agb, dsb.
Integrated as Step 1c in compliance check pipeline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.
Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
"Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
rows from sections found in the DSI text
Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the same URL is used for multiple document types (e.g. /datenschutz
for DSI + Cookie + DSB), the section splitter now:
- Detects duplicate URLs and fetches text only once
- Splits text at classified headings (Cookie, Google Analytics, etc.)
- Assigns matching sections to each doc_type
- DSI always keeps the full text
Extracted to section_splitter.py (170 LOC) to keep routes under 500.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>