ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars.
Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted.
DSB, Art. 77, Betroffenenrechte were all in the truncated portion.
Fixes:
1. Raise text limit from 50k to 200k chars in API response + discovery
2. click_button(): add iframe fallback for Sourcepoint/Quantcast
3. dsi_helpers: iterate ALL page.frames for consent buttons
4. Profiler: only check impressum (not full text) for regulated professions,
and "rechtsanwalt" must be in first 500 chars (company description)
5. GT: save full Spiegel DSI text (13,705 words) as reference
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.
Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
"Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
rows from sections found in the DSI text
Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 bestehende Ansätze (IACE deterministisch, Doc-Check LLM, Gap-Analyse regelbasiert)
und was der Compiler von jedem übernimmt.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove generic B2G keywords (behörde, amt, öffentlich) that match in
every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen"
- Remove "server" from it_services (too generic, appears in every DSI)
- Add consulting, manufacturing, media industries
- Add B2B fallback for GmbH/AG without B2C signals
- Add 10 ground truth files for unified compliance check
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tested BMW, Stadt Koeln, BfDI, Sparkasse, Caritas, TUEV Sued,
Spiegel, ETO Gruppe, EUIPO. Key findings:
- Stadt Koeln + ETO Gruppe best (95% correctness)
- BMW, Sparkasse, Spiegel genuinely deficient (verified)
- EUIPO uses EU Regulation 2018/1725, not GDPR — needs separate checklist
- ~0-2 false positives per website after LLM verification
7 regex fixes emerged from batch testing (soft hyphens, word
insertions, numbered headings, German section names, etc.)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Detailed plan for upgrading the 22 existing Playwright-based banner
checks to the same quality level as the document checks:
- 6 L1 + 30 L2 hierarchical checks
- Expert hints with EuGH/CNIL/DSK/EDPB references
- 3-phase evidence (before consent, after reject, after accept)
- Dark pattern detection (button size, color, click asymmetry)
- Estimated 3-4h implementation
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes based on manual verification of all 30 failed checks:
1. Cookie table: recognize "folgende cookies" + column headers as text
2. Cookie names: add JSESSIONID, cookieinfo, et_id, BT_* patterns
3. Essential justified: match "sitzung zuordnen", "betrieb der website"
4. Social bookmarks: recognize as 2-click alternative
5. DSFA plural: "kanaelen" now matches alongside "kanal"
6. Section splitter: skip-headings no longer lose subsequent text
(Risikoabwaegung section was cut from DSFA, losing risk scores)
7. Cookie legal basis: accept Art. 6(1)(f) in cookie context
Reduces false positives from 7 to ~1-2 for IHK Konstanz test case.
Ground truth table: zeroclaw/docs/ground-truth-ihk-konstanz.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only Cookie and Widerruf sections are checked as separate documents.
Social Media, DSFA, Betroffenenrechte, Dienste von Drittanbietern are
part of the parent DSI and no longer generate false findings.
Added PLAN-rag-document-check.md for Phase 2:
- RAG-based checks with document-type-specific Controls
- DSFA checklist (Art. 35 + Landes-Listen)
- AVV checklist (Art. 28)
- Reference detection (sub-doc → parent doc)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Real-world case from EU authority (EUIPO) with 7 findings:
- Grammatically broken consent text (bad DE translation)
- Coupling prohibition violation (login = consent, Art. 7(4) DSGVO)
- No reject button, no granularity, no active opt-in
- Broken link layout (DSE/ToS links appear after submit button)
- Includes correction suggestion and planned agent check implementations
- Pattern: WSO2 Identity Server default templates (systemic issue)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mandatory_content_checker.py keywords break with alternative formulations.
Solution: LLM-based check per mandatory field (9 calls, parallelizable).
For other session to implement alongside Dict→Control migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 files with hardcoded legal knowledge identified. Review deadline 2026-07-01.
legal_basis_validator.py marked with warning log on every use.
Instruction file for other session to execute migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migration 086: compliance_agent_scans table (findings, services, corrections)
- agent_history_routes.py: POST /scans (save), GET /scans (list), GET /scans/{id}
- Scan results survive page reloads and can be reviewed later
- Phase 10 (Playwright website scanner) added to product roadmap
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Automated comparison: services mentioned in privacy policy vs. actually
embedded on website. Three categories: undocumented (Art. 13 violation),
outdated (cleanup), correctly documented (check third country only).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-page crawl: scan 5-10 strategic pages (start, footer links) for
chatbot widgets, AI text mentions, and tracking services. Feed results
into relevance filter to reduce false positives.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Addresses false-positive controls like C_TRANSPARENCY being recommended
when no AI usage is evident. Plan for separate implementation session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>