Commit Graph

6 Commits

Author SHA1 Message Date
Benjamin Admin 33bf2b7c5a feat(service-detector): detect 118 services in legal texts (was 20)
Build + Deploy / build-admin-compliance (push) Successful in 2m5s
Build + Deploy / build-backend-compliance (push) Successful in 3m26s
Build + Deploy / build-ai-sdk (push) Successful in 56s
Build + Deploy / build-developer-portal (push) Successful in 1m29s
Build + Deploy / build-tts (push) Failing after 1m48s
Build + Deploy / build-document-crawler (push) Successful in 44s
Build + Deploy / build-dsms-gateway (push) Successful in 28s
Build + Deploy / build-dsms-node (push) Successful in 17s
CI / branch-name (push) Has been skipped
Build + Deploy / trigger-orca (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 52s
CI / test-python-backend (push) Successful in 36s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
New service_detector.py uses service_registry (88 entries) plus 30+
extra text patterns to detect services mentioned in DSI/legal texts.

Results on Spiegel: 31/32 services detected (97%, was 5/32 = 16%).
Includes metadata: name, category, country, EU adequacy status.

- Profiler now uses detect_services_in_text() instead of 20-entry list
- Profile extractor adds detected_services with full metadata
- Auto-generates scope hint for non-EU services (Drittlandtransfer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 16:00:15 +02:00
Benjamin Admin 5e317d2f0f fix: text extraction 50k char limit was root cause of all Spiegel FNs
Build + Deploy / build-admin-compliance (push) Successful in 18s
Build + Deploy / build-backend-compliance (push) Successful in 12s
Build + Deploy / build-ai-sdk (push) Successful in 10s
Build + Deploy / build-developer-portal (push) Successful in 10s
Build + Deploy / build-tts (push) Successful in 10s
Build + Deploy / build-document-crawler (push) Successful in 9s
Build + Deploy / build-dsms-gateway (push) Successful in 10s
Build + Deploy / build-dsms-node (push) Successful in 15s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m46s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 41s
CI / test-python-backend (push) Successful in 37s
CI / test-python-document-crawler (push) Successful in 27s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 13s
Build + Deploy / trigger-orca (push) Successful in 2m13s
ROOT CAUSE: main.py line 338 truncated full_text at 50,000 chars.
Spiegel DSI has 107,720 chars (13,705 words) — only 47% was extracted.
DSB, Art. 77, Betroffenenrechte were all in the truncated portion.

Fixes:
1. Raise text limit from 50k to 200k chars in API response + discovery
2. click_button(): add iframe fallback for Sourcepoint/Quantcast
3. dsi_helpers: iterate ALL page.frames for consent buttons
4. Profiler: only check impressum (not full text) for regulated professions,
   and "rechtsanwalt" must be in first 500 chars (company description)
5. GT: save full Spiegel DSI text (13,705 words) as reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 15:22:38 +02:00
Benjamin Admin c702260ec1 fix: 5 regex bugs + text extraction scroll + GT update
Build + Deploy / build-admin-compliance (push) Successful in 13s
Build + Deploy / build-backend-compliance (push) Successful in 23s
Build + Deploy / build-ai-sdk (push) Successful in 13s
Build + Deploy / build-developer-portal (push) Successful in 14s
Build + Deploy / build-tts (push) Successful in 15s
Build + Deploy / build-document-crawler (push) Successful in 13s
Build + Deploy / build-dsms-gateway (push) Successful in 15s
Build + Deploy / build-dsms-node (push) Successful in 14s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m26s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 39s
CI / test-python-backend (push) Successful in 39s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 22s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 2m28s
Root cause: Spiegel DSI text was truncated (lazy-loading) — the
rights/DSB/complaints sections at the bottom were never extracted.

Fixes:
1. Text extraction: scroll to bottom before innerText (dsi_discovery.py)
2. V.i.S.d.P.: add "verantwortlicher i.s.v." + "§18 Abs. N MStV" pattern
3. USt-IdNr: add "umsatzsteuer-id" + "DE 212 442 423" (with spaces)
4. Profiler: remove generic "anwalt"/"praxis" (false positive on Spiegel
   "Redaktionsanwalt"), keep only "rechtsanwalt", "kanzlei" etc.
5. Section splitter: auto_fill_from_dsi() fills empty Cookie/Social-Media
   rows from sections found in the DSI text

Ground Truth 06-spiegel.md fully rewritten with verified data from
live website — 3 L1 False Negatives identified (DSB, Beschwerderecht,
Betroffenenrechte all present on website but not in extracted text).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-13 01:20:55 +02:00
Benjamin Admin be9cfdc2d4 feat(compliance-check): skip Widerruf for B2B, limit MCs, fix industry
Build + Deploy / build-admin-compliance (push) Successful in 2m1s
Build + Deploy / build-backend-compliance (push) Successful in 4m20s
Build + Deploy / build-ai-sdk (push) Successful in 53s
Build + Deploy / build-developer-portal (push) Successful in 2m6s
Build + Deploy / build-tts (push) Successful in 2m48s
Build + Deploy / build-document-crawler (push) Successful in 52s
Build + Deploy / build-dsms-gateway (push) Successful in 11s
Build + Deploy / build-dsms-node (push) Successful in 13s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 15s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m45s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 45s
CI / test-python-backend (push) Successful in 41s
CI / test-python-document-crawler (push) Successful in 26s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 15s
Build + Deploy / trigger-orca (push) Successful in 3m17s
- Skip Widerrufsbelehrung check entirely for B2B/B2G businesses
- Limit MC checks to top 20 per doc_type (by severity) to reduce noise
  (e.g. 75 impressum MCs → 20, avoiding 55 irrelevant FAILs)
- Add consulting/manufacturing industry keywords (arbeitssicherheit,
  brandschutz, werkzeugbau, etc.)
- Lower industry detection threshold from 2 to 1 keyword hit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 17:03:57 +02:00
Benjamin Admin 407a9503e4 fix(profiler): fix B2G false positive + add consulting/manufacturing
Build + Deploy / build-admin-compliance (push) Successful in 2m27s
Build + Deploy / build-backend-compliance (push) Successful in 3m40s
Build + Deploy / build-ai-sdk (push) Successful in 1m0s
Build + Deploy / build-developer-portal (push) Successful in 1m16s
Build + Deploy / build-tts (push) Successful in 1m54s
Build + Deploy / build-document-crawler (push) Successful in 1m2s
Build + Deploy / build-dsms-gateway (push) Successful in 31s
Build + Deploy / build-dsms-node (push) Successful in 20s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 17s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 2m44s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Successful in 49s
CI / test-python-backend (push) Successful in 36s
CI / test-python-document-crawler (push) Successful in 25s
CI / test-python-dsms-gateway (push) Successful in 21s
CI / validate-canonical-controls (push) Successful in 14s
Build + Deploy / trigger-orca (push) Successful in 3m23s
- Remove generic B2G keywords (behörde, amt, öffentlich) that match in
  every DSI due to "Aufsichtsbehörde", "Amtsgericht", "veröffentlichen"
- Remove "server" from it_services (too generic, appears in every DSI)
- Add consulting, manufacturing, media industries
- Add B2B fallback for GmbH/AG without B2C signals
- Add 10 ground truth files for unified compliance check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-12 12:20:44 +02:00
Benjamin Admin 0d0e705117 feat: Unified Compliance-Check — 8 document types in one form
New 3-tab structure: Website-Scan, Compliance-Check, Banner-Check.

Compliance-Check Tab (replaces Dokumenten-Pruefung + Impressum-Check):
- 8 document rows: DSI, Impressum, Social Media, Cookie, AGB,
  Nutzungsbedingungen, Widerruf, DSB-Kontakt
- Each row: URL input + "Text laden" + file upload + manual text
- "Text laden" extracts via consent-tester, shows in editable textarea
- User verifies/corrects text before checking
- Empty fields = "not present" → own finding

Business Profiler (business_profiler.py):
- Detects B2B/B2C/B2G from all documents together
- Recognizes regulated professions, online shops, editorial content
- Context-aware: INFO checks become PASS/FAIL based on profile

Backend: /compliance-check + /extract-text endpoints
Frontend: ComplianceCheckTab.tsx + DocumentRow.tsx
API proxies: compliance-check/route.ts + extract-text/route.ts

Also: Impressum regex fixes (Telefon, AG, Geschaeftsfuehrung)
and INFO severity for context-dependent checks.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-11 20:56:10 +02:00