feat(audit): Text-Paste-Mode pro Row — Crawler optional umgehen
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / nodejs-build (push) Successful in 3m27s
CI / iace-gt-coverage (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 17s
CI / loc-budget (push) Failing after 20s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / test-python-backend (push) Successful in 47s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped

Hintergrund: VW liefert ueber URL-Crawler nur 6 Vendors statt der 100+
die in der echten Cookie-Tabelle stehen. Wenn der User die Tabelle aber
direkt von der Site kopieren kann (was bei den meisten OEM-Sites moeglich
ist), umgehen wir den Crawler komplett und parsen den Text deterministisch.

Backend:
* doc_type_classifier.py — 7 Pattern-Gruppen (§5 TMG, Art.13 DSGVO,
  AGB-Klauseln, Widerrufs-Frist, Cookie-Tabellen-Header, etc). Wenn der
  User Text ins falsche Doc-Type-Feld kopiert (Impressum->DSE),
  detect_mismatch liefert detected + action ('reclassify' bei sehr hoher
  Konfidenz, 'warn' bei medium).
* cookies_table_parser.py — Tab/Pipe/Komma/Semicolon-Separator-Auto-
  Detection, Spalten-Mapping per Header-Keyword. Aggregiert Cookie-
  Eintraege zu Vendor-Records (mit _guess_vendor-Fallback). Voll
  deterministisch, kein LLM.
* doc_input_warnings.py — Mail-Block ueber dem Audit, der Mismatches +
  Auto-Reclassifies dem User transparent macht.
* Pipeline: text gewinnt ueber url (war schon im Schema vermerkt), neue
  Felder declared_doc_type / input_source / reclassify_hint in doc_entries.
  Pasted-Tabellen-Vendors haben Vorrang vor Library-Fallback + LLM-Cascade
  (sind 100% genau).

Frontend (DocCheckTab):
* Pro Row Mode-Toggle 'URL' / 'Text einfuegen' (lila wenn aktiv).
* Textarea (h-32, monospace) im text-mode mit kontext-spezifischem
  Placeholder (Cookie-Hinweis ggue. anderen Doc-Types) und Live-
  Zeichen-/Wort-Counter.
* Submit-Button accepted entries mit URL ODER text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-21 18:58:32 +02:00
parent 7335f64f4f
commit e411c4f0d3
5 changed files with 670 additions and 49 deletions
@@ -323,12 +323,25 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
url_text_cache: dict[str, str] = {}
n_docs = max(1, len(req.documents))
# User-pasted-Tabellen-Vendors (kein LLM noetig) — werden weiter
# unten in cmp_vendors gemerged.
pasted_table_vendors: list[dict] = []
for i, doc in enumerate(req.documents):
pct = int(1 + (i / n_docs) * 29)
_update(check_id, f"Texte laden {i+1}/{n_docs}: {doc.doc_type}...", pct)
text = doc.text
text = (doc.text or "").strip()
input_source = "url"
cmp_payloads: list[dict] = []
if not text and doc.url:
if text:
input_source = "text"
if doc.url:
input_source = "text+url" # User hat beide gefuellt
logger.info(
"doc_type=%s: User hat URL UND Text geliefert — "
"Text gewinnt, URL wird als Quellen-Referenz behalten",
doc.doc_type,
)
elif doc.url:
url_key = doc.url.strip().rstrip("/").lower()
if url_key in url_text_cache:
text = url_text_cache[url_key]
@@ -336,16 +349,62 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
text, cmp_payloads = await _fetch_text(doc.url, doc_type=doc.doc_type)
if text:
url_text_cache[url_key] = text
# Auto-Reclassify-Check: wenn der user Text in das falsche
# Doc-Type-Feld kopiert hat (z.B. Impressum-Text in DSE),
# erkennen und ggf. umtaggen.
actual_doc_type = doc.doc_type
reclassify_hint: dict | None = None
if input_source.startswith("text") and len(text) >= 500:
try:
from compliance.services.doc_type_classifier import (
detect_mismatch,
)
reclassify_hint = detect_mismatch(doc.doc_type, text)
if reclassify_hint and reclassify_hint["action"] == "reclassify":
actual_doc_type = reclassify_hint["detected"]
logger.info(
"doc_type AUTO-RECLASSIFY: deklariert=%s "
"erkannt=%s (score %d vs %d) — uebernehme erkannten Typ",
doc.doc_type, actual_doc_type,
reclassify_hint["detected_score"],
reclassify_hint["declared_score"],
)
except Exception as e:
logger.warning("doc_type_classifier failed: %s", e)
# Cookie-Tabelle: wenn User Tabelle reinkopiert hat, deterministisch
# parsen (kein LLM noetig) und Vendors gleich ableiten.
if input_source.startswith("text") and actual_doc_type == "cookie":
try:
from compliance.services.cookies_table_parser import (
parse_cookie_table,
)
tab_vendors = parse_cookie_table(text)
if tab_vendors:
pasted_table_vendors.extend(tab_vendors)
logger.info(
"Cookie-Tabelle erkannt im pasted Text — "
"%d Vendors / %d Cookies deterministisch geparst",
len(tab_vendors),
sum(len(v.get("cookies", [])) for v in tab_vendors),
)
except Exception as e:
logger.warning("cookies_table_parser failed: %s", e)
if text:
doc_texts[doc.doc_type] = text
doc_texts[actual_doc_type] = text
doc_entries.append({
"doc_type": doc.doc_type,
"url": doc.url,
"text": text,
"word_count": len(text.split()) if text else 0,
"auto_discovered": False,
"doc_type": actual_doc_type,
"declared_doc_type": doc.doc_type,
"url": doc.url,
"text": text,
"word_count": len(text.split()) if text else 0,
"auto_discovered": False,
"discovery_attempted": False,
"cmp_payloads": cmp_payloads,
"cmp_payloads": cmp_payloads,
"input_source": input_source,
"reclassify_hint": reclassify_hint,
})
# Step 1a-bis: AUTO-DISCOVERY. For each canonical doc_type the user
@@ -767,6 +826,25 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
logger.info("P57: added %d new vendors from Phase G (total: %d)",
added, len(cmp_vendors))
# User-pasted Cookie-Tabelle (deterministisch, kein LLM):
# die hat IMMER Vorrang weil 100% genau.
if pasted_table_vendors:
existing = {(v.get("name") or "").strip().lower()
for v in cmp_vendors}
added_p = 0
for v in pasted_table_vendors:
nm = (v.get("name") or "").strip()
if not nm or nm.lower() in existing:
continue
cmp_vendors.append(v)
existing.add(nm.lower())
added_p += 1
if added_p:
logger.info(
"Pasted-Tabellen-Merge: +%d Vendors (total: %d)",
added_p, len(cmp_vendors),
)
# Cookie-Library-Fallback (P52 Lite): wenn weiterhin wenige
# Vendors aber viele after_accept-Cookies, aus Library auflösen.
# VW-Lehre: 6 LLM-Grob-Vendors reichen NICHT — die Library
@@ -1250,6 +1328,19 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
except Exception as e:
logger.warning("P82 GF-1-pager skipped: %s", e)
# Doc-Input-Warnings — wenn User Text ins falsche Feld gepastet hat
input_warn_html = ""
try:
from compliance.services.doc_input_warnings import (
collect_warnings, build_warnings_block_html,
)
warns = collect_warnings(doc_entries)
if warns:
input_warn_html = build_warnings_block_html(warns)
logger.info("doc-input-warnings: %d Mismatches gefunden", len(warns))
except Exception as e:
logger.warning("doc-input-warnings skipped: %s", e)
# P86: Branchen-Benchmark (nur wenn scan_context.industry gesetzt)
bench_html = ""
try:
@@ -1293,7 +1384,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
logger.warning("P84 diff-mode skipped: %s", e)
full_html = (
gf_one_pager_html + bench_html + diff_html
gf_one_pager_html + input_warn_html + bench_html + diff_html
+ critical_html + scope_disclaimer_html + exec_summary_html
+ cookie_arch_html + summary_html + scanned_html + profile_html
+ scorecard_html + redundancy_html