feat(audit): Text-Paste-Mode pro Row — Crawler optional umgehen
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / nodejs-build (push) Successful in 3m27s
CI / iace-gt-coverage (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 17s
CI / loc-budget (push) Failing after 20s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / test-python-backend (push) Successful in 47s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
CI / detect-changes (push) Successful in 12s
CI / branch-name (push) Has been skipped
CI / nodejs-build (push) Successful in 3m27s
CI / iace-gt-coverage (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / secret-scan (push) Has been skipped
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / validate-canonical-controls (push) Successful in 17s
CI / loc-budget (push) Failing after 20s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go (push) Has been skipped
CI / test-python-backend (push) Successful in 47s
CI / test-python-document-crawler (push) Has been skipped
CI / test-python-dsms-gateway (push) Has been skipped
Hintergrund: VW liefert ueber URL-Crawler nur 6 Vendors statt der 100+
die in der echten Cookie-Tabelle stehen. Wenn der User die Tabelle aber
direkt von der Site kopieren kann (was bei den meisten OEM-Sites moeglich
ist), umgehen wir den Crawler komplett und parsen den Text deterministisch.
Backend:
* doc_type_classifier.py — 7 Pattern-Gruppen (§5 TMG, Art.13 DSGVO,
AGB-Klauseln, Widerrufs-Frist, Cookie-Tabellen-Header, etc). Wenn der
User Text ins falsche Doc-Type-Feld kopiert (Impressum->DSE),
detect_mismatch liefert detected + action ('reclassify' bei sehr hoher
Konfidenz, 'warn' bei medium).
* cookies_table_parser.py — Tab/Pipe/Komma/Semicolon-Separator-Auto-
Detection, Spalten-Mapping per Header-Keyword. Aggregiert Cookie-
Eintraege zu Vendor-Records (mit _guess_vendor-Fallback). Voll
deterministisch, kein LLM.
* doc_input_warnings.py — Mail-Block ueber dem Audit, der Mismatches +
Auto-Reclassifies dem User transparent macht.
* Pipeline: text gewinnt ueber url (war schon im Schema vermerkt), neue
Felder declared_doc_type / input_source / reclassify_hint in doc_entries.
Pasted-Tabellen-Vendors haben Vorrang vor Library-Fallback + LLM-Cascade
(sind 100% genau).
Frontend (DocCheckTab):
* Pro Row Mode-Toggle 'URL' / 'Text einfuegen' (lila wenn aktiv).
* Textarea (h-32, monospace) im text-mode mit kontext-spezifischem
Placeholder (Cookie-Hinweis ggue. anderen Doc-Types) und Live-
Zeichen-/Wort-Counter.
* Submit-Button accepted entries mit URL ODER text.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -323,12 +323,25 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
url_text_cache: dict[str, str] = {}
|
||||
|
||||
n_docs = max(1, len(req.documents))
|
||||
# User-pasted-Tabellen-Vendors (kein LLM noetig) — werden weiter
|
||||
# unten in cmp_vendors gemerged.
|
||||
pasted_table_vendors: list[dict] = []
|
||||
for i, doc in enumerate(req.documents):
|
||||
pct = int(1 + (i / n_docs) * 29)
|
||||
_update(check_id, f"Texte laden {i+1}/{n_docs}: {doc.doc_type}...", pct)
|
||||
text = doc.text
|
||||
text = (doc.text or "").strip()
|
||||
input_source = "url"
|
||||
cmp_payloads: list[dict] = []
|
||||
if not text and doc.url:
|
||||
if text:
|
||||
input_source = "text"
|
||||
if doc.url:
|
||||
input_source = "text+url" # User hat beide gefuellt
|
||||
logger.info(
|
||||
"doc_type=%s: User hat URL UND Text geliefert — "
|
||||
"Text gewinnt, URL wird als Quellen-Referenz behalten",
|
||||
doc.doc_type,
|
||||
)
|
||||
elif doc.url:
|
||||
url_key = doc.url.strip().rstrip("/").lower()
|
||||
if url_key in url_text_cache:
|
||||
text = url_text_cache[url_key]
|
||||
@@ -336,16 +349,62 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
text, cmp_payloads = await _fetch_text(doc.url, doc_type=doc.doc_type)
|
||||
if text:
|
||||
url_text_cache[url_key] = text
|
||||
|
||||
# Auto-Reclassify-Check: wenn der user Text in das falsche
|
||||
# Doc-Type-Feld kopiert hat (z.B. Impressum-Text in DSE),
|
||||
# erkennen und ggf. umtaggen.
|
||||
actual_doc_type = doc.doc_type
|
||||
reclassify_hint: dict | None = None
|
||||
if input_source.startswith("text") and len(text) >= 500:
|
||||
try:
|
||||
from compliance.services.doc_type_classifier import (
|
||||
detect_mismatch,
|
||||
)
|
||||
reclassify_hint = detect_mismatch(doc.doc_type, text)
|
||||
if reclassify_hint and reclassify_hint["action"] == "reclassify":
|
||||
actual_doc_type = reclassify_hint["detected"]
|
||||
logger.info(
|
||||
"doc_type AUTO-RECLASSIFY: deklariert=%s "
|
||||
"erkannt=%s (score %d vs %d) — uebernehme erkannten Typ",
|
||||
doc.doc_type, actual_doc_type,
|
||||
reclassify_hint["detected_score"],
|
||||
reclassify_hint["declared_score"],
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("doc_type_classifier failed: %s", e)
|
||||
|
||||
# Cookie-Tabelle: wenn User Tabelle reinkopiert hat, deterministisch
|
||||
# parsen (kein LLM noetig) und Vendors gleich ableiten.
|
||||
if input_source.startswith("text") and actual_doc_type == "cookie":
|
||||
try:
|
||||
from compliance.services.cookies_table_parser import (
|
||||
parse_cookie_table,
|
||||
)
|
||||
tab_vendors = parse_cookie_table(text)
|
||||
if tab_vendors:
|
||||
pasted_table_vendors.extend(tab_vendors)
|
||||
logger.info(
|
||||
"Cookie-Tabelle erkannt im pasted Text — "
|
||||
"%d Vendors / %d Cookies deterministisch geparst",
|
||||
len(tab_vendors),
|
||||
sum(len(v.get("cookies", [])) for v in tab_vendors),
|
||||
)
|
||||
except Exception as e:
|
||||
logger.warning("cookies_table_parser failed: %s", e)
|
||||
|
||||
if text:
|
||||
doc_texts[doc.doc_type] = text
|
||||
doc_texts[actual_doc_type] = text
|
||||
doc_entries.append({
|
||||
"doc_type": doc.doc_type,
|
||||
"url": doc.url,
|
||||
"text": text,
|
||||
"word_count": len(text.split()) if text else 0,
|
||||
"auto_discovered": False,
|
||||
"doc_type": actual_doc_type,
|
||||
"declared_doc_type": doc.doc_type,
|
||||
"url": doc.url,
|
||||
"text": text,
|
||||
"word_count": len(text.split()) if text else 0,
|
||||
"auto_discovered": False,
|
||||
"discovery_attempted": False,
|
||||
"cmp_payloads": cmp_payloads,
|
||||
"cmp_payloads": cmp_payloads,
|
||||
"input_source": input_source,
|
||||
"reclassify_hint": reclassify_hint,
|
||||
})
|
||||
|
||||
# Step 1a-bis: AUTO-DISCOVERY. For each canonical doc_type the user
|
||||
@@ -767,6 +826,25 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
logger.info("P57: added %d new vendors from Phase G (total: %d)",
|
||||
added, len(cmp_vendors))
|
||||
|
||||
# User-pasted Cookie-Tabelle (deterministisch, kein LLM):
|
||||
# die hat IMMER Vorrang weil 100% genau.
|
||||
if pasted_table_vendors:
|
||||
existing = {(v.get("name") or "").strip().lower()
|
||||
for v in cmp_vendors}
|
||||
added_p = 0
|
||||
for v in pasted_table_vendors:
|
||||
nm = (v.get("name") or "").strip()
|
||||
if not nm or nm.lower() in existing:
|
||||
continue
|
||||
cmp_vendors.append(v)
|
||||
existing.add(nm.lower())
|
||||
added_p += 1
|
||||
if added_p:
|
||||
logger.info(
|
||||
"Pasted-Tabellen-Merge: +%d Vendors (total: %d)",
|
||||
added_p, len(cmp_vendors),
|
||||
)
|
||||
|
||||
# Cookie-Library-Fallback (P52 Lite): wenn weiterhin wenige
|
||||
# Vendors aber viele after_accept-Cookies, aus Library auflösen.
|
||||
# VW-Lehre: 6 LLM-Grob-Vendors reichen NICHT — die Library
|
||||
@@ -1250,6 +1328,19 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
except Exception as e:
|
||||
logger.warning("P82 GF-1-pager skipped: %s", e)
|
||||
|
||||
# Doc-Input-Warnings — wenn User Text ins falsche Feld gepastet hat
|
||||
input_warn_html = ""
|
||||
try:
|
||||
from compliance.services.doc_input_warnings import (
|
||||
collect_warnings, build_warnings_block_html,
|
||||
)
|
||||
warns = collect_warnings(doc_entries)
|
||||
if warns:
|
||||
input_warn_html = build_warnings_block_html(warns)
|
||||
logger.info("doc-input-warnings: %d Mismatches gefunden", len(warns))
|
||||
except Exception as e:
|
||||
logger.warning("doc-input-warnings skipped: %s", e)
|
||||
|
||||
# P86: Branchen-Benchmark (nur wenn scan_context.industry gesetzt)
|
||||
bench_html = ""
|
||||
try:
|
||||
@@ -1293,7 +1384,7 @@ async def _run_compliance_check(check_id: str, req: ComplianceCheckRequest):
|
||||
logger.warning("P84 diff-mode skipped: %s", e)
|
||||
|
||||
full_html = (
|
||||
gf_one_pager_html + bench_html + diff_html
|
||||
gf_one_pager_html + input_warn_html + bench_html + diff_html
|
||||
+ critical_html + scope_disclaimer_html + exec_summary_html
|
||||
+ cookie_arch_html + summary_html + scanned_html + profile_html
|
||||
+ scorecard_html + redundancy_html
|
||||
|
||||
Reference in New Issue
Block a user