fix: Section classifier strips leading numbers + recognizes German headings
Build + Deploy / build-developer-portal (push) Successful in 1m21s
Build + Deploy / build-tts (push) Successful in 1m31s
Build + Deploy / build-document-crawler (push) Successful in 37s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
Build + Deploy / build-admin-compliance (push) Successful in 2m21s
Build + Deploy / build-backend-compliance (push) Successful in 3m47s
Build + Deploy / build-ai-sdk (push) Successful in 55s
CI / test-python-dsms-gateway (push) Successful in 29s
CI / validate-canonical-controls (push) Successful in 17s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 21s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m21s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 57s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 29s
Build + Deploy / trigger-orca (push) Successful in 3m3s
Build + Deploy / build-developer-portal (push) Successful in 1m21s
Build + Deploy / build-tts (push) Successful in 1m31s
Build + Deploy / build-document-crawler (push) Successful in 37s
Build + Deploy / build-dsms-gateway (push) Successful in 26s
Build + Deploy / build-dsms-node (push) Successful in 11s
Build + Deploy / build-admin-compliance (push) Successful in 2m21s
Build + Deploy / build-backend-compliance (push) Successful in 3m47s
Build + Deploy / build-ai-sdk (push) Successful in 55s
CI / test-python-dsms-gateway (push) Successful in 29s
CI / validate-canonical-controls (push) Successful in 17s
CI / branch-name (push) Has been skipped
CI / guardrail-integrity (push) Has been skipped
CI / loc-budget (push) Failing after 21s
CI / secret-scan (push) Has been skipped
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / nodejs-build (push) Successful in 3m21s
CI / dep-audit (push) Has been skipped
CI / sbom-scan (push) Has been skipped
CI / test-go (push) Failing after 57s
CI / test-python-backend (push) Successful in 42s
CI / test-python-document-crawler (push) Successful in 29s
Build + Deploy / trigger-orca (push) Successful in 3m3s
- "5. Soziale Medien" now stripped to "soziale medien" before classification - Added "soziale medien/netzwerke" as social_media heading pattern - Fixes etogruppe.com where Social Media section wasn't detected Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -327,7 +327,7 @@ SECTION_TYPE_MAP = [
|
||||
(r"^(?:agb|allgemeine geschäftsbedingungen|nutzungsbedingungen)$", "agb"),
|
||||
# DSFA MUST be checked BEFORE social_media (both can contain "Social Media")
|
||||
(r"datenschutzfolge|dsfa|risikoanalyse", "dsfa"),
|
||||
(r"^social\s*media$", "social_media"), # Standalone heading "Social Media" = DSE
|
||||
(r"^social\s*media$|^soziale\s+(?:medien|netzwerke)$", "social_media"),
|
||||
(r"datenschutzerkl(?:ae|ä)rung.*social|datenschutz\s+f(?:ue|ü)r\s+social", "social_media"),
|
||||
]
|
||||
|
||||
@@ -415,6 +415,8 @@ def _classify_section(heading: str) -> str | None:
|
||||
"""Classify a section heading into a document type."""
|
||||
import re as _re
|
||||
heading_lower = heading.lower().strip()
|
||||
# Strip leading numbers/bullets: "5. Soziale Medien" → "soziale medien"
|
||||
heading_lower = _re.sub(r"^[\d\.\)\-]+\s*", "", heading_lower).strip()
|
||||
# Skip known sub-sections
|
||||
if heading_lower in SKIP_HEADINGS:
|
||||
return None
|
||||
|
||||
Reference in New Issue
Block a user