feat(rag): regulation_short Casing-Normalisierung am Ingest-Rand

Der Re-Ingest leitet regulation_short z.T. via title()-Casing aus Dateinamen ab
('dsgvo'->'Dsgvo', 'osha otm'->'Osha Otm') -> falsche Akronyme im Payload UND im
article_label ('Art. 37 Dsgvo'). NEU: normalize_regulation_short() in legal_metadata,
token-basiert mit kuratiertem Akronym-Set -> nur gelistete Akronyme werden gross,
legitimes Mixed-Case (GeschGehG, MuSchG, GoBD, MiCA, eIDAS, EuGH) bleibt unberuehrt.
Angewandt am Ingest-Rand in documents.py (greift fuer Payload-Feld + display_name).
+13 Tests gruen. Bestandsdaten brauchen separaten einmaligen Qdrant-Patch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-06-21 18:31:45 +02:00
parent de542633e2
commit 0c5f1fd7a4
3 changed files with 71 additions and 1 deletions
+13 -1
View File
@@ -10,7 +10,12 @@ from embedding_client import embedding_client
from html_utils import decode_html_bytes, looks_like_html, strip_html
from minio_client_wrapper import minio_wrapper
from qdrant_client_wrapper import qdrant_wrapper
from legal_metadata import build_legal_fields, compute_chunk_hash, deterministic_point_id
from legal_metadata import (
build_legal_fields,
compute_chunk_hash,
deterministic_point_id,
normalize_regulation_short,
)
logger = logging.getLogger("rag-service.api.documents")
@@ -155,6 +160,13 @@ async def upload_document(
except json.JSONDecodeError:
logger.warning("Invalid metadata_json, ignoring")
# Casing-Normalisierung am Ingest-Rand: title-caste Akronyme korrigieren
# ('Dsgvo'->'DSGVO'), damit Payload-Feld UND article_label sauber sind.
if extra_metadata.get("regulation_short"):
extra_metadata["regulation_short"] = normalize_regulation_short(
extra_metadata["regulation_short"]
)
# --- Build payloads (rag_reingest_spec.md §2/§3: zitierfaehige Legal-Metadaten) ---
reg_code = (
extra_metadata.get("regulation_code")