fix(embedding): kurze Legal-Docs behalten Sektions-Prefix (chunk_text_legal)
chunk_text_legal hatte einen Early-Return fuer text <= chunk_size, der den [§ X]-Prefix uebersprang -> chunk_text_legal_structured konnte section/article nicht extrahieren -> article="" -> (a) article_label fiel auf "BDSG" zurueck (kein §), (b) deterministische Point-ID kollidierte (alle article="" -> gleiche ID) -> ~die Haelfte kurzer §§ ueberschrieben sich. Fix: Early-Return traegt den erkannten Sektions-Header als Prefix. Belegt am BDSG-§-Ingest: 44->86 distinkte §§, §38 sauber "BDSG § 38". Wirkt nur auf KUENFTIGE Ingests (kein Re-Chunk). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -483,7 +483,13 @@ def chunk_text_legal(text: str, chunk_size: int, overlap: int) -> List[str]:
|
|||||||
Works for both German (DSGVO, BGB, AI Act DE) and English (NIST, SLSA, CRA EN) texts.
|
Works for both German (DSGVO, BGB, AI Act DE) and English (NIST, SLSA, CRA EN) texts.
|
||||||
"""
|
"""
|
||||||
if not text or len(text) <= chunk_size:
|
if not text or len(text) <= chunk_size:
|
||||||
return [text.strip()] if text and text.strip() else []
|
body = (text or "").strip()
|
||||||
|
if not body:
|
||||||
|
return []
|
||||||
|
# Kurze Dokumente (ein §/Artikel) trotzdem mit Sektions-Prefix versehen, damit
|
||||||
|
# chunk_text_legal_structured Section/Artikel extrahieren kann (sonst article="").
|
||||||
|
hdr = _extract_section_header(body.split("\n", 1)[0])
|
||||||
|
return [f"[{hdr[:120]}] {body}"] if hdr else [body]
|
||||||
|
|
||||||
# --- Phase 1: Split into sections by legal headers ---
|
# --- Phase 1: Split into sections by legal headers ---
|
||||||
lines = text.split('\n')
|
lines = text.split('\n')
|
||||||
|
|||||||
Reference in New Issue
Block a user