feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments: - D5+: NIST/ENISA PDF quality fix (0%→45% section rate) - D5+: 4 lost NIST PDFs restored (11k chunks) - D5+: Text normalization + section detection for NIST/BSI - D6: Citation backfill (3,651 controls updated, old archived) - E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...) - E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.) - E4: Standards (GoBD, BAIT, VAIT) - E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG) - E7: 9 court judgments as full text (Schrems II 154 chunks, Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41, Schadenersatz 29, BAG 48, Google Fonts 14) - Infra: Qdrant snapshot mechanism, upload-before-delete safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -381,7 +381,13 @@ def main():
|
|||||||
continue
|
continue
|
||||||
|
|
||||||
# 2. Get text
|
# 2. Get text
|
||||||
|
try:
|
||||||
text = get_text(doc)
|
text = get_text(doc)
|
||||||
|
except Exception as e:
|
||||||
|
print(f" ERROR extracting text: {e}")
|
||||||
|
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||||
|
"new": 0, "sect": 0})
|
||||||
|
continue
|
||||||
|
|
||||||
# 3. Upload with legal strategy
|
# 3. Upload with legal strategy
|
||||||
print(" Uploading with strategy='legal'...")
|
print(" Uploading with strategy='legal'...")
|
||||||
|
|||||||
Reference in New Issue
Block a user