feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments: - D5+: NIST/ENISA PDF quality fix (0%→45% section rate) - D5+: 4 lost NIST PDFs restored (11k chunks) - D5+: Text normalization + section detection for NIST/BSI - D6: Citation backfill (3,651 controls updated, old archived) - E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...) - E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.) - E4: Standards (GoBD, BAIT, VAIT) - E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG) - E7: 9 court judgments as full text (Schrems II 154 chunks, Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41, Schadenersatz 29, BAG 48, Google Fonts 14) - Infra: Qdrant snapshot mechanism, upload-before-delete safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -381,7 +381,13 @@ def main():
|
||||
continue
|
||||
|
||||
# 2. Get text
|
||||
try:
|
||||
text = get_text(doc)
|
||||
except Exception as e:
|
||||
print(f" ERROR extracting text: {e}")
|
||||
results.append({"file": doc["upload_filename"], "old": old_count,
|
||||
"new": 0, "sect": 0})
|
||||
continue
|
||||
|
||||
# 3. Upload with legal strategy
|
||||
print(" Uploading with strategy='legal'...")
|
||||
|
||||
Reference in New Issue
Block a user