fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE

_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-03 07:42:06 +02:00
parent 0b0eed27b0
commit 2f4a3f2ea2
5 changed files with 843 additions and 0 deletions
+5
View File
@@ -389,6 +389,11 @@ _SECTION_NUMBER_RE = re.compile(
r'|Kapitel\s+(\d+)' # Kapitel 2
r'|Anhang\s+([IVXLC\d]+)' # Anhang III
r'|Annex\s+([IVXLC\d]+)' # Annex XII
# NIST/ENISA/standard identifiers
r'|([A-Z]{2}\.[A-Z]{2}-\d{2})' # GV.OC-01 (NIST CSF 2.0)
r'|([A-Z]{2,4}-\d+(?:\(\d+\))?)' # AC-1, AC-1(1) (NIST controls)
r'|(\d+\.\d+(?:\.\d+)*)' # 3.1, 2.3.1 (numbered sections)
r'|(A\d{2}(?::\d{4})?)' # A01:2021 (OWASP)
r')',
re.IGNORECASE
)