fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021). This caused 0% section rate for all NIST/BSI/ENISA documents even though sections were correctly detected — the section NUMBER wasn't extracted from the header. Also adds: - reupload_legal_strategy.py: re-upload with legal chunking - extract_and_upload_nist.py: local PDF extraction workaround - qdrant-snapshot.sh: backup mechanism for Qdrant collections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -389,6 +389,11 @@ _SECTION_NUMBER_RE = re.compile(
|
||||
r'|Kapitel\s+(\d+)' # Kapitel 2
|
||||
r'|Anhang\s+([IVXLC\d]+)' # Anhang III
|
||||
r'|Annex\s+([IVXLC\d]+)' # Annex XII
|
||||
# NIST/ENISA/standard identifiers
|
||||
r'|([A-Z]{2}\.[A-Z]{2}-\d{2})' # GV.OC-01 (NIST CSF 2.0)
|
||||
r'|([A-Z]{2,4}-\d+(?:\(\d+\))?)' # AC-1, AC-1(1) (NIST controls)
|
||||
r'|(\d+\.\d+(?:\.\d+)*)' # 3.1, 2.3.1 (numbered sections)
|
||||
r'|(A\d{2}(?::\d{4})?)' # A01:2021 (OWASP)
|
||||
r')',
|
||||
re.IGNORECASE
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user