feat(embedding): add NIST/ENISA/standard section numbering to chunker

Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Benjamin Admin
2026-05-02 19:24:10 +02:00
parent 5a6e588641
commit 3009f3d13a
2 changed files with 220 additions and 1 deletions
+7 -1
View File
@@ -281,7 +281,7 @@ ENGLISH_ABBREVIATIONS = {
# Combined abbreviations for both languages
ALL_ABBREVIATIONS = GERMAN_ABBREVIATIONS | ENGLISH_ABBREVIATIONS
# Regex pattern for legal section headers (§, Art., Article, Section, etc.)
# Regex pattern for legal/standard section headers
_LEGAL_SECTION_RE = re.compile(
r'^(?:'
r'§\s*\d+' # § 25, § 5a
@@ -296,6 +296,12 @@ _LEGAL_SECTION_RE = re.compile(
r'|Part\s+[IVXLC\d]+' # Part III
r'|Recital\s+\d+' # Recital 42
r'|Erwaegungsgrund\s+\d+' # Erwaegungsgrund 26
# NIST/ENISA/standard numbering
r'|\d+\.\d+(?:\.\d+)*\s+[A-ZÄÖÜ]' # 1.1 Title, 2.3.1 Subtitle
r'|[A-Z]{2,4}[-\.]\d+(?:\.\d+)*\b' # AC-1, AU-2, PO.1, PW.1.1
r'|Table\s+\d+' # Table 1, Table A-1
r'|Figure\s+\d+' # Figure 1
r'|Appendix\s+[A-Z\d]' # Appendix A, Appendix 1
r')',
re.IGNORECASE | re.MULTILINE
)