feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect: - Numbered sections: 1.1 Title, 2.3.1 Subtitle - Control family IDs: AC-1, AU-2, PO.1, PW.1.1 - Table/Figure/Appendix references Also adds EUR-Lex HTML replacement script. 58 embedding-service tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -281,7 +281,7 @@ ENGLISH_ABBREVIATIONS = {
|
||||
# Combined abbreviations for both languages
|
||||
ALL_ABBREVIATIONS = GERMAN_ABBREVIATIONS | ENGLISH_ABBREVIATIONS
|
||||
|
||||
# Regex pattern for legal section headers (§, Art., Article, Section, etc.)
|
||||
# Regex pattern for legal/standard section headers
|
||||
_LEGAL_SECTION_RE = re.compile(
|
||||
r'^(?:'
|
||||
r'§\s*\d+' # § 25, § 5a
|
||||
@@ -296,6 +296,12 @@ _LEGAL_SECTION_RE = re.compile(
|
||||
r'|Part\s+[IVXLC\d]+' # Part III
|
||||
r'|Recital\s+\d+' # Recital 42
|
||||
r'|Erwaegungsgrund\s+\d+' # Erwaegungsgrund 26
|
||||
# NIST/ENISA/standard numbering
|
||||
r'|\d+\.\d+(?:\.\d+)*\s+[A-ZÄÖÜ]' # 1.1 Title, 2.3.1 Subtitle
|
||||
r'|[A-Z]{2,4}[-\.]\d+(?:\.\d+)*\b' # AC-1, AU-2, PO.1, PW.1.1
|
||||
r'|Table\s+\d+' # Table 1, Table A-1
|
||||
r'|Figure\s+\d+' # Figure 1
|
||||
r'|Appendix\s+[A-Z\d]' # Appendix A, Appendix 1
|
||||
r')',
|
||||
re.IGNORECASE | re.MULTILINE
|
||||
)
|
||||
|
||||
Reference in New Issue
Block a user