Files
breakpilot-core/embedding-service
Benjamin Admin 322e2d9cb3 feat(embedding): implement legal-aware chunking pipeline
Replace plain recursive chunker with legal-aware chunking that:
- Detects legal section headers (§, Art., Section, Chapter, Annex)
- Adds section context prefix to every chunk
- Splits on paragraph boundaries then sentence boundaries
- Protects DE + EN abbreviations (80+ patterns) from false splits
- Supports language detection for locale-specific processing
- Force-splits overlong sentences at word boundaries

The old plain_recursive API option is removed — all non-semantic
strategies now route through chunk_text_legal().

Includes 40 tests covering header detection, abbreviation protection,
sentence splitting, and legal chunking behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 09:18:23 +01:00
..