breakpilot-core

Author	SHA1	Message	Date
Benjamin Admin	3009f3d13a	feat(embedding): add NIST/ENISA/standard section numbering to chunker Extends _LEGAL_SECTION_RE to detect: - Numbered sections: 1.1 Title, 2.3.1 Subtitle - Control family IDs: AC-1, AU-2, PO.1, PW.1.1 - Table/Figure/Appendix references Also adds EUR-Lex HTML replacement script. 58 embedding-service tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 19:24:10 +02:00
Benjamin Admin	75dda9ac92	feat(embedding): add pdfplumber backend for multi-column PDF extraction EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use multi-column layouts that pypdf breaks into fragmented words ("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly. Backend priority: unstructured > pdfplumber > pypdf (auto mode). Also increases D5 re-ingestion timeout to 3600s for large PDFs. 58 embedding-service tests passing. pdfplumber: MIT license. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 15:42:25 +02:00
Benjamin Admin	93099b2770	feat(pipeline): structural metadata end-to-end (Blocks D2-D4) D2: RAG service stores section/section_title/paragraph/paragraph_num/page from embedding service chunks_with_metadata into Qdrant payloads. D3: Control generator prefers section > article > section_title from Qdrant, adds page to source_citation and generation_metadata. D4: Validated with real BGB §§ 312-312k text. Found and fixed critical bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing only the first chunk per document to have metadata. All subsequent chunks lost section info. Also fixes pre-existing lint issues (unused imports, ambiguous variable names, duplicate dict key, bare except). 456 tests passing (58 embedding + 387 pipeline + 11 rag-service). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 20:34:00 +02:00
Benjamin Admin	6ab10415d8	feat(embedding): add structural metadata to legal chunking (Block D1) chunk_text_legal_structured() returns metadata per chunk: - section: "§ 312k", "Art. 5" - section_title: "Kündigungsbutton" - paragraph: "Abs. 1", "Nr. 3" - paragraph_num: 1, 3 - page: (prepared for PDF integration) - index: sequential position /chunk endpoint now returns chunks_with_metadata alongside plain chunks. Backward compatible — existing consumers use chunks field unchanged. New regex: _PARAGRAPH_RE (Abs/Nr/Satz/lit), _SECTION_NUMBER_RE New functions: _parse_section_metadata(), _extract_paragraph_ref() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 15:25:23 +02:00
Benjamin Admin	322e2d9cb3	feat(embedding): implement legal-aware chunking pipeline Replace plain recursive chunker with legal-aware chunking that: - Detects legal section headers (§, Art., Section, Chapter, Annex) - Adds section context prefix to every chunk - Splits on paragraph boundaries then sentence boundaries - Protects DE + EN abbreviations (80+ patterns) from false splits - Supports language detection for locale-specific processing - Force-splits overlong sentences at word boundaries The old plain_recursive API option is removed — all non-semantic strategies now route through chunk_text_legal(). Includes 40 tests covering header detection, abbreviation protection, sentence splitting, and legal chunking behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 09:18:23 +01:00
Benjamin Boenisch	ad111d5e69	Initial commit: breakpilot-core - Shared Infrastructure Docker Compose with 24+ services: - PostgreSQL (PostGIS), Valkey, MinIO, Qdrant - Vault (PKI/TLS), Nginx (Reverse Proxy) - Backend Core API, Consent Service, Billing Service - RAG Service, Embedding Service - Gitea, Woodpecker CI/CD - Night Scheduler, Health Aggregator - Jitsi (Web/XMPP/JVB/Jicofo), Mailpit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 23:47:13 +01:00

6 Commits