Commit Graph

7 Commits

Author SHA1 Message Date
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
Benjamin Admin 93099b2770 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.

D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.

D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.

Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).

456 tests passing (58 embedding + 387 pipeline + 11 rag-service).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 20:34:00 +02:00
Benjamin Admin 6ab10415d8 feat(embedding): add structural metadata to legal chunking (Block D1)
chunk_text_legal_structured() returns metadata per chunk:
- section: "§ 312k", "Art. 5"
- section_title: "Kündigungsbutton"
- paragraph: "Abs. 1", "Nr. 3"
- paragraph_num: 1, 3
- page: (prepared for PDF integration)
- index: sequential position

/chunk endpoint now returns chunks_with_metadata alongside plain chunks.
Backward compatible — existing consumers use chunks field unchanged.

New regex: _PARAGRAPH_RE (Abs/Nr/Satz/lit), _SECTION_NUMBER_RE
New functions: _parse_section_metadata(), _extract_paragraph_ref()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:25:23 +02:00
Benjamin Admin 322e2d9cb3 feat(embedding): implement legal-aware chunking pipeline
Replace plain recursive chunker with legal-aware chunking that:
- Detects legal section headers (§, Art., Section, Chapter, Annex)
- Adds section context prefix to every chunk
- Splits on paragraph boundaries then sentence boundaries
- Protects DE + EN abbreviations (80+ patterns) from false splits
- Supports language detection for locale-specific processing
- Force-splits overlong sentences at word boundaries

The old plain_recursive API option is removed — all non-semantic
strategies now route through chunk_text_legal().

Includes 40 tests covering header detection, abbreviation protection,
sentence splitting, and legal chunking behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 09:18:23 +01:00
Benjamin Boenisch ad111d5e69 Initial commit: breakpilot-core - Shared Infrastructure
Docker Compose with 24+ services:
- PostgreSQL (PostGIS), Valkey, MinIO, Qdrant
- Vault (PKI/TLS), Nginx (Reverse Proxy)
- Backend Core API, Consent Service, Billing Service
- RAG Service, Embedding Service
- Gitea, Woodpecker CI/CD
- Night Scheduler, Health Aggregator
- Jitsi (Web/XMPP/JVB/Jicofo), Mailpit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 23:47:13 +01:00