breakpilot-core

Author	SHA1	Message	Date
Benjamin Admin	fac9280716	feat(pipeline): Block D5+-E complete session — 20k+ new chunks Session 02-03.05.2026 accomplishments: - D5+: NIST/ENISA PDF quality fix (0%→45% section rate) - D5+: 4 lost NIST PDFs restored (11k chunks) - D5+: Text normalization + section detection for NIST/BSI - D6: Citation backfill (3,651 controls updated, old archived) - E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...) - E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.) - E4: Standards (GoBD, BAIT, VAIT) - E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG) - E7: 9 court judgments as full text (Schrems II 154 chunks, Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41, Schadenersatz 29, BAG 48, Google Fonts 14) - Infra: Qdrant snapshot mechanism, upload-before-delete safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-03 22:31:57 +02:00
Benjamin Admin	118be3540d	feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts - d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap), archives old citations, updated 3.651 controls (93.6% coverage) - ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG, MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks) - ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency manually ingested (1.057 chunks) - Updated session handover with current state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-03 13:19:27 +02:00
Benjamin Admin	2f4a3f2ea2	fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE _SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021). This caused 0% section rate for all NIST/BSI/ENISA documents even though sections were correctly detected — the section NUMBER wasn't extracted from the header. Also adds: - reupload_legal_strategy.py: re-upload with legal chunking - extract_and_upload_nist.py: local PDF extraction workaround - qdrant-snapshot.sh: backup mechanism for Qdrant collections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-03 07:42:06 +02:00
Benjamin Admin	0b0eed27b0	feat(embedding): NIST PDF text normalization + safe re-ingest script Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents: - _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1), control IDs (AC - 1 → AC-1), ligatures, soft hyphens - pdfplumber tolerances increased (x=3,y=4) for better column handling - 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10 - reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs - reingest_d5.py: safety fix — upload first, verify, then delete old Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-03 06:42:46 +02:00
Benjamin Admin	ff21bc258a	docs: session handover — D2-D5 complete, quality report, NIST plan Major session achievements: - Structural metadata end-to-end (D2-D4) - 430 docs re-ingested with new chunking - HTML stripping + charset detection (0% → 97.6%) - 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%) - Quality report script (500 controls: 13% fully correct) - Frontend requirements.map fix Open: NIST/ENISA text normalization, citation backfill, D5 script safety (upload-before-delete), BEG IV ingestion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 22:22:55 +02:00
Benjamin Admin	3009f3d13a	feat(embedding): add NIST/ENISA/standard section numbering to chunker Extends _LEGAL_SECTION_RE to detect: - Numbered sections: 1.1 Title, 2.3.1 Subtitle - Control family IDs: AC-1, AU-2, PO.1, PW.1.1 - Table/Figure/Appendix references Also adds EUR-Lex HTML replacement script. 58 embedding-service tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 19:24:10 +02:00
Benjamin Admin	75dda9ac92	feat(embedding): add pdfplumber backend for multi-column PDF extraction EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use multi-column layouts that pypdf breaks into fragmented words ("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly. Backend priority: unstructured > pdfplumber > pypdf (auto mode). Also increases D5 re-ingestion timeout to 3600s for large PDFs. 58 embedding-service tests passing. pdfplumber: MIT license. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 15:42:25 +02:00
Benjamin Admin	ddad58f607	fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping <div>/<p> tags intact. The legal chunker regex requires § at line start, which never matched inside HTML tags → 0% section metadata for HTML docs. Fix: detect HTML content and strip tags before sending to embedding service. Block elements become newlines, entities are decoded. § signs now appear at line starts → section detection works. Also adds D5 re-ingestion scripts (reingest_d5.py + config) for batch re-processing of all documents in Qdrant collections. 27 rag-service tests passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-02 08:18:25 +02:00
Benjamin Admin	93099b2770	feat(pipeline): structural metadata end-to-end (Blocks D2-D4) D2: RAG service stores section/section_title/paragraph/paragraph_num/page from embedding service chunks_with_metadata into Qdrant payloads. D3: Control generator prefers section > article > section_title from Qdrant, adds page to source_citation and generation_metadata. D4: Validated with real BGB §§ 312-312k text. Found and fixed critical bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing only the first chunk per document to have metadata. All subsequent chunks lost section info. Also fixes pre-existing lint issues (unused imports, ambiguous variable names, duplicate dict key, bare except). 456 tests passing (58 embedding + 387 pipeline + 11 rag-service). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-01 20:34:00 +02:00
Benjamin Admin	6a0e7c947f	perf(pipeline): switch to v3 prompt for generation, v4 fields via Haiku backfill Remove applicability/scanner_hint/evidence_type/provides_context from Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are added via cheap Haiku backfill afterwards (~$1.50 per 10k controls). Saves ~$200 over the remaining 160k obligations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-27 00:14:47 +02:00
Benjamin Admin	4716023abc	feat(control-pipeline): JSONB migration for generation_metadata - Add migration script (scripts/migrate_jsonb.py) that converts 89,443 Python dict repr rows to valid JSON via ast.literal_eval - Column altered from TEXT to native JSONB - Index created on generation_metadata->>'merge_group_hint' - Remove unnecessary ::jsonb casts in pipeline_adapter.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-22 07:49:11 +02:00
Benjamin Admin	b14be8583d	feat: Betriebsrats-Compliance — BAG-Ingestion Script + BV-Template 1. BAG-Urteile Ingestion Script (21 kuratierte Urteile zu §87 BetrVG) - Microsoft 365, SAP ERP, E-Mail, Standardsoftware, Video, SaaS/Cloud - 14 erfolgreich ingestiert (4.726 Chunks in bp_compliance_datenschutz) 2. Betriebsvereinbarung Template (6. Document Template) - SQL-Migration mit 13 Sektionen (A-M), ~30 Placeholders - Conditional Blocks fuer KI-Systeme, Video, HR - Python-Generator mit automatischer TOM-Befuellung Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 10:49:01 +02:00
Benjamin Admin	441d5740bd	feat: Applicability Engine + API-Filter + DB-Sync + Cleanup CI / go-lint (push) Has been skipped Details CI / python-lint (push) Has been skipped Details CI / nodejs-lint (push) Has been skipped Details CI / test-go-consent (push) Successful in 35s Details CI / test-python-voice (push) Successful in 33s Details CI / test-bqas (push) Successful in 37s Details CI / Deploy (push) Failing after 2s Details - Applicability Engine (deterministisch, kein LLM): filtert Controls nach Branche, Unternehmensgroesse, Scope-Signalen - API-Filter auf GET /controls, /controls-count, /controls-meta - POST /controls/applicable Endpoint fuer Company-Profile-Matching - 35 Unit-Tests fuer Engine - Port-8098-Konflikt mit Nginx gefixt (nur expose, kein Host-Port) - CLAUDE.md: control-pipeline Dokumentation ergaenzt - 6 internationale Gesetze geloescht (ES/FR/HU/NL/SE/CZ — nur DACH) - DB-Backup-Import-Script (import_backup.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 21:58:17 +02:00

13 Commits