Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.
Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.
Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.
Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
followed inline <a> text — § ended up mid-line, not at line start.
2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
fallback ISO-8859-1.
32 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.
Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.
Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.
27 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.
D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.
D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.
Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).
456 tests passing (58 embedding + 387 pipeline + 11 rag-service).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.
TEMPORARY — re-enable after dedup is finished.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.
Reduces Phase 1 from ~96h to ~2h.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).
Saves ~$200 over the remaining 160k obligations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prompt v4 adds 6 new fields to Pass 0b output:
- applicability: condition rules (same format as dependency engine)
- check_type: expanded to 10 granular types
- scanner_hint: search_terms + negative_indicators for MCP
- manual_review_required_if: escalation conditions
- evidence_type: code/process/hybrid
- provides_context: context variables this control creates
New endpoint POST /generate/backfill-extended:
- Backfills existing 9k controls via Haiku Batch API (~$1.50)
- Adds all 6 new fields to generation_metadata
- Supports dry_run mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLM merge_key phases (e.g. "submission") don't always match PHASE_ORDER
keys. Derive phase order from action_type via get_phase_order() instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix truncated title detection: only flag near-200-char titles or mid-word cutoffs
- Fix evidence leak detection: check title start patterns, not keyword substring
("nachweisen" verb is valid action, "Nachweis vorliegen" is evidence)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- GET /generate/batch-api-status/{batch_id} — check Anthropic batch status
- POST /generate/process-batch — process completed batch results (background)
- GET /generate/process-batch-status/{job_id} — poll processing progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add assertion, pass_criteria, fail_criteria, check_type to AtomicControlCandidate dataclass
- Parse MCP fields from LLM output in _process_pass0b_control
- Store MCP fields in generation_metadata JSON for later use by MCP scanner
- Fields default to empty when not present (backward-compatible with old prompts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Obligations classified before API call:
- evidence → skipped (saves API cost)
- composite → skipped (not atomic)
- framework_container → skipped (decompose separately)
- atomic → sent to LLM
Filter stats returned in submit response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes from v2 evaluation (7.9/10 avg, 28 controls):
1. COMPOUND BAN: "durchführen UND Maßnahmen ergreifen" → pick primary action only
2. EVIDENCE-OF-ACTION: "Tests dokumentieren" → evidence field, not own control
3. PFLICHT=PROZESS: "Behörden informieren" + "Verfahren etablieren" = 1 control
4. MERGE-KEY BUG: merge_key from LLM output now stored in generation_metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>