Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.
Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
followed inline <a> text — § ended up mid-line, not at line start.
2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
fallback ISO-8859-1.
32 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.
Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.
Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.
27 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.
D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.
D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.
Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).
456 tests passing (58 embedding + 387 pipeline + 11 rag-service).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.
TEMPORARY — re-enable after dedup is finished.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.
Reduces Phase 1 from ~96h to ~2h.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).
Saves ~$200 over the remaining 160k obligations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prompt v4 adds 6 new fields to Pass 0b output:
- applicability: condition rules (same format as dependency engine)
- check_type: expanded to 10 granular types
- scanner_hint: search_terms + negative_indicators for MCP
- manual_review_required_if: escalation conditions
- evidence_type: code/process/hybrid
- provides_context: context variables this control creates
New endpoint POST /generate/backfill-extended:
- Backfills existing 9k controls via Haiku Batch API (~$1.50)
- Adds all 6 new fields to generation_metadata
- Supports dry_run mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLM merge_key phases (e.g. "submission") don't always match PHASE_ORDER
keys. Derive phase order from action_type via get_phase_order() instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix truncated title detection: only flag near-200-char titles or mid-word cutoffs
- Fix evidence leak detection: check title start patterns, not keyword substring
("nachweisen" verb is valid action, "Nachweis vorliegen" is evidence)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- GET /generate/batch-api-status/{batch_id} — check Anthropic batch status
- POST /generate/process-batch — process completed batch results (background)
- GET /generate/process-batch-status/{job_id} — poll processing progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add assertion, pass_criteria, fail_criteria, check_type to AtomicControlCandidate dataclass
- Parse MCP fields from LLM output in _process_pass0b_control
- Store MCP fields in generation_metadata JSON for later use by MCP scanner
- Fields default to empty when not present (backward-compatible with old prompts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Obligations classified before API call:
- evidence → skipped (saves API cost)
- composite → skipped (not atomic)
- framework_container → skipped (decompose separately)
- atomic → sent to LLM
Filter stats returned in submit response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes from v2 evaluation (7.9/10 avg, 28 controls):
1. COMPOUND BAN: "durchführen UND Maßnahmen ergreifen" → pick primary action only
2. EVIDENCE-OF-ACTION: "Tests dokumentieren" → evidence field, not own control
3. PFLICHT=PROZESS: "Behörden informieren" + "Verfahren etablieren" = 1 control
4. MERGE-KEY BUG: merge_key from LLM output now stored in generation_metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes to system prompt:
- Evidence/documentation belongs in evidence field, NOT as separate control
- SBOM = 1 control (not "maintain" + "document" separately)
- Security lifecycle phases (identify/assess/remediate/monitor) = separate controls
- Same object + same action + same actor = 1 control (merge, not split)
- Titles must contain the ACTION, not just the subject
WRONG: "Vertraulichkeit Mitarbeiter"
RIGHT: "Mitarbeiter zur Vertraulichkeit verpflichten"
Titles serve as MCP search queries against customer documents/code.
Bad titles = bad search results = unusable product.
All 52,566 old pass0b controls deprecated (not deleted) for full regeneration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents UniqueViolation from blocking entire batch. Each result
is committed individually, errors are rolled back without affecting
subsequent results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous version searched against atomic_controls_dedup collection which
only contains Pass 0b atomic controls. Now creates a temporary collection
with ALL draft controls as reference, then checks targets against it.
Two phases:
1. Index ~53k reference drafts into temp Qdrant collection (batch 32)
2. Search each of 14k target controls, Embedding + LLM for borderline
3. Cleanup temp collection when done
Status updates every 50 controls (fixed counter bug).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>