New table: deployment_checks (verdict, blocking/warning controls, risk score)
New API:
POST /v1/deployment-checks (SDK asks: "can I deploy?")
GET /v1/deployment-checks/{id} (check result)
POST /v1/deployment-checks/{id}/override (manual override with justification)
GET /v1/deployment-checks/stats (approval/block rate)
Check logic: queries G1 decision_traces + G3 open failures per affected control.
Verdict: approved (0 blocking) or blocked (with fix recommendations).
454 tests pass, 0 regressions.
Block G complete: G1-G4 all implemented.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New table: decision_events (assessment→decision→fix→verification→failure cycle)
New API:
POST /v1/decision-events (record lifecycle event)
GET /v1/decision-events (list with filters)
GET /v1/decision-events/timeline/{control_id} (full chronological timeline)
GET /v1/decision-events/stats (failure rate, cycle times)
Each event captures input_state, output_state, actor, evidence.
454 tests pass, 0 regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New table: compliance_commits (commit hash, affected controls, risk level)
New API:
POST /v1/compliance-commits (SDK registers commit + impact)
GET /v1/compliance-commits (list with filters)
GET /v1/compliance-commits/by-control/{id} (all commits for a control)
GET /v1/compliance-commits/stats (dashboard)
GET /v1/compliance-commits/{id} (detail)
GIN index on affected_control_ids for fast @> containment queries.
454 tests pass, 0 regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New table: decision_traces (status, reason, evidence, fix plan per control)
New API:
POST/GET/PUT /v1/decision-traces (CRUD for decisions)
GET /v1/decision-traces/stats (compliance dashboard)
GET /v1/controls/{id}/full-trace (Regulation→Obligation→Control→Decision→Evidence)
454 tests pass, 0 regressions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
G-pre1: 144k objects clustered into 7,466 groups via Mini-Batch K-Means
on bge-m3 embeddings. Two-stage: k=5000 base + sub-cluster groups >50.
G-pre2: 5,114 Master Controls from lifecycle phase chains
(define→implement→test→monitor), linking 172,504 atomic controls.
G-pre3: REST API for Master Controls
GET /v1/master-controls (list, search, filter)
GET /v1/master-controls/stats
GET /v1/master-controls/{mc_id} (detail with phase-controls)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 tests confirm all REGULATION_LICENSE_MAP, ACTION_TYPES, _NEGATIVE_PATTERNS,
_ACTION_SYNONYMS, and _OBJECT_SYNONYMS entries are correctly migrated to DB.
Dicts kept as fallback for DB-unavailability resilience.
Block F complete: F1-F5 all done.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses Ollama (qwen3.5:35b-a3b, think:false) to generate additional
German synonyms for action types and object tokens. Results stored
with source='llm' in action_synonyms/object_synonyms tables.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: init scripts ran repeatedly (on container restart) and tried
vault secrets enable / vault auth enable for already-existing paths.
Vault logged ERRORs and burned 40-84% CPU in the loop.
Fix:
- Marker file /vault/data/.init-complete skips re-initialization
- vault secrets list / vault auth list checks before enable calls
- No more "path already in use" errors on subsequent runs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registers /generate/run-pass0a and /generate/pass0a-status/{job_id}
on the core control-pipeline (port 8098). Previously Pass 0a was only
available on the compliance backend which connects to Production DB,
causing a split-brain when controls are generated locally.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.
- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.
Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.
Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.
Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
followed inline <a> text — § ended up mid-line, not at line start.
2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
fallback ISO-8859-1.
32 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.
Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.
Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.
27 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/translate.ts: LiteLLM DE<>EN translation utility
- Migration 006: description_de/description_en on both dataroom tables
- Admin + investor upload APIs: accept description+lang, auto-translate the other language on save
- PATCH /api/admin/dataroom/documents/[id]: description path in addition to display_name path
- PATCH /api/dataroom/uploads/[id]: investor can edit their own upload descriptions
- PATCH /api/admin/dataroom/investors/[id]/uploads: admin can edit investor upload descriptions
- All GET queries updated to return description fields
- Admin dataroom: drop zone replaces upload button, multi-file, inline description editor per doc and per investor upload
- Investor dataroom: drop zone, multi-file, description+lang textarea before upload, inline description editing on existing uploads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.
D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.
D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.
Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).
456 tests passing (58 embedding + 387 pipeline + 11 rag-service).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dataroom routes were reading x-investor-id from request headers which
the middleware sets as response headers — these don't reach route handlers
when the admin fallback path runs (NextResponse.next() without header).
Switch to getSessionFromCookie() consistent with all other investor routes.
Auth page DSGVO footer switched from absolute bottom-0 to normal flow
so the expanded Art. 13 notice doesn't overlap the login card.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker volume inherits directory ownership from the image on first mount.
Without this, the volume mounts as root and the nextjs (uid 1001) process
gets EACCES when trying to write dataroom uploads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
72 Stunden → 30 Tage, expand scope to include personal contact data,
add Art. 15–21 rights, LfDI BW supervisory authority. Both DE + EN.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- runDataCleanup() replaces maskOverdueInvestors(): now also anonymizes
never-activated invites after 90 days, deletes sessions + magic links
older than 30 days, NULLs IPs in audit logs older than 30 days, and
redacts email from audit log details JSONB for masked investors
- New /api/admin/cleanup POST endpoint for scheduled invocation
- New .gitea/workflows/pitch-cleanup.yml: daily cron at 02:00 UTC calls
the cleanup endpoint so anonymization is genuinely automatic, not lazy
- Switch masking window from first_activity_at to last_login_at (30 days
of inactivity; resets on each login)
- Both auth pages: DSGVO footer now covers all Art. 13 requirements —
data categories, retention cutoffs, Art. 15–21 rights, contact address,
LfDI Baden-Württemberg as supervisory authority
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New POST /api/admin/investors/[id]/generate-link endpoint: creates a
magic link without sending email, returns the URL for the admin to
copy and share manually (for when email is filtered)
- Adds 'Copy Link' button (emerald) to investor list and detail pages;
link is copied to clipboard on click
- New lib/masking.ts: maskOverdueInvestors() UPDATE that anonymizes
email/name/company → revokes sessions 72h after first investor login
- first_activity_at recorded on first verify (COALESCE, set once only)
- migration 004 adds first_activity_at + data_masked_at columns with
partial index; also wired into /api/admin/migrate for one-shot apply
- Admin UI shows 'anonymized' badge, expiry countdown, and masked state;
Copy Link + Resend are disabled for anonymized investors
- verify route returns 410 if data_masked_at is set (belt-and-suspenders
alongside the revoked status check)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tsconfig.json: add mcp-server to exclude list so the standalone MCP
package's imports don't break the Next.js type-check build
- FinanzplanSlide.tsx: resolve merge conflict, keep MonthlyGrid refactor
from upstream (discards superseded inline table from stash)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previously committed version was missing useIsLight hook, all sub-components
(PillarRow, ColHeader, CentralHub, BridgeConnectors, FeatureCard, DetailModal,
StarField, ticker components) and their data/types. Only the main component
shell was present, causing a CI build failure on type-check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>