The fm_scenarios array in each pitch version snapshot already stores the
fp_scenario IDs directly (same pattern 1 Mio used). Wandeldarlehen snapshots
were missing Bear/Bull entries — updated in DB to add them.
- /api/data: include fp_scenarios in version response (was omitted)
- PitchDeck: derive fpBaseScenarioId from data.fp_scenarios
- useFpKPIs: accept fpBaseScenarioId instead of isWandeldarlehen boolean
- AssumptionsSlide: find Bear/Base/Bull by name from fpScenarios prop
- FinanzplanSlide: initialize from fpBaseScenarioId, use version scenarios for selector
- FinancialsSlide / ExecutiveSummarySlide: pass fpBaseScenarioId to hook
- types: add FpScenarioRef + fp_scenarios field to PitchData
No UUID hardcoded in any component. Adding a new pitch version only
requires setting the correct fp_scenario IDs in its fm_scenarios snapshot.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AssumptionsSlide sends ?scenarioId=<uuid> for Bear/Base/Bull cards but
the route was silently dropping it for non-admin requests, making all
three cards return the same default Base Case data. Since fp_ financial
projections are already investor-facing, any valid scenarioId is allowed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.
- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.
Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.
Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.
Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.
Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.
58 embedding-service tests passing. pdfplumber: MIT license.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
followed inline <a> text — § ended up mid-line, not at line start.
2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
fallback ISO-8859-1.
32 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.
Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.
Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.
27 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/translate.ts: LiteLLM DE<>EN translation utility
- Migration 006: description_de/description_en on both dataroom tables
- Admin + investor upload APIs: accept description+lang, auto-translate the other language on save
- PATCH /api/admin/dataroom/documents/[id]: description path in addition to display_name path
- PATCH /api/dataroom/uploads/[id]: investor can edit their own upload descriptions
- PATCH /api/admin/dataroom/investors/[id]/uploads: admin can edit investor upload descriptions
- All GET queries updated to return description fields
- Admin dataroom: drop zone replaces upload button, multi-file, inline description editor per doc and per investor upload
- Investor dataroom: drop zone, multi-file, description+lang textarea before upload, inline description editing on existing uploads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.
D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.
D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.
Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).
456 tests passing (58 embedding + 387 pipeline + 11 rag-service).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dataroom routes were reading x-investor-id from request headers which
the middleware sets as response headers — these don't reach route handlers
when the admin fallback path runs (NextResponse.next() without header).
Switch to getSessionFromCookie() consistent with all other investor routes.
Auth page DSGVO footer switched from absolute bottom-0 to normal flow
so the expanded Art. 13 notice doesn't overlap the login card.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker volume inherits directory ownership from the image on first mount.
Without this, the volume mounts as root and the nextjs (uid 1001) process
gets EACCES when trying to write dataroom uploads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
72 Stunden → 30 Tage, expand scope to include personal contact data,
add Art. 15–21 rights, LfDI BW supervisory authority. Both DE + EN.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- runDataCleanup() replaces maskOverdueInvestors(): now also anonymizes
never-activated invites after 90 days, deletes sessions + magic links
older than 30 days, NULLs IPs in audit logs older than 30 days, and
redacts email from audit log details JSONB for masked investors
- New /api/admin/cleanup POST endpoint for scheduled invocation
- New .gitea/workflows/pitch-cleanup.yml: daily cron at 02:00 UTC calls
the cleanup endpoint so anonymization is genuinely automatic, not lazy
- Switch masking window from first_activity_at to last_login_at (30 days
of inactivity; resets on each login)
- Both auth pages: DSGVO footer now covers all Art. 13 requirements —
data categories, retention cutoffs, Art. 15–21 rights, contact address,
LfDI Baden-Württemberg as supervisory authority
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New POST /api/admin/investors/[id]/generate-link endpoint: creates a
magic link without sending email, returns the URL for the admin to
copy and share manually (for when email is filtered)
- Adds 'Copy Link' button (emerald) to investor list and detail pages;
link is copied to clipboard on click
- New lib/masking.ts: maskOverdueInvestors() UPDATE that anonymizes
email/name/company → revokes sessions 72h after first investor login
- first_activity_at recorded on first verify (COALESCE, set once only)
- migration 004 adds first_activity_at + data_masked_at columns with
partial index; also wired into /api/admin/migrate for one-shot apply
- Admin UI shows 'anonymized' badge, expiry countdown, and masked state;
Copy Link + Resend are disabled for anonymized investors
- verify route returns 410 if data_masked_at is set (belt-and-suspenders
alongside the revoked status check)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tsconfig.json: add mcp-server to exclude list so the standalone MCP
package's imports don't break the Next.js type-check build
- FinanzplanSlide.tsx: resolve merge conflict, keep MonthlyGrid refactor
from upstream (discards superseded inline table from stash)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previously committed version was missing useIsLight hook, all sub-components
(PillarRow, ColHeader, CentralHub, BridgeConnectors, FeatureCard, DetailModal,
StarField, ticker components) and their data/types. Only the main component
shell was present, causing a CI build failure on type-check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.
TEMPORARY — re-enable after dedup is finished.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.
Reduces Phase 1 from ~96h to ~2h.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).
Saves ~$200 over the remaining 160k obligations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>