Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
followed inline <a> text — § ended up mid-line, not at line start.
2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
fallback ISO-8859-1.
32 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.
Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.
Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.
27 rag-service tests passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- lib/translate.ts: LiteLLM DE<>EN translation utility
- Migration 006: description_de/description_en on both dataroom tables
- Admin + investor upload APIs: accept description+lang, auto-translate the other language on save
- PATCH /api/admin/dataroom/documents/[id]: description path in addition to display_name path
- PATCH /api/dataroom/uploads/[id]: investor can edit their own upload descriptions
- PATCH /api/admin/dataroom/investors/[id]/uploads: admin can edit investor upload descriptions
- All GET queries updated to return description fields
- Admin dataroom: drop zone replaces upload button, multi-file, inline description editor per doc and per investor upload
- Investor dataroom: drop zone, multi-file, description+lang textarea before upload, inline description editing on existing uploads
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.
D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.
D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.
Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).
456 tests passing (58 embedding + 387 pipeline + 11 rag-service).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dataroom routes were reading x-investor-id from request headers which
the middleware sets as response headers — these don't reach route handlers
when the admin fallback path runs (NextResponse.next() without header).
Switch to getSessionFromCookie() consistent with all other investor routes.
Auth page DSGVO footer switched from absolute bottom-0 to normal flow
so the expanded Art. 13 notice doesn't overlap the login card.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Docker volume inherits directory ownership from the image on first mount.
Without this, the volume mounts as root and the nextjs (uid 1001) process
gets EACCES when trying to write dataroom uploads.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
72 Stunden → 30 Tage, expand scope to include personal contact data,
add Art. 15–21 rights, LfDI BW supervisory authority. Both DE + EN.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- runDataCleanup() replaces maskOverdueInvestors(): now also anonymizes
never-activated invites after 90 days, deletes sessions + magic links
older than 30 days, NULLs IPs in audit logs older than 30 days, and
redacts email from audit log details JSONB for masked investors
- New /api/admin/cleanup POST endpoint for scheduled invocation
- New .gitea/workflows/pitch-cleanup.yml: daily cron at 02:00 UTC calls
the cleanup endpoint so anonymization is genuinely automatic, not lazy
- Switch masking window from first_activity_at to last_login_at (30 days
of inactivity; resets on each login)
- Both auth pages: DSGVO footer now covers all Art. 13 requirements —
data categories, retention cutoffs, Art. 15–21 rights, contact address,
LfDI Baden-Württemberg as supervisory authority
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New POST /api/admin/investors/[id]/generate-link endpoint: creates a
magic link without sending email, returns the URL for the admin to
copy and share manually (for when email is filtered)
- Adds 'Copy Link' button (emerald) to investor list and detail pages;
link is copied to clipboard on click
- New lib/masking.ts: maskOverdueInvestors() UPDATE that anonymizes
email/name/company → revokes sessions 72h after first investor login
- first_activity_at recorded on first verify (COALESCE, set once only)
- migration 004 adds first_activity_at + data_masked_at columns with
partial index; also wired into /api/admin/migrate for one-shot apply
- Admin UI shows 'anonymized' badge, expiry countdown, and masked state;
Copy Link + Resend are disabled for anonymized investors
- verify route returns 410 if data_masked_at is set (belt-and-suspenders
alongside the revoked status check)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tsconfig.json: add mcp-server to exclude list so the standalone MCP
package's imports don't break the Next.js type-check build
- FinanzplanSlide.tsx: resolve merge conflict, keep MonthlyGrid refactor
from upstream (discards superseded inline table from stash)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previously committed version was missing useIsLight hook, all sub-components
(PillarRow, ColHeader, CentralHub, BridgeConnectors, FeatureCard, DetailModal,
StarField, ticker components) and their data/types. Only the main component
shell was present, causing a CI build failure on type-check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.
TEMPORARY — re-enable after dedup is finished.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.
Reduces Phase 1 from ~96h to ~2h.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).
Saves ~$200 over the remaining 160k obligations.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prompt v4 adds 6 new fields to Pass 0b output:
- applicability: condition rules (same format as dependency engine)
- check_type: expanded to 10 granular types
- scanner_hint: search_terms + negative_indicators for MCP
- manual_review_required_if: escalation conditions
- evidence_type: code/process/hybrid
- provides_context: context variables this control creates
New endpoint POST /generate/backfill-extended:
- Backfills existing 9k controls via Haiku Batch API (~$1.50)
- Adds all 6 new fields to generation_metadata
- Supports dry_run mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LLM merge_key phases (e.g. "submission") don't always match PHASE_ORDER
keys. Derive phase order from action_type via get_phase_order() instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix truncated title detection: only flag near-200-char titles or mid-word cutoffs
- Fix evidence leak detection: check title start patterns, not keyword substring
("nachweisen" verb is valid action, "Nachweis vorliegen" is evidence)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- GET /generate/batch-api-status/{batch_id} — check Anthropic batch status
- POST /generate/process-batch — process completed batch results (background)
- GET /generate/process-batch-status/{job_id} — poll processing progress
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add assertion, pass_criteria, fail_criteria, check_type to AtomicControlCandidate dataclass
- Parse MCP fields from LLM output in _process_pass0b_control
- Store MCP fields in generation_metadata JSON for later use by MCP scanner
- Fields default to empty when not present (backward-compatible with old prompts)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Obligations classified before API call:
- evidence → skipped (saves API cost)
- composite → skipped (not atomic)
- framework_container → skipped (decompose separately)
- atomic → sent to LLM
Filter stats returned in submit response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes from v2 evaluation (7.9/10 avg, 28 controls):
1. COMPOUND BAN: "durchführen UND Maßnahmen ergreifen" → pick primary action only
2. EVIDENCE-OF-ACTION: "Tests dokumentieren" → evidence field, not own control
3. PFLICHT=PROZESS: "Behörden informieren" + "Verfahren etablieren" = 1 control
4. MERGE-KEY BUG: merge_key from LLM output now stored in generation_metadata
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key changes to system prompt:
- Evidence/documentation belongs in evidence field, NOT as separate control
- SBOM = 1 control (not "maintain" + "document" separately)
- Security lifecycle phases (identify/assess/remediate/monitor) = separate controls
- Same object + same action + same actor = 1 control (merge, not split)
- Titles must contain the ACTION, not just the subject
WRONG: "Vertraulichkeit Mitarbeiter"
RIGHT: "Mitarbeiter zur Vertraulichkeit verpflichten"
Titles serve as MCP search queries against customer documents/code.
Bad titles = bad search results = unusable product.
All 52,566 old pass0b controls deprecated (not deleted) for full regeneration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents UniqueViolation from blocking entire batch. Each result
is committed individually, errors are rolled back without affecting
subsequent results.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>