Commit Graph

617 Commits

Author SHA1 Message Date
Sharang Parnerkar 30a9165497 feat(pitch): showcase mode — per-investor toggle hides financial/investor slides for customer demos
Build pitch-deck / build-push-deploy (push) Successful in 1m35s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 30s
Adds is_showcase boolean to pitch_investors; when set, filters out financials,
the ask, cap table, assumptions, finanzplan, risks, and intro-presenter slides.
Slide navigation is fully dynamic — progress bar and counts update accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 22:41:15 +02:00
Sharang Parnerkar f2184be02f fix: tab row counts use investor's scenario, not always Base Case
Build pitch-deck / build-push-deploy (push) Successful in 1m34s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 32s
/api/finanzplan now accepts ?scenarioId and uses it for the per-sheet
row counts (the numbers in brackets on the tab bar). FinanzplanSlide
passes fpBaseScenarioId when fetching the sheet list, so Wandeldarlehen
investors see e.g. Personalkosten (9) instead of (35).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:21:40 +02:00
Sharang Parnerkar 06014d57b3 fix: derive fp_scenario IDs from version snapshot, eliminate hardcoded UUIDs
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 31s
The fm_scenarios array in each pitch version snapshot already stores the
fp_scenario IDs directly (same pattern 1 Mio used). Wandeldarlehen snapshots
were missing Bear/Bull entries — updated in DB to add them.

- /api/data: include fp_scenarios in version response (was omitted)
- PitchDeck: derive fpBaseScenarioId from data.fp_scenarios
- useFpKPIs: accept fpBaseScenarioId instead of isWandeldarlehen boolean
- AssumptionsSlide: find Bear/Base/Bull by name from fpScenarios prop
- FinanzplanSlide: initialize from fpBaseScenarioId, use version scenarios for selector
- FinancialsSlide / ExecutiveSummarySlide: pass fpBaseScenarioId to hook
- types: add FpScenarioRef + fp_scenarios field to PitchData

No UUID hardcoded in any component. Adding a new pitch version only
requires setting the correct fp_scenario IDs in its fm_scenarios snapshot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:00:06 +02:00
Sharang Parnerkar 6c022d1a79 fix: allow investors to query fp_ scenarios by scenarioId
Build pitch-deck / build-push-deploy (push) Successful in 1m55s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 34s
AssumptionsSlide sends ?scenarioId=<uuid> for Bear/Base/Bull cards but
the route was silently dropping it for non-admin requests, making all
three cards return the same default Base Case data. Since fp_ financial
projections are already investor-facing, any valid scenarioId is allowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:27:14 +02:00
Benjamin Admin 652e3a65a3 feat(pipeline): F2+F3 action/object ontology — DB-backed normalization
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 31s
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.

- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:47:53 +02:00
Benjamin Admin aab8eeb335 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 33s
2026-05-03 23:14:34 +02:00
Benjamin Admin 9437e029d0 feat(pipeline): F1 regulation registry — DB-backed license/source-type lookup
Migrates REGULATION_LICENSE_MAP (135 entries) and SOURCE_REGULATION_CLASSIFICATION
(58 entries) from hardcoded Python dicts to compliance.regulation_registry table.

- SQL migration: 002_regulation_registry.sql (table + indexes + trigger)
- Migration script: f1_migrate_regulation_registry.py (162 rows, --dry-run)
- RegulationRegistry cache: 5min TTL, prefix fallback, graceful degradation
- control_generator._classify_regulation() delegates to DB with dict fallback
- source_type_classification.classify_source_regulation() delegates to DB
- 34 new tests (lookup, cache, degradation, migration data consistency)
- 421 total tests pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:14:06 +02:00
Benjamin Admin 4fd2bfefcd docs: session handover updated for Block F start
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:51:23 +02:00
Benjamin Admin fac9280716 feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments:
- D5+: NIST/ENISA PDF quality fix (0%→45% section rate)
- D5+: 4 lost NIST PDFs restored (11k chunks)
- D5+: Text normalization + section detection for NIST/BSI
- D6: Citation backfill (3,651 controls updated, old archived)
- E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...)
- E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.)
- E4: Standards (GoBD, BAIT, VAIT)
- E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG)
- E7: 9 court judgments as full text (Schrems II 154 chunks,
  Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41,
  Schadenersatz 29, BAG 48, Google Fonts 14)
- Infra: Qdrant snapshot mechanism, upload-before-delete safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:31:57 +02:00
Benjamin Admin 118be3540d feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap),
  archives old citations, updated 3.651 controls (93.6% coverage)
- ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG,
  MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks)
- ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due
  to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency
  manually ingested (1.057 chunks)
- Updated session handover with current state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
Benjamin Admin a9671a572b fix(embedding): single-number ALL-CAPS section detection for ENISA/BSI
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.

Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:02 +02:00
Benjamin Admin 2f4a3f2ea2 fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:42:06 +02:00
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 97a7f6f264 docs: comprehensive session handover with full roadmap (Blocks A-G)
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:30:50 +02:00
Benjamin Admin ff21bc258a docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements:
- Structural metadata end-to-end (D2-D4)
- 430 docs re-ingested with new chunking
- HTML stripping + charset detection (0% → 97.6%)
- 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%)
- Quality report script (500 controls: 13% fully correct)
- Frontend requirements.map fix

Open: NIST/ENISA text normalization, citation backfill,
D5 script safety (upload-before-delete), BEG IV ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 5a6e588641 docs: update session handover — D2-D5 complete, EU PDF issue documented
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.

Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:34:34 +02:00
Benjamin Admin 41183ff93d fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:30:33 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00
Benjamin Admin a459636bc4 fix(rag): HTML charset detection + opening block tag newlines
Two bugs fixed:
1. Opening block tags (<h3>, <div>) now also create newlines, not just
   closing tags. Fixes: gesetze-im-internet.de puts § inside <h3> which
   followed inline <a> text — § ended up mid-line, not at line start.

2. HTML charset detection from meta tag (charset=iso-8859-1). Files from
   gesetze-im-internet.de use ISO-8859-1, not UTF-8. The § byte (0xA7)
   was destroyed by UTF-8 decode. Now: try UTF-8 → check meta charset →
   fallback ISO-8859-1.

32 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:35:47 +02:00
Benjamin Admin ddad58f607 fix(rag): strip HTML tags before chunking + D5 re-ingestion scripts
HTML files from gesetze-im-internet.de were decoded as raw UTF-8, keeping
<div>/<p> tags intact. The legal chunker regex requires § at line start,
which never matched inside HTML tags → 0% section metadata for HTML docs.

Fix: detect HTML content and strip tags before sending to embedding
service. Block elements become newlines, entities are decoded.
§ signs now appear at line starts → section detection works.

Also adds D5 re-ingestion scripts (reingest_d5.py + config) for
batch re-processing of all documents in Qdrant collections.

27 rag-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 08:18:25 +02:00
Sharang Parnerkar f130c45ca8 feat(dataroom): bilingual descriptions, drag-drop multi-file upload, edit existing upload descriptions
Build pitch-deck / build-push-deploy (push) Successful in 1m47s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 32s
- lib/translate.ts: LiteLLM DE<>EN translation utility
- Migration 006: description_de/description_en on both dataroom tables
- Admin + investor upload APIs: accept description+lang, auto-translate the other language on save
- PATCH /api/admin/dataroom/documents/[id]: description path in addition to display_name path
- PATCH /api/dataroom/uploads/[id]: investor can edit their own upload descriptions
- PATCH /api/admin/dataroom/investors/[id]/uploads: admin can edit investor upload descriptions
- All GET queries updated to return description fields
- Admin dataroom: drop zone replaces upload button, multi-file, inline description editor per doc and per investor upload
- Investor dataroom: drop zone, multi-file, description+lang textarea before upload, inline description editing on existing uploads

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 21:00:36 +02:00
Benjamin Admin 93099b2770 feat(pipeline): structural metadata end-to-end (Blocks D2-D4)
D2: RAG service stores section/section_title/paragraph/paragraph_num/page
from embedding service chunks_with_metadata into Qdrant payloads.

D3: Control generator prefers section > article > section_title from
Qdrant, adds page to source_citation and generation_metadata.

D4: Validated with real BGB §§ 312-312k text. Found and fixed critical
bug where Phase 3 overlap destroyed the [§ ...] section prefix, causing
only the first chunk per document to have metadata. All subsequent
chunks lost section info.

Also fixes pre-existing lint issues (unused imports, ambiguous variable
names, duplicate dict key, bare except).

456 tests passing (58 embedding + 387 pipeline + 11 rag-service).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 20:34:00 +02:00
Sharang Parnerkar 370143b643 fix(dataroom): use getSessionFromCookie() instead of middleware headers; fix auth page overflow
Build pitch-deck / build-push-deploy (push) Successful in 1m33s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 37s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 27s
Dataroom routes were reading x-investor-id from request headers which
the middleware sets as response headers — these don't reach route handlers
when the admin fallback path runs (NextResponse.next() without header).
Switch to getSessionFromCookie() consistent with all other investor routes.

Auth page DSGVO footer switched from absolute bottom-0 to normal flow
so the expanded Art. 13 notice doesn't overlap the login card.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 16:03:21 +02:00
Sharang Parnerkar 07039cc408 fix(pitch-deck): pre-create /data/dataroom owned by nextjs in Dockerfile
Build pitch-deck / build-push-deploy (push) Successful in 1m18s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 29s
Docker volume inherits directory ownership from the image on first mount.
Without this, the volume mounts as root and the nextjs (uid 1001) process
gets EACCES when trying to write dataroom uploads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:51:50 +02:00
Sharang Parnerkar af83e41494 feat(pitch-deck): add Data Room link for investors in top-right corner
Build pitch-deck / build-push-deploy (push) Successful in 1m19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 29s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:47:14 +02:00
Sharang Parnerkar 9888b1b5d7 feat(pitch-deck): data room — file sharing and investor uploads
Build pitch-deck / build-push-deploy (push) Successful in 1m21s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 32s
- lib/dataroom-storage.ts: local volume storage (DATAROOM_PATH env var,
  default /data/dataroom) replacing NextCloud WebDAV
- Admin API: upload documents, rename, delete, manage per-investor releases
- Investor API: list released documents, stream download with audit log,
  upload own documents (max DATAROOM_MAX_UPLOAD_MB, default 50MB)
- /pitch-admin/dataroom: document list + release toggles + investor uploads tab
- /dataroom: investor-facing document library + upload section
- All reads and writes logged to pitch_audit_logs
- Migration 005: dataroom_documents, dataroom_releases, dataroom_investor_uploads
- AdminShell: Data Room nav link (FolderOpen icon)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:38:21 +02:00
Benjamin Admin da21339e76 docs: add session handover instructions for next session
Covers: completed blocks A-D1, remaining D2-G, critical files,
DB state, memory files, test commands.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:33:05 +02:00
Benjamin Admin 6ab10415d8 feat(embedding): add structural metadata to legal chunking (Block D1)
chunk_text_legal_structured() returns metadata per chunk:
- section: "§ 312k", "Art. 5"
- section_title: "Kündigungsbutton"
- paragraph: "Abs. 1", "Nr. 3"
- paragraph_num: 1, 3
- page: (prepared for PDF integration)
- index: sequential position

/chunk endpoint now returns chunks_with_metadata alongside plain chunks.
Backward compatible — existing consumers use chunks field unchanged.

New regex: _PARAGRAPH_RE (Abs/Nr/Satz/lit), _SECTION_NUMBER_RE
New functions: _parse_section_metadata(), _extract_paragraph_ref()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 15:25:23 +02:00
Sharang Parnerkar 1bf1411c66 fix(pitch-deck): update email privacy notice to match GDPR changes
Build pitch-deck / build-push-deploy (push) Successful in 1m19s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 29s
72 Stunden → 30 Tage, expand scope to include personal contact data,
add Art. 15–21 rights, LfDI BW supervisory authority. Both DE + EN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:20:46 +02:00
Sharang Parnerkar 5946aa47d5 fix(pitch-deck): GDPR compliance — automated cleanup, full Art. 13 notice
Build pitch-deck / build-push-deploy (push) Successful in 1m37s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 38s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 30s
- runDataCleanup() replaces maskOverdueInvestors(): now also anonymizes
  never-activated invites after 90 days, deletes sessions + magic links
  older than 30 days, NULLs IPs in audit logs older than 30 days, and
  redacts email from audit log details JSONB for masked investors
- New /api/admin/cleanup POST endpoint for scheduled invocation
- New .gitea/workflows/pitch-cleanup.yml: daily cron at 02:00 UTC calls
  the cleanup endpoint so anonymization is genuinely automatic, not lazy
- Switch masking window from first_activity_at to last_login_at (30 days
  of inactivity; resets on each login)
- Both auth pages: DSGVO footer now covers all Art. 13 requirements —
  data categories, retention cutoffs, Art. 15–21 rights, contact address,
  LfDI Baden-Württemberg as supervisory authority

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-01 15:11:51 +02:00
Benjamin Admin d9c16fb914 feat(pipeline): add adversarial tests (30 cases) + regression harness
Block C implementation:
- adversarial_cases.yaml: 30 tricky cases in 5 categories
  (wrong legal basis, dark patterns, incomplete docs, similar-but-different, homonyms)
- test_adversarial.py: 63 tests validating adversarial cases
- test_regression.py: ontology stability, dependency engine, quality metrics
- conftest.py: shared fixtures (DB session, sample controls)

Total: 371 tests passing (221 existing + 150 new).
Real-world benchmarks (C1) need manual ground truth creation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 13:02:29 +02:00
Benjamin Admin 6f58fdbaa5 docs: add test strategy instruction for dedicated session (Block C)
3 test levels: Real-World Benchmarks (10 DE websites), Adversarial Tests
(30 tricky cases), Regression Harness (CI/CD quality gate).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 12:28:58 +02:00
Benjamin Admin b8ff4e9290 feat(pipeline): add review-verify endpoint — LLM decides DUPLIKAT/VERSCHIEDEN
Sends 67k review candidates to Haiku Batch API in pairs.
Each pair gets a DUPLIKAT/VERSCHIEDEN decision with reasoning.
Results stored in control_dedup_reviews.review_status.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 09:36:30 +02:00
Benjamin Admin f2104768a0 fix(docker): re-enable healthcheck after dedup completion
Dedup is done (162k controls). Re-enable healthcheck with generous
timeouts (10 retries × 30s) and restart: unless-stopped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-01 08:39:57 +02:00
Sharang Parnerkar 2f861cd6d7 feat(pitch-admin): backfill first_activity_at for existing investors
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 30s
CI / test-python-voice (push) Successful in 31s
CI / test-bqas (push) Successful in 31s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 15:08:26 +02:00
Sharang Parnerkar 23b233bda3 feat(pitch-admin): generate magic link + 72h investor data masking
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 30s
- New POST /api/admin/investors/[id]/generate-link endpoint: creates a
  magic link without sending email, returns the URL for the admin to
  copy and share manually (for when email is filtered)
- Adds 'Copy Link' button (emerald) to investor list and detail pages;
  link is copied to clipboard on click
- New lib/masking.ts: maskOverdueInvestors() UPDATE that anonymizes
  email/name/company → revokes sessions 72h after first investor login
- first_activity_at recorded on first verify (COALESCE, set once only)
- migration 004 adds first_activity_at + data_masked_at columns with
  partial index; also wired into /api/admin/migrate for one-shot apply
- Admin UI shows 'anonymized' badge, expiry countdown, and masked state;
  Copy Link + Resend are disabled for anonymized investors
- verify route returns 410 if data_masked_at is set (belt-and-suspenders
  alongside the revoked status check)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:55:29 +02:00
Sharang Parnerkar adfff6cfe4 fix(pitch-deck): exclude mcp-server from Next.js tsconfig + resolve FinanzplanSlide conflict
Build pitch-deck / build-push-deploy (push) Successful in 1m13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 27s
CI / test-python-voice (push) Successful in 27s
CI / test-bqas (push) Successful in 31s
- tsconfig.json: add mcp-server to exclude list so the standalone MCP
  package's imports don't break the Next.js type-check build
- FinanzplanSlide.tsx: resolve merge conflict, keep MonthlyGrid refactor
  from upstream (discards superseded inline table from stash)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:11:40 +02:00
Sharang Parnerkar 269464943e fix(pitch-deck): restore complete USPSlide with all helper functions
Build pitch-deck / build-push-deploy (push) Failing after 40s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 29s
CI / test-bqas (push) Successful in 26s
The previously committed version was missing useIsLight hook, all sub-components
(PillarRow, ColHeader, CentralHub, BridgeConnectors, FeatureCard, DetailModal,
StarField, ticker components) and their data/types. Only the main component
shell was present, causing a CI build failure on type-check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-30 14:05:42 +02:00
Benjamin Admin e8df15c0f8 fix: add proxy_read_timeout 300s to admin-compliance location block
Scan endpoint needs up to 3-5 min (multi-page crawl + LLM calls).
Without explicit timeout, nginx defaults to 60s → 504 Gateway Timeout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 11:23:02 +02:00
Benjamin Admin 7c5592b50e feat(pipeline): add checkpoint to dedup Phase 2 — survives container restart
Stores last_control_id in canonical_generation_jobs after each page.
On restart, resumes from checkpoint instead of starting over.
Checkpoint is deleted on completion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 09:12:23 +02:00
Benjamin Admin e8f018f2c6 fix: increase client_max_body_size to 50M for ports 3007 + 8093
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 56s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 31s
Port 3007 (admin-compliance) had no limit (nginx default 1M) causing
413 on SDK state saves. Port 8093 (SDK) had 10M, now 50M.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-29 08:54:06 +02:00
Benjamin Admin b151951448 fix(pipeline): make dedup Phase 2 resilient — paginated, timeout, per-control error handling
- Paginated DB queries (100 rows/page) instead of loading all 166k rows
- Individual timeout (30s) per embedding + qdrant call
- Per-control try/except — one failure doesn't kill the job
- Sequential processing (no asyncio.gather) for stability
- Progress logging every 500 controls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 15:31:28 +02:00
Benjamin Admin 2e2e81b3e1 fix(docker): disable healthcheck + auto-restart for control-pipeline during dedup
The dedup job blocks the event loop for extended periods, causing
health checks to fail repeatedly. Even 10 retries × 30s wasn't enough.
Disabled healthcheck and restart policy until dedup is complete.

TEMPORARY — re-enable after dedup is finished.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 14:39:19 +02:00
Benjamin Admin b873c0e4ae fix(docker): increase control-pipeline healthcheck tolerance for long-running jobs
Dedup Phase 2 blocks the event loop for extended periods, causing
health checks to fail. Docker then restarts the container and kills
the job. Increased retries from 3 to 10, timeout from 10s to 30s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 12:35:39 +02:00
Benjamin Admin 9dc16674e2 perf(pipeline): skip singleton groups in dedup Phase 1
153k of 160k merge groups have only 1 control — no intra-group
dedup possible. Skip them in Phase 1, they become masters automatically.
Phase 2 (cross-group) still checks them via Qdrant embeddings.

Reduces Phase 1 from ~96h to ~2h.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-28 00:31:22 +02:00
Benjamin Admin e6e2688b56 fix(pipeline): add idempotency guard to submit-pass0b endpoint
Prevents duplicate batch submissions that caused ~$170 in extra costs.
Refuses new submit if a batch was submitted in the last 10 minutes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 18:59:03 +02:00
Benjamin Admin 28aa74b4b0 Merge remote-tracking branch 'gitea/main'
Build pitch-deck / build-push-deploy (push) Failing after 1m13s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 49s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 31s
# Conflicts:
#	pitch-deck/components/slides/MilestonesSlide.tsx
#	pitch-deck/lib/finanzplan/engine.ts
2026-04-27 13:14:54 +02:00
Benjamin Admin 8e37441782 perf(pipeline): switch back to v4 prompt — backfill costs nearly the same
v3+backfill=$31.60/10k vs v4=$33/10k — not worth the extra complexity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 00:44:23 +02:00
Benjamin Admin 6a0e7c947f perf(pipeline): switch to v3 prompt for generation, v4 fields via Haiku backfill
Remove applicability/scanner_hint/evidence_type/provides_context from
Pass 0b prompt to reduce output tokens (~40% less). These 6 fields are
added via cheap Haiku backfill afterwards (~$1.50 per 10k controls).

Saves ~$200 over the remaining 160k obligations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-27 00:14:47 +02:00