698 Commits

Author SHA1 Message Date
Benjamin Admin 0bb9726ddd Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 48s
CI / test-python-voice (push) Successful in 43s
CI / test-bqas (push) Successful in 36s
2026-05-10 15:09:51 +02:00
Benjamin Admin 8510af46eb feat(pipeline): MC Quality Overhaul — 74.5% → 92.8% accuracy, 5.3K → 13.6K MCs
Phase 0: Quality Audit script (Claude Sonnet, 1750 samples)
Phase 1: Object ontology expanded 31 → 74 tokens with descriptions + boundaries
Phase 2: 174K controls re-classified via Haiku (10 batches, $50)
  - Generic tokens removed (documentation, procedure, process)
  - L2 sub-topics added (108K + 64K controls)
  - Bad subtopics fixed (stakeholder_*, escalation fragments)
Phase 3: Re-clustering K=18704 (37K objects → 16.7K groups)
Phase 4: Direct MC generation from canonical tokens (gpre2_direct_mc.py)
Phase 5: Regulation-source split (gpre3, dry-run tested)

New features:
- Tenant-isolated document upload API (rag-service)
- BAuA crawler (Playwright, 131 PDFs downloaded)
- OSHA Technical Manual crawler (23 chapters)
- CE obligation extractor (6141 obligations from Qdrant)

RAG ingestion:
- 126 BAuA PDFs (TRBS/TRGS/ASR): 27,664 chunks
- OSHA Technical Manual: 7,241 chunks
- OSHA 1910 Subpart O (full): 745 chunks
- EuGH C-588/21 P: 216 chunks
- EU 2018/1725: 842 chunks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-10 15:08:15 +02:00
Benjamin Admin 81db904b3e feat(legal-sources): add OSHA machinery safety standards + international norms mapping
OSHA 29 CFR 1910 Subpart O (1910.211-1910.219) — complete machine
guarding requirements. US federal law, public domain.

International norms mapping table: China GB/T, Korea KS, India BIS
equivalents to ISO/EN standards. Unfortunately all countries protect
ISO copyright even for identical national adoptions (IDT).

Only OSHA provides truly free machinery safety content.
EU Excel harmonised standards list included for reference.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-09 10:50:43 +02:00
Sharang Parnerkar 572052285c fix: require button click to consume magic link token
Build pitch-deck / build-push-deploy (push) Successful in 1m54s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 37s
Email security gateways follow GET redirects automatically and were
consuming the token before the investor clicked through. The verify page
now shows an 'Access Pitch Deck' button; the token is only consumed on
explicit click, which scanners cannot trigger.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 23:30:27 +02:00
Sharang Parnerkar 1ef22e6f95 fix: use PITCH_BASE_URL for short link redirects instead of request.url
Build pitch-deck / build-push-deploy (push) Successful in 1m39s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 29s
Behind Orca's reverse proxy, request.url resolves to http://127.0.0.1:3000
which causes redirects to go to the internal address instead of the public
domain. Use PITCH_BASE_URL (already set in service.toml) as the base.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:55:53 +02:00
Sharang Parnerkar d291af0e33 fix: whitelist /p/* in middleware so short links work without a session
Build pitch-deck / build-push-deploy (push) Successful in 1m38s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:42:09 +02:00
Sharang Parnerkar 76aad8b1d1 feat(pitch-deck): branded short links for magic URLs (pitch.breakpilot.ai/p/ab3xk2)
Build pitch-deck / build-push-deploy (push) Successful in 1m31s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 32s
CI / test-python-voice (push) Successful in 34s
CI / test-bqas (push) Successful in 30s
- New pitch_short_links table stores 6-char alphanumeric codes mapped to magic link tokens
- GET /p/[code] redirects to /auth/verify?token=... (302, validates expiry)
- All magic link generation points (invite, generate-link, resend) now create a short code
- Emails (invite + resend) use the short URL — less token-like, cleaner for spam filters
- Copy-link UI shows short URL prominently with full URL as fallback
- Migration 008 added to /api/admin/migrate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 10:34:24 +02:00
Sharang Parnerkar 54f0919b73 feat(pitch-deck): translate financial plan row labels when lang=en
Build pitch-deck / build-push-deploy (push) Successful in 2m0s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 32s
- Add ROW_LABEL_MAP (DE→EN) covering GuV, Liquidität, Kunden, Betriebliche Aufwendungen rows
- Add FORMULA_TOOLTIPS_EN with English tooltip text for all formula-driven rows
- Add MONTH_LABELS_EN (Mrz→Mar, Mai→May, Okt→Oct)
- LabelWithTooltip now accepts `de` flag, translates display text and tooltip accordingly
- Month column headers switch between DE/EN month abbreviations
- Falls back to original German label for any row not in the map

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-07 09:47:45 +02:00
Sharang Parnerkar ec7eee8e3d feat(pitch-deck): change preferred_lang for existing investors from detail page
Build pitch-deck / build-push-deploy (push) Successful in 1m27s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 34s
- GET /api/admin/investors/:id now returns preferred_lang
- PATCH /api/admin/investors/:id accepts preferred_lang (de/en), validates value
- Investor detail page: DE/EN toggle in the Pitch Version card, instant save on click

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:31:59 +02:00
Sharang Parnerkar b0d273d3ab feat(pitch-deck): add pitch version selection to investor invite form
Build pitch-deck / build-push-deploy (push) Successful in 1m33s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 36s
CI / test-bqas (push) Successful in 32s
- Version dropdown on the invite form shows all committed versions
- Selected version is assigned to the investor at creation time (no separate step needed)
- API validates version is committed before upserting
- Leaving the dropdown empty keeps any existing assignment (COALESCE behavior)
- version_id included in audit log

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:27:23 +02:00
Sharang Parnerkar 17b9006b88 feat(pitch-deck): English email templates, investor language preference, link-only invite mode
Build pitch-deck / build-push-deploy (push) Successful in 1m55s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 35s
- Add English email template variants (greeting, message, closing, subject, CTA copy)
- Add `preferred_lang` column to `pitch_investors` — stored per investor, deck opens in that language by default
- Invite form: DE/EN language toggle that switches email defaults and pitch language setting
- Invite form: "Send email" toggle — when off, creates investor + returns magic link without sending email (for cold outreach attachment)
- `app/page.tsx`: initializes pitch language from investor's `preferred_lang` before first render (no flash)
- Migration 007 added to `/api/admin/migrate` route for production rollout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-06 23:18:40 +02:00
Benjamin Admin e013702a02 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 37s
2026-05-06 21:06:19 +02:00
Benjamin Admin f022b489e2 docs: comprehensive session handover — Blocks F+G complete, next: MC quality refinement
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 21:06:01 +02:00
Benjamin Admin 0092c4fe47 feat(pipeline): G-pre1 refinement script for large object groups
Splits master controls >200 members by re-clustering their object groups
with k=4-20 per group. First round: 38 groups → 325 sub-groups → 253 new MCs.
25 generic MCs remain (monitoring, procedure, etc.) — need regulation-source split.

Session summary: Block F complete, Control Generation (1,599+), Pass 0a/0b,
Production Sync, G-pre1/2/3 Object Clustering + Master Controls + API,
G1-G4 Compliance Execution Layer (Decision Trace, Commit Ledger, Decision Memory,
Pre-Deployment Enforcement).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:41:49 +02:00
Benjamin Admin d5bcd0bd5b feat(pipeline): G4 Pre-Deployment Enforcement — CI/CD compliance gate
New table: deployment_checks (verdict, blocking/warning controls, risk score)
New API:
  POST /v1/deployment-checks (SDK asks: "can I deploy?")
  GET /v1/deployment-checks/{id} (check result)
  POST /v1/deployment-checks/{id}/override (manual override with justification)
  GET /v1/deployment-checks/stats (approval/block rate)

Check logic: queries G1 decision_traces + G3 open failures per affected control.
Verdict: approved (0 blocking) or blocked (with fix recommendations).
454 tests pass, 0 regressions.

Block G complete: G1-G4 all implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:24:45 +02:00
Benjamin Admin c398e74d5e feat(pipeline): G3 Full Decision Memory — compliance lifecycle event stream
New table: decision_events (assessment→decision→fix→verification→failure cycle)
New API:
  POST /v1/decision-events (record lifecycle event)
  GET /v1/decision-events (list with filters)
  GET /v1/decision-events/timeline/{control_id} (full chronological timeline)
  GET /v1/decision-events/stats (failure rate, cycle times)

Each event captures input_state, output_state, actor, evidence.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 20:16:25 +02:00
Benjamin Admin e82f99b8cb feat(pipeline): G2 Compliance Commit Ledger — code↔control audit trail
New table: compliance_commits (commit hash, affected controls, risk level)
New API:
  POST /v1/compliance-commits (SDK registers commit + impact)
  GET /v1/compliance-commits (list with filters)
  GET /v1/compliance-commits/by-control/{id} (all commits for a control)
  GET /v1/compliance-commits/stats (dashboard)
  GET /v1/compliance-commits/{id} (detail)

GIN index on affected_control_ids for fast @> containment queries.
454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 19:17:45 +02:00
Benjamin Admin 66a70ab31c feat(pipeline): G1 Decision Trace — compliance decision tracking
New table: decision_traces (status, reason, evidence, fix plan per control)
New API:
  POST/GET/PUT /v1/decision-traces (CRUD for decisions)
  GET /v1/decision-traces/stats (compliance dashboard)
  GET /v1/controls/{id}/full-trace (Regulation→Obligation→Control→Decision→Evidence)

454 tests pass, 0 regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 18:26:21 +02:00
Benjamin Admin ad24835940 feat(pipeline): G-pre1/2/3 — Object Clustering + Master Controls + API
G-pre1: 144k objects clustered into 7,466 groups via Mini-Batch K-Means
  on bge-m3 embeddings. Two-stage: k=5000 base + sub-cluster groups >50.
G-pre2: 5,114 Master Controls from lifecycle phase chains
  (define→implement→test→monitor), linking 172,504 atomic controls.
G-pre3: REST API for Master Controls
  GET /v1/master-controls (list, search, filter)
  GET /v1/master-controls/stats
  GET /v1/master-controls/{mc_id} (detail with phase-controls)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-06 15:11:38 +02:00
Benjamin Admin e683701a44 fix(gitea): remove /etc/timezone mount (macOS incompatible), use TZ env var
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 19:37:43 +02:00
Benjamin Admin 0bad74a3bd docs: session handover — Block F complete, pipeline done, G-pre1 analysis
Session 03-05.05.2026:
- Block F1-F5 complete (DB migration of hardcoded dicts)
- Control Generation: 1,599 controls + 11,522 obligations + 1,147 atomics
- Production sync: 2,625 controls + 11,522 obligations synced
- G-pre1 analysis: 183k objects → 144k after normalize (needs hierarchical clustering)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 18:02:10 +02:00
Benjamin Admin 22257a7ed8 feat(pipeline): F5 validation tests — verify DB matches hardcoded dicts
8 tests confirm all REGULATION_LICENSE_MAP, ACTION_TYPES, _NEGATIVE_PATTERNS,
_ACTION_SYNONYMS, and _OBJECT_SYNONYMS entries are correctly migrated to DB.
Dicts kept as fallback for DB-unavailability resilience.

Block F complete: F1-F5 all done.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 16:06:59 +02:00
Benjamin Admin a20de0b52b feat(pipeline): F4 LLM synonym enrichment script
Uses Ollama (qwen3.5:35b-a3b, think:false) to generate additional
German synonyms for action types and object tokens. Results stored
with source='llm' in action_synonyms/object_synonyms tables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 15:45:43 +02:00
Benjamin Admin 775d8b52f3 fix(vault): prevent CPU-burning init loop with marker file + idempotent checks
Root cause: init scripts ran repeatedly (on container restart) and tried
vault secrets enable / vault auth enable for already-existing paths.
Vault logged ERRORs and burned 40-84% CPU in the loop.

Fix:
- Marker file /vault/data/.init-complete skips re-initialization
- vault secrets list / vault auth list checks before enable calls
- No more "path already in use" errors on subsequent runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 11:46:16 +02:00
Sharang Parnerkar f0a84e79ab fix(preview): return fp_scenarios key so version-specific scenario is resolved in admin preview
Build pitch-deck / build-push-deploy (push) Successful in 1m39s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 33s
The preview-data API was returning `fm_scenarios` but PitchDeck reads
`data.fp_scenarios`, so fpBaseScenarioId was always null and the
Finanzplan slide fell back to the global default scenario (Base Case 200k)
instead of the version's assigned scenario (e.g. 1 Mio. Euro Base).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-05 11:39:53 +02:00
Benjamin Admin 64f45be63a feat(pipeline): add Pass 0a endpoint to core control-pipeline
Registers /generate/run-pass0a and /generate/pass0a-status/{job_id}
on the core control-pipeline (port 8098). Previously Pass 0a was only
available on the compliance backend which connects to Production DB,
causing a split-brain when controls are generated locally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-05 07:21:58 +02:00
Sharang Parnerkar 404963db77 feat(showcase): restore intro-presenter and executive-summary slides in showcase mode
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:14:18 +02:00
Sharang Parnerkar 0acbf25956 fix(showcase): hide Data Room link for showcase sessions
Build pitch-deck / build-push-deploy (push) Successful in 1m23s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 29s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 30s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:12:57 +02:00
Sharang Parnerkar 2bd9b015eb fix(showcase): block financial data from AI Q&A, fix FAB overflow, fix presenter slide mapping
Build pitch-deck / build-push-deploy (push) Successful in 1m47s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 41s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 32s
AI Q&A: fetch is_showcase from DB; showcase sessions receive no financial/funding
context and have an explicit LLM guard refusing to discuss investment details.
FAQ context and financial slide IDs stripped from system prompt.

FAB: flex layout so Fullscreen button is always visible regardless of panel height.

Presenter: pass activeSlideOrder to usePresenterMode so buildSlideAudioPlan maps
slideIdx → slideId from the filtered list, not the full SLIDE_ORDER. Progress
calculation also filters to active scripts only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 23:00:55 +02:00
Sharang Parnerkar be126a7a39 fix(pitch): showcase sidebar shows only filtered slides + AI presenter via FAB
Build pitch-deck / build-push-deploy (push) Successful in 1m22s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 31s
CI / test-python-voice (push) Successful in 30s
CI / test-bqas (push) Successful in 31s
NavigationFAB and SlideOverview now accept slideNames prop and render only the
active slide list (filtered for showcase mode). Adds AI presenter start button
to the FAB footer so it's accessible even when intro-presenter slide is hidden.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 22:50:33 +02:00
Sharang Parnerkar 30a9165497 feat(pitch): showcase mode — per-investor toggle hides financial/investor slides for customer demos
Build pitch-deck / build-push-deploy (push) Successful in 1m35s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 39s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 30s
Adds is_showcase boolean to pitch_investors; when set, filters out financials,
the ask, cap table, assumptions, finanzplan, risks, and intro-presenter slides.
Slide navigation is fully dynamic — progress bar and counts update accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 22:41:15 +02:00
Sharang Parnerkar f2184be02f fix: tab row counts use investor's scenario, not always Base Case
Build pitch-deck / build-push-deploy (push) Successful in 1m34s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 35s
CI / test-bqas (push) Successful in 32s
/api/finanzplan now accepts ?scenarioId and uses it for the per-sheet
row counts (the numbers in brackets on the tab bar). FinanzplanSlide
passes fpBaseScenarioId when fetching the sheet list, so Wandeldarlehen
investors see e.g. Personalkosten (9) instead of (35).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:21:40 +02:00
Sharang Parnerkar 06014d57b3 fix: derive fp_scenario IDs from version snapshot, eliminate hardcoded UUIDs
Build pitch-deck / build-push-deploy (push) Successful in 1m30s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 33s
CI / test-python-voice (push) Successful in 32s
CI / test-bqas (push) Successful in 31s
The fm_scenarios array in each pitch version snapshot already stores the
fp_scenario IDs directly (same pattern 1 Mio used). Wandeldarlehen snapshots
were missing Bear/Bull entries — updated in DB to add them.

- /api/data: include fp_scenarios in version response (was omitted)
- PitchDeck: derive fpBaseScenarioId from data.fp_scenarios
- useFpKPIs: accept fpBaseScenarioId instead of isWandeldarlehen boolean
- AssumptionsSlide: find Bear/Base/Bull by name from fpScenarios prop
- FinanzplanSlide: initialize from fpBaseScenarioId, use version scenarios for selector
- FinancialsSlide / ExecutiveSummarySlide: pass fpBaseScenarioId to hook
- types: add FpScenarioRef + fp_scenarios field to PitchData

No UUID hardcoded in any component. Adding a new pitch version only
requires setting the correct fp_scenario IDs in its fm_scenarios snapshot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 15:00:06 +02:00
Sharang Parnerkar 6c022d1a79 fix: allow investors to query fp_ scenarios by scenarioId
Build pitch-deck / build-push-deploy (push) Successful in 1m55s
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 40s
CI / test-python-voice (push) Successful in 37s
CI / test-bqas (push) Successful in 34s
AssumptionsSlide sends ?scenarioId=<uuid> for Bear/Base/Bull cards but
the route was silently dropping it for non-admin requests, making all
three cards return the same default Base Case data. Since fp_ financial
projections are already investor-facing, any valid scenarioId is allowed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-04 14:27:14 +02:00
Benjamin Admin e869cabc81 docs: session handover — F1-F3 done, control generation running
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-04 07:21:24 +02:00
Benjamin Admin 652e3a65a3 feat(pipeline): F2+F3 action/object ontology — DB-backed normalization
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 36s
CI / test-python-voice (push) Successful in 33s
CI / test-bqas (push) Successful in 31s
Migrates ACTION_TYPES (26+8 types), _NEGATIVE_PATTERNS (22), _ACTION_SYNONYMS
(65), and _OBJECT_SYNONYMS (75) from hardcoded dicts to DB tables.

- SQL migration: 003_action_object_ontology.sql (3 tables)
- Migration scripts: f2_migrate_actions.py (34 types, 145 synonyms), f3_migrate_objects.py (75 objects)
- OntologyRegistry cache: 5min TTL, raises RuntimeError if empty (safe fallback to dicts)
- control_ontology.classify_action/get_phase delegate to DB with dict fallback
- control_dedup.normalize_action/normalize_object delegate to DB with dict fallback
- 25 new tests, 446 total pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:47:53 +02:00
Benjamin Admin aab8eeb335 Merge branch 'main' of ssh://gitea.meghsakha.com:22222/Benjamin_Boenisch/breakpilot-core
CI / go-lint (push) Has been skipped
CI / python-lint (push) Has been skipped
CI / nodejs-lint (push) Has been skipped
CI / test-go-consent (push) Successful in 47s
CI / test-python-voice (push) Successful in 38s
CI / test-bqas (push) Successful in 33s
2026-05-03 23:14:34 +02:00
Benjamin Admin 9437e029d0 feat(pipeline): F1 regulation registry — DB-backed license/source-type lookup
Migrates REGULATION_LICENSE_MAP (135 entries) and SOURCE_REGULATION_CLASSIFICATION
(58 entries) from hardcoded Python dicts to compliance.regulation_registry table.

- SQL migration: 002_regulation_registry.sql (table + indexes + trigger)
- Migration script: f1_migrate_regulation_registry.py (162 rows, --dry-run)
- RegulationRegistry cache: 5min TTL, prefix fallback, graceful degradation
- control_generator._classify_regulation() delegates to DB with dict fallback
- source_type_classification.classify_source_regulation() delegates to DB
- 34 new tests (lookup, cache, degradation, migration data consistency)
- 421 total tests pass, 0 regressions

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 23:14:06 +02:00
Benjamin Admin 4fd2bfefcd docs: session handover updated for Block F start
Next: F1 Regulation Registry (DB + API + Frontend + Auto-Create)
Frontend at /sdk/regulation-registry in breakpilot-compliance admin

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:51:23 +02:00
Benjamin Admin fac9280716 feat(pipeline): Block D5+-E complete session — 20k+ new chunks
Session 02-03.05.2026 accomplishments:
- D5+: NIST/ENISA PDF quality fix (0%→45% section rate)
- D5+: 4 lost NIST PDFs restored (11k chunks)
- D5+: Text normalization + section detection for NIST/BSI
- D6: Citation backfill (3,651 controls updated, old archived)
- E2: 8 DE laws ingested (ArbZG, MuSchG, GmbHG, AktG, InsO...)
- E3: 5 EU regulations (CSRD, CSDDD, Taxonomy, eIDAS, Pay Trans.)
- E4: Standards (GoBD, BAIT, VAIT)
- E6: 3 CH + 4 AT laws (OR, DSV, ArG, ArbVG, AngG, AZG, NISG)
- E7: 9 court judgments as full text (Schrems II 154 chunks,
  Meta 101, BVerfG 161, DSK OH 119, Planet49 42, SCHUFA 41,
  Schadenersatz 29, BAG 48, Google Fonts 14)
- Infra: Qdrant snapshot mechanism, upload-before-delete safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 22:31:57 +02:00
Benjamin Admin 118be3540d feat(pipeline): D6 citation backfill + E2/E3 law ingestion scripts
- d6_citation_backfill.py: 3-tier matching (hash/prefix/overlap),
  archives old citations, updated 3.651 controls (93.6% coverage)
- ingest_de_laws.py: 8 German laws ingested (ArbZG, MuSchG, NachwG,
  MiLoG, GmbHG, AktG, InsO, BUrlG — 1.629 chunks)
- ingest_eu_regulations.py: EUR-Lex ingestion (needs manual HTML due
  to AWS WAF). CSRD, CSDDD, EU Taxonomy, eIDAS 2.0, Pay Transparency
  manually ingested (1.057 chunks)
- Updated session handover with current state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 13:19:27 +02:00
Benjamin Admin a9671a572b fix(embedding): single-number ALL-CAPS section detection for ENISA/BSI
Add case-sensitive _SINGLE_NUM_ALLCAPS_RE for "1. INTRODUCTION" style
headers (ENISA, BSI docs). Cannot use _LEGAL_SECTION_RE for this because
it uses re.IGNORECASE which would false-positive on "1. Erstens" etc.

Also re-downloaded 2 corrupt PDFs from nist.gov (nistir_8259a, nist_ai_rmf)
— originals in MinIO were 263-byte XML error responses, not PDFs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:02 +02:00
Benjamin Admin 2f4a3f2ea2 fix(embedding): add NIST control IDs to _SECTION_NUMBER_RE
_SECTION_NUMBER_RE only had patterns for §/Art/Section/Kapitel/Annex
but missed NIST-style identifiers (AC-1, GV.OC-01, 3.1, A01:2021).
This caused 0% section rate for all NIST/BSI/ENISA documents even
though sections were correctly detected — the section NUMBER wasn't
extracted from the header.

Also adds:
- reupload_legal_strategy.py: re-upload with legal chunking
- extract_and_upload_nist.py: local PDF extraction workaround
- qdrant-snapshot.sh: backup mechanism for Qdrant collections

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 07:42:06 +02:00
Benjamin Admin 0b0eed27b0 feat(embedding): NIST PDF text normalization + safe re-ingest script
Fix broken multi-column PDF extraction for NIST/BSI/ENISA documents:
- _normalize_pdf_text(): fixes broken section numbers (1 . 1 → 1.1),
  control IDs (AC - 1 → AC-1), ligatures, soft hyphens
- pdfplumber tolerances increased (x=3,y=4) for better column handling
- 3 new regex patterns: NIST CSF 2.0, NIST enhancements, OWASP Top 10
- reingest_nist.py: safe upload-before-delete for 4 lost NIST PDFs
- reingest_d5.py: safety fix — upload first, verify, then delete old

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-03 06:42:46 +02:00
Benjamin Admin 97a7f6f264 docs: comprehensive session handover with full roadmap (Blocks A-G)
Complete instructions for next session including:
- Current quality metrics per document type
- Prioritized action items (NIST fix, citation backfill, missing laws)
- Full Block E-G roadmap with details
- All critical files, DB state, test commands
- Known issues (3 lost NIST PDFs, frontend 500s, D5 script safety)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:30:50 +02:00
Benjamin Admin ff21bc258a docs: session handover — D2-D5 complete, quality report, NIST plan
Major session achievements:
- Structural metadata end-to-end (D2-D4)
- 430 docs re-ingested with new chunking
- HTML stripping + charset detection (0% → 97.6%)
- 20 EU regulations from EUR-Lex HTML (DSGVO: 0% → 92%)
- Quality report script (500 controls: 13% fully correct)
- Frontend requirements.map fix

Open: NIST/ENISA text normalization, citation backfill,
D5 script safety (upload-before-delete), BEG IV ingestion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 22:22:55 +02:00
Benjamin Admin 3009f3d13a feat(embedding): add NIST/ENISA/standard section numbering to chunker
Extends _LEGAL_SECTION_RE to detect:
- Numbered sections: 1.1 Title, 2.3.1 Subtitle
- Control family IDs: AC-1, AU-2, PO.1, PW.1.1
- Table/Figure/Appendix references
Also adds EUR-Lex HTML replacement script.

58 embedding-service tests passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 19:24:10 +02:00
Benjamin Admin 5a6e588641 docs: update session handover — D2-D5 complete, EU PDF issue documented
Session achieved: structural metadata end-to-end (D2-D4), overlap bug
fix, HTML stripping with charset detection, 430/436 docs re-ingested.

Remaining: ~40 EU Official Journal PDFs need HTML from EUR-Lex (broken
multi-column PDF extraction), 3 missing EDPB PDFs, 1 corrupt PDF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:34:34 +02:00
Benjamin Admin 41183ff93d fix(docker): set PDF_EXTRACTION_BACKEND to auto (was pymupdf)
The default was 'pymupdf' which doesn't exist as a backend, causing
fallthrough to pypdf every time. With 'auto', the priority is:
unstructured > pdfplumber > pypdf.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 17:30:33 +02:00
Benjamin Admin 75dda9ac92 feat(embedding): add pdfplumber backend for multi-column PDF extraction
EU Official Journal PDFs (AI Act, CRA, NIS2, DSGVO, etc.) use
multi-column layouts that pypdf breaks into fragmented words
("Ar tik el" instead of "Artikel"). pdfplumber handles these correctly.

Backend priority: unstructured > pdfplumber > pypdf (auto mode).
Also increases D5 re-ingestion timeout to 3600s for large PDFs.

58 embedding-service tests passing. pdfplumber: MIT license.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-02 15:42:25 +02:00