chore: bootstrap repo scaffolding (M0.1)

Adds the §1.2 scaffolding required by IMPLEMENTATION_PLAN.md M0.1: README, CONTRIBUTING, CODEOWNERS, CHANGELOG, PR + issue templates, CI workflow, release workflow, LICENSE, commitlint, cliff config, .editorconfig, .gitignore, .env.example. Refs: M0.1
2026-05-18 21:07:15 +02:00
parent 8537fd69dd
commit d816ba2b22
20 changed files with 4761 additions and 1 deletions
@@ -0,0 +1,774 @@
+# Infrastructure Specification
+**Status:** Locked Topology
+**Authors:** Sharang, Benjamin
+**Date:** 2026-05-11 (topology lock: 2026-05-18)
+**Companion docs:** PLATFORM_ARCHITECTURE.md, IMPLEMENTATION_PLAN.md, COST_PLAN.md
+**Cloud provider:** SysEleven Cloud Services (DUS2, OpenStack)
+
+---
+
+## 1. VM Inventory
+
+**Four billable VMs total.** Three in production (one per plane after collapsing Identity+Infra), one in stage. Dev runs entirely on developer laptops via docker-compose.
+
+```
+┌──────────────┬─────────────────┬────────────────────────┬───────────┬─────────────────┐
+│ Name         │ Env             │ SysEleven flavor       │ Public IP │ Planes owned    │
+├──────────────┼─────────────────┼────────────────────────┼───────────┼─────────────────┤
+│ vm-edge      │ prod            │ m2.small  (2v / 8 GB)  │ YES (1)   │ Identity + Infra│
+│ vm-control   │ prod            │ m2.medium (4v / 16 GB) │ No        │ Control         │
+│ vm-data      │ prod            │ m2.medium (4v / 16 GB) │ No        │ Data            │
+│ stage        │ stage           │ m2.small  (2v / 8 GB)  │ YES (1)   │ App plane only  │
+│ (dev)        │ dev             │ local docker-compose   │ n/a       │ all (in-memory) │
+└──────────────┴─────────────────┴────────────────────────┴───────────┴─────────────────┘
+```
+
+**Total compute:** 48 GiB-RAM, 12 vCPU. **Monthly compute net: €192 (36M upfront) / €295 (12M) / €435 (On-Demand).** See COST_PLAN.md for the full three-mode table.
+
+### Why this topology and not the previous 7-VM layout
+
+The earlier draft proposed one VM per service group (vm-gateway, vm-identity, vm-secrets, vm-ops, vm-control, vm-certifai, vm-compliance). That gave maximum failure isolation but cost 132 GiB-RAM stage+prod. At 5 customers the isolation is unused — every VM ran at <10% utilisation. The locked topology buys back failure isolation incrementally as load grows (see §13 Growth Trajectory).
+
+Critical isolations preserved even at 4 VMs:
+- **vm-edge isolates identity from app workloads.** Keycloak JVM has its own page cache; ERPNext background jobs cannot starve token issuance.
+- **vm-data isolates databases from stateless services.** All data-plane DBs share one host, but they're walled off from the portal + ERPNext + Stalwart competing on vm-control.
+- **stage runs the app plane only.** It calls prod Keycloak + prod Tenant Registry under `tenant.kind = stage` rather than mirroring those services.
+
+---
+
+## 2. Service-to-VM Mapping
+
+```
+vm-edge   (prod, m2.small 8 GB, public IP)
+  ├── orca-proxy           (Orca-managed; wildcard TLS terminator)
+  ├── powerdns-auth        (Orca-managed; authoritative DNS for yourplatform.com)
+  ├── keycloak-26          (Orca-managed; JVM, ~1.5 GB heap)
+  ├── postgres-keycloak    (Orca-managed; dedicated PG instance for Keycloak only)
+  ├── infisical            (Orca-managed)
+  ├── postgres-infisical   (Orca-managed; dedicated PG instance for Infisical only)
+  ├── redis-infisical      (Orca-managed; ephemeral)
+  └── gitea                (Orca-managed; SQLite backend to avoid a third PG)
+
+vm-control   (prod, m2.medium 16 GB)
+  ├── customer-portal      (Orca-managed; Next.js)
+  ├── tenant-registry      (Orca-managed; Go)
+  ├── orca-controller      (Orca core process; NOT a managed container)
+  ├── erpnext              (Orca-managed; Frappe bench)
+  ├── frappe-hd            (same bench as ERPNext)
+  ├── mariadb              (Orca-managed; for ERPNext)
+  ├── redis-erpnext        (Orca-managed)
+  └── stalwart-mail        (Orca-managed; SMTP/IMAP/JMAP on mail.yourplatform.com)
+
+vm-data   (prod, m2.medium 16 GB)
+  ├── certifai-dashboard   (Orca-managed)
+  ├── mongodb              (Orca-managed)
+  ├── litellm              (Orca-managed)
+  ├── backend-compliance   (Orca-managed)
+  ├── ai-compliance-sdk    (Orca-managed)
+  ├── admin-compliance     (Orca-managed)
+  ├── postgres-app         (Orca-managed; schemas: tenant_registry, compliance)
+  ├── qdrant               (Orca-managed)
+  └── minio                (Orca-managed)
+
+stage   (stage, m2.small 8 GB, public IP)
+  ├── orca-proxy           (light; only routes to stage app)
+  ├── customer-portal      (NEW VERSION under test)
+  ├── tenant-registry      (NEW VERSION under test, talks to ephemeral PG below)
+  ├── certifai-dashboard   (NEW VERSION under test)
+  ├── backend-compliance   (NEW VERSION under test)
+  ├── ai-compliance-sdk    (NEW VERSION under test)
+  ├── admin-compliance     (NEW VERSION under test)
+  ├── litellm              (light; same image as prod)
+  ├── postgres-app-stage   (ephemeral; lives entirely on stage VM)
+  ├── mongodb-stage        (ephemeral)
+  └── qdrant-stage         (ephemeral, tiny corpus)
+
+  Calls OUT to prod:
+    → auth.yourplatform.com   (Keycloak token issuance, under stage client_id)
+    → mail.yourplatform.com   (Stalwart SMTP, recipient filter forces +stage@ only)
+    → Polar SANDBOX webhook URL (NEVER prod Polar)
+    → no calls to prod Postgres-app, MariaDB, MongoDB
+```
+
+### Stage isolation rules (enforced at the platform, not in product code)
+
+| Risk | Enforcement mechanism | Owner |
+|---|---|---|
+| Stage writes to prod database | Infisical scope: stage app only gets `/stage/*` secrets. Prod DB credentials never reach stage. | Infra plane |
+| Stage emails real customers | Stalwart accept-rule: drop if recipient does not match `*+stage@*`. | Control plane (Stalwart config) |
+| Stage triggers real Polar charges | Stage env points `POLAR_API_URL` to sandbox. Prod Polar webhook secret never on stage. | Control plane |
+| Stage Keycloak JWT used in prod | `stage_client_id` issued only by Keycloak; prod services reject JWTs with this aud. | Identity plane |
+| Stage load DOSes prod Keycloak | Keycloak rate-limit per client_id; stage limited to 60 req/s. | Identity plane |
+
+---
+
+## 3. Network Topology
+
+```
+                          INTERNET
+                              │
+                       (yourplatform.com — authoritative on vm-edge PowerDNS;
+                        stage.yourplatform.com — authoritative same zone)
+                              │
+                ┌─────────────┴─────────────┐
+                │                           │
+        ┌───────▼────────┐         ┌────────▼─────────┐
+        │    vm-edge     │         │     stage        │
+        │   (public IP)  │         │   (public IP)    │
+        │                │         │                  │
+        │ orca-proxy ────┤         │ orca-proxy       │
+        │ powerdns       │         │ portal-new       │
+        │ keycloak       │◄────────┤ tenant-registry-new
+        │ pg-keycloak    │  stage  │ certifai-new     │
+        │ infisical      │  calls  │ compliance-new   │
+        │ pg-infisical   │  prod   │ pg-stage         │
+        │ redis-infis    │  KC +   │ mongo-stage      │
+        │ gitea          │  Stalwart│ qdrant-stage    │
+        └───────┬────────┘         └──────────────────┘
+                │  PRIVATE NETWORK  10.0.0.0/16
+       ┌────────┴─────────┐
+       │                  │
+┌──────▼───────┐  ┌───────▼──────┐
+│  vm-control  │  │   vm-data    │
+│              │  │              │
+│ portal       │  │ certifai     │
+│ tenant-reg   │  │ mongodb      │
+│ orca-ctrl    │  │ litellm      │
+│ erpnext      │  │ backend-comp │
+│ frappe-hd    │  │ ai-sdk       │
+│ mariadb      │  │ admin-comp   │
+│ redis-erp    │  │ pg-app       │
+│ stalwart     │  │ qdrant       │
+└──────────────┘  │ minio        │
+                  └──────────────┘
+
+Orca-Proxy routing (vm-edge, by Host header):
+  auth.yourplatform.com    → 127.0.0.1:8443  (Keycloak, local on vm-edge)
+  erp.yourplatform.com     → vm-control:8000 (ERPNext) [allowlist: our IPs only]
+  git.yourplatform.com     → vm-edge:3000    (Gitea, local) [allowlist: our IPs only]
+  mail.yourplatform.com    → vm-control:587  (Stalwart submission) [allowlist: VM internal only]
+  ns1.yourplatform.com     → 127.0.0.1:53    (PowerDNS, local)
+  *.yourplatform.com       → vm-control:3000 (customer portal)
+
+Orca-Proxy routing (stage, by Host header):
+  *.stage.yourplatform.com → 127.0.0.1:3000  (stage portal — all subdomains route here)
+```
+
+---
+
+## 4. Storage and Volume Requirements
+
+Block volumes (Ceph 3x replicated, €0.10/GiB/mo) mounted to each VM.
+
+```
+┌──────────────┬───────────────────────────────────────────┬─────────┬─────────────────────┐
+│ VM           │ Data stores                               │ +Block  │ Growth profile      │
+├──────────────┼───────────────────────────────────────────┼─────────┼─────────────────────┤
+│ vm-edge      │ pg-keycloak + pg-infisical + Gitea repos  │ +50 GB  │ Slow                │
+│ vm-control   │ MariaDB (ERPNext) + Stalwart mail spool   │ +250 GB │ Medium              │
+│ vm-data      │ MongoDB + pg-app + Qdrant + MinIO         │ +500 GB │ Fast (scales w/ N)  │
+│ stage        │ pg-stage + mongo-stage + qdrant-stage     │ +50 GB  │ Resets per release  │
+└──────────────┴───────────────────────────────────────────┴─────────┴─────────────────────┘
+
+Each VM's root disk: 50 GB ephemeral, included in flavor price.
+
+Object storage (S3, €0.02/GiB/mo single-region or €0.0496/GiB/mo geo-redundant):
+  ┌─────────────────────────────────┬─────────┬──────────────────────────┐
+  │ Bucket                          │ Size    │ Purpose                  │
+  ├─────────────────────────────────┼─────────┼──────────────────────────┤
+  │ s3://backups (geo-redundant)    │ ~500 GB │ Database dumps           │
+  │ s3://seed-data                  │ ~30 GB  │ Demo tenant fixtures     │
+  │ s3://exports                    │ ~50 GB  │ GDPR/offboarding ZIPs    │
+  │ s3://audit-archive              │ ~20 GB  │ Old audit log overflow   │
+  └─────────────────────────────────┴─────────┴──────────────────────────┘
+```
+
+---
+
+## 5. Backup Requirements
+
+All backups ship to **SysEleven Object Storage** (S3-compatible, geo-redundant DUS2 ↔ HAM1 for production-critical data). Backup jobs run as Orca one-shot containers on cron. Infisical holds the S3 credentials.
+
+```
+┌───────────────────────┬──────────────────┬────────────┬────────────┬──────────────────────┐
+│ Data store            │ Method           │ Frequency  │ Retention  │ Owner (who restores) │
+├───────────────────────┼──────────────────┼────────────┼────────────┼──────────────────────┤
+│ pg-keycloak (vm-edge) │ pg_dump → S3-geo │ Every 6h   │ 14 days    │ Infra Plane          │
+│ pg-infisical (vm-edge)│ pg_dump → S3-geo │ Daily      │ 30 days    │ Infra Plane          │
+│ Gitea (vm-edge)       │ gitea dump → S3  │ Daily      │ 30 days    │ Infra Plane          │
+│ Keycloak realm export │ KC export → S3   │ Daily      │ 14 days    │ Identity Plane (owns)│
+│ Infisical store       │ encrypted → S3   │ Daily      │ 30 days    │ Infra Plane          │
+│ MariaDB (vm-control)  │ mysqldump → S3   │ Every 6h   │ 30 days    │ Control Plane        │
+│ Stalwart queue/store  │ tar → S3         │ Daily      │ 7 days     │ Control Plane        │
+│ pg-app (vm-data)      │ pg_dump → S3-geo │ Every 6h   │ 30 days    │ Data Plane (owns RPO)│
+│ MongoDB (vm-data)     │ mongodump → S3   │ Daily      │ 30 days    │ Data Plane           │
+│ MinIO (vm-data)       │ mc mirror → S3   │ Daily      │ 90 days    │ Data Plane           │
+│ Qdrant (vm-data)      │ API snap → S3    │ Daily      │ 14 days    │ Data Plane (rebuild) │
+│ stage *               │ no backup        │ —          │ —          │ — (ephemeral)        │
+│ Orca config (IaC)     │ Gitea (VCS)      │ On commit  │ Forever    │ Infra Plane          │
+└───────────────────────┴──────────────────┴────────────┴────────────┴──────────────────────┘
+```
+
+### RPO by data criticality
+
+```
+CRITICAL (RPO ≤ 6h)
+  pg-keycloak       — org memberships, IdP config
+  pg-app            — tenant registry, compliance records
+  MariaDB/ERPNext   — sales orders, invoices, contracts
+
+IMPORTANT (RPO ≤ 24h)
+  MongoDB           — chat history, user preferences
+  MinIO             — compliance evidence documents
+  pg-infisical      — encrypted secrets
+  Stalwart store    — inbound webhooks, bounce records
+
+RECOVERABLE (RPO ≤ 48h, rebuildable)
+  Qdrant            — vector index (rebuildable from MinIO source documents)
+  Gitea             — code (mirrored on dev machines)
+  Keycloak export   — org structure (pg-keycloak is primary)
+
+NOT BACKED UP
+  stage (any data)  — by design; restored from seed bundles on each deploy
+  redis-*           — caches; restart cold
+```
+
+---
+
+## 6. Constraint Framework
+
+### Constraint types
+
+```
+AVAILABILITY   — required uptime percentage over rolling 30 days
+RTO            — Recovery Time Objective: max time to restore service after failure
+RPO            — Recovery Point Objective: max acceptable data loss window
+IaC            — service must be declared in Orca config, no manual container runs in prod
+SECRET_HYGIENE — all secrets via Infisical machine identity; no env files, no hardcoded values
+NETWORK        — whether service is internet-exposed or internal-only
+DATA_RESIDENCY — all data must remain in EU (SysEleven DUS2 + HAM1)
+AUDIT_TRAIL    — all mutating actions logged (who, what, when, from where)
+IMMUTABILITY   — config changes go through Gitea → Orca pipeline, not manual SSH
+STAGE_ISOLATION— stage tenant cannot mutate any prod data; reads-only against prod KC + TR
+```
+
+### Plane ownership of constraints
+
+Even though planes now share VMs, the **ownership model is unchanged** — the plane that owns a constraint owns it regardless of which VM hosts the service. The Infra Plane (now collapsed onto vm-edge alongside the Identity plane) still mechanically enforces backup, IaC, secrets, and network constraints.
+
+```
+╔══════════════════════════════════════════════════════════════════════════════════════════╗
+║  IDENTITY PLANE  (on vm-edge)                                                            ║
+║                                                                                          ║
+║  Owns / defines:                                                                         ║
+║    AVAILABILITY  — must be ≥ 99.5% (root dep for everything)                            ║
+║    RTO           — ≤ 15 min                                                              ║
+║    AUDIT_TRAIL   — realm-level audit (logins, token issuance, IdP events)               ║
+║    DATA_RESIDENCY— Keycloak realm data must stay EU                                     ║
+║    STAGE_ISOLATION— rate-limits stage_client_id; rejects stage JWTs in prod audiences   ║
+║                                                                                          ║
+║  Co-tenant note: shares vm-edge with Infra Plane services. JVM heap pinned to 1.5 GB    ║
+║  in Orca manifest so it cannot starve PowerDNS / Infisical.                              ║
+╚══════════════════════════════════════════════════════════════════════════════════════════╝
+
+╔══════════════════════════════════════════════════════════════════════════════════════════╗
+║  CONTROL PLANE  (on vm-control)                                                          ║
+║                                                                                          ║
+║  Owns / defines:                                                                         ║
+║    RPO (tenant)   — tenant registry & compliance schemas RPO ≤ 6h                       ║
+║    RPO (ERPNext)  — sales orders, invoices RPO ≤ 6h                                     ║
+║    AUDIT_TRAIL    — all portal actions (invites, IdP changes, impersonations)            ║
+║    AVAILABILITY   — portal ≥ 99.5%; ERPNext ≥ 99% (internal)                            ║
+║    RTO (portal)   — ≤ 10 min                                                             ║
+║    RTO (ERPNext)  — ≤ 60 min                                                             ║
+║                                                                                          ║
+║  Co-tenant note: ERPNext + Portal + Stalwart on one VM. Orca resource limits enforced:  ║
+║    portal:   1 GB memory cap                                                             ║
+║    erpnext:  6 GB memory cap                                                             ║
+║    mariadb:  3 GB memory cap                                                             ║
+║    stalwart: 1 GB memory cap                                                             ║
+║    tenant-registry: 500 MB                                                                ║
+╚══════════════════════════════════════════════════════════════════════════════════════════╝
+
+╔══════════════════════════════════════════════════════════════════════════════════════════╗
+║  DATA PLANE  (on vm-data)                                                                ║
+║                                                                                          ║
+║  Owns / defines:                                                                         ║
+║    DATA_RESIDENCY — all customer data (MongoDB, pg-app, MinIO) must stay EU              ║
+║    RPO (product)  — compliance records ≤ 6h; chat history ≤ 24h                         ║
+║    DATA_ISOLATION — every query scoped by org_id/tenant_id                               ║
+║    AUDIT_TRAIL    — product-level actions                                                ║
+║    AVAILABILITY   — CERTifAI ≥ 99.5%; compliance ≥ 99.5%                                ║
+║                                                                                          ║
+║  Co-tenant note: this VM is the SCALE driver. When vm-data hits 80% RAM, bump flavor    ║
+║  (m2.medium → m2.large → m2.xlarge). See §13 Growth Trajectory.                          ║
+╚══════════════════════════════════════════════════════════════════════════════════════════╝
+
+╔══════════════════════════════════════════════════════════════════════════════════════════╗
+║  INFRA PLANE  (on vm-edge, alongside Identity)                                           ║
+║                                                                                          ║
+║  Owns / enforces ALL of:                                                                 ║
+║    BACKUP        — executes all backup jobs (pg_dump, mongodump, mc mirror)             ║
+║    IaC           — ALL services declared in Orca config; no manual prod changes         ║
+║    IMMUTABILITY  — config changes: Gitea commit → Gitea Actions → Orca API only         ║
+║    SECRET_HYGIENE— Infisical (on vm-edge); provisions machine identities                ║
+║    NETWORK       — Orca-Proxy rules; VM firewall; no direct VM public exposure          ║
+║    DATA_RESIDENCY— VM region = SysEleven DUS2; backups geo-redundant DUS2↔HAM1          ║
+║    AVAILABILITY  — Orca restart policies, health checks                                  ║
+║    COLD_START    — enforces startup ordering (see §10 Scenario F)                       ║
+║    STAGE_ISOLATION— Infisical secret-path scoping for stage_app identity                ║
+╚══════════════════════════════════════════════════════════════════════════════════════════╝
+```
+
+---
+
+## 7. SLA Table
+
+```
+┌───────────────────────┬──────────────┬─────────┬─────────┬────────────────────────────────┐
+│ Service               │ Availability │ RTO     │ RPO     │ Host VM                        │
+├───────────────────────┼──────────────┼─────────┼─────────┼────────────────────────────────┤
+│ Orca-Proxy            │ 99.9%        │ 5 min   │ N/A     │ vm-edge                        │
+│ PowerDNS              │ 99.9%        │ 5 min   │ N/A     │ vm-edge                        │
+│ Keycloak              │ 99.5%        │ 15 min  │ 6h      │ vm-edge (root auth dep)        │
+│ Infisical             │ 99.5%        │ 30 min  │ 24h     │ vm-edge (running svcs survive) │
+│ Gitea                 │ 99%          │ 2h      │ 24h     │ vm-edge (dev machines mirror)  │
+│ Customer Portal       │ 99.5%        │ 10 min  │ N/A     │ vm-control                     │
+│ Tenant Registry       │ 99.5%        │ 10 min  │ 6h      │ vm-control                     │
+│ ERPNext               │ 99%          │ 60 min  │ 6h      │ vm-control (internal only)     │
+│ Frappe HD             │ 99%          │ 60 min  │ 24h     │ vm-control                     │
+│ MariaDB               │ 99.5%        │ 20 min  │ 6h      │ vm-control                     │
+│ Stalwart Mail         │ 99%          │ 60 min  │ 24h     │ vm-control                     │
+│ CERTifAI              │ 99.5%        │ 10 min  │ 24h     │ vm-data                        │
+│ MongoDB               │ 99.5%        │ 20 min  │ 24h     │ vm-data                        │
+│ LiteLLM               │ 99%          │ 5 min   │ N/A     │ vm-data                        │
+│ backend-compliance    │ 99.5%        │ 10 min  │ 6h      │ vm-data                        │
+│ ai-compliance-sdk     │ 99.5%        │ 10 min  │ 6h      │ vm-data                        │
+│ pg-app                │ 99.9%        │ 20 min  │ 6h      │ vm-data (SPOF — RISK-1)        │
+│ MinIO                 │ 99.5%        │ 30 min  │ 24h     │ vm-data                        │
+│ Qdrant                │ 99%          │ 2h      │ 24h     │ vm-data (rebuildable)          │
+│ stage (any service)   │ 95%          │ best ef.│ N/A     │ stage (ephemeral; no SLA)      │
+└───────────────────────┴──────────────┴─────────┴─────────┴────────────────────────────────┘
+```
+
+---
+
+## 8. IaC Constraint (Orca)
+
+Every production service declared in Orca config. No exceptions.
+
+### Rules
+
+```
+1. ALL containers run via Orca manifests committed to Gitea
+   → /orca/manifests/{vm-name}/{service-name}.toml
+   → Changes go through: Gitea PR → Gitea Actions lint → Orca API apply
+
+2. NO manual docker run / docker-compose up on any production VM
+   → SSH to prod VMs allowed for debugging only; no state changes
+
+3. Secrets are NEVER in Orca manifests
+   → Manifests reference Infisical paths, not values
+   → Bootstrap exception: Keycloak DB URI in Orca env (Keycloak runs ON vm-edge alongside
+     Infisical, so chicken-and-egg is solved by Orca env file, not Infisical lookup)
+
+4. Restart policy: always (Orca restarts crashed containers with exponential backoff)
+   → Health check per service (HTTP /health or TCP probe)
+
+5. Resource limits MANDATORY in every manifest
+   → On a 3-VM prod, co-tenant noise is the single biggest risk; limits are non-negotiable
+   → See §6 Plane ownership "Co-tenant note" boxes for the per-service caps
+
+6. Orca controller state itself is recoverable
+   → Manifest files in Gitea = desired state
+   → Loss of Orca controller = re-apply manifests from Gitea, services continue running
+
+7. Stage app gets its own Infisical scope
+   → /stage/* path; no prod-DB credentials reach this scope
+   → Enforced at Infisical machine-identity level, not in app code
+```
+
+### Gitea Actions pipeline for infra changes
+
+```
+infra change committed to Gitea
+  │
+  ├── lint: validate Orca manifest schema
+  ├── diff: show what changes will be applied (orca plan)
+  ├── (manual approval gate for vm-edge changes — touches auth root)
+  └── apply: POST to Orca Controller API → rolling update
+```
+
+---
+
+## 9. Dependency Graph
+
+Arrows = "requires to function." Dashed = soft (degrades, doesn't fail).
+**Intra-VM dependencies elided** for clarity (e.g. Keycloak ↔ pg-keycloak are on the same host and start together).
+
+```
+         EXTERNAL
+         AI APIs
+         (OpenAI / Anthropic)
+              │
+              │ (soft)
+              ▼
+┌──────────────────────────────────────────────────────────────────┐
+│  vm-edge   (Identity + Infra)                                    │
+│  ┌────────────────────────────────────────────────────────────┐  │
+│  │  pg-keycloak ──► keycloak                                  │  │
+│  │  pg-infisical ─► infisical  ◄── (all VMs pull on startup)  │  │
+│  │  redis-infis ──► infisical                                  │  │
+│  │  (sqlite) ─────► gitea                                      │  │
+│  │  powerdns-auth  (no deps)                                   │  │
+│  │  orca-proxy     (route table only; backends are remote)     │  │
+│  └────────────────────────────────────────────────────────────┘  │
+│                            │ Keycloak JWKS  │ Infisical /secrets │
+│                            │                │                    │
+└────────────────────────────┼────────────────┼────────────────────┘
+                             ▼                ▼
+┌──────────────────────────────────────────────────────────────────┐
+│  vm-control   (Control)                                          │
+│  ┌────────────────────────────────────────────────────────────┐  │
+│  │  mariadb + redis-erp ──► erpnext + frappe-hd               │  │
+│  │  (intra) ─────────────► stalwart                            │  │
+│  │  ──────────────────────► customer-portal                    │  │
+│  │  ──────────────────────► tenant-registry ──► pg-app (vm-data)│  │
+│  └────────────────────────────────────────────────────────────┘  │
+│                            │ tenant-registry API                  │
+└────────────────────────────┼─────────────────────────────────────┘
+                             ▼
+┌──────────────────────────────────────────────────────────────────┐
+│  vm-data   (Data)                                                │
+│  ┌────────────────────────────────────────────────────────────┐  │
+│  │  mongodb ───► certifai ◄── (vm-edge JWKS, vm-edge secrets) │  │
+│  │  litellm ───► certifai, ai-compliance-sdk                  │  │
+│  │  pg-app ────► tenant-registry-on-vm-control, backend-compl,│  │
+│  │              ai-compliance-sdk                              │  │
+│  │  qdrant ────► ai-compliance-sdk                            │  │
+│  │  minio  ────► backend-compliance                           │  │
+│  │  backend-compliance ──► admin-compliance                   │  │
+│  └────────────────────────────────────────────────────────────┘  │
+└──────────────────────────────────────────────────────────────────┘
+
+┌──────────────────────────────────────────────────────────────────┐
+│  stage   (App plane only)                                        │
+│  Calls vm-edge:8443 (KC) + vm-control:587 (Stalwart submission)  │
+│  Calls Polar SANDBOX (never prod Polar webhook URL)              │
+│  Its own ephemeral DBs; cannot read prod data                    │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+### Simplified critical path (customer login → product use)
+
+```
+  DNS (vm-edge PowerDNS)
+   │
+   ▼
+  orca-proxy (vm-edge)
+   │
+   ├──► keycloak (vm-edge) ──► pg-keycloak (intra-VM)
+   │
+   └──► customer-portal (vm-control)
+          ├──► tenant-registry (vm-control) ──► pg-app (vm-data)
+          ├──► certifai (vm-data) ──► mongodb (intra-VM)
+          └──► backend-compliance (vm-data) ──► pg-app (intra-VM)
+                                          ──► ai-sdk ──► qdrant + minio
+                                          ──► litellm ──► [external AI APIs]
+```
+
+---
+
+## 10. Failure Scenarios and Deadlock Analysis
+
+### Scenario A — vm-edge fails (HIGHEST SEVERITY)
+
+```
+Impact:    TOTAL outage. Nothing reachable from internet.
+           No DNS. No TLS. No auth. No new logins. Running JWTs expire within 15 min,
+           then ALL services start returning 401.
+           Backstage and customer portal both fully blocked.
+           Stage also blocked (depends on prod Keycloak).
+Cascade:   T+0:    DNS fails  → orca-proxy unreachable
+           T+5m:   existing JWTs still valid; portal cached → partial reads work
+           T+15m:  JWTs expire → full outage
+Deadlock:  None — services downstream don't deadlock, they just fail closed
+Recovery:  1. Spin up vm-edge-spare (cold standby, same Orca config) — ~3 min provision
+           2. Restore pg-keycloak + pg-infisical from latest backup — ~5 min
+           3. Swap registrar NS records to spare IP (TTL 60s) — ~2 min propagation
+           4. Restart all services on vm-edge-spare via Orca apply — ~3 min
+           Total RTO target: 15 min
+Mitigation: COLD STANDBY vm-edge-spare. Same Orca config committed in Gitea.
+           Provision cost when idle: €0 (only billed when running).
+           Test recovery quarterly.
+Severity:  CRITICAL — single host owns 3 root dependencies (DNS, auth, secrets)
+Cost of fix at Tier C: split vm-edge into vm-edge + vm-identity + vm-secrets
+           (back toward original 7-VM design) — €100/mo extra
+```
+
+### Scenario B — vm-control fails (NEW — consequence of plane consolidation)
+
+```
+Impact:    customer-portal: DOWN → /[slug]/* all return 503
+           tenant-registry: DOWN → Keycloak protocol-mapper for products claim breaks
+                                    → users can log in but see "No active products"
+           ERPNext + Frappe HD: DOWN → we cannot create sales orders or read tickets
+           Stalwart: DOWN → no outbound emails (trial nudges, exports, ticket replies)
+           MariaDB: DOWN → ERPNext queries fail; backups paused
+           Products (CERTifAI, compliance): UNAFFECTED (on vm-data, JWTs still validate)
+           Existing logged-in users: can use products directly via product subdomain
+                                     IF they bookmark it; portal home is 503.
+Cascade:   T+0:    portal 503; new tenant onboarding blocked (registry down)
+           T+15m:  existing JWTs missing refreshed products claim
+           T+1h:   trial emails not sent → trial nudge cadence breaks
+Deadlock:  None
+Recovery:  Restart vm-control containers via Orca. If MariaDB corrupt: restore mysqldump.
+           RTO target: 10 min (portal) / 60 min (ERPNext)
+Mitigation: Multiple services co-hosted = single failure hits many SLAs.
+           Resource limits in Orca prevent ERPNext OOM from killing portal.
+           Quarterly drill: deliberately stop portal, measure recovery.
+Severity:  HIGH — three services down at once, but products keep serving customers
+Cost of fix at Tier B/C: split vm-control → vm-portal + vm-ops (ERPNext)
+           — €64/mo extra at m2.small
+```
+
+### Scenario C — vm-data fails
+
+```
+Impact:    tenant-registry queries: FAIL (pg-app down) → portal returns 503 for tenant lookup
+           customer-portal: DEGRADED (login works, dashboard fails)
+           CERTifAI: COMPLETELY DOWN
+           backend-compliance + ai-sdk + admin: COMPLETELY DOWN
+           ERPNext + Stalwart: UNAFFECTED
+Cascade:   T+0: products down; portal degraded
+           T+15m: support tickets pile up
+           Note: prod is partial — users see error pages but ERPNext + auth still work
+Recovery:  Restart vm-data containers. If pg-app corrupt: restore from pg_dump (RPO 6h).
+           RTO target: 20 min
+Mitigation: This is the SCALE-event VM. RISK-1 below makes this the worst SPOF:
+           one pg-app instance owns tenant_registry + compliance schemas.
+           HIGH PRIORITY fix: split pg-app into separate clusters at Tier B/C transition.
+Severity:  HIGH — products down, business operations (ERPNext) still work so we can
+           contact customers
+```
+
+### Scenario D — LiteLLM fails
+
+```
+Impact:    CERTifAI: AI features fail (summarization, chat completion).
+           CERTifAI dashboard, sessions: UNAFFECTED.
+           compliance AI generation: FAILS (DSFA/TOM/VVT generation blocked).
+           Compliance CRUD: UNAFFECTED.
+Cascade:   Soft degradation only. Products show "AI features temporarily unavailable" banner.
+Deadlock:  None.
+Recovery:  Restart LiteLLM on vm-data (stateless, ~30s).
+Severity:  MEDIUM — graceful degradation by design
+```
+
+### Scenario E — Stage VM compromised or buggy
+
+```
+Impact:    On stage itself: stage portal serves bad data; stage testers see errors.
+           On prod: NONE if isolation rules in §2 are intact.
+           Worst case if isolation breaks:
+             - Stage code tries to call prod pg-app → fails (no creds in /stage/* Infisical)
+             - Stage emits real email → blocked by Stalwart recipient filter
+             - Stage triggers Polar charge → goes to sandbox, no real money
+Cascade:   None to prod by design.
+Recovery:  Roll back stage to previous image via Orca. RTO target: 5 min.
+Mitigation: The 5 enforcement rules in §2 are the load-bearing controls. Verify quarterly
+           via deliberate red-team: try to write to prod pg-app from stage and confirm 401.
+Severity:  LOW (in prod) / HIGH (on stage, but stage SLA is 95%)
+```
+
+### Scenario F — Full Cold Start (Power Loss, All VMs Restart Simultaneously)
+
+```
+Three VMs boot at once. Services must start in dependency order or services
+crash-loop until their deps are ready.
+
+DEADLOCK RISK: vm-control services (portal, tenant-registry) start before vm-data
+               services (pg-app, certifai, compliance). They'll crash-loop ~2-5min
+               with backoff retries.
+               Same for ERPNext on vm-control trying to reach Keycloak on vm-edge.
+
+RESOLUTION: Orca enforces cross-VM startup ordering via health-check dependencies.
+            Bootstrap exception: Keycloak DB URI in Orca env on vm-edge (not from
+            Infisical — chicken-and-egg solved).
+
+Required cold start sequence:
+
+  Phase 0 — Data roots on vm-data (parallel):
+    pg-app, mongodb, qdrant, minio
+  Phase 0 — Data roots on vm-control (parallel):
+    mariadb, redis-erpnext
+  Phase 0 — Data roots on vm-edge (parallel):
+    pg-keycloak, pg-infisical, redis-infisical
+
+  Phase 1 — Secrets + DNS on vm-edge:
+    infisical  (needs: pg-infisical, redis-infisical)
+    powerdns-auth  (no deps)
+
+  Phase 2 — Identity on vm-edge:
+    keycloak  (needs: pg-keycloak [Phase 0], infisical [Phase 1])
+    gitea     (needs: sqlite; ready from Phase 0)
+
+  Phase 3 — Control on vm-control + Data services on vm-data (parallel):
+    tenant-registry  (needs: keycloak [Phase 2], pg-app [Phase 0, remote])
+    erpnext + frappe-hd  (needs: mariadb, redis-erpnext [Phase 0], keycloak [Phase 2])
+    stalwart  (needs: infisical [Phase 1])
+    litellm  (needs: infisical)
+    certifai  (needs: keycloak, mongodb, litellm)
+    backend-compliance  (needs: keycloak, pg-app)
+    ai-compliance-sdk   (needs: pg-app, qdrant, litellm)
+    admin-compliance    (needs: backend + sdk)
+
+  Phase 4 — Customer-facing on vm-control:
+    customer-portal  (needs: keycloak, tenant-registry)
+
+  Phase 5 — Gateway on vm-edge (last):
+    orca-proxy  (waits for all backends healthy before opening listener)
+
+Estimated cold-start time: 6-10 minutes (faster than 7-VM since less network roundtrip)
+```
+
+### Scenario G — Tenant Registry fails
+
+```
+Impact:    Portal cannot resolve tenant from subdomain → /[slug]/* all 503
+           Keycloak protocol mapper cannot get products claim → JWT missing field
+                → users can log in but see "No active products"
+           Products (CERTifAI, compliance) themselves: UNAFFECTED if already authenticated
+Cascade:   New logins degraded.
+           Existing sessions continue.
+Deadlock:  None.
+Recovery:  Restart tenant-registry on vm-control. pg-app on vm-data must be healthy.
+           RTO target: ≤ 60s
+Mitigation: Portal caches slug → tenant mapping with 60s TTL.
+           Short outage invisible to customers.
+Severity:  MEDIUM
+```
+
+---
+
+## 11. Cross-Dependency Summary Table
+
+```
+              Needs → │PG-KC│PG-Inf│PG-App│Mongo│Maria│Redis│Minio│Qdrant│ KC  │Infis│Lit. │T.Reg│
+─────────────────────┼─────┼──────┼──────┼─────┼─────┼─────┼─────┼──────┼─────┼─────┼─────┼─────┤
+keycloak             │  ●  │      │      │     │     │     │     │      │     │ ◐*  │     │     │
+infisical            │     │  ●   │      │     │     │  ●  │     │      │     │     │     │     │
+gitea                │     │      │      │     │     │     │     │      │     │  ●  │     │     │
+tenant-registry      │     │      │  ●   │     │     │     │     │      │  ●  │  ●  │     │     │
+customer-portal      │     │      │      │     │     │     │     │      │  ●  │  ●  │     │  ●  │
+erpnext              │     │      │      │     │  ●  │  ●  │     │      │  ●  │  ●  │     │     │
+frappe-hd            │     │      │      │     │  ●  │  ●  │     │      │     │  ●  │     │     │
+stalwart             │     │      │      │     │     │     │     │      │     │  ●  │     │     │
+certifai             │     │      │      │  ●  │     │     │     │      │  ●  │  ●  │  ◐  │     │
+litellm              │     │      │      │     │     │     │     │      │     │  ●  │     │     │
+backend-compl.       │     │      │  ●   │     │     │     │     │      │  ●  │  ●  │     │     │
+ai-compl-sdk         │     │      │  ●   │     │     │     │     │  ●   │     │  ●  │  ◐  │     │
+admin-compl.         │     │      │      │     │     │     │     │      │     │     │     │     │
+orca-proxy           │     │      │      │     │     │     │     │      │     │     │     │     │
+stage-app            │     │      │      │     │     │     │     │      │  ●  │  ◑  │     │  ◑  │
+
+● = hard dependency (cannot start without)
+◐ = soft dependency (starts, features degrade)
+◑ = stage-only read-mostly dependency (writes blocked by Infisical scope)
+◐*= bootstrap exception (Keycloak DB URI in Orca env on vm-edge, not Infisical)
+```
+
+---
+
+## 12. Open Infrastructure Risks (Priority Order)
+
+```
+RISK-1  pg-app (vm-data) is a single instance serving tenant_registry + compliance schemas.
+        One crash blocks portal AND compliance product simultaneously.
+        → Mitigation: split into pg-registry + pg-compliance at Tier B (200 customers).
+                       Move pg-registry to its own DBaaS PostgreSQL cluster (€213/mo).
+        Priority: HIGH — fix before 100 customers; flagged also in COST_PLAN.md
+
+RISK-2  vm-edge is a single VM owning 3 root dependencies (DNS, auth, secrets).
+        Failure = total external outage. Highest blast radius in the system.
+        → Mitigation:
+          Phase A: cold-standby vm-edge-spare (idle cost €0; tested quarterly)
+          Phase B (Tier C, 500 cust): split vm-edge into vm-edge + vm-identity + vm-secrets
+        Priority: HIGH
+
+RISK-3  vm-control hosts 5 service groups (portal, tenant-registry, ERPNext, Frappe HD,
+        Stalwart). Co-tenant noise risk; one OOM kills the others.
+        → Mitigation:
+          Phase A: hard Orca resource limits per service (see §6 co-tenant notes)
+          Phase B (Tier B): split vm-control → vm-portal + vm-ops at €64/mo extra
+        Priority: MEDIUM
+
+RISK-4  Keycloak is a single instance with no clustering.
+        Any Keycloak outage = total auth failure within JWT TTL.
+        → Mitigation: short-term: tested runbook + 15min RTO target
+                       long-term: Keycloak active-passive cluster (Phase 2, on split vm-identity)
+        Priority: MEDIUM
+
+RISK-5  Stage isolation depends on 5 enforcement controls (see §2 table).
+        If any one breaks, stage code can affect prod customers.
+        → Mitigation: quarterly red-team verification of each control.
+                       Especially: Infisical secret-path scoping and Stalwart recipient filter.
+        Priority: MEDIUM — easy to forget once it's working
+
+RISK-6  Infisical downtime during multi-VM restart causes delayed cold start.
+        → Mitigation: Orca startup ordering + bootstrap secrets for Keycloak only
+        Priority: LOW — documented runbook; cold start is rare
+
+RISK-7  ERPNext → Tenant Registry webhook has no guaranteed delivery.
+        Failed activation = tenant not active after contract signed.
+        → Mitigation: Frappe retry + idempotent /activate endpoint + manual Backstage trigger
+        Priority: LOW
+
+RISK-8  LiteLLM calls external AI APIs (OpenAI / Anthropic).
+        → Mitigation: LiteLLM fallback routing; products degrade gracefully.
+        Priority: LOW — external dependency, by design
+```
+
+---
+
+## 13. Growth Trajectory — when to add VMs
+
+The locked 4-VM topology is right for 5–~200 customers. Past that, expect to add VMs back in this order:
+
+```
+Tier A (5–200 cust):    4 VMs as locked         €192/mo compute (36M upfront)
+                        ↓
+Tier B (200–500):       Bump vm-data m2.med → m2.large       +€64/mo
+                        Add cold-standby vm-edge-spare        +€0 (idle, paid only on swap)
+                        ↓
+Tier C (500–1000):      Split vm-data: vm-data + vm-data-db   +€64/mo
+                        (postgres-app moves to its own VM, or DBaaS cluster +€213/mo)
+                        Split vm-control: vm-control + vm-ops +€64/mo
+                        (ERPNext + MariaDB + Stalwart move to vm-ops)
+                        ↓
+Tier D (1000–2000):     Split vm-edge: vm-edge + vm-identity + vm-secrets   +€96/mo
+                        HA Keycloak active-passive on 2× vm-identity        +€32/mo
+                        Octavia Load Balancer Double Instance               +€58/mo
+                        vm-data m2.large → m2.xlarge or 2×                  +€128–256/mo
+                        ↓
+                        Final topology ≈ 8 prod VMs + DBaaS
+```
+
+Each step is justified by a measurable signal (>80% RAM, >70% CPU, sustained queue depth, or a specific outage scenario). Never split preemptively.
+
+---
+
+## 14. Cost summary (see COST_PLAN.md for full breakdown)
+
+| Mode                    | Compute €/mo | Storage €/mo | Network €/mo | Total net | + 19% VAT |
+|---|---:|---:|---:|---:|---:|
+| On-Demand               | 434.50       | 112          | 2.92         | 549.42    | 653.81   |
+| 12-month commit         | 295.20       | 112          | 2.92         | 410.12    | 488.04   |
+| 36-month no upfront     | 216.00       | 112          | 2.92         | 330.92    | 393.79   |
+| 36-month upfront        | 192.00       | 112          | 2.92         | 306.92    | 365.23   |
+
+Plus €6,912 net one-time payment if signing 36M-upfront for the compute portion.
+
+---
+
+*End of document. Review quarterly or after any significant infrastructure change. Topology last locked 2026-05-18.*