Files
docs/INFRASTRUCTURE.md
T
sharang 03a5b4846e
ci / shared (push) Successful in 4s
chore(domain): yourplatform.com → breakpilot.com
Apply platform-domain decision (2026-05-18). No services touched; docs/config only.

Refs: M1.1
2026-05-18 20:28:41 +00:00

50 KiB
Raw Blame History

Infrastructure Specification

Status: Locked Topology Authors: Sharang, Benjamin Date: 2026-05-11 (topology lock: 2026-05-18) Companion docs: PLATFORM_ARCHITECTURE.md, IMPLEMENTATION_PLAN.md, COST_PLAN.md Cloud provider: SysEleven Cloud Services (DUS2, OpenStack)


1. VM Inventory

Four billable VMs total. Three in production (one per plane after collapsing Identity+Infra), one in stage. Dev runs entirely on developer laptops via docker-compose.

┌──────────────┬─────────────────┬────────────────────────┬───────────┬─────────────────┐
│ Name         │ Env             │ SysEleven flavor       │ Public IP │ Planes owned    │
├──────────────┼─────────────────┼────────────────────────┼───────────┼─────────────────┤
│ vm-edge      │ prod            │ m2.small  (2v / 8 GB)  │ YES (1)   │ Identity + Infra│
│ vm-control   │ prod            │ m2.medium (4v / 16 GB) │ No        │ Control         │
│ vm-data      │ prod            │ m2.medium (4v / 16 GB) │ No        │ Data            │
│ stage        │ stage           │ m2.small  (2v / 8 GB)  │ YES (1)   │ App plane only  │
│ (dev)        │ dev             │ local docker-compose   │ n/a       │ all (in-memory) │
└──────────────┴─────────────────┴────────────────────────┴───────────┴─────────────────┘

Total compute: 48 GiB-RAM, 12 vCPU. Monthly compute net: €192 (36M upfront) / €295 (12M) / €435 (On-Demand). See COST_PLAN.md for the full three-mode table.

Why this topology and not the previous 7-VM layout

The earlier draft proposed one VM per service group (vm-gateway, vm-identity, vm-secrets, vm-ops, vm-control, vm-certifai, vm-compliance). That gave maximum failure isolation but cost 132 GiB-RAM stage+prod. At 5 customers the isolation is unused — every VM ran at <10% utilisation. The locked topology buys back failure isolation incrementally as load grows (see §13 Growth Trajectory).

Critical isolations preserved even at 4 VMs:

  • vm-edge isolates identity from app workloads. Keycloak JVM has its own page cache; ERPNext background jobs cannot starve token issuance.
  • vm-data isolates databases from stateless services. All data-plane DBs share one host, but they're walled off from the portal + ERPNext + Stalwart competing on vm-control.
  • stage runs the app plane only. It calls prod Keycloak + prod Tenant Registry under tenant.kind = stage rather than mirroring those services.

2. Service-to-VM Mapping

vm-edge   (prod, m2.small 8 GB, public IP)
  ├── orca-proxy           (Orca-managed; wildcard TLS terminator)
  ├── powerdns-auth        (Orca-managed; authoritative DNS for breakpilot.com)
  ├── keycloak-26          (Orca-managed; JVM, ~1.5 GB heap)
  ├── postgres-keycloak    (Orca-managed; dedicated PG instance for Keycloak only)
  ├── infisical            (Orca-managed)
  ├── postgres-infisical   (Orca-managed; dedicated PG instance for Infisical only)
  ├── redis-infisical      (Orca-managed; ephemeral)
  └── gitea                (Orca-managed; SQLite backend to avoid a third PG)

vm-control   (prod, m2.medium 16 GB)
  ├── customer-portal      (Orca-managed; Next.js)
  ├── tenant-registry      (Orca-managed; Go)
  ├── orca-controller      (Orca core process; NOT a managed container)
  ├── erpnext              (Orca-managed; Frappe bench)
  ├── frappe-hd            (same bench as ERPNext)
  ├── mariadb              (Orca-managed; for ERPNext)
  ├── redis-erpnext        (Orca-managed)
  └── stalwart-mail        (Orca-managed; SMTP/IMAP/JMAP on mail.breakpilot.com)

vm-data   (prod, m2.medium 16 GB)
  ├── certifai-dashboard   (Orca-managed)
  ├── mongodb              (Orca-managed)
  ├── litellm              (Orca-managed)
  ├── backend-compliance   (Orca-managed)
  ├── ai-compliance-sdk    (Orca-managed)
  ├── admin-compliance     (Orca-managed)
  ├── postgres-app         (Orca-managed; schemas: tenant_registry, compliance)
  ├── qdrant               (Orca-managed)
  └── minio                (Orca-managed)

stage   (stage, m2.small 8 GB, public IP)
  ├── orca-proxy           (light; only routes to stage app)
  ├── customer-portal      (NEW VERSION under test)
  ├── tenant-registry      (NEW VERSION under test, talks to ephemeral PG below)
  ├── certifai-dashboard   (NEW VERSION under test)
  ├── backend-compliance   (NEW VERSION under test)
  ├── ai-compliance-sdk    (NEW VERSION under test)
  ├── admin-compliance     (NEW VERSION under test)
  ├── litellm              (light; same image as prod)
  ├── postgres-app-stage   (ephemeral; lives entirely on stage VM)
  ├── mongodb-stage        (ephemeral)
  └── qdrant-stage         (ephemeral, tiny corpus)

  Calls OUT to prod:
    → auth.breakpilot.com   (Keycloak token issuance, under stage client_id)
    → mail.breakpilot.com   (Stalwart SMTP, recipient filter forces +stage@ only)
    → Polar SANDBOX webhook URL (NEVER prod Polar)
    → no calls to prod Postgres-app, MariaDB, MongoDB

Stage isolation rules (enforced at the platform, not in product code)

Risk Enforcement mechanism Owner
Stage writes to prod database Infisical scope: stage app only gets /stage/* secrets. Prod DB credentials never reach stage. Infra plane
Stage emails real customers Stalwart accept-rule: drop if recipient does not match *+stage@*. Control plane (Stalwart config)
Stage triggers real Polar charges Stage env points POLAR_API_URL to sandbox. Prod Polar webhook secret never on stage. Control plane
Stage Keycloak JWT used in prod stage_client_id issued only by Keycloak; prod services reject JWTs with this aud. Identity plane
Stage load DOSes prod Keycloak Keycloak rate-limit per client_id; stage limited to 60 req/s. Identity plane

3. Network Topology

                          INTERNET
                              │
                       (breakpilot.com — authoritative on vm-edge PowerDNS;
                        stage.breakpilot.com — authoritative same zone)
                              │
                ┌─────────────┴─────────────┐
                │                           │
        ┌───────▼────────┐         ┌────────▼─────────┐
        │    vm-edge     │         │     stage        │
        │   (public IP)  │         │   (public IP)    │
        │                │         │                  │
        │ orca-proxy ────┤         │ orca-proxy       │
        │ powerdns       │         │ portal-new       │
        │ keycloak       │◄────────┤ tenant-registry-new
        │ pg-keycloak    │  stage  │ certifai-new     │
        │ infisical      │  calls  │ compliance-new   │
        │ pg-infisical   │  prod   │ pg-stage         │
        │ redis-infis    │  KC +   │ mongo-stage      │
        │ gitea          │  Stalwart│ qdrant-stage    │
        └───────┬────────┘         └──────────────────┘
                │  PRIVATE NETWORK  10.0.0.0/16
       ┌────────┴─────────┐
       │                  │
┌──────▼───────┐  ┌───────▼──────┐
│  vm-control  │  │   vm-data    │
│              │  │              │
│ portal       │  │ certifai     │
│ tenant-reg   │  │ mongodb      │
│ orca-ctrl    │  │ litellm      │
│ erpnext      │  │ backend-comp │
│ frappe-hd    │  │ ai-sdk       │
│ mariadb      │  │ admin-comp   │
│ redis-erp    │  │ pg-app       │
│ stalwart     │  │ qdrant       │
└──────────────┘  │ minio        │
                  └──────────────┘

Orca-Proxy routing (vm-edge, by Host header):
  auth.breakpilot.com    → 127.0.0.1:8443  (Keycloak, local on vm-edge)
  erp.breakpilot.com     → vm-control:8000 (ERPNext) [allowlist: our IPs only]
  git.breakpilot.com     → vm-edge:3000    (Gitea, local) [allowlist: our IPs only]
  mail.breakpilot.com    → vm-control:587  (Stalwart submission) [allowlist: VM internal only]
  ns1.breakpilot.com     → 127.0.0.1:53    (PowerDNS, local)
  *.breakpilot.com       → vm-control:3000 (customer portal)

Orca-Proxy routing (stage, by Host header):
  *.stage.breakpilot.com → 127.0.0.1:3000  (stage portal — all subdomains route here)

4. Storage and Volume Requirements

Block volumes (Ceph 3x replicated, €0.10/GiB/mo) mounted to each VM.

┌──────────────┬───────────────────────────────────────────┬─────────┬─────────────────────┐
│ VM           │ Data stores                               │ +Block  │ Growth profile      │
├──────────────┼───────────────────────────────────────────┼─────────┼─────────────────────┤
│ vm-edge      │ pg-keycloak + pg-infisical + Gitea repos  │ +50 GB  │ Slow                │
│ vm-control   │ MariaDB (ERPNext) + Stalwart mail spool   │ +250 GB │ Medium              │
│ vm-data      │ MongoDB + pg-app + Qdrant + MinIO         │ +500 GB │ Fast (scales w/ N)  │
│ stage        │ pg-stage + mongo-stage + qdrant-stage     │ +50 GB  │ Resets per release  │
└──────────────┴───────────────────────────────────────────┴─────────┴─────────────────────┘

Each VM's root disk: 50 GB ephemeral, included in flavor price.

Object storage (S3, €0.02/GiB/mo single-region or €0.0496/GiB/mo geo-redundant):
  ┌─────────────────────────────────┬─────────┬──────────────────────────┐
  │ Bucket                          │ Size    │ Purpose                  │
  ├─────────────────────────────────┼─────────┼──────────────────────────┤
  │ s3://backups (geo-redundant)    │ ~500 GB │ Database dumps           │
  │ s3://seed-data                  │ ~30 GB  │ Demo tenant fixtures     │
  │ s3://exports                    │ ~50 GB  │ GDPR/offboarding ZIPs    │
  │ s3://audit-archive              │ ~20 GB  │ Old audit log overflow   │
  └─────────────────────────────────┴─────────┴──────────────────────────┘

5. Backup Requirements

All backups ship to SysEleven Object Storage (S3-compatible, geo-redundant DUS2 ↔ HAM1 for production-critical data). Backup jobs run as Orca one-shot containers on cron. Infisical holds the S3 credentials.

┌───────────────────────┬──────────────────┬────────────┬────────────┬──────────────────────┐
│ Data store            │ Method           │ Frequency  │ Retention  │ Owner (who restores) │
├───────────────────────┼──────────────────┼────────────┼────────────┼──────────────────────┤
│ pg-keycloak (vm-edge) │ pg_dump → S3-geo │ Every 6h   │ 14 days    │ Infra Plane          │
│ pg-infisical (vm-edge)│ pg_dump → S3-geo │ Daily      │ 30 days    │ Infra Plane          │
│ Gitea (vm-edge)       │ gitea dump → S3  │ Daily      │ 30 days    │ Infra Plane          │
│ Keycloak realm export │ KC export → S3   │ Daily      │ 14 days    │ Identity Plane (owns)│
│ Infisical store       │ encrypted → S3   │ Daily      │ 30 days    │ Infra Plane          │
│ MariaDB (vm-control)  │ mysqldump → S3   │ Every 6h   │ 30 days    │ Control Plane        │
│ Stalwart queue/store  │ tar → S3         │ Daily      │ 7 days     │ Control Plane        │
│ pg-app (vm-data)      │ pg_dump → S3-geo │ Every 6h   │ 30 days    │ Data Plane (owns RPO)│
│ MongoDB (vm-data)     │ mongodump → S3   │ Daily      │ 30 days    │ Data Plane           │
│ MinIO (vm-data)       │ mc mirror → S3   │ Daily      │ 90 days    │ Data Plane           │
│ Qdrant (vm-data)      │ API snap → S3    │ Daily      │ 14 days    │ Data Plane (rebuild) │
│ stage *               │ no backup        │ —          │ —          │ — (ephemeral)        │
│ Orca config (IaC)     │ Gitea (VCS)      │ On commit  │ Forever    │ Infra Plane          │
└───────────────────────┴──────────────────┴────────────┴────────────┴──────────────────────┘

RPO by data criticality

CRITICAL (RPO ≤ 6h)
  pg-keycloak       — org memberships, IdP config
  pg-app            — tenant registry, compliance records
  MariaDB/ERPNext   — sales orders, invoices, contracts

IMPORTANT (RPO ≤ 24h)
  MongoDB           — chat history, user preferences
  MinIO             — compliance evidence documents
  pg-infisical      — encrypted secrets
  Stalwart store    — inbound webhooks, bounce records

RECOVERABLE (RPO ≤ 48h, rebuildable)
  Qdrant            — vector index (rebuildable from MinIO source documents)
  Gitea             — code (mirrored on dev machines)
  Keycloak export   — org structure (pg-keycloak is primary)

NOT BACKED UP
  stage (any data)  — by design; restored from seed bundles on each deploy
  redis-*           — caches; restart cold

6. Constraint Framework

Constraint types

AVAILABILITY   — required uptime percentage over rolling 30 days
RTO            — Recovery Time Objective: max time to restore service after failure
RPO            — Recovery Point Objective: max acceptable data loss window
IaC            — service must be declared in Orca config, no manual container runs in prod
SECRET_HYGIENE — all secrets via Infisical machine identity; no env files, no hardcoded values
NETWORK        — whether service is internet-exposed or internal-only
DATA_RESIDENCY — all data must remain in EU (SysEleven DUS2 + HAM1)
AUDIT_TRAIL    — all mutating actions logged (who, what, when, from where)
IMMUTABILITY   — config changes go through Gitea → Orca pipeline, not manual SSH
STAGE_ISOLATION— stage tenant cannot mutate any prod data; reads-only against prod KC + TR

Plane ownership of constraints

Even though planes now share VMs, the ownership model is unchanged — the plane that owns a constraint owns it regardless of which VM hosts the service. The Infra Plane (now collapsed onto vm-edge alongside the Identity plane) still mechanically enforces backup, IaC, secrets, and network constraints.

╔══════════════════════════════════════════════════════════════════════════════════════════╗
║  IDENTITY PLANE  (on vm-edge)                                                            ║
║                                                                                          ║
║  Owns / defines:                                                                         ║
║    AVAILABILITY  — must be ≥ 99.5% (root dep for everything)                            ║
║    RTO           — ≤ 15 min                                                              ║
║    AUDIT_TRAIL   — realm-level audit (logins, token issuance, IdP events)               ║
║    DATA_RESIDENCY— Keycloak realm data must stay EU                                     ║
║    STAGE_ISOLATION— rate-limits stage_client_id; rejects stage JWTs in prod audiences   ║
║                                                                                          ║
║  Co-tenant note: shares vm-edge with Infra Plane services. JVM heap pinned to 1.5 GB    ║
║  in Orca manifest so it cannot starve PowerDNS / Infisical.                              ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════╗
║  CONTROL PLANE  (on vm-control)                                                          ║
║                                                                                          ║
║  Owns / defines:                                                                         ║
║    RPO (tenant)   — tenant registry & compliance schemas RPO ≤ 6h                       ║
║    RPO (ERPNext)  — sales orders, invoices RPO ≤ 6h                                     ║
║    AUDIT_TRAIL    — all portal actions (invites, IdP changes, impersonations)            ║
║    AVAILABILITY   — portal ≥ 99.5%; ERPNext ≥ 99% (internal)                            ║
║    RTO (portal)   — ≤ 10 min                                                             ║
║    RTO (ERPNext)  — ≤ 60 min                                                             ║
║                                                                                          ║
║  Co-tenant note: ERPNext + Portal + Stalwart on one VM. Orca resource limits enforced:  ║
║    portal:   1 GB memory cap                                                             ║
║    erpnext:  6 GB memory cap                                                             ║
║    mariadb:  3 GB memory cap                                                             ║
║    stalwart: 1 GB memory cap                                                             ║
║    tenant-registry: 500 MB                                                                ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════╗
║  DATA PLANE  (on vm-data)                                                                ║
║                                                                                          ║
║  Owns / defines:                                                                         ║
║    DATA_RESIDENCY — all customer data (MongoDB, pg-app, MinIO) must stay EU              ║
║    RPO (product)  — compliance records ≤ 6h; chat history ≤ 24h                         ║
║    DATA_ISOLATION — every query scoped by org_id/tenant_id                               ║
║    AUDIT_TRAIL    — product-level actions                                                ║
║    AVAILABILITY   — CERTifAI ≥ 99.5%; compliance ≥ 99.5%                                ║
║                                                                                          ║
║  Co-tenant note: this VM is the SCALE driver. When vm-data hits 80% RAM, bump flavor    ║
║  (m2.medium → m2.large → m2.xlarge). See §13 Growth Trajectory.                          ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════════════════════════════════╗
║  INFRA PLANE  (on vm-edge, alongside Identity)                                           ║
║                                                                                          ║
║  Owns / enforces ALL of:                                                                 ║
║    BACKUP        — executes all backup jobs (pg_dump, mongodump, mc mirror)             ║
║    IaC           — ALL services declared in Orca config; no manual prod changes         ║
║    IMMUTABILITY  — config changes: Gitea commit → Gitea Actions → Orca API only         ║
║    SECRET_HYGIENE— Infisical (on vm-edge); provisions machine identities                ║
║    NETWORK       — Orca-Proxy rules; VM firewall; no direct VM public exposure          ║
║    DATA_RESIDENCY— VM region = SysEleven DUS2; backups geo-redundant DUS2↔HAM1          ║
║    AVAILABILITY  — Orca restart policies, health checks                                  ║
║    COLD_START    — enforces startup ordering (see §10 Scenario F)                       ║
║    STAGE_ISOLATION— Infisical secret-path scoping for stage_app identity                ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

7. SLA Table

┌───────────────────────┬──────────────┬─────────┬─────────┬────────────────────────────────┐
│ Service               │ Availability │ RTO     │ RPO     │ Host VM                        │
├───────────────────────┼──────────────┼─────────┼─────────┼────────────────────────────────┤
│ Orca-Proxy            │ 99.9%        │ 5 min   │ N/A     │ vm-edge                        │
│ PowerDNS              │ 99.9%        │ 5 min   │ N/A     │ vm-edge                        │
│ Keycloak              │ 99.5%        │ 15 min  │ 6h      │ vm-edge (root auth dep)        │
│ Infisical             │ 99.5%        │ 30 min  │ 24h     │ vm-edge (running svcs survive) │
│ Gitea                 │ 99%          │ 2h      │ 24h     │ vm-edge (dev machines mirror)  │
│ Customer Portal       │ 99.5%        │ 10 min  │ N/A     │ vm-control                     │
│ Tenant Registry       │ 99.5%        │ 10 min  │ 6h      │ vm-control                     │
│ ERPNext               │ 99%          │ 60 min  │ 6h      │ vm-control (internal only)     │
│ Frappe HD             │ 99%          │ 60 min  │ 24h     │ vm-control                     │
│ MariaDB               │ 99.5%        │ 20 min  │ 6h      │ vm-control                     │
│ Stalwart Mail         │ 99%          │ 60 min  │ 24h     │ vm-control                     │
│ CERTifAI              │ 99.5%        │ 10 min  │ 24h     │ vm-data                        │
│ MongoDB               │ 99.5%        │ 20 min  │ 24h     │ vm-data                        │
│ LiteLLM               │ 99%          │ 5 min   │ N/A     │ vm-data                        │
│ backend-compliance    │ 99.5%        │ 10 min  │ 6h      │ vm-data                        │
│ ai-compliance-sdk     │ 99.5%        │ 10 min  │ 6h      │ vm-data                        │
│ pg-app                │ 99.9%        │ 20 min  │ 6h      │ vm-data (SPOF — RISK-1)        │
│ MinIO                 │ 99.5%        │ 30 min  │ 24h     │ vm-data                        │
│ Qdrant                │ 99%          │ 2h      │ 24h     │ vm-data (rebuildable)          │
│ stage (any service)   │ 95%          │ best ef.│ N/A     │ stage (ephemeral; no SLA)      │
└───────────────────────┴──────────────┴─────────┴─────────┴────────────────────────────────┘

8. IaC Constraint (Orca)

Every production service declared in Orca config. No exceptions.

Rules

1. ALL containers run via Orca manifests committed to Gitea
   → /orca/manifests/{vm-name}/{service-name}.toml
   → Changes go through: Gitea PR → Gitea Actions lint → Orca API apply

2. NO manual docker run / docker-compose up on any production VM
   → SSH to prod VMs allowed for debugging only; no state changes

3. Secrets are NEVER in Orca manifests
   → Manifests reference Infisical paths, not values
   → Bootstrap exception: Keycloak DB URI in Orca env (Keycloak runs ON vm-edge alongside
     Infisical, so chicken-and-egg is solved by Orca env file, not Infisical lookup)

4. Restart policy: always (Orca restarts crashed containers with exponential backoff)
   → Health check per service (HTTP /health or TCP probe)

5. Resource limits MANDATORY in every manifest
   → On a 3-VM prod, co-tenant noise is the single biggest risk; limits are non-negotiable
   → See §6 Plane ownership "Co-tenant note" boxes for the per-service caps

6. Orca controller state itself is recoverable
   → Manifest files in Gitea = desired state
   → Loss of Orca controller = re-apply manifests from Gitea, services continue running

7. Stage app gets its own Infisical scope
   → /stage/* path; no prod-DB credentials reach this scope
   → Enforced at Infisical machine-identity level, not in app code

Gitea Actions pipeline for infra changes

infra change committed to Gitea
  │
  ├── lint: validate Orca manifest schema
  ├── diff: show what changes will be applied (orca plan)
  ├── (manual approval gate for vm-edge changes — touches auth root)
  └── apply: POST to Orca Controller API → rolling update

9. Dependency Graph

Arrows = "requires to function." Dashed = soft (degrades, doesn't fail). Intra-VM dependencies elided for clarity (e.g. Keycloak ↔ pg-keycloak are on the same host and start together).

         EXTERNAL
         AI APIs
         (OpenAI / Anthropic)
              │
              │ (soft)
              ▼
┌──────────────────────────────────────────────────────────────────┐
│  vm-edge   (Identity + Infra)                                    │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  pg-keycloak ──► keycloak                                  │  │
│  │  pg-infisical ─► infisical  ◄── (all VMs pull on startup)  │  │
│  │  redis-infis ──► infisical                                  │  │
│  │  (sqlite) ─────► gitea                                      │  │
│  │  powerdns-auth  (no deps)                                   │  │
│  │  orca-proxy     (route table only; backends are remote)     │  │
│  └────────────────────────────────────────────────────────────┘  │
│                            │ Keycloak JWKS  │ Infisical /secrets │
│                            │                │                    │
└────────────────────────────┼────────────────┼────────────────────┘
                             ▼                ▼
┌──────────────────────────────────────────────────────────────────┐
│  vm-control   (Control)                                          │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  mariadb + redis-erp ──► erpnext + frappe-hd               │  │
│  │  (intra) ─────────────► stalwart                            │  │
│  │  ──────────────────────► customer-portal                    │  │
│  │  ──────────────────────► tenant-registry ──► pg-app (vm-data)│  │
│  └────────────────────────────────────────────────────────────┘  │
│                            │ tenant-registry API                  │
└────────────────────────────┼─────────────────────────────────────┘
                             ▼
┌──────────────────────────────────────────────────────────────────┐
│  vm-data   (Data)                                                │
│  ┌────────────────────────────────────────────────────────────┐  │
│  │  mongodb ───► certifai ◄── (vm-edge JWKS, vm-edge secrets) │  │
│  │  litellm ───► certifai, ai-compliance-sdk                  │  │
│  │  pg-app ────► tenant-registry-on-vm-control, backend-compl,│  │
│  │              ai-compliance-sdk                              │  │
│  │  qdrant ────► ai-compliance-sdk                            │  │
│  │  minio  ────► backend-compliance                           │  │
│  │  backend-compliance ──► admin-compliance                   │  │
│  └────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  stage   (App plane only)                                        │
│  Calls vm-edge:8443 (KC) + vm-control:587 (Stalwart submission)  │
│  Calls Polar SANDBOX (never prod Polar webhook URL)              │
│  Its own ephemeral DBs; cannot read prod data                    │
└──────────────────────────────────────────────────────────────────┘

Simplified critical path (customer login → product use)

  DNS (vm-edge PowerDNS)
   │
   ▼
  orca-proxy (vm-edge)
   │
   ├──► keycloak (vm-edge) ──► pg-keycloak (intra-VM)
   │
   └──► customer-portal (vm-control)
          ├──► tenant-registry (vm-control) ──► pg-app (vm-data)
          ├──► certifai (vm-data) ──► mongodb (intra-VM)
          └──► backend-compliance (vm-data) ──► pg-app (intra-VM)
                                          ──► ai-sdk ──► qdrant + minio
                                          ──► litellm ──► [external AI APIs]

10. Failure Scenarios and Deadlock Analysis

Scenario A — vm-edge fails (HIGHEST SEVERITY)

Impact:    TOTAL outage. Nothing reachable from internet.
           No DNS. No TLS. No auth. No new logins. Running JWTs expire within 15 min,
           then ALL services start returning 401.
           Backstage and customer portal both fully blocked.
           Stage also blocked (depends on prod Keycloak).
Cascade:   T+0:    DNS fails  → orca-proxy unreachable
           T+5m:   existing JWTs still valid; portal cached → partial reads work
           T+15m:  JWTs expire → full outage
Deadlock:  None — services downstream don't deadlock, they just fail closed
Recovery:  1. Spin up vm-edge-spare (cold standby, same Orca config) — ~3 min provision
           2. Restore pg-keycloak + pg-infisical from latest backup — ~5 min
           3. Swap registrar NS records to spare IP (TTL 60s) — ~2 min propagation
           4. Restart all services on vm-edge-spare via Orca apply — ~3 min
           Total RTO target: 15 min
Mitigation: COLD STANDBY vm-edge-spare. Same Orca config committed in Gitea.
           Provision cost when idle: €0 (only billed when running).
           Test recovery quarterly.
Severity:  CRITICAL — single host owns 3 root dependencies (DNS, auth, secrets)
Cost of fix at Tier C: split vm-edge into vm-edge + vm-identity + vm-secrets
           (back toward original 7-VM design) — €100/mo extra

Scenario B — vm-control fails (NEW — consequence of plane consolidation)

Impact:    customer-portal: DOWN → /[slug]/* all return 503
           tenant-registry: DOWN → Keycloak protocol-mapper for products claim breaks
                                    → users can log in but see "No active products"
           ERPNext + Frappe HD: DOWN → we cannot create sales orders or read tickets
           Stalwart: DOWN → no outbound emails (trial nudges, exports, ticket replies)
           MariaDB: DOWN → ERPNext queries fail; backups paused
           Products (CERTifAI, compliance): UNAFFECTED (on vm-data, JWTs still validate)
           Existing logged-in users: can use products directly via product subdomain
                                     IF they bookmark it; portal home is 503.
Cascade:   T+0:    portal 503; new tenant onboarding blocked (registry down)
           T+15m:  existing JWTs missing refreshed products claim
           T+1h:   trial emails not sent → trial nudge cadence breaks
Deadlock:  None
Recovery:  Restart vm-control containers via Orca. If MariaDB corrupt: restore mysqldump.
           RTO target: 10 min (portal) / 60 min (ERPNext)
Mitigation: Multiple services co-hosted = single failure hits many SLAs.
           Resource limits in Orca prevent ERPNext OOM from killing portal.
           Quarterly drill: deliberately stop portal, measure recovery.
Severity:  HIGH — three services down at once, but products keep serving customers
Cost of fix at Tier B/C: split vm-control → vm-portal + vm-ops (ERPNext)
           — €64/mo extra at m2.small

Scenario C — vm-data fails

Impact:    tenant-registry queries: FAIL (pg-app down) → portal returns 503 for tenant lookup
           customer-portal: DEGRADED (login works, dashboard fails)
           CERTifAI: COMPLETELY DOWN
           backend-compliance + ai-sdk + admin: COMPLETELY DOWN
           ERPNext + Stalwart: UNAFFECTED
Cascade:   T+0: products down; portal degraded
           T+15m: support tickets pile up
           Note: prod is partial — users see error pages but ERPNext + auth still work
Recovery:  Restart vm-data containers. If pg-app corrupt: restore from pg_dump (RPO 6h).
           RTO target: 20 min
Mitigation: This is the SCALE-event VM. RISK-1 below makes this the worst SPOF:
           one pg-app instance owns tenant_registry + compliance schemas.
           HIGH PRIORITY fix: split pg-app into separate clusters at Tier B/C transition.
Severity:  HIGH — products down, business operations (ERPNext) still work so we can
           contact customers

Scenario D — LiteLLM fails

Impact:    CERTifAI: AI features fail (summarization, chat completion).
           CERTifAI dashboard, sessions: UNAFFECTED.
           compliance AI generation: FAILS (DSFA/TOM/VVT generation blocked).
           Compliance CRUD: UNAFFECTED.
Cascade:   Soft degradation only. Products show "AI features temporarily unavailable" banner.
Deadlock:  None.
Recovery:  Restart LiteLLM on vm-data (stateless, ~30s).
Severity:  MEDIUM — graceful degradation by design

Scenario E — Stage VM compromised or buggy

Impact:    On stage itself: stage portal serves bad data; stage testers see errors.
           On prod: NONE if isolation rules in §2 are intact.
           Worst case if isolation breaks:
             - Stage code tries to call prod pg-app → fails (no creds in /stage/* Infisical)
             - Stage emits real email → blocked by Stalwart recipient filter
             - Stage triggers Polar charge → goes to sandbox, no real money
Cascade:   None to prod by design.
Recovery:  Roll back stage to previous image via Orca. RTO target: 5 min.
Mitigation: The 5 enforcement rules in §2 are the load-bearing controls. Verify quarterly
           via deliberate red-team: try to write to prod pg-app from stage and confirm 401.
Severity:  LOW (in prod) / HIGH (on stage, but stage SLA is 95%)

Scenario F — Full Cold Start (Power Loss, All VMs Restart Simultaneously)

Three VMs boot at once. Services must start in dependency order or services
crash-loop until their deps are ready.

DEADLOCK RISK: vm-control services (portal, tenant-registry) start before vm-data
               services (pg-app, certifai, compliance). They'll crash-loop ~2-5min
               with backoff retries.
               Same for ERPNext on vm-control trying to reach Keycloak on vm-edge.

RESOLUTION: Orca enforces cross-VM startup ordering via health-check dependencies.
            Bootstrap exception: Keycloak DB URI in Orca env on vm-edge (not from
            Infisical — chicken-and-egg solved).

Required cold start sequence:

  Phase 0 — Data roots on vm-data (parallel):
    pg-app, mongodb, qdrant, minio
  Phase 0 — Data roots on vm-control (parallel):
    mariadb, redis-erpnext
  Phase 0 — Data roots on vm-edge (parallel):
    pg-keycloak, pg-infisical, redis-infisical

  Phase 1 — Secrets + DNS on vm-edge:
    infisical  (needs: pg-infisical, redis-infisical)
    powerdns-auth  (no deps)

  Phase 2 — Identity on vm-edge:
    keycloak  (needs: pg-keycloak [Phase 0], infisical [Phase 1])
    gitea     (needs: sqlite; ready from Phase 0)

  Phase 3 — Control on vm-control + Data services on vm-data (parallel):
    tenant-registry  (needs: keycloak [Phase 2], pg-app [Phase 0, remote])
    erpnext + frappe-hd  (needs: mariadb, redis-erpnext [Phase 0], keycloak [Phase 2])
    stalwart  (needs: infisical [Phase 1])
    litellm  (needs: infisical)
    certifai  (needs: keycloak, mongodb, litellm)
    backend-compliance  (needs: keycloak, pg-app)
    ai-compliance-sdk   (needs: pg-app, qdrant, litellm)
    admin-compliance    (needs: backend + sdk)

  Phase 4 — Customer-facing on vm-control:
    customer-portal  (needs: keycloak, tenant-registry)

  Phase 5 — Gateway on vm-edge (last):
    orca-proxy  (waits for all backends healthy before opening listener)

Estimated cold-start time: 6-10 minutes (faster than 7-VM since less network roundtrip)

Scenario G — Tenant Registry fails

Impact:    Portal cannot resolve tenant from subdomain → /[slug]/* all 503
           Keycloak protocol mapper cannot get products claim → JWT missing field
                → users can log in but see "No active products"
           Products (CERTifAI, compliance) themselves: UNAFFECTED if already authenticated
Cascade:   New logins degraded.
           Existing sessions continue.
Deadlock:  None.
Recovery:  Restart tenant-registry on vm-control. pg-app on vm-data must be healthy.
           RTO target: ≤ 60s
Mitigation: Portal caches slug → tenant mapping with 60s TTL.
           Short outage invisible to customers.
Severity:  MEDIUM

11. Cross-Dependency Summary Table

              Needs → │PG-KC│PG-Inf│PG-App│Mongo│Maria│Redis│Minio│Qdrant│ KC  │Infis│Lit. │T.Reg│
─────────────────────┼─────┼──────┼──────┼─────┼─────┼─────┼─────┼──────┼─────┼─────┼─────┼─────┤
keycloak             │  ●  │      │      │     │     │     │     │      │     │ ◐*  │     │     │
infisical            │     │  ●   │      │     │     │  ●  │     │      │     │     │     │     │
gitea                │     │      │      │     │     │     │     │      │     │  ●  │     │     │
tenant-registry      │     │      │  ●   │     │     │     │     │      │  ●  │  ●  │     │     │
customer-portal      │     │      │      │     │     │     │     │      │  ●  │  ●  │     │  ●  │
erpnext              │     │      │      │     │  ●  │  ●  │     │      │  ●  │  ●  │     │     │
frappe-hd            │     │      │      │     │  ●  │  ●  │     │      │     │  ●  │     │     │
stalwart             │     │      │      │     │     │     │     │      │     │  ●  │     │     │
certifai             │     │      │      │  ●  │     │     │     │      │  ●  │  ●  │  ◐  │     │
litellm              │     │      │      │     │     │     │     │      │     │  ●  │     │     │
backend-compl.       │     │      │  ●   │     │     │     │     │      │  ●  │  ●  │     │     │
ai-compl-sdk         │     │      │  ●   │     │     │     │     │  ●   │     │  ●  │  ◐  │     │
admin-compl.         │     │      │      │     │     │     │     │      │     │     │     │     │
orca-proxy           │     │      │      │     │     │     │     │      │     │     │     │     │
stage-app            │     │      │      │     │     │     │     │      │  ●  │  ◑  │     │  ◑  │

● = hard dependency (cannot start without)
◐ = soft dependency (starts, features degrade)
◑ = stage-only read-mostly dependency (writes blocked by Infisical scope)
◐*= bootstrap exception (Keycloak DB URI in Orca env on vm-edge, not Infisical)

12. Open Infrastructure Risks (Priority Order)

RISK-1  pg-app (vm-data) is a single instance serving tenant_registry + compliance schemas.
        One crash blocks portal AND compliance product simultaneously.
        → Mitigation: split into pg-registry + pg-compliance at Tier B (200 customers).
                       Move pg-registry to its own DBaaS PostgreSQL cluster (€213/mo).
        Priority: HIGH — fix before 100 customers; flagged also in COST_PLAN.md

RISK-2  vm-edge is a single VM owning 3 root dependencies (DNS, auth, secrets).
        Failure = total external outage. Highest blast radius in the system.
        → Mitigation:
          Phase A: cold-standby vm-edge-spare (idle cost €0; tested quarterly)
          Phase B (Tier C, 500 cust): split vm-edge into vm-edge + vm-identity + vm-secrets
        Priority: HIGH

RISK-3  vm-control hosts 5 service groups (portal, tenant-registry, ERPNext, Frappe HD,
        Stalwart). Co-tenant noise risk; one OOM kills the others.
        → Mitigation:
          Phase A: hard Orca resource limits per service (see §6 co-tenant notes)
          Phase B (Tier B): split vm-control → vm-portal + vm-ops at €64/mo extra
        Priority: MEDIUM

RISK-4  Keycloak is a single instance with no clustering.
        Any Keycloak outage = total auth failure within JWT TTL.
        → Mitigation: short-term: tested runbook + 15min RTO target
                       long-term: Keycloak active-passive cluster (Phase 2, on split vm-identity)
        Priority: MEDIUM

RISK-5  Stage isolation depends on 5 enforcement controls (see §2 table).
        If any one breaks, stage code can affect prod customers.
        → Mitigation: quarterly red-team verification of each control.
                       Especially: Infisical secret-path scoping and Stalwart recipient filter.
        Priority: MEDIUM — easy to forget once it's working

RISK-6  Infisical downtime during multi-VM restart causes delayed cold start.
        → Mitigation: Orca startup ordering + bootstrap secrets for Keycloak only
        Priority: LOW — documented runbook; cold start is rare

RISK-7  ERPNext → Tenant Registry webhook has no guaranteed delivery.
        Failed activation = tenant not active after contract signed.
        → Mitigation: Frappe retry + idempotent /activate endpoint + manual Backstage trigger
        Priority: LOW

RISK-8  LiteLLM calls external AI APIs (OpenAI / Anthropic).
        → Mitigation: LiteLLM fallback routing; products degrade gracefully.
        Priority: LOW — external dependency, by design

13. Growth Trajectory — when to add VMs

The locked 4-VM topology is right for 5~200 customers. Past that, expect to add VMs back in this order:

Tier A (5200 cust):    4 VMs as locked         €192/mo compute (36M upfront)
                        ↓
Tier B (200500):       Bump vm-data m2.med → m2.large       +€64/mo
                        Add cold-standby vm-edge-spare        +€0 (idle, paid only on swap)
                        ↓
Tier C (5001000):      Split vm-data: vm-data + vm-data-db   +€64/mo
                        (postgres-app moves to its own VM, or DBaaS cluster +€213/mo)
                        Split vm-control: vm-control + vm-ops +€64/mo
                        (ERPNext + MariaDB + Stalwart move to vm-ops)
                        ↓
Tier D (10002000):     Split vm-edge: vm-edge + vm-identity + vm-secrets   +€96/mo
                        HA Keycloak active-passive on 2× vm-identity        +€32/mo
                        Octavia Load Balancer Double Instance               +€58/mo
                        vm-data m2.large → m2.xlarge or 2×                  +€128256/mo
                        ↓
                        Final topology ≈ 8 prod VMs + DBaaS

Each step is justified by a measurable signal (>80% RAM, >70% CPU, sustained queue depth, or a specific outage scenario). Never split preemptively.


14. Cost summary (see COST_PLAN.md for full breakdown)

Mode Compute €/mo Storage €/mo Network €/mo Total net + 19% VAT
On-Demand 434.50 112 2.92 549.42 653.81
12-month commit 295.20 112 2.92 410.12 488.04
36-month no upfront 216.00 112 2.92 330.92 393.79
36-month upfront 192.00 112 2.92 306.92 365.23

Plus €6,912 net one-time payment if signing 36M-upfront for the compute portion.


End of document. Review quarterly or after any significant infrastructure change. Topology last locked 2026-05-18.