03a5b4846e
ci / shared (push) Successful in 4s
Apply platform-domain decision (2026-05-18). No services touched; docs/config only. Refs: M1.1
775 lines
50 KiB
Markdown
775 lines
50 KiB
Markdown
# Infrastructure Specification
|
||
**Status:** Locked Topology
|
||
**Authors:** Sharang, Benjamin
|
||
**Date:** 2026-05-11 (topology lock: 2026-05-18)
|
||
**Companion docs:** PLATFORM_ARCHITECTURE.md, IMPLEMENTATION_PLAN.md, COST_PLAN.md
|
||
**Cloud provider:** SysEleven Cloud Services (DUS2, OpenStack)
|
||
|
||
---
|
||
|
||
## 1. VM Inventory
|
||
|
||
**Four billable VMs total.** Three in production (one per plane after collapsing Identity+Infra), one in stage. Dev runs entirely on developer laptops via docker-compose.
|
||
|
||
```
|
||
┌──────────────┬─────────────────┬────────────────────────┬───────────┬─────────────────┐
|
||
│ Name │ Env │ SysEleven flavor │ Public IP │ Planes owned │
|
||
├──────────────┼─────────────────┼────────────────────────┼───────────┼─────────────────┤
|
||
│ vm-edge │ prod │ m2.small (2v / 8 GB) │ YES (1) │ Identity + Infra│
|
||
│ vm-control │ prod │ m2.medium (4v / 16 GB) │ No │ Control │
|
||
│ vm-data │ prod │ m2.medium (4v / 16 GB) │ No │ Data │
|
||
│ stage │ stage │ m2.small (2v / 8 GB) │ YES (1) │ App plane only │
|
||
│ (dev) │ dev │ local docker-compose │ n/a │ all (in-memory) │
|
||
└──────────────┴─────────────────┴────────────────────────┴───────────┴─────────────────┘
|
||
```
|
||
|
||
**Total compute:** 48 GiB-RAM, 12 vCPU. **Monthly compute net: €192 (36M upfront) / €295 (12M) / €435 (On-Demand).** See COST_PLAN.md for the full three-mode table.
|
||
|
||
### Why this topology and not the previous 7-VM layout
|
||
|
||
The earlier draft proposed one VM per service group (vm-gateway, vm-identity, vm-secrets, vm-ops, vm-control, vm-certifai, vm-compliance). That gave maximum failure isolation but cost 132 GiB-RAM stage+prod. At 5 customers the isolation is unused — every VM ran at <10% utilisation. The locked topology buys back failure isolation incrementally as load grows (see §13 Growth Trajectory).
|
||
|
||
Critical isolations preserved even at 4 VMs:
|
||
- **vm-edge isolates identity from app workloads.** Keycloak JVM has its own page cache; ERPNext background jobs cannot starve token issuance.
|
||
- **vm-data isolates databases from stateless services.** All data-plane DBs share one host, but they're walled off from the portal + ERPNext + Stalwart competing on vm-control.
|
||
- **stage runs the app plane only.** It calls prod Keycloak + prod Tenant Registry under `tenant.kind = stage` rather than mirroring those services.
|
||
|
||
---
|
||
|
||
## 2. Service-to-VM Mapping
|
||
|
||
```
|
||
vm-edge (prod, m2.small 8 GB, public IP)
|
||
├── orca-proxy (Orca-managed; wildcard TLS terminator)
|
||
├── powerdns-auth (Orca-managed; authoritative DNS for breakpilot.com)
|
||
├── keycloak-26 (Orca-managed; JVM, ~1.5 GB heap)
|
||
├── postgres-keycloak (Orca-managed; dedicated PG instance for Keycloak only)
|
||
├── infisical (Orca-managed)
|
||
├── postgres-infisical (Orca-managed; dedicated PG instance for Infisical only)
|
||
├── redis-infisical (Orca-managed; ephemeral)
|
||
└── gitea (Orca-managed; SQLite backend to avoid a third PG)
|
||
|
||
vm-control (prod, m2.medium 16 GB)
|
||
├── customer-portal (Orca-managed; Next.js)
|
||
├── tenant-registry (Orca-managed; Go)
|
||
├── orca-controller (Orca core process; NOT a managed container)
|
||
├── erpnext (Orca-managed; Frappe bench)
|
||
├── frappe-hd (same bench as ERPNext)
|
||
├── mariadb (Orca-managed; for ERPNext)
|
||
├── redis-erpnext (Orca-managed)
|
||
└── stalwart-mail (Orca-managed; SMTP/IMAP/JMAP on mail.breakpilot.com)
|
||
|
||
vm-data (prod, m2.medium 16 GB)
|
||
├── certifai-dashboard (Orca-managed)
|
||
├── mongodb (Orca-managed)
|
||
├── litellm (Orca-managed)
|
||
├── backend-compliance (Orca-managed)
|
||
├── ai-compliance-sdk (Orca-managed)
|
||
├── admin-compliance (Orca-managed)
|
||
├── postgres-app (Orca-managed; schemas: tenant_registry, compliance)
|
||
├── qdrant (Orca-managed)
|
||
└── minio (Orca-managed)
|
||
|
||
stage (stage, m2.small 8 GB, public IP)
|
||
├── orca-proxy (light; only routes to stage app)
|
||
├── customer-portal (NEW VERSION under test)
|
||
├── tenant-registry (NEW VERSION under test, talks to ephemeral PG below)
|
||
├── certifai-dashboard (NEW VERSION under test)
|
||
├── backend-compliance (NEW VERSION under test)
|
||
├── ai-compliance-sdk (NEW VERSION under test)
|
||
├── admin-compliance (NEW VERSION under test)
|
||
├── litellm (light; same image as prod)
|
||
├── postgres-app-stage (ephemeral; lives entirely on stage VM)
|
||
├── mongodb-stage (ephemeral)
|
||
└── qdrant-stage (ephemeral, tiny corpus)
|
||
|
||
Calls OUT to prod:
|
||
→ auth.breakpilot.com (Keycloak token issuance, under stage client_id)
|
||
→ mail.breakpilot.com (Stalwart SMTP, recipient filter forces +stage@ only)
|
||
→ Polar SANDBOX webhook URL (NEVER prod Polar)
|
||
→ no calls to prod Postgres-app, MariaDB, MongoDB
|
||
```
|
||
|
||
### Stage isolation rules (enforced at the platform, not in product code)
|
||
|
||
| Risk | Enforcement mechanism | Owner |
|
||
|---|---|---|
|
||
| Stage writes to prod database | Infisical scope: stage app only gets `/stage/*` secrets. Prod DB credentials never reach stage. | Infra plane |
|
||
| Stage emails real customers | Stalwart accept-rule: drop if recipient does not match `*+stage@*`. | Control plane (Stalwart config) |
|
||
| Stage triggers real Polar charges | Stage env points `POLAR_API_URL` to sandbox. Prod Polar webhook secret never on stage. | Control plane |
|
||
| Stage Keycloak JWT used in prod | `stage_client_id` issued only by Keycloak; prod services reject JWTs with this aud. | Identity plane |
|
||
| Stage load DOSes prod Keycloak | Keycloak rate-limit per client_id; stage limited to 60 req/s. | Identity plane |
|
||
|
||
---
|
||
|
||
## 3. Network Topology
|
||
|
||
```
|
||
INTERNET
|
||
│
|
||
(breakpilot.com — authoritative on vm-edge PowerDNS;
|
||
stage.breakpilot.com — authoritative same zone)
|
||
│
|
||
┌─────────────┴─────────────┐
|
||
│ │
|
||
┌───────▼────────┐ ┌────────▼─────────┐
|
||
│ vm-edge │ │ stage │
|
||
│ (public IP) │ │ (public IP) │
|
||
│ │ │ │
|
||
│ orca-proxy ────┤ │ orca-proxy │
|
||
│ powerdns │ │ portal-new │
|
||
│ keycloak │◄────────┤ tenant-registry-new
|
||
│ pg-keycloak │ stage │ certifai-new │
|
||
│ infisical │ calls │ compliance-new │
|
||
│ pg-infisical │ prod │ pg-stage │
|
||
│ redis-infis │ KC + │ mongo-stage │
|
||
│ gitea │ Stalwart│ qdrant-stage │
|
||
└───────┬────────┘ └──────────────────┘
|
||
│ PRIVATE NETWORK 10.0.0.0/16
|
||
┌────────┴─────────┐
|
||
│ │
|
||
┌──────▼───────┐ ┌───────▼──────┐
|
||
│ vm-control │ │ vm-data │
|
||
│ │ │ │
|
||
│ portal │ │ certifai │
|
||
│ tenant-reg │ │ mongodb │
|
||
│ orca-ctrl │ │ litellm │
|
||
│ erpnext │ │ backend-comp │
|
||
│ frappe-hd │ │ ai-sdk │
|
||
│ mariadb │ │ admin-comp │
|
||
│ redis-erp │ │ pg-app │
|
||
│ stalwart │ │ qdrant │
|
||
└──────────────┘ │ minio │
|
||
└──────────────┘
|
||
|
||
Orca-Proxy routing (vm-edge, by Host header):
|
||
auth.breakpilot.com → 127.0.0.1:8443 (Keycloak, local on vm-edge)
|
||
erp.breakpilot.com → vm-control:8000 (ERPNext) [allowlist: our IPs only]
|
||
git.breakpilot.com → vm-edge:3000 (Gitea, local) [allowlist: our IPs only]
|
||
mail.breakpilot.com → vm-control:587 (Stalwart submission) [allowlist: VM internal only]
|
||
ns1.breakpilot.com → 127.0.0.1:53 (PowerDNS, local)
|
||
*.breakpilot.com → vm-control:3000 (customer portal)
|
||
|
||
Orca-Proxy routing (stage, by Host header):
|
||
*.stage.breakpilot.com → 127.0.0.1:3000 (stage portal — all subdomains route here)
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Storage and Volume Requirements
|
||
|
||
Block volumes (Ceph 3x replicated, €0.10/GiB/mo) mounted to each VM.
|
||
|
||
```
|
||
┌──────────────┬───────────────────────────────────────────┬─────────┬─────────────────────┐
|
||
│ VM │ Data stores │ +Block │ Growth profile │
|
||
├──────────────┼───────────────────────────────────────────┼─────────┼─────────────────────┤
|
||
│ vm-edge │ pg-keycloak + pg-infisical + Gitea repos │ +50 GB │ Slow │
|
||
│ vm-control │ MariaDB (ERPNext) + Stalwart mail spool │ +250 GB │ Medium │
|
||
│ vm-data │ MongoDB + pg-app + Qdrant + MinIO │ +500 GB │ Fast (scales w/ N) │
|
||
│ stage │ pg-stage + mongo-stage + qdrant-stage │ +50 GB │ Resets per release │
|
||
└──────────────┴───────────────────────────────────────────┴─────────┴─────────────────────┘
|
||
|
||
Each VM's root disk: 50 GB ephemeral, included in flavor price.
|
||
|
||
Object storage (S3, €0.02/GiB/mo single-region or €0.0496/GiB/mo geo-redundant):
|
||
┌─────────────────────────────────┬─────────┬──────────────────────────┐
|
||
│ Bucket │ Size │ Purpose │
|
||
├─────────────────────────────────┼─────────┼──────────────────────────┤
|
||
│ s3://backups (geo-redundant) │ ~500 GB │ Database dumps │
|
||
│ s3://seed-data │ ~30 GB │ Demo tenant fixtures │
|
||
│ s3://exports │ ~50 GB │ GDPR/offboarding ZIPs │
|
||
│ s3://audit-archive │ ~20 GB │ Old audit log overflow │
|
||
└─────────────────────────────────┴─────────┴──────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Backup Requirements
|
||
|
||
All backups ship to **SysEleven Object Storage** (S3-compatible, geo-redundant DUS2 ↔ HAM1 for production-critical data). Backup jobs run as Orca one-shot containers on cron. Infisical holds the S3 credentials.
|
||
|
||
```
|
||
┌───────────────────────┬──────────────────┬────────────┬────────────┬──────────────────────┐
|
||
│ Data store │ Method │ Frequency │ Retention │ Owner (who restores) │
|
||
├───────────────────────┼──────────────────┼────────────┼────────────┼──────────────────────┤
|
||
│ pg-keycloak (vm-edge) │ pg_dump → S3-geo │ Every 6h │ 14 days │ Infra Plane │
|
||
│ pg-infisical (vm-edge)│ pg_dump → S3-geo │ Daily │ 30 days │ Infra Plane │
|
||
│ Gitea (vm-edge) │ gitea dump → S3 │ Daily │ 30 days │ Infra Plane │
|
||
│ Keycloak realm export │ KC export → S3 │ Daily │ 14 days │ Identity Plane (owns)│
|
||
│ Infisical store │ encrypted → S3 │ Daily │ 30 days │ Infra Plane │
|
||
│ MariaDB (vm-control) │ mysqldump → S3 │ Every 6h │ 30 days │ Control Plane │
|
||
│ Stalwart queue/store │ tar → S3 │ Daily │ 7 days │ Control Plane │
|
||
│ pg-app (vm-data) │ pg_dump → S3-geo │ Every 6h │ 30 days │ Data Plane (owns RPO)│
|
||
│ MongoDB (vm-data) │ mongodump → S3 │ Daily │ 30 days │ Data Plane │
|
||
│ MinIO (vm-data) │ mc mirror → S3 │ Daily │ 90 days │ Data Plane │
|
||
│ Qdrant (vm-data) │ API snap → S3 │ Daily │ 14 days │ Data Plane (rebuild) │
|
||
│ stage * │ no backup │ — │ — │ — (ephemeral) │
|
||
│ Orca config (IaC) │ Gitea (VCS) │ On commit │ Forever │ Infra Plane │
|
||
└───────────────────────┴──────────────────┴────────────┴────────────┴──────────────────────┘
|
||
```
|
||
|
||
### RPO by data criticality
|
||
|
||
```
|
||
CRITICAL (RPO ≤ 6h)
|
||
pg-keycloak — org memberships, IdP config
|
||
pg-app — tenant registry, compliance records
|
||
MariaDB/ERPNext — sales orders, invoices, contracts
|
||
|
||
IMPORTANT (RPO ≤ 24h)
|
||
MongoDB — chat history, user preferences
|
||
MinIO — compliance evidence documents
|
||
pg-infisical — encrypted secrets
|
||
Stalwart store — inbound webhooks, bounce records
|
||
|
||
RECOVERABLE (RPO ≤ 48h, rebuildable)
|
||
Qdrant — vector index (rebuildable from MinIO source documents)
|
||
Gitea — code (mirrored on dev machines)
|
||
Keycloak export — org structure (pg-keycloak is primary)
|
||
|
||
NOT BACKED UP
|
||
stage (any data) — by design; restored from seed bundles on each deploy
|
||
redis-* — caches; restart cold
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Constraint Framework
|
||
|
||
### Constraint types
|
||
|
||
```
|
||
AVAILABILITY — required uptime percentage over rolling 30 days
|
||
RTO — Recovery Time Objective: max time to restore service after failure
|
||
RPO — Recovery Point Objective: max acceptable data loss window
|
||
IaC — service must be declared in Orca config, no manual container runs in prod
|
||
SECRET_HYGIENE — all secrets via Infisical machine identity; no env files, no hardcoded values
|
||
NETWORK — whether service is internet-exposed or internal-only
|
||
DATA_RESIDENCY — all data must remain in EU (SysEleven DUS2 + HAM1)
|
||
AUDIT_TRAIL — all mutating actions logged (who, what, when, from where)
|
||
IMMUTABILITY — config changes go through Gitea → Orca pipeline, not manual SSH
|
||
STAGE_ISOLATION— stage tenant cannot mutate any prod data; reads-only against prod KC + TR
|
||
```
|
||
|
||
### Plane ownership of constraints
|
||
|
||
Even though planes now share VMs, the **ownership model is unchanged** — the plane that owns a constraint owns it regardless of which VM hosts the service. The Infra Plane (now collapsed onto vm-edge alongside the Identity plane) still mechanically enforces backup, IaC, secrets, and network constraints.
|
||
|
||
```
|
||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||
║ IDENTITY PLANE (on vm-edge) ║
|
||
║ ║
|
||
║ Owns / defines: ║
|
||
║ AVAILABILITY — must be ≥ 99.5% (root dep for everything) ║
|
||
║ RTO — ≤ 15 min ║
|
||
║ AUDIT_TRAIL — realm-level audit (logins, token issuance, IdP events) ║
|
||
║ DATA_RESIDENCY— Keycloak realm data must stay EU ║
|
||
║ STAGE_ISOLATION— rate-limits stage_client_id; rejects stage JWTs in prod audiences ║
|
||
║ ║
|
||
║ Co-tenant note: shares vm-edge with Infra Plane services. JVM heap pinned to 1.5 GB ║
|
||
║ in Orca manifest so it cannot starve PowerDNS / Infisical. ║
|
||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||
|
||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||
║ CONTROL PLANE (on vm-control) ║
|
||
║ ║
|
||
║ Owns / defines: ║
|
||
║ RPO (tenant) — tenant registry & compliance schemas RPO ≤ 6h ║
|
||
║ RPO (ERPNext) — sales orders, invoices RPO ≤ 6h ║
|
||
║ AUDIT_TRAIL — all portal actions (invites, IdP changes, impersonations) ║
|
||
║ AVAILABILITY — portal ≥ 99.5%; ERPNext ≥ 99% (internal) ║
|
||
║ RTO (portal) — ≤ 10 min ║
|
||
║ RTO (ERPNext) — ≤ 60 min ║
|
||
║ ║
|
||
║ Co-tenant note: ERPNext + Portal + Stalwart on one VM. Orca resource limits enforced: ║
|
||
║ portal: 1 GB memory cap ║
|
||
║ erpnext: 6 GB memory cap ║
|
||
║ mariadb: 3 GB memory cap ║
|
||
║ stalwart: 1 GB memory cap ║
|
||
║ tenant-registry: 500 MB ║
|
||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||
|
||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||
║ DATA PLANE (on vm-data) ║
|
||
║ ║
|
||
║ Owns / defines: ║
|
||
║ DATA_RESIDENCY — all customer data (MongoDB, pg-app, MinIO) must stay EU ║
|
||
║ RPO (product) — compliance records ≤ 6h; chat history ≤ 24h ║
|
||
║ DATA_ISOLATION — every query scoped by org_id/tenant_id ║
|
||
║ AUDIT_TRAIL — product-level actions ║
|
||
║ AVAILABILITY — CERTifAI ≥ 99.5%; compliance ≥ 99.5% ║
|
||
║ ║
|
||
║ Co-tenant note: this VM is the SCALE driver. When vm-data hits 80% RAM, bump flavor ║
|
||
║ (m2.medium → m2.large → m2.xlarge). See §13 Growth Trajectory. ║
|
||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||
|
||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||
║ INFRA PLANE (on vm-edge, alongside Identity) ║
|
||
║ ║
|
||
║ Owns / enforces ALL of: ║
|
||
║ BACKUP — executes all backup jobs (pg_dump, mongodump, mc mirror) ║
|
||
║ IaC — ALL services declared in Orca config; no manual prod changes ║
|
||
║ IMMUTABILITY — config changes: Gitea commit → Gitea Actions → Orca API only ║
|
||
║ SECRET_HYGIENE— Infisical (on vm-edge); provisions machine identities ║
|
||
║ NETWORK — Orca-Proxy rules; VM firewall; no direct VM public exposure ║
|
||
║ DATA_RESIDENCY— VM region = SysEleven DUS2; backups geo-redundant DUS2↔HAM1 ║
|
||
║ AVAILABILITY — Orca restart policies, health checks ║
|
||
║ COLD_START — enforces startup ordering (see §10 Scenario F) ║
|
||
║ STAGE_ISOLATION— Infisical secret-path scoping for stage_app identity ║
|
||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||
```
|
||
|
||
---
|
||
|
||
## 7. SLA Table
|
||
|
||
```
|
||
┌───────────────────────┬──────────────┬─────────┬─────────┬────────────────────────────────┐
|
||
│ Service │ Availability │ RTO │ RPO │ Host VM │
|
||
├───────────────────────┼──────────────┼─────────┼─────────┼────────────────────────────────┤
|
||
│ Orca-Proxy │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||
│ PowerDNS │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||
│ Keycloak │ 99.5% │ 15 min │ 6h │ vm-edge (root auth dep) │
|
||
│ Infisical │ 99.5% │ 30 min │ 24h │ vm-edge (running svcs survive) │
|
||
│ Gitea │ 99% │ 2h │ 24h │ vm-edge (dev machines mirror) │
|
||
│ Customer Portal │ 99.5% │ 10 min │ N/A │ vm-control │
|
||
│ Tenant Registry │ 99.5% │ 10 min │ 6h │ vm-control │
|
||
│ ERPNext │ 99% │ 60 min │ 6h │ vm-control (internal only) │
|
||
│ Frappe HD │ 99% │ 60 min │ 24h │ vm-control │
|
||
│ MariaDB │ 99.5% │ 20 min │ 6h │ vm-control │
|
||
│ Stalwart Mail │ 99% │ 60 min │ 24h │ vm-control │
|
||
│ CERTifAI │ 99.5% │ 10 min │ 24h │ vm-data │
|
||
│ MongoDB │ 99.5% │ 20 min │ 24h │ vm-data │
|
||
│ LiteLLM │ 99% │ 5 min │ N/A │ vm-data │
|
||
│ backend-compliance │ 99.5% │ 10 min │ 6h │ vm-data │
|
||
│ ai-compliance-sdk │ 99.5% │ 10 min │ 6h │ vm-data │
|
||
│ pg-app │ 99.9% │ 20 min │ 6h │ vm-data (SPOF — RISK-1) │
|
||
│ MinIO │ 99.5% │ 30 min │ 24h │ vm-data │
|
||
│ Qdrant │ 99% │ 2h │ 24h │ vm-data (rebuildable) │
|
||
│ stage (any service) │ 95% │ best ef.│ N/A │ stage (ephemeral; no SLA) │
|
||
└───────────────────────┴──────────────┴─────────┴─────────┴────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## 8. IaC Constraint (Orca)
|
||
|
||
Every production service declared in Orca config. No exceptions.
|
||
|
||
### Rules
|
||
|
||
```
|
||
1. ALL containers run via Orca manifests committed to Gitea
|
||
→ /orca/manifests/{vm-name}/{service-name}.toml
|
||
→ Changes go through: Gitea PR → Gitea Actions lint → Orca API apply
|
||
|
||
2. NO manual docker run / docker-compose up on any production VM
|
||
→ SSH to prod VMs allowed for debugging only; no state changes
|
||
|
||
3. Secrets are NEVER in Orca manifests
|
||
→ Manifests reference Infisical paths, not values
|
||
→ Bootstrap exception: Keycloak DB URI in Orca env (Keycloak runs ON vm-edge alongside
|
||
Infisical, so chicken-and-egg is solved by Orca env file, not Infisical lookup)
|
||
|
||
4. Restart policy: always (Orca restarts crashed containers with exponential backoff)
|
||
→ Health check per service (HTTP /health or TCP probe)
|
||
|
||
5. Resource limits MANDATORY in every manifest
|
||
→ On a 3-VM prod, co-tenant noise is the single biggest risk; limits are non-negotiable
|
||
→ See §6 Plane ownership "Co-tenant note" boxes for the per-service caps
|
||
|
||
6. Orca controller state itself is recoverable
|
||
→ Manifest files in Gitea = desired state
|
||
→ Loss of Orca controller = re-apply manifests from Gitea, services continue running
|
||
|
||
7. Stage app gets its own Infisical scope
|
||
→ /stage/* path; no prod-DB credentials reach this scope
|
||
→ Enforced at Infisical machine-identity level, not in app code
|
||
```
|
||
|
||
### Gitea Actions pipeline for infra changes
|
||
|
||
```
|
||
infra change committed to Gitea
|
||
│
|
||
├── lint: validate Orca manifest schema
|
||
├── diff: show what changes will be applied (orca plan)
|
||
├── (manual approval gate for vm-edge changes — touches auth root)
|
||
└── apply: POST to Orca Controller API → rolling update
|
||
```
|
||
|
||
---
|
||
|
||
## 9. Dependency Graph
|
||
|
||
Arrows = "requires to function." Dashed = soft (degrades, doesn't fail).
|
||
**Intra-VM dependencies elided** for clarity (e.g. Keycloak ↔ pg-keycloak are on the same host and start together).
|
||
|
||
```
|
||
EXTERNAL
|
||
AI APIs
|
||
(OpenAI / Anthropic)
|
||
│
|
||
│ (soft)
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ vm-edge (Identity + Infra) │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ pg-keycloak ──► keycloak │ │
|
||
│ │ pg-infisical ─► infisical ◄── (all VMs pull on startup) │ │
|
||
│ │ redis-infis ──► infisical │ │
|
||
│ │ (sqlite) ─────► gitea │ │
|
||
│ │ powerdns-auth (no deps) │ │
|
||
│ │ orca-proxy (route table only; backends are remote) │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
│ │ Keycloak JWKS │ Infisical /secrets │
|
||
│ │ │ │
|
||
└────────────────────────────┼────────────────┼────────────────────┘
|
||
▼ ▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ vm-control (Control) │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ mariadb + redis-erp ──► erpnext + frappe-hd │ │
|
||
│ │ (intra) ─────────────► stalwart │ │
|
||
│ │ ──────────────────────► customer-portal │ │
|
||
│ │ ──────────────────────► tenant-registry ──► pg-app (vm-data)│ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
│ │ tenant-registry API │
|
||
└────────────────────────────┼─────────────────────────────────────┘
|
||
▼
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ vm-data (Data) │
|
||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||
│ │ mongodb ───► certifai ◄── (vm-edge JWKS, vm-edge secrets) │ │
|
||
│ │ litellm ───► certifai, ai-compliance-sdk │ │
|
||
│ │ pg-app ────► tenant-registry-on-vm-control, backend-compl,│ │
|
||
│ │ ai-compliance-sdk │ │
|
||
│ │ qdrant ────► ai-compliance-sdk │ │
|
||
│ │ minio ────► backend-compliance │ │
|
||
│ │ backend-compliance ──► admin-compliance │ │
|
||
│ └────────────────────────────────────────────────────────────┘ │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
|
||
┌──────────────────────────────────────────────────────────────────┐
|
||
│ stage (App plane only) │
|
||
│ Calls vm-edge:8443 (KC) + vm-control:587 (Stalwart submission) │
|
||
│ Calls Polar SANDBOX (never prod Polar webhook URL) │
|
||
│ Its own ephemeral DBs; cannot read prod data │
|
||
└──────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Simplified critical path (customer login → product use)
|
||
|
||
```
|
||
DNS (vm-edge PowerDNS)
|
||
│
|
||
▼
|
||
orca-proxy (vm-edge)
|
||
│
|
||
├──► keycloak (vm-edge) ──► pg-keycloak (intra-VM)
|
||
│
|
||
└──► customer-portal (vm-control)
|
||
├──► tenant-registry (vm-control) ──► pg-app (vm-data)
|
||
├──► certifai (vm-data) ──► mongodb (intra-VM)
|
||
└──► backend-compliance (vm-data) ──► pg-app (intra-VM)
|
||
──► ai-sdk ──► qdrant + minio
|
||
──► litellm ──► [external AI APIs]
|
||
```
|
||
|
||
---
|
||
|
||
## 10. Failure Scenarios and Deadlock Analysis
|
||
|
||
### Scenario A — vm-edge fails (HIGHEST SEVERITY)
|
||
|
||
```
|
||
Impact: TOTAL outage. Nothing reachable from internet.
|
||
No DNS. No TLS. No auth. No new logins. Running JWTs expire within 15 min,
|
||
then ALL services start returning 401.
|
||
Backstage and customer portal both fully blocked.
|
||
Stage also blocked (depends on prod Keycloak).
|
||
Cascade: T+0: DNS fails → orca-proxy unreachable
|
||
T+5m: existing JWTs still valid; portal cached → partial reads work
|
||
T+15m: JWTs expire → full outage
|
||
Deadlock: None — services downstream don't deadlock, they just fail closed
|
||
Recovery: 1. Spin up vm-edge-spare (cold standby, same Orca config) — ~3 min provision
|
||
2. Restore pg-keycloak + pg-infisical from latest backup — ~5 min
|
||
3. Swap registrar NS records to spare IP (TTL 60s) — ~2 min propagation
|
||
4. Restart all services on vm-edge-spare via Orca apply — ~3 min
|
||
Total RTO target: 15 min
|
||
Mitigation: COLD STANDBY vm-edge-spare. Same Orca config committed in Gitea.
|
||
Provision cost when idle: €0 (only billed when running).
|
||
Test recovery quarterly.
|
||
Severity: CRITICAL — single host owns 3 root dependencies (DNS, auth, secrets)
|
||
Cost of fix at Tier C: split vm-edge into vm-edge + vm-identity + vm-secrets
|
||
(back toward original 7-VM design) — €100/mo extra
|
||
```
|
||
|
||
### Scenario B — vm-control fails (NEW — consequence of plane consolidation)
|
||
|
||
```
|
||
Impact: customer-portal: DOWN → /[slug]/* all return 503
|
||
tenant-registry: DOWN → Keycloak protocol-mapper for products claim breaks
|
||
→ users can log in but see "No active products"
|
||
ERPNext + Frappe HD: DOWN → we cannot create sales orders or read tickets
|
||
Stalwart: DOWN → no outbound emails (trial nudges, exports, ticket replies)
|
||
MariaDB: DOWN → ERPNext queries fail; backups paused
|
||
Products (CERTifAI, compliance): UNAFFECTED (on vm-data, JWTs still validate)
|
||
Existing logged-in users: can use products directly via product subdomain
|
||
IF they bookmark it; portal home is 503.
|
||
Cascade: T+0: portal 503; new tenant onboarding blocked (registry down)
|
||
T+15m: existing JWTs missing refreshed products claim
|
||
T+1h: trial emails not sent → trial nudge cadence breaks
|
||
Deadlock: None
|
||
Recovery: Restart vm-control containers via Orca. If MariaDB corrupt: restore mysqldump.
|
||
RTO target: 10 min (portal) / 60 min (ERPNext)
|
||
Mitigation: Multiple services co-hosted = single failure hits many SLAs.
|
||
Resource limits in Orca prevent ERPNext OOM from killing portal.
|
||
Quarterly drill: deliberately stop portal, measure recovery.
|
||
Severity: HIGH — three services down at once, but products keep serving customers
|
||
Cost of fix at Tier B/C: split vm-control → vm-portal + vm-ops (ERPNext)
|
||
— €64/mo extra at m2.small
|
||
```
|
||
|
||
### Scenario C — vm-data fails
|
||
|
||
```
|
||
Impact: tenant-registry queries: FAIL (pg-app down) → portal returns 503 for tenant lookup
|
||
customer-portal: DEGRADED (login works, dashboard fails)
|
||
CERTifAI: COMPLETELY DOWN
|
||
backend-compliance + ai-sdk + admin: COMPLETELY DOWN
|
||
ERPNext + Stalwart: UNAFFECTED
|
||
Cascade: T+0: products down; portal degraded
|
||
T+15m: support tickets pile up
|
||
Note: prod is partial — users see error pages but ERPNext + auth still work
|
||
Recovery: Restart vm-data containers. If pg-app corrupt: restore from pg_dump (RPO 6h).
|
||
RTO target: 20 min
|
||
Mitigation: This is the SCALE-event VM. RISK-1 below makes this the worst SPOF:
|
||
one pg-app instance owns tenant_registry + compliance schemas.
|
||
HIGH PRIORITY fix: split pg-app into separate clusters at Tier B/C transition.
|
||
Severity: HIGH — products down, business operations (ERPNext) still work so we can
|
||
contact customers
|
||
```
|
||
|
||
### Scenario D — LiteLLM fails
|
||
|
||
```
|
||
Impact: CERTifAI: AI features fail (summarization, chat completion).
|
||
CERTifAI dashboard, sessions: UNAFFECTED.
|
||
compliance AI generation: FAILS (DSFA/TOM/VVT generation blocked).
|
||
Compliance CRUD: UNAFFECTED.
|
||
Cascade: Soft degradation only. Products show "AI features temporarily unavailable" banner.
|
||
Deadlock: None.
|
||
Recovery: Restart LiteLLM on vm-data (stateless, ~30s).
|
||
Severity: MEDIUM — graceful degradation by design
|
||
```
|
||
|
||
### Scenario E — Stage VM compromised or buggy
|
||
|
||
```
|
||
Impact: On stage itself: stage portal serves bad data; stage testers see errors.
|
||
On prod: NONE if isolation rules in §2 are intact.
|
||
Worst case if isolation breaks:
|
||
- Stage code tries to call prod pg-app → fails (no creds in /stage/* Infisical)
|
||
- Stage emits real email → blocked by Stalwart recipient filter
|
||
- Stage triggers Polar charge → goes to sandbox, no real money
|
||
Cascade: None to prod by design.
|
||
Recovery: Roll back stage to previous image via Orca. RTO target: 5 min.
|
||
Mitigation: The 5 enforcement rules in §2 are the load-bearing controls. Verify quarterly
|
||
via deliberate red-team: try to write to prod pg-app from stage and confirm 401.
|
||
Severity: LOW (in prod) / HIGH (on stage, but stage SLA is 95%)
|
||
```
|
||
|
||
### Scenario F — Full Cold Start (Power Loss, All VMs Restart Simultaneously)
|
||
|
||
```
|
||
Three VMs boot at once. Services must start in dependency order or services
|
||
crash-loop until their deps are ready.
|
||
|
||
DEADLOCK RISK: vm-control services (portal, tenant-registry) start before vm-data
|
||
services (pg-app, certifai, compliance). They'll crash-loop ~2-5min
|
||
with backoff retries.
|
||
Same for ERPNext on vm-control trying to reach Keycloak on vm-edge.
|
||
|
||
RESOLUTION: Orca enforces cross-VM startup ordering via health-check dependencies.
|
||
Bootstrap exception: Keycloak DB URI in Orca env on vm-edge (not from
|
||
Infisical — chicken-and-egg solved).
|
||
|
||
Required cold start sequence:
|
||
|
||
Phase 0 — Data roots on vm-data (parallel):
|
||
pg-app, mongodb, qdrant, minio
|
||
Phase 0 — Data roots on vm-control (parallel):
|
||
mariadb, redis-erpnext
|
||
Phase 0 — Data roots on vm-edge (parallel):
|
||
pg-keycloak, pg-infisical, redis-infisical
|
||
|
||
Phase 1 — Secrets + DNS on vm-edge:
|
||
infisical (needs: pg-infisical, redis-infisical)
|
||
powerdns-auth (no deps)
|
||
|
||
Phase 2 — Identity on vm-edge:
|
||
keycloak (needs: pg-keycloak [Phase 0], infisical [Phase 1])
|
||
gitea (needs: sqlite; ready from Phase 0)
|
||
|
||
Phase 3 — Control on vm-control + Data services on vm-data (parallel):
|
||
tenant-registry (needs: keycloak [Phase 2], pg-app [Phase 0, remote])
|
||
erpnext + frappe-hd (needs: mariadb, redis-erpnext [Phase 0], keycloak [Phase 2])
|
||
stalwart (needs: infisical [Phase 1])
|
||
litellm (needs: infisical)
|
||
certifai (needs: keycloak, mongodb, litellm)
|
||
backend-compliance (needs: keycloak, pg-app)
|
||
ai-compliance-sdk (needs: pg-app, qdrant, litellm)
|
||
admin-compliance (needs: backend + sdk)
|
||
|
||
Phase 4 — Customer-facing on vm-control:
|
||
customer-portal (needs: keycloak, tenant-registry)
|
||
|
||
Phase 5 — Gateway on vm-edge (last):
|
||
orca-proxy (waits for all backends healthy before opening listener)
|
||
|
||
Estimated cold-start time: 6-10 minutes (faster than 7-VM since less network roundtrip)
|
||
```
|
||
|
||
### Scenario G — Tenant Registry fails
|
||
|
||
```
|
||
Impact: Portal cannot resolve tenant from subdomain → /[slug]/* all 503
|
||
Keycloak protocol mapper cannot get products claim → JWT missing field
|
||
→ users can log in but see "No active products"
|
||
Products (CERTifAI, compliance) themselves: UNAFFECTED if already authenticated
|
||
Cascade: New logins degraded.
|
||
Existing sessions continue.
|
||
Deadlock: None.
|
||
Recovery: Restart tenant-registry on vm-control. pg-app on vm-data must be healthy.
|
||
RTO target: ≤ 60s
|
||
Mitigation: Portal caches slug → tenant mapping with 60s TTL.
|
||
Short outage invisible to customers.
|
||
Severity: MEDIUM
|
||
```
|
||
|
||
---
|
||
|
||
## 11. Cross-Dependency Summary Table
|
||
|
||
```
|
||
Needs → │PG-KC│PG-Inf│PG-App│Mongo│Maria│Redis│Minio│Qdrant│ KC │Infis│Lit. │T.Reg│
|
||
─────────────────────┼─────┼──────┼──────┼─────┼─────┼─────┼─────┼──────┼─────┼─────┼─────┼─────┤
|
||
keycloak │ ● │ │ │ │ │ │ │ │ │ ◐* │ │ │
|
||
infisical │ │ ● │ │ │ │ ● │ │ │ │ │ │ │
|
||
gitea │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||
tenant-registry │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||
customer-portal │ │ │ │ │ │ │ │ │ ● │ ● │ │ ● │
|
||
erpnext │ │ │ │ │ ● │ ● │ │ │ ● │ ● │ │ │
|
||
frappe-hd │ │ │ │ │ ● │ ● │ │ │ │ ● │ │ │
|
||
stalwart │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||
certifai │ │ │ │ ● │ │ │ │ │ ● │ ● │ ◐ │ │
|
||
litellm │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||
backend-compl. │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||
ai-compl-sdk │ │ │ ● │ │ │ │ │ ● │ │ ● │ ◐ │ │
|
||
admin-compl. │ │ │ │ │ │ │ │ │ │ │ │ │
|
||
orca-proxy │ │ │ │ │ │ │ │ │ │ │ │ │
|
||
stage-app │ │ │ │ │ │ │ │ │ ● │ ◑ │ │ ◑ │
|
||
|
||
● = hard dependency (cannot start without)
|
||
◐ = soft dependency (starts, features degrade)
|
||
◑ = stage-only read-mostly dependency (writes blocked by Infisical scope)
|
||
◐*= bootstrap exception (Keycloak DB URI in Orca env on vm-edge, not Infisical)
|
||
```
|
||
|
||
---
|
||
|
||
## 12. Open Infrastructure Risks (Priority Order)
|
||
|
||
```
|
||
RISK-1 pg-app (vm-data) is a single instance serving tenant_registry + compliance schemas.
|
||
One crash blocks portal AND compliance product simultaneously.
|
||
→ Mitigation: split into pg-registry + pg-compliance at Tier B (200 customers).
|
||
Move pg-registry to its own DBaaS PostgreSQL cluster (€213/mo).
|
||
Priority: HIGH — fix before 100 customers; flagged also in COST_PLAN.md
|
||
|
||
RISK-2 vm-edge is a single VM owning 3 root dependencies (DNS, auth, secrets).
|
||
Failure = total external outage. Highest blast radius in the system.
|
||
→ Mitigation:
|
||
Phase A: cold-standby vm-edge-spare (idle cost €0; tested quarterly)
|
||
Phase B (Tier C, 500 cust): split vm-edge into vm-edge + vm-identity + vm-secrets
|
||
Priority: HIGH
|
||
|
||
RISK-3 vm-control hosts 5 service groups (portal, tenant-registry, ERPNext, Frappe HD,
|
||
Stalwart). Co-tenant noise risk; one OOM kills the others.
|
||
→ Mitigation:
|
||
Phase A: hard Orca resource limits per service (see §6 co-tenant notes)
|
||
Phase B (Tier B): split vm-control → vm-portal + vm-ops at €64/mo extra
|
||
Priority: MEDIUM
|
||
|
||
RISK-4 Keycloak is a single instance with no clustering.
|
||
Any Keycloak outage = total auth failure within JWT TTL.
|
||
→ Mitigation: short-term: tested runbook + 15min RTO target
|
||
long-term: Keycloak active-passive cluster (Phase 2, on split vm-identity)
|
||
Priority: MEDIUM
|
||
|
||
RISK-5 Stage isolation depends on 5 enforcement controls (see §2 table).
|
||
If any one breaks, stage code can affect prod customers.
|
||
→ Mitigation: quarterly red-team verification of each control.
|
||
Especially: Infisical secret-path scoping and Stalwart recipient filter.
|
||
Priority: MEDIUM — easy to forget once it's working
|
||
|
||
RISK-6 Infisical downtime during multi-VM restart causes delayed cold start.
|
||
→ Mitigation: Orca startup ordering + bootstrap secrets for Keycloak only
|
||
Priority: LOW — documented runbook; cold start is rare
|
||
|
||
RISK-7 ERPNext → Tenant Registry webhook has no guaranteed delivery.
|
||
Failed activation = tenant not active after contract signed.
|
||
→ Mitigation: Frappe retry + idempotent /activate endpoint + manual Backstage trigger
|
||
Priority: LOW
|
||
|
||
RISK-8 LiteLLM calls external AI APIs (OpenAI / Anthropic).
|
||
→ Mitigation: LiteLLM fallback routing; products degrade gracefully.
|
||
Priority: LOW — external dependency, by design
|
||
```
|
||
|
||
---
|
||
|
||
## 13. Growth Trajectory — when to add VMs
|
||
|
||
The locked 4-VM topology is right for 5–~200 customers. Past that, expect to add VMs back in this order:
|
||
|
||
```
|
||
Tier A (5–200 cust): 4 VMs as locked €192/mo compute (36M upfront)
|
||
↓
|
||
Tier B (200–500): Bump vm-data m2.med → m2.large +€64/mo
|
||
Add cold-standby vm-edge-spare +€0 (idle, paid only on swap)
|
||
↓
|
||
Tier C (500–1000): Split vm-data: vm-data + vm-data-db +€64/mo
|
||
(postgres-app moves to its own VM, or DBaaS cluster +€213/mo)
|
||
Split vm-control: vm-control + vm-ops +€64/mo
|
||
(ERPNext + MariaDB + Stalwart move to vm-ops)
|
||
↓
|
||
Tier D (1000–2000): Split vm-edge: vm-edge + vm-identity + vm-secrets +€96/mo
|
||
HA Keycloak active-passive on 2× vm-identity +€32/mo
|
||
Octavia Load Balancer Double Instance +€58/mo
|
||
vm-data m2.large → m2.xlarge or 2× +€128–256/mo
|
||
↓
|
||
Final topology ≈ 8 prod VMs + DBaaS
|
||
```
|
||
|
||
Each step is justified by a measurable signal (>80% RAM, >70% CPU, sustained queue depth, or a specific outage scenario). Never split preemptively.
|
||
|
||
---
|
||
|
||
## 14. Cost summary (see COST_PLAN.md for full breakdown)
|
||
|
||
| Mode | Compute €/mo | Storage €/mo | Network €/mo | Total net | + 19% VAT |
|
||
|---|---:|---:|---:|---:|---:|
|
||
| On-Demand | 434.50 | 112 | 2.92 | 549.42 | 653.81 |
|
||
| 12-month commit | 295.20 | 112 | 2.92 | 410.12 | 488.04 |
|
||
| 36-month no upfront | 216.00 | 112 | 2.92 | 330.92 | 393.79 |
|
||
| 36-month upfront | 192.00 | 112 | 2.92 | 306.92 | 365.23 |
|
||
|
||
Plus €6,912 net one-time payment if signing 36M-upfront for the compute portion.
|
||
|
||
---
|
||
|
||
*End of document. Review quarterly or after any significant infrastructure change. Topology last locked 2026-05-18.*
|