Adds the §1.2 scaffolding required by IMPLEMENTATION_PLAN.md M0.1: README, CONTRIBUTING, CODEOWNERS, CHANGELOG, PR + issue templates, CI workflow, release workflow, LICENSE, commitlint, cliff config, .editorconfig, .gitignore, .env.example. Refs: M0.1
This commit is contained in:
@@ -0,0 +1,774 @@
|
||||
# Infrastructure Specification
|
||||
**Status:** Locked Topology
|
||||
**Authors:** Sharang, Benjamin
|
||||
**Date:** 2026-05-11 (topology lock: 2026-05-18)
|
||||
**Companion docs:** PLATFORM_ARCHITECTURE.md, IMPLEMENTATION_PLAN.md, COST_PLAN.md
|
||||
**Cloud provider:** SysEleven Cloud Services (DUS2, OpenStack)
|
||||
|
||||
---
|
||||
|
||||
## 1. VM Inventory
|
||||
|
||||
**Four billable VMs total.** Three in production (one per plane after collapsing Identity+Infra), one in stage. Dev runs entirely on developer laptops via docker-compose.
|
||||
|
||||
```
|
||||
┌──────────────┬─────────────────┬────────────────────────┬───────────┬─────────────────┐
|
||||
│ Name │ Env │ SysEleven flavor │ Public IP │ Planes owned │
|
||||
├──────────────┼─────────────────┼────────────────────────┼───────────┼─────────────────┤
|
||||
│ vm-edge │ prod │ m2.small (2v / 8 GB) │ YES (1) │ Identity + Infra│
|
||||
│ vm-control │ prod │ m2.medium (4v / 16 GB) │ No │ Control │
|
||||
│ vm-data │ prod │ m2.medium (4v / 16 GB) │ No │ Data │
|
||||
│ stage │ stage │ m2.small (2v / 8 GB) │ YES (1) │ App plane only │
|
||||
│ (dev) │ dev │ local docker-compose │ n/a │ all (in-memory) │
|
||||
└──────────────┴─────────────────┴────────────────────────┴───────────┴─────────────────┘
|
||||
```
|
||||
|
||||
**Total compute:** 48 GiB-RAM, 12 vCPU. **Monthly compute net: €192 (36M upfront) / €295 (12M) / €435 (On-Demand).** See COST_PLAN.md for the full three-mode table.
|
||||
|
||||
### Why this topology and not the previous 7-VM layout
|
||||
|
||||
The earlier draft proposed one VM per service group (vm-gateway, vm-identity, vm-secrets, vm-ops, vm-control, vm-certifai, vm-compliance). That gave maximum failure isolation but cost 132 GiB-RAM stage+prod. At 5 customers the isolation is unused — every VM ran at <10% utilisation. The locked topology buys back failure isolation incrementally as load grows (see §13 Growth Trajectory).
|
||||
|
||||
Critical isolations preserved even at 4 VMs:
|
||||
- **vm-edge isolates identity from app workloads.** Keycloak JVM has its own page cache; ERPNext background jobs cannot starve token issuance.
|
||||
- **vm-data isolates databases from stateless services.** All data-plane DBs share one host, but they're walled off from the portal + ERPNext + Stalwart competing on vm-control.
|
||||
- **stage runs the app plane only.** It calls prod Keycloak + prod Tenant Registry under `tenant.kind = stage` rather than mirroring those services.
|
||||
|
||||
---
|
||||
|
||||
## 2. Service-to-VM Mapping
|
||||
|
||||
```
|
||||
vm-edge (prod, m2.small 8 GB, public IP)
|
||||
├── orca-proxy (Orca-managed; wildcard TLS terminator)
|
||||
├── powerdns-auth (Orca-managed; authoritative DNS for yourplatform.com)
|
||||
├── keycloak-26 (Orca-managed; JVM, ~1.5 GB heap)
|
||||
├── postgres-keycloak (Orca-managed; dedicated PG instance for Keycloak only)
|
||||
├── infisical (Orca-managed)
|
||||
├── postgres-infisical (Orca-managed; dedicated PG instance for Infisical only)
|
||||
├── redis-infisical (Orca-managed; ephemeral)
|
||||
└── gitea (Orca-managed; SQLite backend to avoid a third PG)
|
||||
|
||||
vm-control (prod, m2.medium 16 GB)
|
||||
├── customer-portal (Orca-managed; Next.js)
|
||||
├── tenant-registry (Orca-managed; Go)
|
||||
├── orca-controller (Orca core process; NOT a managed container)
|
||||
├── erpnext (Orca-managed; Frappe bench)
|
||||
├── frappe-hd (same bench as ERPNext)
|
||||
├── mariadb (Orca-managed; for ERPNext)
|
||||
├── redis-erpnext (Orca-managed)
|
||||
└── stalwart-mail (Orca-managed; SMTP/IMAP/JMAP on mail.yourplatform.com)
|
||||
|
||||
vm-data (prod, m2.medium 16 GB)
|
||||
├── certifai-dashboard (Orca-managed)
|
||||
├── mongodb (Orca-managed)
|
||||
├── litellm (Orca-managed)
|
||||
├── backend-compliance (Orca-managed)
|
||||
├── ai-compliance-sdk (Orca-managed)
|
||||
├── admin-compliance (Orca-managed)
|
||||
├── postgres-app (Orca-managed; schemas: tenant_registry, compliance)
|
||||
├── qdrant (Orca-managed)
|
||||
└── minio (Orca-managed)
|
||||
|
||||
stage (stage, m2.small 8 GB, public IP)
|
||||
├── orca-proxy (light; only routes to stage app)
|
||||
├── customer-portal (NEW VERSION under test)
|
||||
├── tenant-registry (NEW VERSION under test, talks to ephemeral PG below)
|
||||
├── certifai-dashboard (NEW VERSION under test)
|
||||
├── backend-compliance (NEW VERSION under test)
|
||||
├── ai-compliance-sdk (NEW VERSION under test)
|
||||
├── admin-compliance (NEW VERSION under test)
|
||||
├── litellm (light; same image as prod)
|
||||
├── postgres-app-stage (ephemeral; lives entirely on stage VM)
|
||||
├── mongodb-stage (ephemeral)
|
||||
└── qdrant-stage (ephemeral, tiny corpus)
|
||||
|
||||
Calls OUT to prod:
|
||||
→ auth.yourplatform.com (Keycloak token issuance, under stage client_id)
|
||||
→ mail.yourplatform.com (Stalwart SMTP, recipient filter forces +stage@ only)
|
||||
→ Polar SANDBOX webhook URL (NEVER prod Polar)
|
||||
→ no calls to prod Postgres-app, MariaDB, MongoDB
|
||||
```
|
||||
|
||||
### Stage isolation rules (enforced at the platform, not in product code)
|
||||
|
||||
| Risk | Enforcement mechanism | Owner |
|
||||
|---|---|---|
|
||||
| Stage writes to prod database | Infisical scope: stage app only gets `/stage/*` secrets. Prod DB credentials never reach stage. | Infra plane |
|
||||
| Stage emails real customers | Stalwart accept-rule: drop if recipient does not match `*+stage@*`. | Control plane (Stalwart config) |
|
||||
| Stage triggers real Polar charges | Stage env points `POLAR_API_URL` to sandbox. Prod Polar webhook secret never on stage. | Control plane |
|
||||
| Stage Keycloak JWT used in prod | `stage_client_id` issued only by Keycloak; prod services reject JWTs with this aud. | Identity plane |
|
||||
| Stage load DOSes prod Keycloak | Keycloak rate-limit per client_id; stage limited to 60 req/s. | Identity plane |
|
||||
|
||||
---
|
||||
|
||||
## 3. Network Topology
|
||||
|
||||
```
|
||||
INTERNET
|
||||
│
|
||||
(yourplatform.com — authoritative on vm-edge PowerDNS;
|
||||
stage.yourplatform.com — authoritative same zone)
|
||||
│
|
||||
┌─────────────┴─────────────┐
|
||||
│ │
|
||||
┌───────▼────────┐ ┌────────▼─────────┐
|
||||
│ vm-edge │ │ stage │
|
||||
│ (public IP) │ │ (public IP) │
|
||||
│ │ │ │
|
||||
│ orca-proxy ────┤ │ orca-proxy │
|
||||
│ powerdns │ │ portal-new │
|
||||
│ keycloak │◄────────┤ tenant-registry-new
|
||||
│ pg-keycloak │ stage │ certifai-new │
|
||||
│ infisical │ calls │ compliance-new │
|
||||
│ pg-infisical │ prod │ pg-stage │
|
||||
│ redis-infis │ KC + │ mongo-stage │
|
||||
│ gitea │ Stalwart│ qdrant-stage │
|
||||
└───────┬────────┘ └──────────────────┘
|
||||
│ PRIVATE NETWORK 10.0.0.0/16
|
||||
┌────────┴─────────┐
|
||||
│ │
|
||||
┌──────▼───────┐ ┌───────▼──────┐
|
||||
│ vm-control │ │ vm-data │
|
||||
│ │ │ │
|
||||
│ portal │ │ certifai │
|
||||
│ tenant-reg │ │ mongodb │
|
||||
│ orca-ctrl │ │ litellm │
|
||||
│ erpnext │ │ backend-comp │
|
||||
│ frappe-hd │ │ ai-sdk │
|
||||
│ mariadb │ │ admin-comp │
|
||||
│ redis-erp │ │ pg-app │
|
||||
│ stalwart │ │ qdrant │
|
||||
└──────────────┘ │ minio │
|
||||
└──────────────┘
|
||||
|
||||
Orca-Proxy routing (vm-edge, by Host header):
|
||||
auth.yourplatform.com → 127.0.0.1:8443 (Keycloak, local on vm-edge)
|
||||
erp.yourplatform.com → vm-control:8000 (ERPNext) [allowlist: our IPs only]
|
||||
git.yourplatform.com → vm-edge:3000 (Gitea, local) [allowlist: our IPs only]
|
||||
mail.yourplatform.com → vm-control:587 (Stalwart submission) [allowlist: VM internal only]
|
||||
ns1.yourplatform.com → 127.0.0.1:53 (PowerDNS, local)
|
||||
*.yourplatform.com → vm-control:3000 (customer portal)
|
||||
|
||||
Orca-Proxy routing (stage, by Host header):
|
||||
*.stage.yourplatform.com → 127.0.0.1:3000 (stage portal — all subdomains route here)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Storage and Volume Requirements
|
||||
|
||||
Block volumes (Ceph 3x replicated, €0.10/GiB/mo) mounted to each VM.
|
||||
|
||||
```
|
||||
┌──────────────┬───────────────────────────────────────────┬─────────┬─────────────────────┐
|
||||
│ VM │ Data stores │ +Block │ Growth profile │
|
||||
├──────────────┼───────────────────────────────────────────┼─────────┼─────────────────────┤
|
||||
│ vm-edge │ pg-keycloak + pg-infisical + Gitea repos │ +50 GB │ Slow │
|
||||
│ vm-control │ MariaDB (ERPNext) + Stalwart mail spool │ +250 GB │ Medium │
|
||||
│ vm-data │ MongoDB + pg-app + Qdrant + MinIO │ +500 GB │ Fast (scales w/ N) │
|
||||
│ stage │ pg-stage + mongo-stage + qdrant-stage │ +50 GB │ Resets per release │
|
||||
└──────────────┴───────────────────────────────────────────┴─────────┴─────────────────────┘
|
||||
|
||||
Each VM's root disk: 50 GB ephemeral, included in flavor price.
|
||||
|
||||
Object storage (S3, €0.02/GiB/mo single-region or €0.0496/GiB/mo geo-redundant):
|
||||
┌─────────────────────────────────┬─────────┬──────────────────────────┐
|
||||
│ Bucket │ Size │ Purpose │
|
||||
├─────────────────────────────────┼─────────┼──────────────────────────┤
|
||||
│ s3://backups (geo-redundant) │ ~500 GB │ Database dumps │
|
||||
│ s3://seed-data │ ~30 GB │ Demo tenant fixtures │
|
||||
│ s3://exports │ ~50 GB │ GDPR/offboarding ZIPs │
|
||||
│ s3://audit-archive │ ~20 GB │ Old audit log overflow │
|
||||
└─────────────────────────────────┴─────────┴──────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Backup Requirements
|
||||
|
||||
All backups ship to **SysEleven Object Storage** (S3-compatible, geo-redundant DUS2 ↔ HAM1 for production-critical data). Backup jobs run as Orca one-shot containers on cron. Infisical holds the S3 credentials.
|
||||
|
||||
```
|
||||
┌───────────────────────┬──────────────────┬────────────┬────────────┬──────────────────────┐
|
||||
│ Data store │ Method │ Frequency │ Retention │ Owner (who restores) │
|
||||
├───────────────────────┼──────────────────┼────────────┼────────────┼──────────────────────┤
|
||||
│ pg-keycloak (vm-edge) │ pg_dump → S3-geo │ Every 6h │ 14 days │ Infra Plane │
|
||||
│ pg-infisical (vm-edge)│ pg_dump → S3-geo │ Daily │ 30 days │ Infra Plane │
|
||||
│ Gitea (vm-edge) │ gitea dump → S3 │ Daily │ 30 days │ Infra Plane │
|
||||
│ Keycloak realm export │ KC export → S3 │ Daily │ 14 days │ Identity Plane (owns)│
|
||||
│ Infisical store │ encrypted → S3 │ Daily │ 30 days │ Infra Plane │
|
||||
│ MariaDB (vm-control) │ mysqldump → S3 │ Every 6h │ 30 days │ Control Plane │
|
||||
│ Stalwart queue/store │ tar → S3 │ Daily │ 7 days │ Control Plane │
|
||||
│ pg-app (vm-data) │ pg_dump → S3-geo │ Every 6h │ 30 days │ Data Plane (owns RPO)│
|
||||
│ MongoDB (vm-data) │ mongodump → S3 │ Daily │ 30 days │ Data Plane │
|
||||
│ MinIO (vm-data) │ mc mirror → S3 │ Daily │ 90 days │ Data Plane │
|
||||
│ Qdrant (vm-data) │ API snap → S3 │ Daily │ 14 days │ Data Plane (rebuild) │
|
||||
│ stage * │ no backup │ — │ — │ — (ephemeral) │
|
||||
│ Orca config (IaC) │ Gitea (VCS) │ On commit │ Forever │ Infra Plane │
|
||||
└───────────────────────┴──────────────────┴────────────┴────────────┴──────────────────────┘
|
||||
```
|
||||
|
||||
### RPO by data criticality
|
||||
|
||||
```
|
||||
CRITICAL (RPO ≤ 6h)
|
||||
pg-keycloak — org memberships, IdP config
|
||||
pg-app — tenant registry, compliance records
|
||||
MariaDB/ERPNext — sales orders, invoices, contracts
|
||||
|
||||
IMPORTANT (RPO ≤ 24h)
|
||||
MongoDB — chat history, user preferences
|
||||
MinIO — compliance evidence documents
|
||||
pg-infisical — encrypted secrets
|
||||
Stalwart store — inbound webhooks, bounce records
|
||||
|
||||
RECOVERABLE (RPO ≤ 48h, rebuildable)
|
||||
Qdrant — vector index (rebuildable from MinIO source documents)
|
||||
Gitea — code (mirrored on dev machines)
|
||||
Keycloak export — org structure (pg-keycloak is primary)
|
||||
|
||||
NOT BACKED UP
|
||||
stage (any data) — by design; restored from seed bundles on each deploy
|
||||
redis-* — caches; restart cold
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Constraint Framework
|
||||
|
||||
### Constraint types
|
||||
|
||||
```
|
||||
AVAILABILITY — required uptime percentage over rolling 30 days
|
||||
RTO — Recovery Time Objective: max time to restore service after failure
|
||||
RPO — Recovery Point Objective: max acceptable data loss window
|
||||
IaC — service must be declared in Orca config, no manual container runs in prod
|
||||
SECRET_HYGIENE — all secrets via Infisical machine identity; no env files, no hardcoded values
|
||||
NETWORK — whether service is internet-exposed or internal-only
|
||||
DATA_RESIDENCY — all data must remain in EU (SysEleven DUS2 + HAM1)
|
||||
AUDIT_TRAIL — all mutating actions logged (who, what, when, from where)
|
||||
IMMUTABILITY — config changes go through Gitea → Orca pipeline, not manual SSH
|
||||
STAGE_ISOLATION— stage tenant cannot mutate any prod data; reads-only against prod KC + TR
|
||||
```
|
||||
|
||||
### Plane ownership of constraints
|
||||
|
||||
Even though planes now share VMs, the **ownership model is unchanged** — the plane that owns a constraint owns it regardless of which VM hosts the service. The Infra Plane (now collapsed onto vm-edge alongside the Identity plane) still mechanically enforces backup, IaC, secrets, and network constraints.
|
||||
|
||||
```
|
||||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||
║ IDENTITY PLANE (on vm-edge) ║
|
||||
║ ║
|
||||
║ Owns / defines: ║
|
||||
║ AVAILABILITY — must be ≥ 99.5% (root dep for everything) ║
|
||||
║ RTO — ≤ 15 min ║
|
||||
║ AUDIT_TRAIL — realm-level audit (logins, token issuance, IdP events) ║
|
||||
║ DATA_RESIDENCY— Keycloak realm data must stay EU ║
|
||||
║ STAGE_ISOLATION— rate-limits stage_client_id; rejects stage JWTs in prod audiences ║
|
||||
║ ║
|
||||
║ Co-tenant note: shares vm-edge with Infra Plane services. JVM heap pinned to 1.5 GB ║
|
||||
║ in Orca manifest so it cannot starve PowerDNS / Infisical. ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||
║ CONTROL PLANE (on vm-control) ║
|
||||
║ ║
|
||||
║ Owns / defines: ║
|
||||
║ RPO (tenant) — tenant registry & compliance schemas RPO ≤ 6h ║
|
||||
║ RPO (ERPNext) — sales orders, invoices RPO ≤ 6h ║
|
||||
║ AUDIT_TRAIL — all portal actions (invites, IdP changes, impersonations) ║
|
||||
║ AVAILABILITY — portal ≥ 99.5%; ERPNext ≥ 99% (internal) ║
|
||||
║ RTO (portal) — ≤ 10 min ║
|
||||
║ RTO (ERPNext) — ≤ 60 min ║
|
||||
║ ║
|
||||
║ Co-tenant note: ERPNext + Portal + Stalwart on one VM. Orca resource limits enforced: ║
|
||||
║ portal: 1 GB memory cap ║
|
||||
║ erpnext: 6 GB memory cap ║
|
||||
║ mariadb: 3 GB memory cap ║
|
||||
║ stalwart: 1 GB memory cap ║
|
||||
║ tenant-registry: 500 MB ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||
║ DATA PLANE (on vm-data) ║
|
||||
║ ║
|
||||
║ Owns / defines: ║
|
||||
║ DATA_RESIDENCY — all customer data (MongoDB, pg-app, MinIO) must stay EU ║
|
||||
║ RPO (product) — compliance records ≤ 6h; chat history ≤ 24h ║
|
||||
║ DATA_ISOLATION — every query scoped by org_id/tenant_id ║
|
||||
║ AUDIT_TRAIL — product-level actions ║
|
||||
║ AVAILABILITY — CERTifAI ≥ 99.5%; compliance ≥ 99.5% ║
|
||||
║ ║
|
||||
║ Co-tenant note: this VM is the SCALE driver. When vm-data hits 80% RAM, bump flavor ║
|
||||
║ (m2.medium → m2.large → m2.xlarge). See §13 Growth Trajectory. ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||
|
||||
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||
║ INFRA PLANE (on vm-edge, alongside Identity) ║
|
||||
║ ║
|
||||
║ Owns / enforces ALL of: ║
|
||||
║ BACKUP — executes all backup jobs (pg_dump, mongodump, mc mirror) ║
|
||||
║ IaC — ALL services declared in Orca config; no manual prod changes ║
|
||||
║ IMMUTABILITY — config changes: Gitea commit → Gitea Actions → Orca API only ║
|
||||
║ SECRET_HYGIENE— Infisical (on vm-edge); provisions machine identities ║
|
||||
║ NETWORK — Orca-Proxy rules; VM firewall; no direct VM public exposure ║
|
||||
║ DATA_RESIDENCY— VM region = SysEleven DUS2; backups geo-redundant DUS2↔HAM1 ║
|
||||
║ AVAILABILITY — Orca restart policies, health checks ║
|
||||
║ COLD_START — enforces startup ordering (see §10 Scenario F) ║
|
||||
║ STAGE_ISOLATION— Infisical secret-path scoping for stage_app identity ║
|
||||
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. SLA Table
|
||||
|
||||
```
|
||||
┌───────────────────────┬──────────────┬─────────┬─────────┬────────────────────────────────┐
|
||||
│ Service │ Availability │ RTO │ RPO │ Host VM │
|
||||
├───────────────────────┼──────────────┼─────────┼─────────┼────────────────────────────────┤
|
||||
│ Orca-Proxy │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||||
│ PowerDNS │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||||
│ Keycloak │ 99.5% │ 15 min │ 6h │ vm-edge (root auth dep) │
|
||||
│ Infisical │ 99.5% │ 30 min │ 24h │ vm-edge (running svcs survive) │
|
||||
│ Gitea │ 99% │ 2h │ 24h │ vm-edge (dev machines mirror) │
|
||||
│ Customer Portal │ 99.5% │ 10 min │ N/A │ vm-control │
|
||||
│ Tenant Registry │ 99.5% │ 10 min │ 6h │ vm-control │
|
||||
│ ERPNext │ 99% │ 60 min │ 6h │ vm-control (internal only) │
|
||||
│ Frappe HD │ 99% │ 60 min │ 24h │ vm-control │
|
||||
│ MariaDB │ 99.5% │ 20 min │ 6h │ vm-control │
|
||||
│ Stalwart Mail │ 99% │ 60 min │ 24h │ vm-control │
|
||||
│ CERTifAI │ 99.5% │ 10 min │ 24h │ vm-data │
|
||||
│ MongoDB │ 99.5% │ 20 min │ 24h │ vm-data │
|
||||
│ LiteLLM │ 99% │ 5 min │ N/A │ vm-data │
|
||||
│ backend-compliance │ 99.5% │ 10 min │ 6h │ vm-data │
|
||||
│ ai-compliance-sdk │ 99.5% │ 10 min │ 6h │ vm-data │
|
||||
│ pg-app │ 99.9% │ 20 min │ 6h │ vm-data (SPOF — RISK-1) │
|
||||
│ MinIO │ 99.5% │ 30 min │ 24h │ vm-data │
|
||||
│ Qdrant │ 99% │ 2h │ 24h │ vm-data (rebuildable) │
|
||||
│ stage (any service) │ 95% │ best ef.│ N/A │ stage (ephemeral; no SLA) │
|
||||
└───────────────────────┴──────────────┴─────────┴─────────┴────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. IaC Constraint (Orca)
|
||||
|
||||
Every production service declared in Orca config. No exceptions.
|
||||
|
||||
### Rules
|
||||
|
||||
```
|
||||
1. ALL containers run via Orca manifests committed to Gitea
|
||||
→ /orca/manifests/{vm-name}/{service-name}.toml
|
||||
→ Changes go through: Gitea PR → Gitea Actions lint → Orca API apply
|
||||
|
||||
2. NO manual docker run / docker-compose up on any production VM
|
||||
→ SSH to prod VMs allowed for debugging only; no state changes
|
||||
|
||||
3. Secrets are NEVER in Orca manifests
|
||||
→ Manifests reference Infisical paths, not values
|
||||
→ Bootstrap exception: Keycloak DB URI in Orca env (Keycloak runs ON vm-edge alongside
|
||||
Infisical, so chicken-and-egg is solved by Orca env file, not Infisical lookup)
|
||||
|
||||
4. Restart policy: always (Orca restarts crashed containers with exponential backoff)
|
||||
→ Health check per service (HTTP /health or TCP probe)
|
||||
|
||||
5. Resource limits MANDATORY in every manifest
|
||||
→ On a 3-VM prod, co-tenant noise is the single biggest risk; limits are non-negotiable
|
||||
→ See §6 Plane ownership "Co-tenant note" boxes for the per-service caps
|
||||
|
||||
6. Orca controller state itself is recoverable
|
||||
→ Manifest files in Gitea = desired state
|
||||
→ Loss of Orca controller = re-apply manifests from Gitea, services continue running
|
||||
|
||||
7. Stage app gets its own Infisical scope
|
||||
→ /stage/* path; no prod-DB credentials reach this scope
|
||||
→ Enforced at Infisical machine-identity level, not in app code
|
||||
```
|
||||
|
||||
### Gitea Actions pipeline for infra changes
|
||||
|
||||
```
|
||||
infra change committed to Gitea
|
||||
│
|
||||
├── lint: validate Orca manifest schema
|
||||
├── diff: show what changes will be applied (orca plan)
|
||||
├── (manual approval gate for vm-edge changes — touches auth root)
|
||||
└── apply: POST to Orca Controller API → rolling update
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 9. Dependency Graph
|
||||
|
||||
Arrows = "requires to function." Dashed = soft (degrades, doesn't fail).
|
||||
**Intra-VM dependencies elided** for clarity (e.g. Keycloak ↔ pg-keycloak are on the same host and start together).
|
||||
|
||||
```
|
||||
EXTERNAL
|
||||
AI APIs
|
||||
(OpenAI / Anthropic)
|
||||
│
|
||||
│ (soft)
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ vm-edge (Identity + Infra) │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ pg-keycloak ──► keycloak │ │
|
||||
│ │ pg-infisical ─► infisical ◄── (all VMs pull on startup) │ │
|
||||
│ │ redis-infis ──► infisical │ │
|
||||
│ │ (sqlite) ─────► gitea │ │
|
||||
│ │ powerdns-auth (no deps) │ │
|
||||
│ │ orca-proxy (route table only; backends are remote) │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ Keycloak JWKS │ Infisical /secrets │
|
||||
│ │ │ │
|
||||
└────────────────────────────┼────────────────┼────────────────────┘
|
||||
▼ ▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ vm-control (Control) │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ mariadb + redis-erp ──► erpnext + frappe-hd │ │
|
||||
│ │ (intra) ─────────────► stalwart │ │
|
||||
│ │ ──────────────────────► customer-portal │ │
|
||||
│ │ ──────────────────────► tenant-registry ──► pg-app (vm-data)│ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ tenant-registry API │
|
||||
└────────────────────────────┼─────────────────────────────────────┘
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ vm-data (Data) │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ mongodb ───► certifai ◄── (vm-edge JWKS, vm-edge secrets) │ │
|
||||
│ │ litellm ───► certifai, ai-compliance-sdk │ │
|
||||
│ │ pg-app ────► tenant-registry-on-vm-control, backend-compl,│ │
|
||||
│ │ ai-compliance-sdk │ │
|
||||
│ │ qdrant ────► ai-compliance-sdk │ │
|
||||
│ │ minio ────► backend-compliance │ │
|
||||
│ │ backend-compliance ──► admin-compliance │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────────────────────────────────────────────────────────┐
|
||||
│ stage (App plane only) │
|
||||
│ Calls vm-edge:8443 (KC) + vm-control:587 (Stalwart submission) │
|
||||
│ Calls Polar SANDBOX (never prod Polar webhook URL) │
|
||||
│ Its own ephemeral DBs; cannot read prod data │
|
||||
└──────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Simplified critical path (customer login → product use)
|
||||
|
||||
```
|
||||
DNS (vm-edge PowerDNS)
|
||||
│
|
||||
▼
|
||||
orca-proxy (vm-edge)
|
||||
│
|
||||
├──► keycloak (vm-edge) ──► pg-keycloak (intra-VM)
|
||||
│
|
||||
└──► customer-portal (vm-control)
|
||||
├──► tenant-registry (vm-control) ──► pg-app (vm-data)
|
||||
├──► certifai (vm-data) ──► mongodb (intra-VM)
|
||||
└──► backend-compliance (vm-data) ──► pg-app (intra-VM)
|
||||
──► ai-sdk ──► qdrant + minio
|
||||
──► litellm ──► [external AI APIs]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 10. Failure Scenarios and Deadlock Analysis
|
||||
|
||||
### Scenario A — vm-edge fails (HIGHEST SEVERITY)
|
||||
|
||||
```
|
||||
Impact: TOTAL outage. Nothing reachable from internet.
|
||||
No DNS. No TLS. No auth. No new logins. Running JWTs expire within 15 min,
|
||||
then ALL services start returning 401.
|
||||
Backstage and customer portal both fully blocked.
|
||||
Stage also blocked (depends on prod Keycloak).
|
||||
Cascade: T+0: DNS fails → orca-proxy unreachable
|
||||
T+5m: existing JWTs still valid; portal cached → partial reads work
|
||||
T+15m: JWTs expire → full outage
|
||||
Deadlock: None — services downstream don't deadlock, they just fail closed
|
||||
Recovery: 1. Spin up vm-edge-spare (cold standby, same Orca config) — ~3 min provision
|
||||
2. Restore pg-keycloak + pg-infisical from latest backup — ~5 min
|
||||
3. Swap registrar NS records to spare IP (TTL 60s) — ~2 min propagation
|
||||
4. Restart all services on vm-edge-spare via Orca apply — ~3 min
|
||||
Total RTO target: 15 min
|
||||
Mitigation: COLD STANDBY vm-edge-spare. Same Orca config committed in Gitea.
|
||||
Provision cost when idle: €0 (only billed when running).
|
||||
Test recovery quarterly.
|
||||
Severity: CRITICAL — single host owns 3 root dependencies (DNS, auth, secrets)
|
||||
Cost of fix at Tier C: split vm-edge into vm-edge + vm-identity + vm-secrets
|
||||
(back toward original 7-VM design) — €100/mo extra
|
||||
```
|
||||
|
||||
### Scenario B — vm-control fails (NEW — consequence of plane consolidation)
|
||||
|
||||
```
|
||||
Impact: customer-portal: DOWN → /[slug]/* all return 503
|
||||
tenant-registry: DOWN → Keycloak protocol-mapper for products claim breaks
|
||||
→ users can log in but see "No active products"
|
||||
ERPNext + Frappe HD: DOWN → we cannot create sales orders or read tickets
|
||||
Stalwart: DOWN → no outbound emails (trial nudges, exports, ticket replies)
|
||||
MariaDB: DOWN → ERPNext queries fail; backups paused
|
||||
Products (CERTifAI, compliance): UNAFFECTED (on vm-data, JWTs still validate)
|
||||
Existing logged-in users: can use products directly via product subdomain
|
||||
IF they bookmark it; portal home is 503.
|
||||
Cascade: T+0: portal 503; new tenant onboarding blocked (registry down)
|
||||
T+15m: existing JWTs missing refreshed products claim
|
||||
T+1h: trial emails not sent → trial nudge cadence breaks
|
||||
Deadlock: None
|
||||
Recovery: Restart vm-control containers via Orca. If MariaDB corrupt: restore mysqldump.
|
||||
RTO target: 10 min (portal) / 60 min (ERPNext)
|
||||
Mitigation: Multiple services co-hosted = single failure hits many SLAs.
|
||||
Resource limits in Orca prevent ERPNext OOM from killing portal.
|
||||
Quarterly drill: deliberately stop portal, measure recovery.
|
||||
Severity: HIGH — three services down at once, but products keep serving customers
|
||||
Cost of fix at Tier B/C: split vm-control → vm-portal + vm-ops (ERPNext)
|
||||
— €64/mo extra at m2.small
|
||||
```
|
||||
|
||||
### Scenario C — vm-data fails
|
||||
|
||||
```
|
||||
Impact: tenant-registry queries: FAIL (pg-app down) → portal returns 503 for tenant lookup
|
||||
customer-portal: DEGRADED (login works, dashboard fails)
|
||||
CERTifAI: COMPLETELY DOWN
|
||||
backend-compliance + ai-sdk + admin: COMPLETELY DOWN
|
||||
ERPNext + Stalwart: UNAFFECTED
|
||||
Cascade: T+0: products down; portal degraded
|
||||
T+15m: support tickets pile up
|
||||
Note: prod is partial — users see error pages but ERPNext + auth still work
|
||||
Recovery: Restart vm-data containers. If pg-app corrupt: restore from pg_dump (RPO 6h).
|
||||
RTO target: 20 min
|
||||
Mitigation: This is the SCALE-event VM. RISK-1 below makes this the worst SPOF:
|
||||
one pg-app instance owns tenant_registry + compliance schemas.
|
||||
HIGH PRIORITY fix: split pg-app into separate clusters at Tier B/C transition.
|
||||
Severity: HIGH — products down, business operations (ERPNext) still work so we can
|
||||
contact customers
|
||||
```
|
||||
|
||||
### Scenario D — LiteLLM fails
|
||||
|
||||
```
|
||||
Impact: CERTifAI: AI features fail (summarization, chat completion).
|
||||
CERTifAI dashboard, sessions: UNAFFECTED.
|
||||
compliance AI generation: FAILS (DSFA/TOM/VVT generation blocked).
|
||||
Compliance CRUD: UNAFFECTED.
|
||||
Cascade: Soft degradation only. Products show "AI features temporarily unavailable" banner.
|
||||
Deadlock: None.
|
||||
Recovery: Restart LiteLLM on vm-data (stateless, ~30s).
|
||||
Severity: MEDIUM — graceful degradation by design
|
||||
```
|
||||
|
||||
### Scenario E — Stage VM compromised or buggy
|
||||
|
||||
```
|
||||
Impact: On stage itself: stage portal serves bad data; stage testers see errors.
|
||||
On prod: NONE if isolation rules in §2 are intact.
|
||||
Worst case if isolation breaks:
|
||||
- Stage code tries to call prod pg-app → fails (no creds in /stage/* Infisical)
|
||||
- Stage emits real email → blocked by Stalwart recipient filter
|
||||
- Stage triggers Polar charge → goes to sandbox, no real money
|
||||
Cascade: None to prod by design.
|
||||
Recovery: Roll back stage to previous image via Orca. RTO target: 5 min.
|
||||
Mitigation: The 5 enforcement rules in §2 are the load-bearing controls. Verify quarterly
|
||||
via deliberate red-team: try to write to prod pg-app from stage and confirm 401.
|
||||
Severity: LOW (in prod) / HIGH (on stage, but stage SLA is 95%)
|
||||
```
|
||||
|
||||
### Scenario F — Full Cold Start (Power Loss, All VMs Restart Simultaneously)
|
||||
|
||||
```
|
||||
Three VMs boot at once. Services must start in dependency order or services
|
||||
crash-loop until their deps are ready.
|
||||
|
||||
DEADLOCK RISK: vm-control services (portal, tenant-registry) start before vm-data
|
||||
services (pg-app, certifai, compliance). They'll crash-loop ~2-5min
|
||||
with backoff retries.
|
||||
Same for ERPNext on vm-control trying to reach Keycloak on vm-edge.
|
||||
|
||||
RESOLUTION: Orca enforces cross-VM startup ordering via health-check dependencies.
|
||||
Bootstrap exception: Keycloak DB URI in Orca env on vm-edge (not from
|
||||
Infisical — chicken-and-egg solved).
|
||||
|
||||
Required cold start sequence:
|
||||
|
||||
Phase 0 — Data roots on vm-data (parallel):
|
||||
pg-app, mongodb, qdrant, minio
|
||||
Phase 0 — Data roots on vm-control (parallel):
|
||||
mariadb, redis-erpnext
|
||||
Phase 0 — Data roots on vm-edge (parallel):
|
||||
pg-keycloak, pg-infisical, redis-infisical
|
||||
|
||||
Phase 1 — Secrets + DNS on vm-edge:
|
||||
infisical (needs: pg-infisical, redis-infisical)
|
||||
powerdns-auth (no deps)
|
||||
|
||||
Phase 2 — Identity on vm-edge:
|
||||
keycloak (needs: pg-keycloak [Phase 0], infisical [Phase 1])
|
||||
gitea (needs: sqlite; ready from Phase 0)
|
||||
|
||||
Phase 3 — Control on vm-control + Data services on vm-data (parallel):
|
||||
tenant-registry (needs: keycloak [Phase 2], pg-app [Phase 0, remote])
|
||||
erpnext + frappe-hd (needs: mariadb, redis-erpnext [Phase 0], keycloak [Phase 2])
|
||||
stalwart (needs: infisical [Phase 1])
|
||||
litellm (needs: infisical)
|
||||
certifai (needs: keycloak, mongodb, litellm)
|
||||
backend-compliance (needs: keycloak, pg-app)
|
||||
ai-compliance-sdk (needs: pg-app, qdrant, litellm)
|
||||
admin-compliance (needs: backend + sdk)
|
||||
|
||||
Phase 4 — Customer-facing on vm-control:
|
||||
customer-portal (needs: keycloak, tenant-registry)
|
||||
|
||||
Phase 5 — Gateway on vm-edge (last):
|
||||
orca-proxy (waits for all backends healthy before opening listener)
|
||||
|
||||
Estimated cold-start time: 6-10 minutes (faster than 7-VM since less network roundtrip)
|
||||
```
|
||||
|
||||
### Scenario G — Tenant Registry fails
|
||||
|
||||
```
|
||||
Impact: Portal cannot resolve tenant from subdomain → /[slug]/* all 503
|
||||
Keycloak protocol mapper cannot get products claim → JWT missing field
|
||||
→ users can log in but see "No active products"
|
||||
Products (CERTifAI, compliance) themselves: UNAFFECTED if already authenticated
|
||||
Cascade: New logins degraded.
|
||||
Existing sessions continue.
|
||||
Deadlock: None.
|
||||
Recovery: Restart tenant-registry on vm-control. pg-app on vm-data must be healthy.
|
||||
RTO target: ≤ 60s
|
||||
Mitigation: Portal caches slug → tenant mapping with 60s TTL.
|
||||
Short outage invisible to customers.
|
||||
Severity: MEDIUM
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 11. Cross-Dependency Summary Table
|
||||
|
||||
```
|
||||
Needs → │PG-KC│PG-Inf│PG-App│Mongo│Maria│Redis│Minio│Qdrant│ KC │Infis│Lit. │T.Reg│
|
||||
─────────────────────┼─────┼──────┼──────┼─────┼─────┼─────┼─────┼──────┼─────┼─────┼─────┼─────┤
|
||||
keycloak │ ● │ │ │ │ │ │ │ │ │ ◐* │ │ │
|
||||
infisical │ │ ● │ │ │ │ ● │ │ │ │ │ │ │
|
||||
gitea │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||
tenant-registry │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||||
customer-portal │ │ │ │ │ │ │ │ │ ● │ ● │ │ ● │
|
||||
erpnext │ │ │ │ │ ● │ ● │ │ │ ● │ ● │ │ │
|
||||
frappe-hd │ │ │ │ │ ● │ ● │ │ │ │ ● │ │ │
|
||||
stalwart │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||
certifai │ │ │ │ ● │ │ │ │ │ ● │ ● │ ◐ │ │
|
||||
litellm │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||
backend-compl. │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||||
ai-compl-sdk │ │ │ ● │ │ │ │ │ ● │ │ ● │ ◐ │ │
|
||||
admin-compl. │ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
orca-proxy │ │ │ │ │ │ │ │ │ │ │ │ │
|
||||
stage-app │ │ │ │ │ │ │ │ │ ● │ ◑ │ │ ◑ │
|
||||
|
||||
● = hard dependency (cannot start without)
|
||||
◐ = soft dependency (starts, features degrade)
|
||||
◑ = stage-only read-mostly dependency (writes blocked by Infisical scope)
|
||||
◐*= bootstrap exception (Keycloak DB URI in Orca env on vm-edge, not Infisical)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Open Infrastructure Risks (Priority Order)
|
||||
|
||||
```
|
||||
RISK-1 pg-app (vm-data) is a single instance serving tenant_registry + compliance schemas.
|
||||
One crash blocks portal AND compliance product simultaneously.
|
||||
→ Mitigation: split into pg-registry + pg-compliance at Tier B (200 customers).
|
||||
Move pg-registry to its own DBaaS PostgreSQL cluster (€213/mo).
|
||||
Priority: HIGH — fix before 100 customers; flagged also in COST_PLAN.md
|
||||
|
||||
RISK-2 vm-edge is a single VM owning 3 root dependencies (DNS, auth, secrets).
|
||||
Failure = total external outage. Highest blast radius in the system.
|
||||
→ Mitigation:
|
||||
Phase A: cold-standby vm-edge-spare (idle cost €0; tested quarterly)
|
||||
Phase B (Tier C, 500 cust): split vm-edge into vm-edge + vm-identity + vm-secrets
|
||||
Priority: HIGH
|
||||
|
||||
RISK-3 vm-control hosts 5 service groups (portal, tenant-registry, ERPNext, Frappe HD,
|
||||
Stalwart). Co-tenant noise risk; one OOM kills the others.
|
||||
→ Mitigation:
|
||||
Phase A: hard Orca resource limits per service (see §6 co-tenant notes)
|
||||
Phase B (Tier B): split vm-control → vm-portal + vm-ops at €64/mo extra
|
||||
Priority: MEDIUM
|
||||
|
||||
RISK-4 Keycloak is a single instance with no clustering.
|
||||
Any Keycloak outage = total auth failure within JWT TTL.
|
||||
→ Mitigation: short-term: tested runbook + 15min RTO target
|
||||
long-term: Keycloak active-passive cluster (Phase 2, on split vm-identity)
|
||||
Priority: MEDIUM
|
||||
|
||||
RISK-5 Stage isolation depends on 5 enforcement controls (see §2 table).
|
||||
If any one breaks, stage code can affect prod customers.
|
||||
→ Mitigation: quarterly red-team verification of each control.
|
||||
Especially: Infisical secret-path scoping and Stalwart recipient filter.
|
||||
Priority: MEDIUM — easy to forget once it's working
|
||||
|
||||
RISK-6 Infisical downtime during multi-VM restart causes delayed cold start.
|
||||
→ Mitigation: Orca startup ordering + bootstrap secrets for Keycloak only
|
||||
Priority: LOW — documented runbook; cold start is rare
|
||||
|
||||
RISK-7 ERPNext → Tenant Registry webhook has no guaranteed delivery.
|
||||
Failed activation = tenant not active after contract signed.
|
||||
→ Mitigation: Frappe retry + idempotent /activate endpoint + manual Backstage trigger
|
||||
Priority: LOW
|
||||
|
||||
RISK-8 LiteLLM calls external AI APIs (OpenAI / Anthropic).
|
||||
→ Mitigation: LiteLLM fallback routing; products degrade gracefully.
|
||||
Priority: LOW — external dependency, by design
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 13. Growth Trajectory — when to add VMs
|
||||
|
||||
The locked 4-VM topology is right for 5–~200 customers. Past that, expect to add VMs back in this order:
|
||||
|
||||
```
|
||||
Tier A (5–200 cust): 4 VMs as locked €192/mo compute (36M upfront)
|
||||
↓
|
||||
Tier B (200–500): Bump vm-data m2.med → m2.large +€64/mo
|
||||
Add cold-standby vm-edge-spare +€0 (idle, paid only on swap)
|
||||
↓
|
||||
Tier C (500–1000): Split vm-data: vm-data + vm-data-db +€64/mo
|
||||
(postgres-app moves to its own VM, or DBaaS cluster +€213/mo)
|
||||
Split vm-control: vm-control + vm-ops +€64/mo
|
||||
(ERPNext + MariaDB + Stalwart move to vm-ops)
|
||||
↓
|
||||
Tier D (1000–2000): Split vm-edge: vm-edge + vm-identity + vm-secrets +€96/mo
|
||||
HA Keycloak active-passive on 2× vm-identity +€32/mo
|
||||
Octavia Load Balancer Double Instance +€58/mo
|
||||
vm-data m2.large → m2.xlarge or 2× +€128–256/mo
|
||||
↓
|
||||
Final topology ≈ 8 prod VMs + DBaaS
|
||||
```
|
||||
|
||||
Each step is justified by a measurable signal (>80% RAM, >70% CPU, sustained queue depth, or a specific outage scenario). Never split preemptively.
|
||||
|
||||
---
|
||||
|
||||
## 14. Cost summary (see COST_PLAN.md for full breakdown)
|
||||
|
||||
| Mode | Compute €/mo | Storage €/mo | Network €/mo | Total net | + 19% VAT |
|
||||
|---|---:|---:|---:|---:|---:|
|
||||
| On-Demand | 434.50 | 112 | 2.92 | 549.42 | 653.81 |
|
||||
| 12-month commit | 295.20 | 112 | 2.92 | 410.12 | 488.04 |
|
||||
| 36-month no upfront | 216.00 | 112 | 2.92 | 330.92 | 393.79 |
|
||||
| 36-month upfront | 192.00 | 112 | 2.92 | 306.92 | 365.23 |
|
||||
|
||||
Plus €6,912 net one-time payment if signing 36M-upfront for the compute portion.
|
||||
|
||||
---
|
||||
|
||||
*End of document. Review quarterly or after any significant infrastructure change. Topology last locked 2026-05-18.*
|
||||
Reference in New Issue
Block a user