Bootstraps §1.2 scaffolding (README, CONTRIBUTING, CODEOWNERS, CHANGELOG, PR + issue templates, LICENSE, CI workflow, release workflow, commitlint, cliff, .editorconfig, .gitignore, .env.example) and ships a proprietary all-rights-reserved LICENSE naming both founders. Refs: M0.1
This commit was merged in pull request #1.
This commit is contained in:
@@ -0,0 +1,18 @@
|
|||||||
|
root = true
|
||||||
|
|
||||||
|
[*]
|
||||||
|
charset = utf-8
|
||||||
|
end_of_line = lf
|
||||||
|
indent_style = space
|
||||||
|
indent_size = 2
|
||||||
|
insert_final_newline = true
|
||||||
|
trim_trailing_whitespace = true
|
||||||
|
|
||||||
|
[*.go]
|
||||||
|
indent_style = tab
|
||||||
|
|
||||||
|
[Makefile]
|
||||||
|
indent_style = tab
|
||||||
|
|
||||||
|
[*.md]
|
||||||
|
trim_trailing_whitespace = false
|
||||||
@@ -0,0 +1,50 @@
|
|||||||
|
---
|
||||||
|
name: Bug report
|
||||||
|
about: Something works incorrectly or breaks
|
||||||
|
labels: bug
|
||||||
|
---
|
||||||
|
|
||||||
|
## What happened
|
||||||
|
|
||||||
|
<!-- One sentence. The observable symptom, not the root cause. -->
|
||||||
|
|
||||||
|
## What I expected
|
||||||
|
|
||||||
|
<!-- One sentence. -->
|
||||||
|
|
||||||
|
## Steps to reproduce
|
||||||
|
|
||||||
|
1.
|
||||||
|
2.
|
||||||
|
3.
|
||||||
|
|
||||||
|
## Environment
|
||||||
|
|
||||||
|
- **Env:** dev / stage / prod
|
||||||
|
- **Tenant slug:** <!-- e.g. acme, demo, leave blank if platform-wide -->
|
||||||
|
- **Product:** <!-- portal / certifai / compliance / tenant-registry / orca-proxy / ... -->
|
||||||
|
- **Release tag / commit SHA:**
|
||||||
|
- **Browser (if portal):**
|
||||||
|
|
||||||
|
## Evidence
|
||||||
|
|
||||||
|
<!-- Trace ID from SigNoz, log excerpts, screenshots, request/response bodies.
|
||||||
|
STRIP PII before pasting. -->
|
||||||
|
|
||||||
|
```
|
||||||
|
<paste here>
|
||||||
|
```
|
||||||
|
|
||||||
|
**SigNoz trace:** <!-- link -->
|
||||||
|
|
||||||
|
## Blast radius
|
||||||
|
|
||||||
|
- [ ] Affects a single tenant
|
||||||
|
- [ ] Affects multiple tenants
|
||||||
|
- [ ] Affects all tenants on this env
|
||||||
|
- [ ] Data loss or corruption risk
|
||||||
|
- [ ] Security / authz implication
|
||||||
|
|
||||||
|
## Suspected cause (optional)
|
||||||
|
|
||||||
|
<!-- Leave blank if you don't know. Speculation here is welcome but not required. -->
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
---
|
||||||
|
name: Feature / change request
|
||||||
|
about: Propose a new capability or behavior change
|
||||||
|
labels: enhancement
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
<!-- What is the customer / operator / developer trying to do today, and why is it painful?
|
||||||
|
Lead with the WHY. -->
|
||||||
|
|
||||||
|
## Proposed solution
|
||||||
|
|
||||||
|
<!-- One paragraph. The shape of the change, not the implementation detail. -->
|
||||||
|
|
||||||
|
## Acceptance criteria
|
||||||
|
|
||||||
|
<!-- A reviewer should be able to read these and say "shipped" or "not shipped". -->
|
||||||
|
|
||||||
|
- [ ]
|
||||||
|
- [ ]
|
||||||
|
- [ ]
|
||||||
|
|
||||||
|
## Alternatives considered
|
||||||
|
|
||||||
|
<!-- 1–2 sentences each. "Do nothing" is always one alternative — say why it's worse. -->
|
||||||
|
|
||||||
|
## Linked milestone
|
||||||
|
|
||||||
|
<!-- Optional. If this maps to an existing milestone in IMPLEMENTATION_PLAN.md, link it.
|
||||||
|
If it doesn't, that's a signal the plan needs an update. -->
|
||||||
|
|
||||||
|
M0.1 — or **new milestone needed**
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
<!-- Things this issue explicitly does NOT cover, so reviewers don't expand the scope. -->
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
<!-- Things to resolve before implementation can start. -->
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
<!--
|
||||||
|
PR title MUST be a Conventional Commit, e.g.:
|
||||||
|
feat(api): add POST /v1/tenants/:id/cancel
|
||||||
|
fix(auth): reject JWT when org_id missing
|
||||||
|
|
||||||
|
Mark draft if not ready for review.
|
||||||
|
-->
|
||||||
|
|
||||||
|
## What
|
||||||
|
|
||||||
|
<!-- 1–3 bullets. What does this PR change? -->
|
||||||
|
-
|
||||||
|
|
||||||
|
## Why
|
||||||
|
|
||||||
|
<!-- Link the architecture section, milestone ID, or issue this addresses. -->
|
||||||
|
|
||||||
|
Linked milestone: **M0.1**
|
||||||
|
|
||||||
|
<!-- Optional: closes #123, refs #456 -->
|
||||||
|
|
||||||
|
## How
|
||||||
|
|
||||||
|
<!-- Notes for the reviewer: the interesting design choices, the tricky bits, what NOT to focus on. Skip if obvious from the diff. -->
|
||||||
|
|
||||||
|
## Test plan
|
||||||
|
|
||||||
|
- [ ] Unit tests added/updated
|
||||||
|
- [ ] Integration tests added/updated (real DB via testcontainers)
|
||||||
|
- [ ] Playwright e2e added/updated (only if user-facing flow changed)
|
||||||
|
- [ ] Manual smoke on stage after deploy
|
||||||
|
- [ ] Regression test added (only if this PR fixes a bug — must fail before the fix)
|
||||||
|
|
||||||
|
<!-- If a row is genuinely n/a, leave it unchecked and explain below. -->
|
||||||
|
|
||||||
|
## Risk
|
||||||
|
|
||||||
|
**Blast radius:** <!-- single tenant / all tenants / single product / portal-wide / data-plane / infra -->
|
||||||
|
|
||||||
|
**What could break:**
|
||||||
|
-
|
||||||
|
|
||||||
|
**Rollback plan:**
|
||||||
|
<!-- e.g. `orca rollout undo {service} --env=prod`, or "revert the PR and redeploy" -->
|
||||||
|
|
||||||
|
## Checklist
|
||||||
|
|
||||||
|
- [ ] Docs updated (or n/a — explain)
|
||||||
|
- [ ] Audit events emitted for state changes (or n/a)
|
||||||
|
- [ ] Secrets via Infisical, never in repo
|
||||||
|
- [ ] Migration is forward-only + idempotent (or no migration)
|
||||||
|
- [ ] Tenant scoping enforced on every DB query (or no DB access)
|
||||||
|
- [ ] OpenAPI spec updated (or no API change)
|
||||||
|
- [ ] `featureFlags.evaluate()` used for any toggleable behavior (or n/a)
|
||||||
|
- [ ] CHANGELOG entry under "Unreleased" (or n/a)
|
||||||
|
|
||||||
|
## Screenshots / recordings
|
||||||
|
|
||||||
|
<!-- For UI changes. Drop a screenshot or a Loom link. -->
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<!--
|
||||||
|
Reviewer reminder: in this order — risk → tests → security → correctness → style.
|
||||||
|
Squash-merge after approval. PR title becomes the commit message.
|
||||||
|
-->
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
# CI skeleton (TypeScript shape; no app code yet).
|
||||||
|
# Lights up to commitlint + gitleaks + trivy fs scan. Add lint/test/build jobs
|
||||||
|
# when this repo grows real package code.
|
||||||
|
name: ci
|
||||||
|
|
||||||
|
on:
|
||||||
|
pull_request:
|
||||||
|
branches: [main]
|
||||||
|
push:
|
||||||
|
branches: [main]
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
shared:
|
||||||
|
runs-on: docker
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
with: { fetch-depth: 0 }
|
||||||
|
|
||||||
|
- name: commitlint (PR only)
|
||||||
|
if: github.event_name == 'pull_request'
|
||||||
|
uses: wagoid/commitlint-github-action@v6
|
||||||
|
|
||||||
|
- name: gitleaks
|
||||||
|
uses: gitleaks/gitleaks-action@v2
|
||||||
|
|
||||||
|
- name: trivy fs scan
|
||||||
|
uses: aquasecurity/trivy-action@master
|
||||||
|
with:
|
||||||
|
scan-type: fs
|
||||||
|
severity: HIGH,CRITICAL
|
||||||
|
exit-code: 1
|
||||||
@@ -0,0 +1,85 @@
|
|||||||
|
# release.yaml — production release on git tag vX.Y.Z.
|
||||||
|
# Promotes the image already on stage to prod, gated by manual sign-off.
|
||||||
|
name: release
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
tags: ['v*.*.*']
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
promote:
|
||||||
|
runs-on: docker
|
||||||
|
environment:
|
||||||
|
name: production # Gitea Environments — requires sign-off per branch protection
|
||||||
|
url: https://yourplatform.com
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
with: { fetch-depth: 0 }
|
||||||
|
|
||||||
|
- name: extract version
|
||||||
|
id: v
|
||||||
|
run: echo "version=${GITHUB_REF#refs/tags/v}" >> $GITHUB_OUTPUT
|
||||||
|
|
||||||
|
- name: verify stage soak (>= 24h on this image)
|
||||||
|
run: |
|
||||||
|
IMG=registry.yourplatform.com/${{ github.event.repository.name }}:env-stage
|
||||||
|
SOAK_SECONDS=$(orca image-age --env=stage --image $IMG)
|
||||||
|
if [ "$SOAK_SECONDS" -lt 86400 ]; then
|
||||||
|
echo "Stage soak only $SOAK_SECONDS s, < 24h. Aborting."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
env:
|
||||||
|
ORCA_TOKEN: ${{ secrets.ORCA_STAGE_TOKEN }}
|
||||||
|
|
||||||
|
- name: re-tag image as semver + env-prod
|
||||||
|
uses: docker/login-action@v3
|
||||||
|
with:
|
||||||
|
registry: registry.yourplatform.com
|
||||||
|
username: ${{ secrets.REGISTRY_USER }}
|
||||||
|
password: ${{ secrets.REGISTRY_PASS }}
|
||||||
|
|
||||||
|
- run: |
|
||||||
|
IMG=registry.yourplatform.com/${{ github.event.repository.name }}
|
||||||
|
docker pull $IMG:env-stage
|
||||||
|
docker tag $IMG:env-stage $IMG:v${{ steps.v.outputs.version }}
|
||||||
|
docker tag $IMG:env-stage $IMG:env-prod
|
||||||
|
docker push $IMG:v${{ steps.v.outputs.version }}
|
||||||
|
docker push $IMG:env-prod
|
||||||
|
|
||||||
|
- name: deploy to prod
|
||||||
|
run: orca apply --env=prod --image-tag=v${{ steps.v.outputs.version }}
|
||||||
|
env:
|
||||||
|
ORCA_TOKEN: ${{ secrets.ORCA_PROD_TOKEN }}
|
||||||
|
|
||||||
|
- name: post-deploy smoke
|
||||||
|
run: orca exec --env=prod smoke-runner
|
||||||
|
|
||||||
|
- name: generate release notes from conventional commits
|
||||||
|
uses: orhun/git-cliff-action@v3
|
||||||
|
with:
|
||||||
|
config: cliff.toml
|
||||||
|
args: --latest --strip header
|
||||||
|
env:
|
||||||
|
OUTPUT: RELEASE_NOTES.md
|
||||||
|
|
||||||
|
- name: create Gitea release
|
||||||
|
run: |
|
||||||
|
curl -X POST -H "Authorization: token ${{ secrets.GITEA_TOKEN }}" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d "$(jq -Rs '{tag_name:"v${{ steps.v.outputs.version }}", name:"v${{ steps.v.outputs.version }}", body:.}' < RELEASE_NOTES.md)" \
|
||||||
|
https://gitea.meghsakha.com/api/v1/repos/${{ github.repository }}/releases
|
||||||
|
|
||||||
|
rollback-on-failure:
|
||||||
|
needs: promote
|
||||||
|
if: failure()
|
||||||
|
runs-on: docker
|
||||||
|
steps:
|
||||||
|
- name: orca rollback prod
|
||||||
|
run: orca rollout undo ${{ github.event.repository.name }} --env=prod
|
||||||
|
env:
|
||||||
|
ORCA_TOKEN: ${{ secrets.ORCA_PROD_TOKEN }}
|
||||||
|
- name: page on-call
|
||||||
|
run: |
|
||||||
|
curl -X POST -H "Content-Type: application/json" \
|
||||||
|
-d '{"text":"Release of ${{ github.event.repository.name }} ${{ github.ref }} FAILED. Rolled back. See Gitea Actions run."}' \
|
||||||
|
${{ secrets.ONCALL_WEBHOOK }}
|
||||||
+37
@@ -0,0 +1,37 @@
|
|||||||
|
# OS
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# Editors
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*~
|
||||||
|
|
||||||
|
# Local secrets
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
.env.*.local
|
||||||
|
|
||||||
|
# Build outputs
|
||||||
|
dist/
|
||||||
|
build/
|
||||||
|
out/
|
||||||
|
target/
|
||||||
|
coverage/
|
||||||
|
*.log
|
||||||
|
*.tmp
|
||||||
|
|
||||||
|
# Node
|
||||||
|
node_modules/
|
||||||
|
.pnpm-store/
|
||||||
|
.next/
|
||||||
|
.turbo/
|
||||||
|
|
||||||
|
# Go
|
||||||
|
*.test
|
||||||
|
*.out
|
||||||
|
vendor/
|
||||||
|
|
||||||
|
# Rust
|
||||||
|
**/target/
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
All notable changes to this repo. Format: [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
|
||||||
|
Generated section is appended on release tag via `git-cliff` (see `.gitea/workflows/release.yaml`).
|
||||||
|
|
||||||
|
## [Unreleased]
|
||||||
|
|
||||||
|
### Added
|
||||||
|
-
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
-
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
-
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
-
|
||||||
|
|
||||||
|
### Security
|
||||||
|
-
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- Released versions appear below this line, newest first. Don't edit by hand once the release workflow has run. -->
|
||||||
+35
@@ -0,0 +1,35 @@
|
|||||||
|
# CODEOWNERS — auto-requests reviewers based on touched paths.
|
||||||
|
# Format: <path-glob> <@user-or-team> [<@user-or-team> ...]
|
||||||
|
# More specific patterns override less specific ones.
|
||||||
|
# See: https://docs.gitea.com/usage/code-owners
|
||||||
|
#
|
||||||
|
# This is the BASELINE — copy into the repo and tighten paths per service.
|
||||||
|
|
||||||
|
# Default — every PR gets at least Sharang
|
||||||
|
* @sharang
|
||||||
|
|
||||||
|
# Architecture / specs / runbooks (touchy — both founders look)
|
||||||
|
/docs/ @sharang @benjamin_boenisch
|
||||||
|
*.md @sharang @benjamin_boenisch
|
||||||
|
|
||||||
|
# Security-sensitive paths
|
||||||
|
/internal/auth/ @sharang
|
||||||
|
/internal/keycloak/ @sharang
|
||||||
|
/internal/api-keys/ @sharang
|
||||||
|
/middleware/auth/ @sharang
|
||||||
|
|
||||||
|
# Schema and data migrations — irreversible, both founders look
|
||||||
|
/migrations/ @sharang @benjamin_boenisch
|
||||||
|
**/schema/ @sharang @benjamin_boenisch
|
||||||
|
|
||||||
|
# Infra-as-code
|
||||||
|
/orca/ @sharang
|
||||||
|
/.gitea/workflows/ @sharang
|
||||||
|
/Dockerfile @sharang
|
||||||
|
|
||||||
|
# Manifests (catalog metadata visible to customers)
|
||||||
|
/product.manifest.yaml @sharang @benjamin_boenisch
|
||||||
|
|
||||||
|
# Frontend-only changes
|
||||||
|
/src/app/ @sharang
|
||||||
|
/src/components/ @sharang
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
# Contributing
|
||||||
|
|
||||||
|
Conventions are platform-wide. The full ruleset lives in [`platform/docs/IMPLEMENTATION_PLAN.md §1`](https://gitea.meghsakha.com/platform/docs/src/branch/main/IMPLEMENTATION_PLAN.md). This is the short version.
|
||||||
|
|
||||||
|
## Branching
|
||||||
|
|
||||||
|
- Trunk-based. `main` is always deployable.
|
||||||
|
- Branch from `main`. Name: `feat/<slug>`, `fix/<slug>`, `chore/<slug>`, `docs/<slug>`, `refactor/<slug>`.
|
||||||
|
- Max 5 days. Longer-lived branches get merge conflicts and stop being trusted.
|
||||||
|
- Never push directly to `main` (branch protection blocks it).
|
||||||
|
|
||||||
|
## Commits
|
||||||
|
|
||||||
|
[Conventional Commits](https://www.conventionalcommits.org/) — enforced by `commitlint` in CI.
|
||||||
|
|
||||||
|
```
|
||||||
|
<type>(<scope>)?: <subject>
|
||||||
|
|
||||||
|
[optional body]
|
||||||
|
|
||||||
|
[optional footer: BREAKING CHANGE: ..., Refs: M5.2]
|
||||||
|
```
|
||||||
|
|
||||||
|
Types: `feat`, `fix`, `chore`, `docs`, `refactor`, `test`, `perf`, `build`, `ci`.
|
||||||
|
Breaking change: append `!` (e.g. `feat!: drop /v0 endpoints`) and add `BREAKING CHANGE:` footer.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
```
|
||||||
|
feat(api): add POST /v1/tenants/:id/cancel
|
||||||
|
fix(auth): reject JWT when org_id missing
|
||||||
|
docs: link runbook from README
|
||||||
|
refactor!: rename column tenant.kind → tenant.type
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pull requests
|
||||||
|
|
||||||
|
1. Open a PR against `main` using the template (`.gitea/pull_request_template.md` is auto-loaded).
|
||||||
|
2. Fill **every** section — the template is a checklist, not decoration.
|
||||||
|
3. Link the milestone in the body: `Linked milestone: M5.2`.
|
||||||
|
4. Wait for green CI + 1 approving review. **Do not self-merge.**
|
||||||
|
5. Squash-merge. The PR title becomes the commit message — keep it as a Conventional Commit.
|
||||||
|
|
||||||
|
## Tests
|
||||||
|
|
||||||
|
| Change type | Required tests |
|
||||||
|
|---|---|
|
||||||
|
| New API endpoint | unit + integration (testcontainers, real DB) |
|
||||||
|
| New user-facing flow | Playwright e2e against stage |
|
||||||
|
| Bug fix | regression test FIRST (must fail before fix) |
|
||||||
|
| IaC / Orca manifest | `orca validate` + dry-run plan in PR comment |
|
||||||
|
| Pure refactor | existing suite must stay green |
|
||||||
|
|
||||||
|
**"Manually tested" is not acceptable** except for IaC, and even there the dry-run plan must be in the PR.
|
||||||
|
|
||||||
|
## Secrets
|
||||||
|
|
||||||
|
- Never commit secrets. `gitleaks` runs in CI and blocks merge.
|
||||||
|
- Local dev: `.env.local` (gitignored); template at `.env.example`.
|
||||||
|
- Stage / prod: Infisical machine identity at `/{env}/{service}/`.
|
||||||
|
|
||||||
|
## Code style
|
||||||
|
|
||||||
|
| Stack | Tools |
|
||||||
|
|---|---|
|
||||||
|
| Go | `go fmt`, `go vet`, `golangci-lint run` — all required clean |
|
||||||
|
| Rust | `cargo fmt --all`, `cargo clippy -- -D warnings` — both required |
|
||||||
|
| TypeScript | `pnpm lint`, `pnpm typecheck` — both required |
|
||||||
|
| Python | `ruff check`, `ruff format`, `mypy` — all required |
|
||||||
|
|
||||||
|
CI runs these. Pre-commit hooks recommended (`.githooks/pre-commit` in this repo).
|
||||||
|
|
||||||
|
## Audit + observability
|
||||||
|
|
||||||
|
Any state-changing endpoint MUST emit an audit event to Tenant Registry `/audit` in the Retraced-shape schema. See [`PRODUCT_INTEGRATION_SPEC.md §8.4`](https://gitea.meghsakha.com/platform/docs/src/branch/main/PRODUCT_INTEGRATION_SPEC.md).
|
||||||
|
|
||||||
|
Any service ships OTel SDK from day one (`OTEL_EXPORTER_OTLP_ENDPOINT` injected by Orca). No `fmt.Println` / `console.log` in committed code.
|
||||||
|
|
||||||
|
## Reviewer hat
|
||||||
|
|
||||||
|
When reviewing, check in this order:
|
||||||
|
1. **Risk** — what could break in prod? Is the rollback clear?
|
||||||
|
2. **Tests** — do they actually exercise the change?
|
||||||
|
3. **Security** — secrets, authz, input validation, tenant scoping.
|
||||||
|
4. **Correctness** — does it do what the PR says it does?
|
||||||
|
5. **Style** — last; CI already caught the mechanical stuff.
|
||||||
|
|
||||||
|
## Questions
|
||||||
|
|
||||||
|
`#engineering` channel · `oncall@yourplatform.com` · or open a PR with a `[WIP]` prefix and ask in the description.
|
||||||
+258
@@ -0,0 +1,258 @@
|
|||||||
|
# Cost Plan — SysEleven Infrastructure
|
||||||
|
|
||||||
|
Companion to `INFRASTRUCTURE.md` and `IMPLEMENTATION_PLAN.md`. Pricing source: `SysEleven-Cloud-Services-Preisinformationen_01_26_v2.pdf` (effective 2026-01-20). All prices net EUR, exclusive of 19% VAT. Region: DUS2 + HAM1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. TL;DR
|
||||||
|
|
||||||
|
**Locked topology (2026-05-18):** 4 billable VMs — 1 stage + 3 prod — totalling **48 GiB-RAM**. See `INFRASTRUCTURE.md §1`.
|
||||||
|
|
||||||
|
All four pricing modes, side by side, at the locked sizing:
|
||||||
|
|
||||||
|
| Mode | Compute €/mo | Storage €/mo | Network €/mo | **Total net €/mo** | + 19% VAT | **Annual gross €** |
|
||||||
|
|---|---:|---:|---:|---:|---:|---:|
|
||||||
|
| **On-Demand** | 434.50 | 112 | 2.92 | **549.42** | 653.81 | **7,846** |
|
||||||
|
| **12-month commit** | 295.20 | 112 | 2.92 | **410.12** | 488.04 | **5,856** |
|
||||||
|
| **36-month no upfront** | 216.00 | 112 | 2.92 | **330.92** | 393.79 | **4,725** |
|
||||||
|
| **36-month upfront** | 192.00 | 112 | 2.92 | **306.92** | 365.23 | **4,383** |
|
||||||
|
|
||||||
|
**36M upfront one-time payment**: €6,912 net at signing (compute only; storage + network still billed monthly).
|
||||||
|
|
||||||
|
**Recommended cash plan for Year 1:**
|
||||||
|
1. Months 1–3: burn On-Demand (~€549/mo) while flavors get proven against real workload
|
||||||
|
2. Month 4 onward: sign 36M-upfront commit at proven size (~€307/mo)
|
||||||
|
3. Year-1 total infra: **€4,410 net / €5,248 gross** + one-time €6,912 upfront in Month 4
|
||||||
|
|
||||||
|
Growth tiers extend that same baseline (next 4 sections drill in).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. What to use / what to skip
|
||||||
|
|
||||||
|
### Use from day one
|
||||||
|
| Service | Why | Cost |
|
||||||
|
|---|---|---|
|
||||||
|
| **OpenStack IaaS (m2 GP)** | Bread and butter. General-purpose 1:4 vCPU:RAM fits everything. | per VM, see §3 |
|
||||||
|
| **Block Storage (Ceph)** | 3x replicated, persistent. €0.10/GiB/mo. | per GiB |
|
||||||
|
| **Object Storage (S3)** | Backups, audit logs, demo seed bundles, export ZIPs. €0.02/GiB/mo. | per GiB |
|
||||||
|
| **Floating IP** | Public IPs for vm-edge (1) and stage (1). | €2.92/IP/mo |
|
||||||
|
| **VPN as a Service** | Inclusive. Use for ops access from our laptops. | €0 |
|
||||||
|
| **Self-Service Support** | Free. Adequate while we're shaking out the platform. | €0 |
|
||||||
|
|
||||||
|
### Defer until clearly needed
|
||||||
|
| Service | When to add | Cost |
|
||||||
|
|---|---|---|
|
||||||
|
| **DNS Zones (DNSaaS)** | Never — we self-host PowerDNS on vm-edge per [[self-hosted-oss-first]] | €10/zone — skipped |
|
||||||
|
| **Load Balancer (Octavia)** | When we add a second vm-edge for HA (Tier D). Until then orca-proxy + Floating IP is enough. | €14.60–57.67/mo |
|
||||||
|
| **Business Support** | When MRR > €5k. Below that, Self-Service docs cover us. | €185/mo |
|
||||||
|
| **Priority Support** | Only if we sign an Enterprise contract that requires <1h response. | €545/mo |
|
||||||
|
| **DDoS Guard PLUS** | After first attack OR before launching anything customer-promoted. | €875/mo |
|
||||||
|
| **DBaaS PostgreSQL Cluster** | When tenant_registry Postgres becomes the bottleneck (200+ customers, see RISK-1 in INFRASTRUCTURE.md). | €213–426/mo per cluster (m2.small–medium, 36M upfront) |
|
||||||
|
| **MetaKube Core (managed K8s)** | We use Orca (the user's own product). MetaKube would compete with Orca, not complement it. Skip unless Orca is replaced. | €0 by design |
|
||||||
|
| **Managed VM (Business/Priority)** | Defeats Orca. We are the ones who manage VMs. | skipped — saves €1k+/mo |
|
||||||
|
| **Operational Support Platform** | €759–€1,479/mo. Massive overkill until late stage. | skipped |
|
||||||
|
|
||||||
|
### GPU instances (separate concern)
|
||||||
|
LiteLLM today is a passthrough. If we ever self-host an inference model:
|
||||||
|
- **L40S (24 GB GPU RAM)**: €1,309/mo On-Demand, €1,086 (12M), €877 (24M)
|
||||||
|
- **H100 NVL (94 GB)**: €5,755/mo On-Demand, €4,637 (12M), €3,743 (24M)
|
||||||
|
|
||||||
|
For now: route LLM calls through LiteLLM → external provider. Add GPU only if a customer pays for dedicated inference.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Per-VM sizing — Locked topology (Tier A, 5 customers)
|
||||||
|
|
||||||
|
Flavor mapping from `INFRASTRUCTURE.md §1` to SysEleven `m2` General Purpose (1 vCPU : 4 GiB RAM, 50 GiB ephemeral root included).
|
||||||
|
|
||||||
|
### Compute — all four pricing modes side by side
|
||||||
|
|
||||||
|
| VM | Env | Flavor | vCPU | RAM | On-Demand | 12M | 36M no-upfront | 36M upfront |
|
||||||
|
|---|---|---|---:|---:|---:|---:|---:|---:|
|
||||||
|
| stage | stage | m2.small | 2 | 8 GB | 72.42 | 49.20 | 36.00 | 32.00 |
|
||||||
|
| vm-edge | prod | m2.small | 2 | 8 GB | 72.42 | 49.20 | 36.00 | 32.00 |
|
||||||
|
| vm-control | prod | m2.medium | 4 | 16 GB | 144.83 | 98.40 | 72.00 | 64.00 |
|
||||||
|
| vm-data | prod | m2.medium | 4 | 16 GB | 144.83 | 98.40 | 72.00 | 64.00 |
|
||||||
|
| **TOTAL** | | | **12** | **48 GB** | **434.50** | **295.20** | **216.00** | **192.00** |
|
||||||
|
|
||||||
|
**36M upfront one-time cost:** 192 × 36 = **€6,912 net** at signing (compute only; everything else billed monthly).
|
||||||
|
|
||||||
|
**Reference per-GiB-RAM rates** (the linear model behind all numbers above):
|
||||||
|
| Mode | €/GiB-RAM/mo |
|
||||||
|
|---|---:|
|
||||||
|
| On-Demand | 9.05 |
|
||||||
|
| 12M commit | 6.15 |
|
||||||
|
| 36M no-upfront | 4.50 |
|
||||||
|
| 36M upfront | 4.00 |
|
||||||
|
|
||||||
|
Any future sizing change can be sanity-checked as `RAM × rate`.
|
||||||
|
|
||||||
|
### Storage — Tier A steady state
|
||||||
|
|
||||||
|
| Item | GiB | €/GiB/mo | €/mo |
|
||||||
|
|---|---:|---:|---:|
|
||||||
|
| stage block (ephemeral PG + Mongo + Qdrant in-VM) | +50 | 0.10 | 5.00 |
|
||||||
|
| vm-edge block (pg-keycloak + pg-infisical + Gitea repos) | +50 | 0.10 | 5.00 |
|
||||||
|
| vm-control block (MariaDB + Stalwart spool) | +250 | 0.10 | 25.00 |
|
||||||
|
| vm-data block (MongoDB + pg-app + Qdrant + MinIO) | +500 | 0.10 | 50.00 |
|
||||||
|
| Object storage — geo-redundant backups (DUS2↔HAM1) | ~500 | 0.0496 | 25.00 *(€12.50 first 6mo via launch discount)* |
|
||||||
|
| Object storage — seed bundles + exports + audit archive | ~100 | 0.02 | 2.00 |
|
||||||
|
| **Storage subtotal (steady state)** | | | **112.00** |
|
||||||
|
| **Storage subtotal (first 6 months)** | | | **99.50** |
|
||||||
|
|
||||||
|
### Network
|
||||||
|
|
||||||
|
| Item | €/mo |
|
||||||
|
|---|---:|
|
||||||
|
| 1 Floating IP (vm-edge — only public host in prod) | 2.92 |
|
||||||
|
| 1 Floating IP (stage — public for tester access) | 2.92 |
|
||||||
|
| PowerDNS (self-hosted on vm-edge) | 0 |
|
||||||
|
| Octavia Load Balancer (deferred to Tier D HA phase) | 0 |
|
||||||
|
| **Network subtotal** | **5.84** |
|
||||||
|
|
||||||
|
> Storage table above uses 1 Floating IP. Adjust to **€5.84** if running stage with its own public IP (recommended). One-line delta of €2.92/mo.
|
||||||
|
|
||||||
|
### Combined Tier A — four-mode summary
|
||||||
|
|
||||||
|
| Mode | Compute | Storage | Network | **Total net €/mo** | + 19% VAT | **Annual gross €** |
|
||||||
|
|---|---:|---:|---:|---:|---:|---:|
|
||||||
|
| On-Demand | 434.50 | 112 | 5.84 | **552.34** | 657.28 | **7,887** |
|
||||||
|
| 12M commit | 295.20 | 112 | 5.84 | **413.04** | 491.52 | **5,898** |
|
||||||
|
| 36M no-upfront | 216.00 | 112 | 5.84 | **333.84** | 397.27 | **4,767** |
|
||||||
|
| 36M upfront | 192.00 | 112 | 5.84 | **309.84** | 368.71 | **4,425** |
|
||||||
|
|
||||||
|
### Recommended cash plan — Year 1 (use this line in the pitch)
|
||||||
|
|
||||||
|
| Months | Mode | €/mo (net) | Subtotal € |
|
||||||
|
|---|---|---:|---:|
|
||||||
|
| 1–3 (rightsizing window) | On-Demand | 552.34 | 1,657 |
|
||||||
|
| 4–12 (proven baseline) | 36M upfront | 309.84 | 2,789 |
|
||||||
|
| **Year-1 infra net** | | | **4,446** |
|
||||||
|
| + 19% VAT | | | **5,291** |
|
||||||
|
| + one-time 36M upfront in Month 4 | (compute)| | **6,912** |
|
||||||
|
| **Year-1 cash out (gross)** | | | **12,203** |
|
||||||
|
|
||||||
|
### 3-year cumulative (full 36M commitment term)
|
||||||
|
|
||||||
|
| Item | € |
|
||||||
|
|---|---:|
|
||||||
|
| Months 1–3 On-Demand (compute+storage+net) | 1,657 |
|
||||||
|
| Compute 36M upfront (paid Month 4) | 6,912 |
|
||||||
|
| Storage + network, 36 months × ~118 €/mo | 4,248 |
|
||||||
|
| **3-year infra net** | **12,817** |
|
||||||
|
| + 19% VAT | **15,252** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Growth tiers — what scales when
|
||||||
|
|
||||||
|
### Tier A — Pilot (5 customers, first 6 months)
|
||||||
|
- **Locked topology**: 4 VMs (stage + vm-edge + vm-control + vm-data). See INFRASTRUCTURE.md §1.
|
||||||
|
- **Year 1 cash plan**: 3 months On-Demand → 36M upfront. ~€310/mo committed compute+storage+net + one-time €6,912.
|
||||||
|
- **Add**: Self-Service support (free). Skip LB, DNSaaS, DDoS, DBaaS, MetaKube, Managed Services.
|
||||||
|
|
||||||
|
### Tier B — Early growth (50–200 customers, Year 1)
|
||||||
|
- **Vertical scale only.** Bump vm-data m2.medium → m2.large (+€64/mo for 36M upfront).
|
||||||
|
- **Add cold-standby vm-edge-spare** (€0 idle, only billed during a swap event).
|
||||||
|
- **Add Business Support** (€185/mo) once MRR > €5k.
|
||||||
|
- **Add LB Single Instance** (€14.60/mo) when we want zero-downtime portal deploys.
|
||||||
|
- **Add DDoS Guard PLUS** (€875/mo) before any marketing push.
|
||||||
|
- Estimated total: **~€1,100–1,400/mo + VAT**.
|
||||||
|
|
||||||
|
### Tier C — Scale (500–1000 customers, Year 1–2)
|
||||||
|
- **Split vm-data** into vm-data + vm-data-db (move pg-app to its own VM; resolves RISK-1).
|
||||||
|
- Alternative: move pg-registry to DBaaS m2.small cluster (3 inst, 36M upfront): **€213/mo**
|
||||||
|
- **Split vm-control** into vm-control + vm-ops (ERPNext + MariaDB + Stalwart go to vm-ops): **+€64/mo**
|
||||||
|
- **HA edge**: second vm-edge, switch Floating IP → Load Balancer Double Instance (**€58/mo**).
|
||||||
|
- **Object storage growth:** audit logs, exports, demo backups → estimated 2 TB = **€40/mo**.
|
||||||
|
- Estimated total: **~€2,000–2,500/mo + VAT**.
|
||||||
|
|
||||||
|
### Tier D — Full scale (2000 customers, Year 2–3)
|
||||||
|
- **3-node clusters** on hot paths: vm-control × 2, vm-data × 2.
|
||||||
|
- **Split vm-edge** into vm-edge + vm-identity + vm-secrets (back toward original 7-VM design).
|
||||||
|
- **DBaaS m2.medium cluster** (4V/16GB, 36M upfront): **€426/mo** for tenant_registry.
|
||||||
|
- **Keycloak HA cluster**: 2 vm-identity (m2.medium) + Postgres replica.
|
||||||
|
- **Priority Support** (€545/mo) becomes worth it.
|
||||||
|
- **Object storage:** ~5 TB = **€100/mo**.
|
||||||
|
- **DDoS Guard PREMIUM** (€2,200/mo) if traffic warrants — likely stays on PLUS.
|
||||||
|
- Estimated total: **€4,500–6,000/mo + VAT**.
|
||||||
|
|
||||||
|
### Compute scaling cheat sheet (vs locked topology)
|
||||||
|
|
||||||
|
| Tier | Customers | Topology delta from Tier A | Compute €/mo (36M upfront) |
|
||||||
|
|---|---:|---|---:|
|
||||||
|
| **A** | 5 | locked baseline: stage + 3 prod VMs (48 GiB) | **192** |
|
||||||
|
| **B** | 200 | + vm-data bumped m2.med → m2.large (+16 GiB) | **256** |
|
||||||
|
| **C** | 1000 | + split vm-data (+16 GiB), split vm-control (+16 GiB) | **384** |
|
||||||
|
| **D** | 2000 | + split vm-edge (3 → 3 VMs), HA clusters (~+90 GiB) | **~640** |
|
||||||
|
|
||||||
|
The **€4/GiB-RAM/mo rate** (GP, 36M upfront) is the linear model — everything else (storage, network, support, DBaaS, DDoS) scales sub-linearly with customer count. Compute is never the bottleneck on the bill.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Cost per customer
|
||||||
|
|
||||||
|
| Tier | Customers | Monthly infra net (€) | Per customer/month (€) |
|
||||||
|
|---|---:|---:|---:|
|
||||||
|
| A | 5 | 310 | **62.00** |
|
||||||
|
| B | 200 | 1,200 | **6.00** |
|
||||||
|
| C | 1000 | 2,300 | **2.30** |
|
||||||
|
| D | 2000 | 5,000 | **2.50** |
|
||||||
|
|
||||||
|
At Tier A the per-customer cost is irrelevant — fixed costs dominate. From Tier B onward our gross margin on a Professional plan (assume €99/customer/month) is **~94%** infrastructure-only. Add LLM passthrough (LiteLLM) + Polar.sh fees (~5%) + on-call time, and we are still well above the 80% gross margin floor SaaS investors look for.
|
||||||
|
|
||||||
|
**Break-even: ~4 paying customers at €99/mo covers Tier A infra (€310/mo net).**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. SysEleven services we explicitly skip and why
|
||||||
|
|
||||||
|
| Service | Why skip |
|
||||||
|
|---|---|
|
||||||
|
| DNSaaS (€10/zone) | We self-host PowerDNS on vm-gateway. €0 marginal cost since vm-gateway exists anyway. |
|
||||||
|
| MetaKube Core | Orca already orchestrates our containers. MetaKube would mean abandoning Orca, which the user owns. |
|
||||||
|
| MetaKube Accelerator | Same — competes with Orca. |
|
||||||
|
| MetaKube Operator add-ons (ExternalDNS, Cert-Manager, Tideways, Velero etc. at €78–171/mo each) | We pick and roll our own per [[self-hosted-oss-first]]. |
|
||||||
|
| Managed VM (Business €128–142/mo per VM, Priority €164–182) | Defeats Orca. We are the operators. Saves €1k+/mo at 7 VMs. |
|
||||||
|
| Operational Support Platform (€759–1,479/mo) | Massively over-specified for our scale. Buy individual Engineering Support days (€1,264/day) on demand if a real incident requires it. |
|
||||||
|
| DDoS Guard PREMIUM (€2,200) / ENTERPRISE (€4,800) | PLUS at €875/mo is enough for ≤500-customer scale. Upgrade if we see actual 1+ Tbps attacks. |
|
||||||
|
| Block Storage for Databases (€0.09 vs €0.10) | The €0.01/GiB difference saves ~€5/mo at our scale. Use it only on DBaaS cluster volumes (where SysEleven enforces it anyway). |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Negotiation levers
|
||||||
|
|
||||||
|
SysEleven publishes list prices but is open to commercial negotiation, especially as a German Mittelstand provider courting startups. Things worth asking for:
|
||||||
|
|
||||||
|
1. **Startup credits.** Hetzner, OVH, and most EU clouds run startup-credit programs. Ask SysEleven for the equivalent before signing the 36M commit. Even €5–10k of credits = 6–12 months of Tier A infra free.
|
||||||
|
2. **EXIST / HTGF discount.** If we close the €1.5M raise (`project_breakpilot_fundraising`), SysEleven sometimes offers "Gründerförderung" pricing for HTGF-backed companies.
|
||||||
|
3. **Single-region discount.** We don't need DUS2 + HAM1 geo-redundancy at Tier A. Ask if single-region (DUS2 only) is cheaper.
|
||||||
|
4. **Object storage commitment.** 6-month 50%-off on geo-redundant storage applies anyway, but bulk commitments on regular S3 may unlock further pricing.
|
||||||
|
5. **Bundled support.** If we commit to 36M IaaS + Business Support, ask for support fee waiver in year 1.
|
||||||
|
6. **Move-in incentive.** Negotiate a setup/migration credit covering first 3 months of On-Demand burn.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Open questions / things to validate
|
||||||
|
|
||||||
|
- **Port 25 outbound from vm-ops.** Confirmed with SysEleven that outbound SMTP is allowed by default; if not, fall-back is to relay through Postal/Postmark for transactional only.
|
||||||
|
- **Region choice.** DUS2 vs HAM1 — DUS2 is the only region for L40S GPUs, HAM1 has A30. If we never self-host inference, region is purely a latency choice (DUS2 closer to most EU customers).
|
||||||
|
- **Geo-redundant Ceph backups.** Currently planning local block + S3 backup. Could also use SysEleven's geo-redundant S3 (DUS2 ↔ HAM1) for true DR. Cost: €0.05/GiB/mo vs €0.02 single-region. At 500GB backup that's €15/mo extra — buy it.
|
||||||
|
- **Egress traffic.** Fair Use policy — they reserve the right to bill if we exceed normal patterns. CERTifAI LLM passthrough could be heavy. Ask for clarification on what triggers metered billing.
|
||||||
|
- **VPN-as-a-Service inclusive.** Confirmed in the pricing doc. Use it for ops access — replaces our need to build IP-allowlists into Orca-Proxy for `erp.` and `git.`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Recommendation summary
|
||||||
|
|
||||||
|
1. **Sign On-Demand for first 90 days.** Burn ~€1,365/mo while you find the right flavor for each VM.
|
||||||
|
2. **At Day 90, commit 36M upfront on proven baselines.** Cuts monthly to ~€700.
|
||||||
|
3. **Keep all 7 VMs separate.** The €100/mo difference vs. consolidation is not worth losing failure isolation.
|
||||||
|
4. **Skip every Managed Service.** We have Orca.
|
||||||
|
5. **Add Business Support at €5k MRR, DDoS PLUS before any public marketing push.**
|
||||||
|
6. **Negotiate startup credits before signing.** Could be worth months of free infra.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of document. Pricing snapshot 2026-01-20; re-check before signing commitments.*
|
||||||
@@ -0,0 +1,714 @@
|
|||||||
|
# Implementation Plan — Breakpilot Platform
|
||||||
|
|
||||||
|
Companion to `PLATFORM_ARCHITECTURE.md`, `INFRASTRUCTURE.md`, and `PRODUCT_INTEGRATION_SPEC.md`.
|
||||||
|
|
||||||
|
This is the build plan for an AI coding agent (Claude Code, executing PRs against the listed repos). Each milestone is sized to fit in 1–3 PRs, ships independently, and leaves the system in a working state.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. How to read this document
|
||||||
|
|
||||||
|
- Milestones are named `M{phase}.{n}` and grouped by phase.
|
||||||
|
- Each milestone has: **Goal**, **Depends on**, **Repos/files**, **Deliverables**, **Acceptance**, **Tests**, **Gate**, **Effort** (S = ≤1 day, M = 2–4 days, L = ≥1 week).
|
||||||
|
- "Gate" is who/what approves the PR for merge. Standard is 1 human reviewer + green CI; some milestones add a manual sign-off.
|
||||||
|
- Phases are ordered; milestones within a phase can be parallelised where dependencies allow.
|
||||||
|
- The dependency graph at §11 is the source of truth — when in doubt, read it.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Cross-cutting conventions (apply to every PR in every repo)
|
||||||
|
|
||||||
|
### 1.1 Repo strategy
|
||||||
|
Polyrepo under a new Gitea org `gitea.meghsakha.com/platform/`. One repo per deployable unit. Existing product repos stay where they are.
|
||||||
|
|
||||||
|
**Repos to create:**
|
||||||
|
|
||||||
|
| Repo | Purpose | Created in |
|
||||||
|
|---|---|---|
|
||||||
|
| `platform/orca-platform` | IaC for VMs, Orca manifests, DNS, TLS, backups | M1.1 |
|
||||||
|
| `platform/tenant-registry` | Go service: tenant glue, audit, API keys | M4.1 |
|
||||||
|
| `platform/portal` | Next.js 15: customer area + backstage | M5.1 |
|
||||||
|
| `platform/docs` | Architecture, integration spec, this plan, runbooks | M0.1 |
|
||||||
|
| `platform/seed-data` | Demo tenant fixtures per product | M13.1 |
|
||||||
|
| `platform/design-tokens` | CSS variables / fonts (consumed by product web comps) | M5.1 |
|
||||||
|
|
||||||
|
**Existing repos that get changes (no new repos):**
|
||||||
|
- `benjamin_boenisch/certifai` — M6.1 / M6.2 / M6.3
|
||||||
|
- `benjamin_boenisch/breakpilot-compliance` — M7.1 / M7.2
|
||||||
|
|
||||||
|
### 1.2 Per-repo scaffolding (must exist before any feature work)
|
||||||
|
Every new repo lands in M0.1 with:
|
||||||
|
|
||||||
|
```
|
||||||
|
/README.md what this repo is, how to run, links to architecture
|
||||||
|
/CONTRIBUTING.md branch model, commit format, how to open a PR
|
||||||
|
/CODEOWNERS at least one mandatory reviewer (us)
|
||||||
|
/.gitea/
|
||||||
|
/pull_request_template.md
|
||||||
|
/issue_template/
|
||||||
|
bug.md
|
||||||
|
feature.md
|
||||||
|
/.gitea/workflows/
|
||||||
|
ci.yaml fmt → lint → test → build (per-language details in M0.2)
|
||||||
|
release.yaml on tag: build image, push to registry
|
||||||
|
/CHANGELOG.md generated from conventional commits
|
||||||
|
/LICENSE MIT for portal/docs; Apache-2.0 for libraries
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 Branch + commit conventions
|
||||||
|
- **Trunk-based.** `main` is always deployable. Feature branches: `feat/<short-slug>`, `fix/<short-slug>`, `chore/<short-slug>`. Max lifetime 5 days.
|
||||||
|
- **Conventional Commits** (`feat:`, `fix:`, `chore:`, `docs:`, `refactor:`, `test:`, `breaking!:`). Enforced by `commitlint` in CI.
|
||||||
|
- **Squash-merge** to main. PR title becomes the commit message.
|
||||||
|
- **Direct push to main is blocked** by Gitea branch protection.
|
||||||
|
|
||||||
|
### 1.4 PR template (in every repo)
|
||||||
|
```markdown
|
||||||
|
## What
|
||||||
|
<1-3 bullets>
|
||||||
|
|
||||||
|
## Why
|
||||||
|
<link to architecture section, milestone ID, or issue>
|
||||||
|
|
||||||
|
## How
|
||||||
|
<implementation notes for the reviewer>
|
||||||
|
|
||||||
|
## Test plan
|
||||||
|
- [ ] unit
|
||||||
|
- [ ] integration (if API surface changed)
|
||||||
|
- [ ] e2e (if user-facing flow changed)
|
||||||
|
- [ ] manual smoke on stage
|
||||||
|
|
||||||
|
## Risk
|
||||||
|
<what could break, blast radius, rollback plan>
|
||||||
|
|
||||||
|
## Linked milestone
|
||||||
|
M{phase}.{n}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.5 CI checks (required before merge, configured in M0.2)
|
||||||
|
Per language defaults:
|
||||||
|
|
||||||
|
| Stack | Required checks |
|
||||||
|
|---|---|
|
||||||
|
| Go | `go fmt -l (no diff)`, `go vet`, `golangci-lint`, `go test ./...`, `go build` |
|
||||||
|
| Rust | `cargo fmt --check`, `cargo clippy -- -D warnings`, `cargo test -j8` |
|
||||||
|
| TypeScript | `pnpm lint`, `pnpm typecheck`, `pnpm test`, `pnpm build` |
|
||||||
|
| Python | `ruff check`, `ruff format --check`, `mypy`, `pytest` |
|
||||||
|
| All | `commitlint`, image build, container scan (`trivy`), SBOM upload |
|
||||||
|
|
||||||
|
### 1.6 Approval gates
|
||||||
|
- **Standard gate** (most milestones): 1 human reviewer approves + all CI checks green. Enforced by Gitea branch protection on `main`.
|
||||||
|
- **CODEOWNERS** auto-requests the right reviewer based on path.
|
||||||
|
- **Production-promotion gate** (release tags only): manual sign-off by `@sharang` on the release issue + stage soak ≥ 24h.
|
||||||
|
- **Security gate** (M2.x, M4.x, M14.x): security checklist in PR body completed.
|
||||||
|
|
||||||
|
### 1.7 Versioning + release strategy
|
||||||
|
- **Semver** per repo. Container images carry **three tags**: `:sha-<short>`, `:v1.4.2`, `:env-stage` / `:env-prod`.
|
||||||
|
- **Stage** auto-deploys on every merge to `main` (Gitea Actions → Orca apply against `stage` cluster).
|
||||||
|
- **Production** deploys only when a release tag `vX.Y.Z` is created. Tag creation requires the production-promotion gate.
|
||||||
|
- **Rollback**: `orca rollout undo <service>` flips back to previous image tag. RTO target ≤ 5 min for any single service.
|
||||||
|
- **Database migrations** are forward-only and run as an init container before the service starts. Migrations that delete columns require two releases (1: stop writing, 2: drop).
|
||||||
|
|
||||||
|
### 1.8 Environments
|
||||||
|
Three Orca clusters, all on the same hardware until volume justifies separation:
|
||||||
|
|
||||||
|
| Env | Cluster name | Purpose | Data | Auto-deploy? |
|
||||||
|
|---|---|---|---|---|
|
||||||
|
| dev | local | Developer machine, docker-compose | fixtures | n/a |
|
||||||
|
| stage | `orca-stage` | Pre-prod validation | seeded demo + synthetic customers | yes (on merge to main) |
|
||||||
|
| prod | `orca-prod` | Live customer traffic | real | tag + gate |
|
||||||
|
|
||||||
|
Domain pattern:
|
||||||
|
- dev: `*.localhost` (mkcert)
|
||||||
|
- stage: `*.stage.yourplatform.com`
|
||||||
|
- prod: `*.yourplatform.com`
|
||||||
|
|
||||||
|
### 1.9 Observability + audit
|
||||||
|
- **SigNoz** (already running at `signoz.meghsakha.com`) for traces, logs, metrics. Every service ships OTel SDK from day one.
|
||||||
|
- **Audit events** in the Retraced-shape schema (PRODUCT_INTEGRATION_SPEC.md §8.4) emitted to Tenant Registry `/audit` from every service. Required for every state-changing endpoint.
|
||||||
|
- **Structured logs** (JSON) only. No `fmt.Println` / `console.log` in committed code; CI rejects.
|
||||||
|
|
||||||
|
### 1.10 Secrets
|
||||||
|
- Infisical machine identity per service, path `/{env}/{service}/`.
|
||||||
|
- The only secret allowed in an Orca env file is the Keycloak DB URI (bootstrap exception — see INFRASTRUCTURE.md).
|
||||||
|
- CI scans for committed secrets via `gitleaks`. Failures block merge.
|
||||||
|
|
||||||
|
### 1.11 Testing policy (mandatory; see also `feedback_testing_everything`)
|
||||||
|
- **Unit**: every non-trivial function.
|
||||||
|
- **Integration**: every API endpoint, against real Postgres/MongoDB via `testcontainers`. No mock databases.
|
||||||
|
- **E2E**: every user-facing flow has at least one Playwright spec running against stage post-deploy.
|
||||||
|
- **Regression**: when a bug is fixed, a failing test is added FIRST, then the fix.
|
||||||
|
- **No PR ships without tests.** "Manual tested" is not acceptable except for IaC.
|
||||||
|
|
||||||
|
### 1.12 A/B testing (designed for, adopted later)
|
||||||
|
Every place where a future flag would gate behaviour MUST flow through a single `featureFlags.evaluate(tenantId, flagKey)` function. Initial implementation returns hard-coded values from `manifest.yaml`. Swap to Unleash/OpenFeature in M19.1 with zero call-site changes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Phase 0 — Foundations (M0.x – M3.x)
|
||||||
|
|
||||||
|
**Goal:** Repos exist, CI works, infra is provisioned and observable, identity + secrets are usable. No customer-visible features yet.
|
||||||
|
|
||||||
|
### M0.1 — Bootstrap repos and docs
|
||||||
|
- **Depends on:** nothing
|
||||||
|
- **Repos:** `platform/docs`, `platform/orca-platform`, `platform/portal`, `platform/tenant-registry`, `platform/design-tokens`, `platform/seed-data`
|
||||||
|
- **Deliverables:** create the Gitea `platform` org; for each repo add the §1.2 scaffolding; `platform/docs` ingests the existing `PLATFORM_ARCHITECTURE.md`, `INFRASTRUCTURE.md`, `PRODUCT_INTEGRATION_SPEC.md`, this plan.
|
||||||
|
- **Acceptance:** every repo has a working `README.md`, `CONTRIBUTING.md`, `CODEOWNERS`, PR template.
|
||||||
|
- **Tests:** n/a
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
### M0.2 — CI templates + branch protection
|
||||||
|
- **Depends on:** M0.1
|
||||||
|
- **Repos:** all of the above
|
||||||
|
- **Deliverables:** `.gitea/workflows/ci.yaml` per repo (matching §1.5 by stack), Gitea branch protection on `main` (require PR, 1 review, status checks green, no direct push), `commitlint`, `gitleaks`, `trivy` configured.
|
||||||
|
- **Acceptance:** a deliberately-broken PR is rejected by every check; a clean PR is mergeable.
|
||||||
|
- **Tests:** smoke PR per repo demonstrating green CI.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
### M0.3 — Self-hosted DNS + wildcard TLS
|
||||||
|
- **Depends on:** M1.2 (vm-edge must exist before PowerDNS lands)
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:**
|
||||||
|
- **PowerDNS Authoritative** on `vm-edge` (Orca-managed). PostgreSQL backend on same VM (small; ~100 records).
|
||||||
|
- At the registrar (Benjamin's account): set `ns1.yourplatform.com` and `ns2.yourplatform.com` glue records pointing at vm-edge public IP; delegate the domain to those NS.
|
||||||
|
- Zone file committed in `orca-platform/dns/yourplatform.com.zone`; Orca syncs into PowerDNS on apply.
|
||||||
|
- Records: apex `yourplatform.com`, wildcards `*.yourplatform.com` + `*.stage.yourplatform.com`, plus `auth.`, `erp.`, `mcp.`, `cdn.`, `mail.`, `ns1.`, `ns2.`, SPF/DKIM/DMARC TXT records (for M3.2).
|
||||||
|
- Wildcard TLS via Let's Encrypt **DNS-01 against PowerDNS** (Lego's `--dns=pdns` provider); ACME credentials in Infisical at `/prod/orca-proxy/PDNS_API_KEY`.
|
||||||
|
- Orca-Proxy reloads the cert via watch on the secret file; renewal cron runs at 02:00 daily.
|
||||||
|
- **Acceptance:** `dig @1.1.1.1 anything.yourplatform.com` returns an answer; `curl https://anything.yourplatform.com` returns 404 from Orca-Proxy (no TLS error).
|
||||||
|
- **Tests:** ACME renewal dry-run; PowerDNS zone-diff check in CI; reach via stage and prod subdomains; cert expiry page wired to SigNoz alert.
|
||||||
|
- **Gate:** standard + manual DNS-delegation check by both founders (irreversible from registrar side without 24–48h propagation)
|
||||||
|
- **Effort:** M (was S — registrar delegation + PowerDNS adds setup time vs. Cloudflare)
|
||||||
|
|
||||||
|
### M1.1 — `orca-platform` repo (IaC)
|
||||||
|
- **Depends on:** M0.1, M0.2
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** directory layout per `INFRASTRUCTURE.md`; one Orca manifest per VM × service; per-env overlays (`overlays/dev`, `overlays/stage`, `overlays/prod`); a `Makefile` with `make plan` / `make apply` per env.
|
||||||
|
- **Acceptance:** `make plan ENV=stage` produces a no-op diff once applied.
|
||||||
|
- **Tests:** `orca validate` runs in CI; PRs that break a manifest fail.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M1.2 — Provision VMs (locked topology)
|
||||||
|
- **Depends on:** M1.1 (Orca manifest layout)
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** the **4 VMs** from `INFRASTRUCTURE.md §1` provisioned on SysEleven (DUS2):
|
||||||
|
- **stage** (m2.small, public IP) — runs app-plane code only, calls prod KC + Stalwart
|
||||||
|
- **vm-edge** (m2.small, public IP) — Identity + Infra planes (orca-proxy, PowerDNS, Keycloak, pg-keycloak, Infisical, pg-infisical, Gitea)
|
||||||
|
- **vm-control** (m2.medium) — Control plane (portal, tenant-registry, ERPNext, Frappe HD, MariaDB, Stalwart)
|
||||||
|
- **vm-data** (m2.medium) — Data plane (CERTifAI, MongoDB, LiteLLM, compliance ×3, pg-app, Qdrant, MinIO)
|
||||||
|
- Private network 10.0.0.0/16 between all four. Public ingress only via vm-edge (and stage's own IP for tester access).
|
||||||
|
- SSH disabled; only `orca exec` for shell access.
|
||||||
|
- **Acceptance:** every VM reachable from Orca control plane; private-network connectivity verified; resource limits per service set in manifest per `INFRASTRUCTURE.md §6` co-tenant notes.
|
||||||
|
- **Tests:** cold-start sequence from `INFRASTRUCTURE.md §10 Scenario F` runs successfully on stage VMs.
|
||||||
|
- **Gate:** standard + manual sign-off (touches infra spend and 36M commitment decision)
|
||||||
|
- **Effort:** M
|
||||||
|
- **Cost impact:** see COST_PLAN.md §3. Initial run: ~€552/mo On-Demand, dropping to ~€310/mo after 36M-upfront commit in Month 4.
|
||||||
|
|
||||||
|
### M1.3 — Backups, monitoring, on-call
|
||||||
|
- **Depends on:** M1.2
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** backup cron per VM per `INFRASTRUCTURE.md §3` (Postgres pg_dump, MinIO bucket replication); SigNoz OTel collector running on every VM; alert routing to `oncall@yourplatform.com`; restore runbook in `platform/docs/runbooks/restore.md`.
|
||||||
|
- **Acceptance:** restore drill on stage succeeds (script in `platform/orca-platform/scripts/restore-drill.sh`); SigNoz shows traces from a synthetic request.
|
||||||
|
- **Tests:** disaster-recovery exercise per failure scenario in `INFRASTRUCTURE.md §10` — at least Scenarios A, B, F validated on stage.
|
||||||
|
- **Gate:** standard + manual sign-off
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
### M2.1 — Keycloak deployment
|
||||||
|
- **Depends on:** M1.2, M1.3
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** Keycloak 26 on `vm-identity`, Postgres backing store on `vm-control`, exposed at `auth.yourplatform.com` and `auth.stage.yourplatform.com`. Realm import file in `orca-platform/keycloak/realm-export.json` (committed, source-of-truth).
|
||||||
|
- **Acceptance:** master admin login works; realm `breakpilot-prod` exists in both envs.
|
||||||
|
- **Tests:** automated realm-state diff in CI (`kcadm` against checked-in export).
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M2.2 — Realm configuration: roles + protocol mappers + Organizations
|
||||||
|
- **Depends on:** M2.1
|
||||||
|
- **Repos:** `platform/orca-platform` (realm config)
|
||||||
|
- **Deliverables:** Organizations feature enabled; realm roles `BREAKPILOT_ADMIN`, `SUPPORT_ENGINEER`, `SALES_REP`; org roles `IT_ADMIN`, `CXO`, `FINANCE`, `LEGAL`, `USER`; protocol mapper that calls Tenant Registry at token issuance for `products`, `plan`, `tenant_status` claims; SALES_REP guardrail policy (token only issuable with `org_id = demo`).
|
||||||
|
- **Acceptance:** a test user gets the expected JWT claims; a SALES_REP user cannot get a JWT for a non-demo org (verified by integration test).
|
||||||
|
- **Tests:** Keycloak integration suite in `platform/tenant-registry/test/keycloak_test.go`.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M3.1 — Infisical
|
||||||
|
- **Depends on:** M1.2
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** Infisical on `vm-secrets`, machine identity per service, secret paths laid out per `PRODUCT_INTEGRATION_SPEC.md §9.4`.
|
||||||
|
- **Acceptance:** a stub service can read its secrets at startup; rotating a secret in Infisical UI is picked up on next pod start.
|
||||||
|
- **Tests:** smoke test container reads secrets.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
### M3.2 — Stalwart transactional email
|
||||||
|
- **Depends on:** M0.3 (needs DNS records under our control), M3.1
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:**
|
||||||
|
- **Stalwart** on `vm-control` (Orca-managed); reachable at `mail.yourplatform.com`.
|
||||||
|
- DNS records added to the zone in M0.3: `mail` A record, MX → mail, SPF (`v=spf1 mx -all`), DKIM (Stalwart-generated public key), DMARC (`p=quarantine; rua=mailto:dmarc@yourplatform.com`), reverse DNS (PTR) configured at the cloud provider for the vm-control public IP — coordinate with vm-edge since outbound mail must egress from a host with a clean PTR.
|
||||||
|
- SMTP submission service account per platform sender: `noreply@`, `oncall@`, `support@`, `billing@`, `dmarc@`.
|
||||||
|
- Outbound queue and bounce handler; failed deliveries surface as audit events.
|
||||||
|
- Webhook receiver at `/inbound/postmaster` for bounce/complaint feedback loops (Gmail FBL, MS SNDS).
|
||||||
|
- **IP warming plan**: write a `platform/docs/runbooks/email-warming.md` documenting the 4–8 week ramp from low daily volumes; first 2 weeks of trial nudges (M12.2) explicitly throttled.
|
||||||
|
- **Acceptance:** test email from `noreply@yourplatform.com` to `parnerkarsharang@gmail.com` lands in inbox (not spam) on day 1; SPF/DKIM/DMARC all "pass" in Gmail's "show original" view; mail-tester.com score ≥ 9/10.
|
||||||
|
- **Tests:** automated daily mail-tester check (failure pages on-call); bounce-handling integration test.
|
||||||
|
- **Gate:** standard + security checklist + manual deliverability sign-off (DKIM keys are load-bearing)
|
||||||
|
- **Effort:** L (deliverability tuning is the long tail)
|
||||||
|
|
||||||
|
**Phase 0 exit criteria:**
|
||||||
|
- Stage cluster boots cold from cron-driven nightly stop/start using only `INFRASTRUCTURE.md §5` ordering.
|
||||||
|
- A synthetic HTTPS request to `https://hello.stage.yourplatform.com` reaches a stub container.
|
||||||
|
- Restore drill on stage Postgres succeeds end-to-end.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Phase 1 — Control plane core (M4.x – M5.x)
|
||||||
|
|
||||||
|
**Goal:** Tenant Registry stores tenants; the portal authenticates a user and resolves their tenant. No products surfaced yet.
|
||||||
|
|
||||||
|
### M4.1 — Tenant Registry: schema + migrations
|
||||||
|
- **Depends on:** M1.2, M2.2
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** Go service scaffold; `golang-migrate` migrations for `tenants`, `tenant_projects`, `tenant_products`, `tenant_idp_config`, `api_keys`, `audit_log` per `PLATFORM_ARCHITECTURE.md §5c`; the `tenant.status` enum + `tenant.kind` column from the lifecycle spec.
|
||||||
|
- **Acceptance:** `make migrate-up` on a fresh Postgres produces the documented schema.
|
||||||
|
- **Tests:** migration up/down round-trip via `testcontainers-go`.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M4.2 — Tenant Registry: REST API
|
||||||
|
- **Depends on:** M4.1
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** OpenAPI 3.1 spec at `/openapi.yaml`; endpoints `POST /tenants`, `GET /tenants/:id`, `POST /tenants/:id/activate`, `POST /tenants/:id/cancel`, `GET /catalog`, `POST /catalog/request`, `POST /catalog/trial-request`, `POST /api-keys`, `POST /internal/api-keys/verify`, `POST /audit`, `GET /audit`.
|
||||||
|
- **Acceptance:** every endpoint passes the OpenAPI contract test; returns documented errors for invalid input.
|
||||||
|
- **Tests:** integration tests against real Postgres for every endpoint.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
### M4.3 — Tenant Registry: Keycloak adapter
|
||||||
|
- **Depends on:** M4.2, M2.2
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** package `internal/keycloak` that creates orgs, invites IT_ADMIN users, sets realm roles, and serves the protocol-mapper claims endpoint (the URL Keycloak hits during token issuance from M2.2).
|
||||||
|
- **Acceptance:** creating a tenant via `POST /tenants` provisions a Keycloak org and one IT_ADMIN user; user receives invite email.
|
||||||
|
- **Tests:** integration test against the stage Keycloak.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M5.1 — Portal scaffold: subdomain routing + OIDC login
|
||||||
|
- **Depends on:** M2.2, M4.3, M0.3
|
||||||
|
- **Repos:** `platform/portal`, `platform/design-tokens`
|
||||||
|
- **Deliverables:** Next.js 15 app on `vm-control`; middleware reads `Host` → extracts slug → calls Tenant Registry `GET /tenants?slug=` → injects tenant context; Keycloak OIDC login; logout; `design-tokens` package consumed by portal.
|
||||||
|
- **Acceptance:** visiting `https://acme.stage.yourplatform.com` redirects to Keycloak; after login, user lands on `/acme/dashboard` (empty page) with valid session.
|
||||||
|
- **Tests:** Playwright e2e: login + logout for an existing test tenant.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M5.2 — Portal: dashboard + backstage shells
|
||||||
|
- **Depends on:** M5.1
|
||||||
|
- **Repos:** `platform/portal`
|
||||||
|
- **Deliverables:** customer dashboard route `/[slug]/dashboard` (renders product tiles from JWT `products` claim — empty initially), backstage routes per `PLATFORM_ARCHITECTURE.md §5a` skeleton, RBAC enforcement (§5a "Operating principles" — hide what user can't access), session refresh.
|
||||||
|
- **Acceptance:** user with `org_roles=[USER]` cannot see settings or billing links; backstage routes return 403 for non-`BREAKPILOT_ADMIN` users.
|
||||||
|
- **Tests:** Playwright spec per role × route matrix.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M5.3 — Playwright e2e harness
|
||||||
|
- **Depends on:** M5.2
|
||||||
|
- **Repos:** `platform/portal`
|
||||||
|
- **Deliverables:** Playwright config that runs against `stage.yourplatform.com` post-deploy; CI job `e2e-stage` triggered after stage deploy; failure pages on-call.
|
||||||
|
- **Acceptance:** breaking change to login is caught in CI within 10 min of merge.
|
||||||
|
- **Tests:** the suite itself.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
**Phase 1 exit criteria:**
|
||||||
|
- A tenant created via `POST /tenants` results in a working login flow at `<slug>.stage.yourplatform.com`.
|
||||||
|
- All Phase 1 routes have a passing Playwright spec running on every stage deploy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Phase 2 — Existing product uplift (M6.x – M7.x, parallel)
|
||||||
|
|
||||||
|
**Goal:** CERTifAI and breakpilot-compliance both honour the JWT contract and surface a real product tile in the portal.
|
||||||
|
|
||||||
|
### M6.1 — CERTifAI: org_id scoping at DB layer
|
||||||
|
- **Depends on:** M2.2
|
||||||
|
- **Repos:** `benjamin_boenisch/certifai`
|
||||||
|
- **Deliverables:** MongoDB middleware that requires `org_id` on every query; backfill script for existing collections; per-tenant collection-level role checks (`IT_ADMIN` → Admin, etc.).
|
||||||
|
- **Acceptance:** integration test attempting a cross-tenant read returns `403`; existing single-tenant flows still work for tenant `default`.
|
||||||
|
- **Tests:** unit + integration; regression tests for every existing controller.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** L (4–6 weeks per prior gap analysis)
|
||||||
|
|
||||||
|
### M6.2 — CERTifAI: JWT validation + role mapping
|
||||||
|
- **Depends on:** M6.1
|
||||||
|
- **Repos:** `benjamin_boenisch/certifai`
|
||||||
|
- **Deliverables:** Keycloak JWKS validation middleware; role mapping per `PLATFORM_ARCHITECTURE.md §6`; tenant_status middleware (returns 402 on writes when `frozen`, 410 when `archived`, allows demo with no metering).
|
||||||
|
- **Acceptance:** all four `tenant.status` states behave per spec; tested against a stage Keycloak.
|
||||||
|
- **Tests:** integration tests per status value.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M6.3 — CERTifAI: manifest + integration assets
|
||||||
|
- **Depends on:** M6.2
|
||||||
|
- **Repos:** `benjamin_boenisch/certifai`
|
||||||
|
- **Deliverables:** `product.manifest.yaml` per `PRODUCT_INTEGRATION_SPEC.md §10` published to `cdn.yourplatform.com`; OpenAPI 3.1 spec; `/v1/health`, `/v1/usage`, `/v1/tenants/:id/export`, `DELETE /v1/tenants/:id/data`, `POST /v1/tenants/demo/reset`; web component `certifai-dashboard` per §5.A.
|
||||||
|
- **Acceptance:** CERTifAI appears in the portal catalog; subscribed tenants can open it from the dashboard.
|
||||||
|
- **Tests:** contract test that manifest validates against schema; web component renders inside portal shadow-DOM host.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
### M7.1 — Compliance: JWT validation upgrade
|
||||||
|
- **Depends on:** M2.2
|
||||||
|
- **Repos:** `benjamin_boenisch/breakpilot-compliance`
|
||||||
|
- **Deliverables:** Next.js proxy validates JWT against Keycloak JWKS (replacing today's `X-Tenant-ID` trust); tenant_status middleware as in M6.2.
|
||||||
|
- **Acceptance:** spoofing `X-Tenant-ID` without a JWT returns 401; valid JWT for tenant A cannot read tenant B data.
|
||||||
|
- **Tests:** integration tests for both auth and status states.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M (3–5 weeks per prior gap analysis)
|
||||||
|
|
||||||
|
### M7.2 — Compliance: manifest + integration assets
|
||||||
|
- **Depends on:** M7.1
|
||||||
|
- **Repos:** `benjamin_boenisch/breakpilot-compliance`
|
||||||
|
- **Deliverables:** same endpoint set as M6.3; web component (existing React → `@r2wc/react-to-web-component` per §5.A); manifest with `supports_projects: true` (already implemented).
|
||||||
|
- **Acceptance:** compliance appears in portal catalog; opens from dashboard; project switching works inside the product.
|
||||||
|
- **Tests:** as M6.3.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
**Phase 2 exit criteria:**
|
||||||
|
- A real tenant on stage can subscribe to both products and use them through the portal.
|
||||||
|
- Cross-product audit at `/[slug]/audit` shows events from both products in the Retraced schema.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Phase 3 — Business operations (M8.x – M9.x)
|
||||||
|
|
||||||
|
**Goal:** ERPNext and Frappe HD run, Sales Order → tenant activate works, tickets escalate to Gitea.
|
||||||
|
|
||||||
|
### M8.1 — ERPNext deployment
|
||||||
|
- **Depends on:** M1.2, M2.1
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** Frappe + ERPNext on `vm-control` (separate Postgres database from tenant_registry — see `INFRASTRUCTURE.md` RISK-1); reached at `erp.yourplatform.com`; Keycloak OIDC; IP-restricted at Orca-Proxy.
|
||||||
|
- **Acceptance:** us login works; a Customer record can be created manually.
|
||||||
|
- **Tests:** smoke test for OIDC; backup of Frappe filestore validated.
|
||||||
|
- **Gate:** standard + manual sign-off (touches `vm-control` resources)
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M8.2 — ERPNext customization
|
||||||
|
- **Depends on:** M8.1
|
||||||
|
- **Repos:** `platform/orca-platform/erpnext-app/`
|
||||||
|
- **Deliverables:** custom Frappe app with: `tenant_id` field on `Customer`; `sales_owner` field on `Lead`; server scripts for the Sales Order → Tenant Registry webhook; `Cancel` workflow that calls Tenant Registry `/cancel`.
|
||||||
|
- **Acceptance:** submitting a Sales Order in ERPNext triggers a tenant activation in stage Tenant Registry.
|
||||||
|
- **Tests:** server-script unit tests (Frappe test harness); integration test exercises the full webhook.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M8.3 — Self-serve billing (Polar.sh)
|
||||||
|
- **Depends on:** M8.1, M5.2
|
||||||
|
- **Repos:** `platform/portal`, `platform/tenant-registry`
|
||||||
|
- **Deliverables:**
|
||||||
|
- Polar.sh organization + products configured for Starter / Professional / per-seat tiers.
|
||||||
|
- Polar Checkout embedded in portal `/[slug]/billing/upgrade`.
|
||||||
|
- Webhook listener at `tenant-registry /polar/webhook` (HMAC-verified) handles `subscription.created`, `subscription.updated`, `subscription.canceled`, `order.paid` → flips `tenant.status`, mirrors the customer + invoice into ERPNext via REST.
|
||||||
|
- Polar acts as Merchant of Record — they handle EU VAT MOSS, no per-country tax registration needed for our side.
|
||||||
|
- Portal billing page reads invoices from ERPNext (single source of truth for accounting) but links out to Polar's customer portal for payment-method management.
|
||||||
|
- **Acceptance:** signing up self-serve creates a tenant, a Polar subscription, an ERPNext Customer + Invoice, and a usable login; VAT line item appears correctly on the EU customer's invoice.
|
||||||
|
- **Tests:** integration test against Polar sandbox; webhook replay test; tax calculation correct for at least DE, FR, NL, US.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
> **Why Polar.sh over Stripe / Lemon Squeezy:** OSS-aligned, Merchant of Record (handles EU VAT MOSS automatically), developer-first, 4% + Stripe fees vs. Lemon's 5%. Stripe direct would require us to register for VAT in 27 countries — not viable for a 2-person team. See [[self-hosted-oss-first]].
|
||||||
|
|
||||||
|
### M9.1 — Frappe Helpdesk
|
||||||
|
- **Depends on:** M8.1
|
||||||
|
- **Repos:** `platform/orca-platform`
|
||||||
|
- **Deliverables:** Frappe HD on the same Frappe bench; customer portal embedded at `/[slug]/support/`.
|
||||||
|
- **Acceptance:** a customer user can submit a ticket; we receive it.
|
||||||
|
- **Tests:** Playwright spec for ticket submission.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
### M9.2 — HD → Gitea escalation
|
||||||
|
- **Depends on:** M9.1
|
||||||
|
- **Repos:** `platform/orca-platform/erpnext-app/`
|
||||||
|
- **Deliverables:** server script that on a `Ticket: Escalate to Engineering` action creates a Gitea issue in the matching repo via Gitea REST API; reverse webhook from Gitea on issue close marks ticket resolved.
|
||||||
|
- **Acceptance:** the round-trip works for a test ticket on stage.
|
||||||
|
- **Tests:** integration test against stage Gitea.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
**Phase 3 exit criteria:**
|
||||||
|
- ERPNext is the source of truth for billing/CRM/HR.
|
||||||
|
- The full Lead → Quote → Sales Order → Tenant chain works on stage.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Phase 4 — Customer UX & lifecycle (M10.x – M14.x)
|
||||||
|
|
||||||
|
**Goal:** Every customer-facing flow from `PLATFORM_ARCHITECTURE.md` works end-to-end on stage.
|
||||||
|
|
||||||
|
### M10.1 — Customer area: full surfaces
|
||||||
|
- **Depends on:** M5.2, M6.3, M7.2
|
||||||
|
- **Repos:** `platform/portal`
|
||||||
|
- **Deliverables:** real implementations of `/[slug]/dashboard`, `/[slug]/products/*`, `/[slug]/projects`, `/[slug]/settings/{identity,users,api-keys,integrations}`, `/[slug]/billing`, `/[slug]/audit`, `/[slug]/support`.
|
||||||
|
- **Acceptance:** every route is implemented, RBAC-gated, with empty/loading/error states.
|
||||||
|
- **Tests:** one Playwright spec per route × primary role.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
### M10.2 — Cross-product audit view
|
||||||
|
- **Depends on:** M10.1, M4.2
|
||||||
|
- **Repos:** `platform/portal`
|
||||||
|
- **Deliverables:** audit page filters by product/actor/action/time; CSV + PDF export; events rendered from the Retraced-shape schema.
|
||||||
|
- **Acceptance:** a DPO-style query ("show me everything user X did across all products last month") returns in <2s for a tenant with 100k events.
|
||||||
|
- **Tests:** load test with synthetic events.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M11.1 — Catalog flow (P13)
|
||||||
|
- **Depends on:** M4.2, M10.1
|
||||||
|
- **Repos:** `platform/portal`, `platform/tenant-registry`
|
||||||
|
- **Deliverables:** `/[slug]/catalog` UI per `PLATFORM_ARCHITECTURE.md` P13; "Request" button creates ERPNext CRM Lead.
|
||||||
|
- **Acceptance:** customer requests a non-subscribed product; sales sees a Lead in ERPNext with the right `sales_owner`.
|
||||||
|
- **Tests:** Playwright e2e covering the full P13 sequence.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M12.1 — Self-serve trial (P15)
|
||||||
|
- **Depends on:** M8.3, M11.1
|
||||||
|
- **Repos:** `platform/portal`, `platform/tenant-registry`
|
||||||
|
- **Deliverables:** public `/start` form; trial tenant provisioning (status=trial, trial_ends_at); banner; trial_quota enforcement (read by products from JWT).
|
||||||
|
- **Acceptance:** prospect signs up; trial tenant with 14-day timer exists; quota enforced.
|
||||||
|
- **Tests:** Playwright e2e signs up → uses → hits quota.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M12.2 — Trial lifecycle cron + emails
|
||||||
|
- **Depends on:** M12.1, M3.2 (Stalwart must be deliverability-clean)
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** scheduler in tenant-registry that runs day-7/12/14 emails; status transitions trial → active (on payment) or trial → frozen → archived; SMTP via Stalwart at `mail.yourplatform.com:587`; sender `noreply@yourplatform.com`; HTML + plaintext templates committed under `tenant-registry/templates/email/`; List-Unsubscribe headers per RFC 8058.
|
||||||
|
- **Acceptance:** in a time-warped stage test (script that advances `trial_ends_at`), all transitions fire in order and all three emails land in Gmail inbox.
|
||||||
|
- **Tests:** integration test with time injection; deliverability spot-check at each release.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M13.1 — Demo tenant seeding
|
||||||
|
- **Depends on:** M6.3, M7.2
|
||||||
|
- **Repos:** `platform/seed-data`
|
||||||
|
- **Deliverables:** per-product fixture archives (`certifai/seed-v1.tar.gz`, `compliance/seed-v1.tar.gz`); publishing pipeline to `cdn.yourplatform.com`; `catalog.demo.seed_data_url` populated in product manifests.
|
||||||
|
- **Acceptance:** calling `POST /v1/tenants/demo/reset` on either product restores fixtures.
|
||||||
|
- **Tests:** integration test asserts fixture state after reset.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M13.2 — Sales demo flow (P14)
|
||||||
|
- **Depends on:** M2.2, M13.1
|
||||||
|
- **Repos:** `platform/portal`, `platform/tenant-registry`
|
||||||
|
- **Deliverables:** demo tenant created in stage and prod with `kind=demo, status=demo`; SALES_REP role usable; backstage routes restricted to `/backstage/leads` and `/backstage/demo`; demo tenant audit events tagged `{"demo": true}` and hidden from real-tenant audit views.
|
||||||
|
- **Acceptance:** sales rep logs in at `demo.yourplatform.com`, walks both products live, [Request Trial] modal creates a CRM Lead with `sales_owner = the rep`.
|
||||||
|
- **Tests:** Playwright e2e for the sales walk-through.
|
||||||
|
- **Gate:** standard + security checklist (SALES_REP guardrail enforcement is the load-bearing piece)
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M13.3 — Nightly demo reset
|
||||||
|
- **Depends on:** M13.2
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** cron at 03:00 Europe/Berlin calls each product's reset endpoint; failures page on-call.
|
||||||
|
- **Acceptance:** after a deliberately-corrupted demo state, the next 03:00 reset restores fixtures.
|
||||||
|
- **Tests:** test runs the reset manually + verifies fixture state.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** S
|
||||||
|
|
||||||
|
### M14.1 — Cancel + frozen state (P16 part 1)
|
||||||
|
- **Depends on:** M10.1, M6.2, M7.1
|
||||||
|
- **Repos:** `platform/portal`, `platform/tenant-registry`
|
||||||
|
- **Deliverables:** cancel modal with reason + typed-confirm; status active → frozen transition; Stripe `cancel_at_period_end`; ERPNext Opportunity → Lost; reactivation path within 30 days.
|
||||||
|
- **Acceptance:** test customer cancels; portal switches to read-only; reactivate restores `active` status without data loss.
|
||||||
|
- **Tests:** Playwright e2e covering cancel + reactivate.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M14.2 — Offboarding cron + final export (P16 part 2)
|
||||||
|
- **Depends on:** M14.1, M6.3, M7.2
|
||||||
|
- **Repos:** `platform/tenant-registry`
|
||||||
|
- **Deliverables:** day-30 cron builds final export ZIP per product, emails signed URL (7-day TTL), calls `DELETE /v1/tenants/:id/data` on every subscribed product, archives Keycloak org, marks `tenant.status = archived`.
|
||||||
|
- **Acceptance:** time-warped test runs the full P16 sequence end-to-end on stage; export ZIP contains data from both products; second post-archive request to either product returns 410.
|
||||||
|
- **Tests:** integration test with time injection; GDPR-compliance regression suite added.
|
||||||
|
- **Gate:** standard + security checklist + manual sign-off (irreversible operation)
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
**Phase 4 exit criteria:**
|
||||||
|
- Every flow P1–P16 from `PLATFORM_ARCHITECTURE.md` has a passing Playwright spec.
|
||||||
|
- Stage runs a full lifecycle: sign-up trial → convert → use → cancel → offboard, in an automated nightly job.
|
||||||
|
- We can hand a prospect a real demo using `demo.yourplatform.com`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Phase 5 — Headless products (M15.x – M17.x)
|
||||||
|
|
||||||
|
**Goal:** Make the platform host products with no UI of their own.
|
||||||
|
|
||||||
|
### M15.1 — API key infrastructure
|
||||||
|
- **Depends on:** M4.2, M10.1
|
||||||
|
- **Repos:** `platform/tenant-registry`, `platform/portal`
|
||||||
|
- **Deliverables:** API key CRUD per `PRODUCT_INTEGRATION_SPEC.md §6.2`; portal UI at `/[slug]/settings/api-keys`; `POST /internal/api-keys/verify` for products.
|
||||||
|
- **Acceptance:** create key in portal; product call with key succeeds; revoke kills access within 60s.
|
||||||
|
- **Tests:** integration tests for verify endpoint; Playwright for portal UI.
|
||||||
|
- **Gate:** standard + security checklist (rotation + scope enforcement)
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M15.2 — Webhook delivery
|
||||||
|
- **Depends on:** M15.1
|
||||||
|
- **Repos:** `platform/tenant-registry`, `platform/portal`
|
||||||
|
- **Deliverables:** webhook config + delivery service per `PLATFORM_ARCHITECTURE.md` H4; portal page `/[slug]/integrations`; signed payloads; 3-attempt retry with backoff; dead-letter visible at `/webhooks/deliveries`.
|
||||||
|
- **Acceptance:** test webhook to https://requestbin.com works; failed deliveries appear in dead letter.
|
||||||
|
- **Tests:** integration tests with a local sink.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M16.1 — First headless product reference implementation
|
||||||
|
- **Depends on:** M15.2
|
||||||
|
- **Repos:** TBD (proof-of-concept can live in `platform/docs/examples/headless-template/`)
|
||||||
|
- **Deliverables:** a minimal headless product (e.g., echo-bot) that implements the full §5.C contract: manifest, API, audit emit, usage emit, demo reset, GDPR endpoints.
|
||||||
|
- **Acceptance:** echo-bot is bookable from catalog, works end-to-end, passes the same lifecycle test as Phase 4.
|
||||||
|
- **Tests:** the lifecycle e2e from M14.2 extended to include echo-bot.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M17.1 — MCP servers (Enterprise)
|
||||||
|
- **Depends on:** M6.3, M7.2
|
||||||
|
- **Repos:** `benjamin_boenisch/certifai`, `benjamin_boenisch/breakpilot-compliance`
|
||||||
|
- **Deliverables:** MCP endpoints per `PRODUCT_INTEGRATION_SPEC.md §10` `mcp:` block; gated on `plan == enterprise`; routed via `mcp.yourplatform.com`.
|
||||||
|
- **Acceptance:** Claude Code can connect to `mcp.yourplatform.com/certifai` with a service token and call `list_ai_agents`.
|
||||||
|
- **Tests:** MCP contract test using `mcp-cli`.
|
||||||
|
- **Gate:** standard + security checklist
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
**Phase 5 exit criteria:**
|
||||||
|
- A third-party (or us) can add a new headless product by following `PRODUCT_INTEGRATION_SPEC.md` and a referenced template, with no portal code changes required.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Phase 6 — Enterprise + scale (M18.x – M19.x)
|
||||||
|
|
||||||
|
These ship only when a paying customer requires them.
|
||||||
|
|
||||||
|
### M18.1 — Custom domains
|
||||||
|
- **Depends on:** M0.3, M10.1
|
||||||
|
- **Repos:** `platform/orca-platform`, `platform/portal`
|
||||||
|
- **Deliverables:** ACME on-demand TLS in Orca-Proxy; portal UI for customer to add domain; CNAME verification.
|
||||||
|
- **Acceptance:** `compliance.acme.com` resolves and renders the Acme portal.
|
||||||
|
- **Tests:** integration test with a synthetic domain.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
### M18.2 — Physical data isolation
|
||||||
|
- **Depends on:** M4.1, M6.1, M7.1
|
||||||
|
- **Repos:** all data-plane products + `tenant-registry`
|
||||||
|
- **Deliverables:** option per tenant for a dedicated Postgres / Mongo schema or database; provisioning automation; migration path from logical → physical.
|
||||||
|
- **Acceptance:** an enterprise tenant runs on a dedicated schema; cross-tenant queries are physically impossible.
|
||||||
|
- **Tests:** isolation enforcement test.
|
||||||
|
- **Gate:** standard + security review + manual sign-off
|
||||||
|
- **Effort:** L
|
||||||
|
|
||||||
|
### M19.1 — A/B testing infra
|
||||||
|
- **Depends on:** anywhere `featureFlags.evaluate()` is called
|
||||||
|
- **Repos:** new `platform/feature-flags` (Unleash on `vm-control` or hosted) + portal SDK shim
|
||||||
|
- **Deliverables:** swap the hard-coded `evaluate()` from §1.12 to call Unleash; eval results land in audit events for reproducibility.
|
||||||
|
- **Acceptance:** flipping a flag in Unleash changes behaviour for the targeted tenant set within 30s; no behavior change for other tenants.
|
||||||
|
- **Tests:** integration test asserts flag-driven branches.
|
||||||
|
- **Gate:** standard
|
||||||
|
- **Effort:** M
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Cross-cutting work (every phase, ongoing)
|
||||||
|
|
||||||
|
These are not milestones — they are commitments enforced by CI and process.
|
||||||
|
|
||||||
|
- **Regression suite expansion.** Every bug fix lands with a regression test FIRST. Tracked by `tests-added` label on PRs; fix-without-test PRs are rejected by reviewer.
|
||||||
|
- **Security review per phase.** End of each phase: dependency audit (`cargo audit`, `npm audit`, `pip-audit`), SAST scan (`semgrep`), threat model update in `platform/docs/security/`.
|
||||||
|
- **Disaster-recovery drills.** Once per phase on stage: pick one scenario from `INFRASTRUCTURE.md §10`, run it, document time-to-recover in the runbook.
|
||||||
|
- **Doc currency.** PR template requires the author to tick "docs updated" or "n/a" — CI fails on a missing tick.
|
||||||
|
- **OSS swap-in readiness.** When adding metering / audit / SCIM / flag eval code, use the schema/interface noted in `PRODUCT_INTEGRATION_SPEC.md §15` so swap-in stays cheap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. First-PR checklist for Claude Code
|
||||||
|
|
||||||
|
When starting work, the first sequence of PRs should be:
|
||||||
|
|
||||||
|
1. **PR-1** (M0.1): Create `platform/docs` with copied architecture docs + this plan. Land in 1 day.
|
||||||
|
2. **PR-2 to PR-7** (M0.1 continued): Bootstrap each of the other five repos with §1.2 scaffolding. Land in parallel.
|
||||||
|
3. **PR-8** (M0.2): CI templates + branch protection per repo.
|
||||||
|
4. **PR-9** (M1.1): `orca-platform` directory layout + first stub manifest.
|
||||||
|
5. **PR-10** (M1.2): VM provisioning (vm-edge, vm-identity, vm-secrets, vm-control first — DNS and Keycloak depend on these).
|
||||||
|
6. **PR-11** (M0.3): PowerDNS on vm-edge + zone file + registrar NS delegation + wildcard TLS via Let's Encrypt DNS-01.
|
||||||
|
|
||||||
|
After PR-11, the dependency graph fans out and parallel work begins.
|
||||||
|
|
||||||
|
For each PR, Claude Code MUST:
|
||||||
|
- Open the PR with the §1.4 template filled in.
|
||||||
|
- Link the milestone ID in PR body (`Linked milestone: M0.1`).
|
||||||
|
- Wait for human approval (no self-merge — branch protection enforces).
|
||||||
|
- After merge: verify the stage deploy succeeds before starting the next dependent PR.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Dependency graph
|
||||||
|
|
||||||
|
```
|
||||||
|
┌── M6.1 ── M6.2 ── M6.3 ──┐
|
||||||
|
│ │
|
||||||
|
┌── M2.1 ── M2.2 ────────┤ ├── M10.1 ── M10.2
|
||||||
|
│ │ │ │
|
||||||
|
M0.1 ── M0.2 ── M1.1 ──┼── M1.2 ── M0.3 ── M1.3 │ │ ├── M11.1 ── M12.1 ── M12.2
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ └── M3.1 ── M3.2 │ ├── M13.2 ── M13.3 │
|
||||||
|
│ │ │ │ │
|
||||||
|
└─────────────────────── M4.1 ── M4.2 ── M4.3 ── M5.1 ── M5.2 ── M5.3 M13.1 │
|
||||||
|
│
|
||||||
|
M8.1 ── M8.2 ── M8.3 ── M9.1 ── M9.2 ──────────────────────┤
|
||||||
|
│
|
||||||
|
M15.1 ── M15.2 ── M16.1 ── M17.1 │
|
||||||
|
│
|
||||||
|
M14.1 ── M14.2
|
||||||
|
|
||||||
|
Phase-6 (M18, M19) depends on Phase-4 completion + a paying customer.
|
||||||
|
M12.2 depends on M3.2 (Stalwart deliverability must be clean before trial emails go out).
|
||||||
|
```
|
||||||
|
|
||||||
|
**Critical path** (longest chain to first paying customer):
|
||||||
|
`M0.1 → M0.2 → M1.1 → M1.2 → M0.3 → M1.3 → M2.1 → M2.2 → M4.1 → M4.2 → M4.3 → M5.1 → M5.2 → M6.2 → M6.3 → M10.1 → M11.1 → M12.1`
|
||||||
|
|
||||||
|
That's 18 milestones. With one full-time agent and standard human review pacing, plan for **9–13 weeks** to first paying customer flow on stage (added 1 week for the PowerDNS / DNS-delegation cycle vs. the prior Cloudflare path); **+2–4 weeks** for prod hardening and the Phase-4 lifecycle completion.
|
||||||
|
|
||||||
|
> **Note on M3.2 critical path:** Stalwart IP warming (4–8 weeks) runs in *background parallel* — start it immediately after M3.1 so warming finishes before M12.2 needs it. It is NOT on the critical path for first paying customer (that customer can be onboarded by hand), but it IS on the critical path for self-serve trial volume.
|
||||||
|
|
||||||
|
**Parallelism opportunities:**
|
||||||
|
- M6.x and M7.x can run fully in parallel (different repos, different stacks).
|
||||||
|
- M8.x is independent of all data-plane work once M2.2 is done.
|
||||||
|
- M15.x can begin as soon as M10.1 lands.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Open questions to resolve before starting
|
||||||
|
|
||||||
|
**Resolved:**
|
||||||
|
- ~~Email provider~~ → **Stalwart**, self-hosted on vm-control. Plan in M3.2; 4–8 week IP warming acknowledged.
|
||||||
|
- ~~Stripe vs Lemon Squeezy~~ → **Polar.sh**. Plan in M8.3.
|
||||||
|
- ~~Cloudflare account ownership~~ → not used; DNS is self-hosted via PowerDNS on vm-edge (M0.3). Registrar account (Benjamin's) still needs documented 2FA recovery — see new DR item below.
|
||||||
|
|
||||||
|
**Still open:**
|
||||||
|
- **CDN host** for `cdn.yourplatform.com`: self-hosted MinIO + Caddy on vm-edge is the OSS-aligned default; alternative is BunnyCDN (cheap, EU). Decide before M6.3 (manifest bundles + hero images).
|
||||||
|
- **Cloud provider for port 25 outbound.** Stalwart needs unblocked port 25 to send mail. Hetzner blocks by default and requires a request to unblock with proof of intent + abuse contact; OVH and Scaleway unblock on request faster. Confirm with Benjamin which provider vm-control runs on. Block on M3.2 if port 25 is unblockable — fallback is sending via a different provider's IP with reverse DNS.
|
||||||
|
- **Test data privacy.** The demo tenant must contain ONLY synthetic data — confirm seed pipeline strips real PII even if our test orgs accidentally seed from prod.
|
||||||
|
- **Registrar + DNS bus-factor.** Document who owns the registrar account, who has 2FA recovery codes, and the procedure to update NS records without that person available. Goes in `platform/docs/runbooks/dr.md` before M0.3 ships.
|
||||||
|
- **Internal CA.** `step-ca` listed in INFRASTRUCTURE.md vm-edge as "optional" — decide whether inter-service mTLS is in scope for Phase 0 or deferred until Phase 4 (Enterprise tier).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of document. Open items in §12 should be triaged before M0.1 starts; the bus-factor and port-25 items are the only hard blockers.*
|
||||||
@@ -0,0 +1,774 @@
|
|||||||
|
# Infrastructure Specification
|
||||||
|
**Status:** Locked Topology
|
||||||
|
**Authors:** Sharang, Benjamin
|
||||||
|
**Date:** 2026-05-11 (topology lock: 2026-05-18)
|
||||||
|
**Companion docs:** PLATFORM_ARCHITECTURE.md, IMPLEMENTATION_PLAN.md, COST_PLAN.md
|
||||||
|
**Cloud provider:** SysEleven Cloud Services (DUS2, OpenStack)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. VM Inventory
|
||||||
|
|
||||||
|
**Four billable VMs total.** Three in production (one per plane after collapsing Identity+Infra), one in stage. Dev runs entirely on developer laptops via docker-compose.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────┬─────────────────┬────────────────────────┬───────────┬─────────────────┐
|
||||||
|
│ Name │ Env │ SysEleven flavor │ Public IP │ Planes owned │
|
||||||
|
├──────────────┼─────────────────┼────────────────────────┼───────────┼─────────────────┤
|
||||||
|
│ vm-edge │ prod │ m2.small (2v / 8 GB) │ YES (1) │ Identity + Infra│
|
||||||
|
│ vm-control │ prod │ m2.medium (4v / 16 GB) │ No │ Control │
|
||||||
|
│ vm-data │ prod │ m2.medium (4v / 16 GB) │ No │ Data │
|
||||||
|
│ stage │ stage │ m2.small (2v / 8 GB) │ YES (1) │ App plane only │
|
||||||
|
│ (dev) │ dev │ local docker-compose │ n/a │ all (in-memory) │
|
||||||
|
└──────────────┴─────────────────┴────────────────────────┴───────────┴─────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**Total compute:** 48 GiB-RAM, 12 vCPU. **Monthly compute net: €192 (36M upfront) / €295 (12M) / €435 (On-Demand).** See COST_PLAN.md for the full three-mode table.
|
||||||
|
|
||||||
|
### Why this topology and not the previous 7-VM layout
|
||||||
|
|
||||||
|
The earlier draft proposed one VM per service group (vm-gateway, vm-identity, vm-secrets, vm-ops, vm-control, vm-certifai, vm-compliance). That gave maximum failure isolation but cost 132 GiB-RAM stage+prod. At 5 customers the isolation is unused — every VM ran at <10% utilisation. The locked topology buys back failure isolation incrementally as load grows (see §13 Growth Trajectory).
|
||||||
|
|
||||||
|
Critical isolations preserved even at 4 VMs:
|
||||||
|
- **vm-edge isolates identity from app workloads.** Keycloak JVM has its own page cache; ERPNext background jobs cannot starve token issuance.
|
||||||
|
- **vm-data isolates databases from stateless services.** All data-plane DBs share one host, but they're walled off from the portal + ERPNext + Stalwart competing on vm-control.
|
||||||
|
- **stage runs the app plane only.** It calls prod Keycloak + prod Tenant Registry under `tenant.kind = stage` rather than mirroring those services.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Service-to-VM Mapping
|
||||||
|
|
||||||
|
```
|
||||||
|
vm-edge (prod, m2.small 8 GB, public IP)
|
||||||
|
├── orca-proxy (Orca-managed; wildcard TLS terminator)
|
||||||
|
├── powerdns-auth (Orca-managed; authoritative DNS for yourplatform.com)
|
||||||
|
├── keycloak-26 (Orca-managed; JVM, ~1.5 GB heap)
|
||||||
|
├── postgres-keycloak (Orca-managed; dedicated PG instance for Keycloak only)
|
||||||
|
├── infisical (Orca-managed)
|
||||||
|
├── postgres-infisical (Orca-managed; dedicated PG instance for Infisical only)
|
||||||
|
├── redis-infisical (Orca-managed; ephemeral)
|
||||||
|
└── gitea (Orca-managed; SQLite backend to avoid a third PG)
|
||||||
|
|
||||||
|
vm-control (prod, m2.medium 16 GB)
|
||||||
|
├── customer-portal (Orca-managed; Next.js)
|
||||||
|
├── tenant-registry (Orca-managed; Go)
|
||||||
|
├── orca-controller (Orca core process; NOT a managed container)
|
||||||
|
├── erpnext (Orca-managed; Frappe bench)
|
||||||
|
├── frappe-hd (same bench as ERPNext)
|
||||||
|
├── mariadb (Orca-managed; for ERPNext)
|
||||||
|
├── redis-erpnext (Orca-managed)
|
||||||
|
└── stalwart-mail (Orca-managed; SMTP/IMAP/JMAP on mail.yourplatform.com)
|
||||||
|
|
||||||
|
vm-data (prod, m2.medium 16 GB)
|
||||||
|
├── certifai-dashboard (Orca-managed)
|
||||||
|
├── mongodb (Orca-managed)
|
||||||
|
├── litellm (Orca-managed)
|
||||||
|
├── backend-compliance (Orca-managed)
|
||||||
|
├── ai-compliance-sdk (Orca-managed)
|
||||||
|
├── admin-compliance (Orca-managed)
|
||||||
|
├── postgres-app (Orca-managed; schemas: tenant_registry, compliance)
|
||||||
|
├── qdrant (Orca-managed)
|
||||||
|
└── minio (Orca-managed)
|
||||||
|
|
||||||
|
stage (stage, m2.small 8 GB, public IP)
|
||||||
|
├── orca-proxy (light; only routes to stage app)
|
||||||
|
├── customer-portal (NEW VERSION under test)
|
||||||
|
├── tenant-registry (NEW VERSION under test, talks to ephemeral PG below)
|
||||||
|
├── certifai-dashboard (NEW VERSION under test)
|
||||||
|
├── backend-compliance (NEW VERSION under test)
|
||||||
|
├── ai-compliance-sdk (NEW VERSION under test)
|
||||||
|
├── admin-compliance (NEW VERSION under test)
|
||||||
|
├── litellm (light; same image as prod)
|
||||||
|
├── postgres-app-stage (ephemeral; lives entirely on stage VM)
|
||||||
|
├── mongodb-stage (ephemeral)
|
||||||
|
└── qdrant-stage (ephemeral, tiny corpus)
|
||||||
|
|
||||||
|
Calls OUT to prod:
|
||||||
|
→ auth.yourplatform.com (Keycloak token issuance, under stage client_id)
|
||||||
|
→ mail.yourplatform.com (Stalwart SMTP, recipient filter forces +stage@ only)
|
||||||
|
→ Polar SANDBOX webhook URL (NEVER prod Polar)
|
||||||
|
→ no calls to prod Postgres-app, MariaDB, MongoDB
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stage isolation rules (enforced at the platform, not in product code)
|
||||||
|
|
||||||
|
| Risk | Enforcement mechanism | Owner |
|
||||||
|
|---|---|---|
|
||||||
|
| Stage writes to prod database | Infisical scope: stage app only gets `/stage/*` secrets. Prod DB credentials never reach stage. | Infra plane |
|
||||||
|
| Stage emails real customers | Stalwart accept-rule: drop if recipient does not match `*+stage@*`. | Control plane (Stalwart config) |
|
||||||
|
| Stage triggers real Polar charges | Stage env points `POLAR_API_URL` to sandbox. Prod Polar webhook secret never on stage. | Control plane |
|
||||||
|
| Stage Keycloak JWT used in prod | `stage_client_id` issued only by Keycloak; prod services reject JWTs with this aud. | Identity plane |
|
||||||
|
| Stage load DOSes prod Keycloak | Keycloak rate-limit per client_id; stage limited to 60 req/s. | Identity plane |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Network Topology
|
||||||
|
|
||||||
|
```
|
||||||
|
INTERNET
|
||||||
|
│
|
||||||
|
(yourplatform.com — authoritative on vm-edge PowerDNS;
|
||||||
|
stage.yourplatform.com — authoritative same zone)
|
||||||
|
│
|
||||||
|
┌─────────────┴─────────────┐
|
||||||
|
│ │
|
||||||
|
┌───────▼────────┐ ┌────────▼─────────┐
|
||||||
|
│ vm-edge │ │ stage │
|
||||||
|
│ (public IP) │ │ (public IP) │
|
||||||
|
│ │ │ │
|
||||||
|
│ orca-proxy ────┤ │ orca-proxy │
|
||||||
|
│ powerdns │ │ portal-new │
|
||||||
|
│ keycloak │◄────────┤ tenant-registry-new
|
||||||
|
│ pg-keycloak │ stage │ certifai-new │
|
||||||
|
│ infisical │ calls │ compliance-new │
|
||||||
|
│ pg-infisical │ prod │ pg-stage │
|
||||||
|
│ redis-infis │ KC + │ mongo-stage │
|
||||||
|
│ gitea │ Stalwart│ qdrant-stage │
|
||||||
|
└───────┬────────┘ └──────────────────┘
|
||||||
|
│ PRIVATE NETWORK 10.0.0.0/16
|
||||||
|
┌────────┴─────────┐
|
||||||
|
│ │
|
||||||
|
┌──────▼───────┐ ┌───────▼──────┐
|
||||||
|
│ vm-control │ │ vm-data │
|
||||||
|
│ │ │ │
|
||||||
|
│ portal │ │ certifai │
|
||||||
|
│ tenant-reg │ │ mongodb │
|
||||||
|
│ orca-ctrl │ │ litellm │
|
||||||
|
│ erpnext │ │ backend-comp │
|
||||||
|
│ frappe-hd │ │ ai-sdk │
|
||||||
|
│ mariadb │ │ admin-comp │
|
||||||
|
│ redis-erp │ │ pg-app │
|
||||||
|
│ stalwart │ │ qdrant │
|
||||||
|
└──────────────┘ │ minio │
|
||||||
|
└──────────────┘
|
||||||
|
|
||||||
|
Orca-Proxy routing (vm-edge, by Host header):
|
||||||
|
auth.yourplatform.com → 127.0.0.1:8443 (Keycloak, local on vm-edge)
|
||||||
|
erp.yourplatform.com → vm-control:8000 (ERPNext) [allowlist: our IPs only]
|
||||||
|
git.yourplatform.com → vm-edge:3000 (Gitea, local) [allowlist: our IPs only]
|
||||||
|
mail.yourplatform.com → vm-control:587 (Stalwart submission) [allowlist: VM internal only]
|
||||||
|
ns1.yourplatform.com → 127.0.0.1:53 (PowerDNS, local)
|
||||||
|
*.yourplatform.com → vm-control:3000 (customer portal)
|
||||||
|
|
||||||
|
Orca-Proxy routing (stage, by Host header):
|
||||||
|
*.stage.yourplatform.com → 127.0.0.1:3000 (stage portal — all subdomains route here)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Storage and Volume Requirements
|
||||||
|
|
||||||
|
Block volumes (Ceph 3x replicated, €0.10/GiB/mo) mounted to each VM.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌──────────────┬───────────────────────────────────────────┬─────────┬─────────────────────┐
|
||||||
|
│ VM │ Data stores │ +Block │ Growth profile │
|
||||||
|
├──────────────┼───────────────────────────────────────────┼─────────┼─────────────────────┤
|
||||||
|
│ vm-edge │ pg-keycloak + pg-infisical + Gitea repos │ +50 GB │ Slow │
|
||||||
|
│ vm-control │ MariaDB (ERPNext) + Stalwart mail spool │ +250 GB │ Medium │
|
||||||
|
│ vm-data │ MongoDB + pg-app + Qdrant + MinIO │ +500 GB │ Fast (scales w/ N) │
|
||||||
|
│ stage │ pg-stage + mongo-stage + qdrant-stage │ +50 GB │ Resets per release │
|
||||||
|
└──────────────┴───────────────────────────────────────────┴─────────┴─────────────────────┘
|
||||||
|
|
||||||
|
Each VM's root disk: 50 GB ephemeral, included in flavor price.
|
||||||
|
|
||||||
|
Object storage (S3, €0.02/GiB/mo single-region or €0.0496/GiB/mo geo-redundant):
|
||||||
|
┌─────────────────────────────────┬─────────┬──────────────────────────┐
|
||||||
|
│ Bucket │ Size │ Purpose │
|
||||||
|
├─────────────────────────────────┼─────────┼──────────────────────────┤
|
||||||
|
│ s3://backups (geo-redundant) │ ~500 GB │ Database dumps │
|
||||||
|
│ s3://seed-data │ ~30 GB │ Demo tenant fixtures │
|
||||||
|
│ s3://exports │ ~50 GB │ GDPR/offboarding ZIPs │
|
||||||
|
│ s3://audit-archive │ ~20 GB │ Old audit log overflow │
|
||||||
|
└─────────────────────────────────┴─────────┴──────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Backup Requirements
|
||||||
|
|
||||||
|
All backups ship to **SysEleven Object Storage** (S3-compatible, geo-redundant DUS2 ↔ HAM1 for production-critical data). Backup jobs run as Orca one-shot containers on cron. Infisical holds the S3 credentials.
|
||||||
|
|
||||||
|
```
|
||||||
|
┌───────────────────────┬──────────────────┬────────────┬────────────┬──────────────────────┐
|
||||||
|
│ Data store │ Method │ Frequency │ Retention │ Owner (who restores) │
|
||||||
|
├───────────────────────┼──────────────────┼────────────┼────────────┼──────────────────────┤
|
||||||
|
│ pg-keycloak (vm-edge) │ pg_dump → S3-geo │ Every 6h │ 14 days │ Infra Plane │
|
||||||
|
│ pg-infisical (vm-edge)│ pg_dump → S3-geo │ Daily │ 30 days │ Infra Plane │
|
||||||
|
│ Gitea (vm-edge) │ gitea dump → S3 │ Daily │ 30 days │ Infra Plane │
|
||||||
|
│ Keycloak realm export │ KC export → S3 │ Daily │ 14 days │ Identity Plane (owns)│
|
||||||
|
│ Infisical store │ encrypted → S3 │ Daily │ 30 days │ Infra Plane │
|
||||||
|
│ MariaDB (vm-control) │ mysqldump → S3 │ Every 6h │ 30 days │ Control Plane │
|
||||||
|
│ Stalwart queue/store │ tar → S3 │ Daily │ 7 days │ Control Plane │
|
||||||
|
│ pg-app (vm-data) │ pg_dump → S3-geo │ Every 6h │ 30 days │ Data Plane (owns RPO)│
|
||||||
|
│ MongoDB (vm-data) │ mongodump → S3 │ Daily │ 30 days │ Data Plane │
|
||||||
|
│ MinIO (vm-data) │ mc mirror → S3 │ Daily │ 90 days │ Data Plane │
|
||||||
|
│ Qdrant (vm-data) │ API snap → S3 │ Daily │ 14 days │ Data Plane (rebuild) │
|
||||||
|
│ stage * │ no backup │ — │ — │ — (ephemeral) │
|
||||||
|
│ Orca config (IaC) │ Gitea (VCS) │ On commit │ Forever │ Infra Plane │
|
||||||
|
└───────────────────────┴──────────────────┴────────────┴────────────┴──────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### RPO by data criticality
|
||||||
|
|
||||||
|
```
|
||||||
|
CRITICAL (RPO ≤ 6h)
|
||||||
|
pg-keycloak — org memberships, IdP config
|
||||||
|
pg-app — tenant registry, compliance records
|
||||||
|
MariaDB/ERPNext — sales orders, invoices, contracts
|
||||||
|
|
||||||
|
IMPORTANT (RPO ≤ 24h)
|
||||||
|
MongoDB — chat history, user preferences
|
||||||
|
MinIO — compliance evidence documents
|
||||||
|
pg-infisical — encrypted secrets
|
||||||
|
Stalwart store — inbound webhooks, bounce records
|
||||||
|
|
||||||
|
RECOVERABLE (RPO ≤ 48h, rebuildable)
|
||||||
|
Qdrant — vector index (rebuildable from MinIO source documents)
|
||||||
|
Gitea — code (mirrored on dev machines)
|
||||||
|
Keycloak export — org structure (pg-keycloak is primary)
|
||||||
|
|
||||||
|
NOT BACKED UP
|
||||||
|
stage (any data) — by design; restored from seed bundles on each deploy
|
||||||
|
redis-* — caches; restart cold
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Constraint Framework
|
||||||
|
|
||||||
|
### Constraint types
|
||||||
|
|
||||||
|
```
|
||||||
|
AVAILABILITY — required uptime percentage over rolling 30 days
|
||||||
|
RTO — Recovery Time Objective: max time to restore service after failure
|
||||||
|
RPO — Recovery Point Objective: max acceptable data loss window
|
||||||
|
IaC — service must be declared in Orca config, no manual container runs in prod
|
||||||
|
SECRET_HYGIENE — all secrets via Infisical machine identity; no env files, no hardcoded values
|
||||||
|
NETWORK — whether service is internet-exposed or internal-only
|
||||||
|
DATA_RESIDENCY — all data must remain in EU (SysEleven DUS2 + HAM1)
|
||||||
|
AUDIT_TRAIL — all mutating actions logged (who, what, when, from where)
|
||||||
|
IMMUTABILITY — config changes go through Gitea → Orca pipeline, not manual SSH
|
||||||
|
STAGE_ISOLATION— stage tenant cannot mutate any prod data; reads-only against prod KC + TR
|
||||||
|
```
|
||||||
|
|
||||||
|
### Plane ownership of constraints
|
||||||
|
|
||||||
|
Even though planes now share VMs, the **ownership model is unchanged** — the plane that owns a constraint owns it regardless of which VM hosts the service. The Infra Plane (now collapsed onto vm-edge alongside the Identity plane) still mechanically enforces backup, IaC, secrets, and network constraints.
|
||||||
|
|
||||||
|
```
|
||||||
|
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||||
|
║ IDENTITY PLANE (on vm-edge) ║
|
||||||
|
║ ║
|
||||||
|
║ Owns / defines: ║
|
||||||
|
║ AVAILABILITY — must be ≥ 99.5% (root dep for everything) ║
|
||||||
|
║ RTO — ≤ 15 min ║
|
||||||
|
║ AUDIT_TRAIL — realm-level audit (logins, token issuance, IdP events) ║
|
||||||
|
║ DATA_RESIDENCY— Keycloak realm data must stay EU ║
|
||||||
|
║ STAGE_ISOLATION— rate-limits stage_client_id; rejects stage JWTs in prod audiences ║
|
||||||
|
║ ║
|
||||||
|
║ Co-tenant note: shares vm-edge with Infra Plane services. JVM heap pinned to 1.5 GB ║
|
||||||
|
║ in Orca manifest so it cannot starve PowerDNS / Infisical. ║
|
||||||
|
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||||
|
|
||||||
|
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||||
|
║ CONTROL PLANE (on vm-control) ║
|
||||||
|
║ ║
|
||||||
|
║ Owns / defines: ║
|
||||||
|
║ RPO (tenant) — tenant registry & compliance schemas RPO ≤ 6h ║
|
||||||
|
║ RPO (ERPNext) — sales orders, invoices RPO ≤ 6h ║
|
||||||
|
║ AUDIT_TRAIL — all portal actions (invites, IdP changes, impersonations) ║
|
||||||
|
║ AVAILABILITY — portal ≥ 99.5%; ERPNext ≥ 99% (internal) ║
|
||||||
|
║ RTO (portal) — ≤ 10 min ║
|
||||||
|
║ RTO (ERPNext) — ≤ 60 min ║
|
||||||
|
║ ║
|
||||||
|
║ Co-tenant note: ERPNext + Portal + Stalwart on one VM. Orca resource limits enforced: ║
|
||||||
|
║ portal: 1 GB memory cap ║
|
||||||
|
║ erpnext: 6 GB memory cap ║
|
||||||
|
║ mariadb: 3 GB memory cap ║
|
||||||
|
║ stalwart: 1 GB memory cap ║
|
||||||
|
║ tenant-registry: 500 MB ║
|
||||||
|
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||||
|
|
||||||
|
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||||
|
║ DATA PLANE (on vm-data) ║
|
||||||
|
║ ║
|
||||||
|
║ Owns / defines: ║
|
||||||
|
║ DATA_RESIDENCY — all customer data (MongoDB, pg-app, MinIO) must stay EU ║
|
||||||
|
║ RPO (product) — compliance records ≤ 6h; chat history ≤ 24h ║
|
||||||
|
║ DATA_ISOLATION — every query scoped by org_id/tenant_id ║
|
||||||
|
║ AUDIT_TRAIL — product-level actions ║
|
||||||
|
║ AVAILABILITY — CERTifAI ≥ 99.5%; compliance ≥ 99.5% ║
|
||||||
|
║ ║
|
||||||
|
║ Co-tenant note: this VM is the SCALE driver. When vm-data hits 80% RAM, bump flavor ║
|
||||||
|
║ (m2.medium → m2.large → m2.xlarge). See §13 Growth Trajectory. ║
|
||||||
|
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||||
|
|
||||||
|
╔══════════════════════════════════════════════════════════════════════════════════════════╗
|
||||||
|
║ INFRA PLANE (on vm-edge, alongside Identity) ║
|
||||||
|
║ ║
|
||||||
|
║ Owns / enforces ALL of: ║
|
||||||
|
║ BACKUP — executes all backup jobs (pg_dump, mongodump, mc mirror) ║
|
||||||
|
║ IaC — ALL services declared in Orca config; no manual prod changes ║
|
||||||
|
║ IMMUTABILITY — config changes: Gitea commit → Gitea Actions → Orca API only ║
|
||||||
|
║ SECRET_HYGIENE— Infisical (on vm-edge); provisions machine identities ║
|
||||||
|
║ NETWORK — Orca-Proxy rules; VM firewall; no direct VM public exposure ║
|
||||||
|
║ DATA_RESIDENCY— VM region = SysEleven DUS2; backups geo-redundant DUS2↔HAM1 ║
|
||||||
|
║ AVAILABILITY — Orca restart policies, health checks ║
|
||||||
|
║ COLD_START — enforces startup ordering (see §10 Scenario F) ║
|
||||||
|
║ STAGE_ISOLATION— Infisical secret-path scoping for stage_app identity ║
|
||||||
|
╚══════════════════════════════════════════════════════════════════════════════════════════╝
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. SLA Table
|
||||||
|
|
||||||
|
```
|
||||||
|
┌───────────────────────┬──────────────┬─────────┬─────────┬────────────────────────────────┐
|
||||||
|
│ Service │ Availability │ RTO │ RPO │ Host VM │
|
||||||
|
├───────────────────────┼──────────────┼─────────┼─────────┼────────────────────────────────┤
|
||||||
|
│ Orca-Proxy │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||||||
|
│ PowerDNS │ 99.9% │ 5 min │ N/A │ vm-edge │
|
||||||
|
│ Keycloak │ 99.5% │ 15 min │ 6h │ vm-edge (root auth dep) │
|
||||||
|
│ Infisical │ 99.5% │ 30 min │ 24h │ vm-edge (running svcs survive) │
|
||||||
|
│ Gitea │ 99% │ 2h │ 24h │ vm-edge (dev machines mirror) │
|
||||||
|
│ Customer Portal │ 99.5% │ 10 min │ N/A │ vm-control │
|
||||||
|
│ Tenant Registry │ 99.5% │ 10 min │ 6h │ vm-control │
|
||||||
|
│ ERPNext │ 99% │ 60 min │ 6h │ vm-control (internal only) │
|
||||||
|
│ Frappe HD │ 99% │ 60 min │ 24h │ vm-control │
|
||||||
|
│ MariaDB │ 99.5% │ 20 min │ 6h │ vm-control │
|
||||||
|
│ Stalwart Mail │ 99% │ 60 min │ 24h │ vm-control │
|
||||||
|
│ CERTifAI │ 99.5% │ 10 min │ 24h │ vm-data │
|
||||||
|
│ MongoDB │ 99.5% │ 20 min │ 24h │ vm-data │
|
||||||
|
│ LiteLLM │ 99% │ 5 min │ N/A │ vm-data │
|
||||||
|
│ backend-compliance │ 99.5% │ 10 min │ 6h │ vm-data │
|
||||||
|
│ ai-compliance-sdk │ 99.5% │ 10 min │ 6h │ vm-data │
|
||||||
|
│ pg-app │ 99.9% │ 20 min │ 6h │ vm-data (SPOF — RISK-1) │
|
||||||
|
│ MinIO │ 99.5% │ 30 min │ 24h │ vm-data │
|
||||||
|
│ Qdrant │ 99% │ 2h │ 24h │ vm-data (rebuildable) │
|
||||||
|
│ stage (any service) │ 95% │ best ef.│ N/A │ stage (ephemeral; no SLA) │
|
||||||
|
└───────────────────────┴──────────────┴─────────┴─────────┴────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. IaC Constraint (Orca)
|
||||||
|
|
||||||
|
Every production service declared in Orca config. No exceptions.
|
||||||
|
|
||||||
|
### Rules
|
||||||
|
|
||||||
|
```
|
||||||
|
1. ALL containers run via Orca manifests committed to Gitea
|
||||||
|
→ /orca/manifests/{vm-name}/{service-name}.toml
|
||||||
|
→ Changes go through: Gitea PR → Gitea Actions lint → Orca API apply
|
||||||
|
|
||||||
|
2. NO manual docker run / docker-compose up on any production VM
|
||||||
|
→ SSH to prod VMs allowed for debugging only; no state changes
|
||||||
|
|
||||||
|
3. Secrets are NEVER in Orca manifests
|
||||||
|
→ Manifests reference Infisical paths, not values
|
||||||
|
→ Bootstrap exception: Keycloak DB URI in Orca env (Keycloak runs ON vm-edge alongside
|
||||||
|
Infisical, so chicken-and-egg is solved by Orca env file, not Infisical lookup)
|
||||||
|
|
||||||
|
4. Restart policy: always (Orca restarts crashed containers with exponential backoff)
|
||||||
|
→ Health check per service (HTTP /health or TCP probe)
|
||||||
|
|
||||||
|
5. Resource limits MANDATORY in every manifest
|
||||||
|
→ On a 3-VM prod, co-tenant noise is the single biggest risk; limits are non-negotiable
|
||||||
|
→ See §6 Plane ownership "Co-tenant note" boxes for the per-service caps
|
||||||
|
|
||||||
|
6. Orca controller state itself is recoverable
|
||||||
|
→ Manifest files in Gitea = desired state
|
||||||
|
→ Loss of Orca controller = re-apply manifests from Gitea, services continue running
|
||||||
|
|
||||||
|
7. Stage app gets its own Infisical scope
|
||||||
|
→ /stage/* path; no prod-DB credentials reach this scope
|
||||||
|
→ Enforced at Infisical machine-identity level, not in app code
|
||||||
|
```
|
||||||
|
|
||||||
|
### Gitea Actions pipeline for infra changes
|
||||||
|
|
||||||
|
```
|
||||||
|
infra change committed to Gitea
|
||||||
|
│
|
||||||
|
├── lint: validate Orca manifest schema
|
||||||
|
├── diff: show what changes will be applied (orca plan)
|
||||||
|
├── (manual approval gate for vm-edge changes — touches auth root)
|
||||||
|
└── apply: POST to Orca Controller API → rolling update
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Dependency Graph
|
||||||
|
|
||||||
|
Arrows = "requires to function." Dashed = soft (degrades, doesn't fail).
|
||||||
|
**Intra-VM dependencies elided** for clarity (e.g. Keycloak ↔ pg-keycloak are on the same host and start together).
|
||||||
|
|
||||||
|
```
|
||||||
|
EXTERNAL
|
||||||
|
AI APIs
|
||||||
|
(OpenAI / Anthropic)
|
||||||
|
│
|
||||||
|
│ (soft)
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ vm-edge (Identity + Infra) │
|
||||||
|
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ pg-keycloak ──► keycloak │ │
|
||||||
|
│ │ pg-infisical ─► infisical ◄── (all VMs pull on startup) │ │
|
||||||
|
│ │ redis-infis ──► infisical │ │
|
||||||
|
│ │ (sqlite) ─────► gitea │ │
|
||||||
|
│ │ powerdns-auth (no deps) │ │
|
||||||
|
│ │ orca-proxy (route table only; backends are remote) │ │
|
||||||
|
│ └────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ Keycloak JWKS │ Infisical /secrets │
|
||||||
|
│ │ │ │
|
||||||
|
└────────────────────────────┼────────────────┼────────────────────┘
|
||||||
|
▼ ▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ vm-control (Control) │
|
||||||
|
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ mariadb + redis-erp ──► erpnext + frappe-hd │ │
|
||||||
|
│ │ (intra) ─────────────► stalwart │ │
|
||||||
|
│ │ ──────────────────────► customer-portal │ │
|
||||||
|
│ │ ──────────────────────► tenant-registry ──► pg-app (vm-data)│ │
|
||||||
|
│ └────────────────────────────────────────────────────────────┘ │
|
||||||
|
│ │ tenant-registry API │
|
||||||
|
└────────────────────────────┼─────────────────────────────────────┘
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ vm-data (Data) │
|
||||||
|
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||||
|
│ │ mongodb ───► certifai ◄── (vm-edge JWKS, vm-edge secrets) │ │
|
||||||
|
│ │ litellm ───► certifai, ai-compliance-sdk │ │
|
||||||
|
│ │ pg-app ────► tenant-registry-on-vm-control, backend-compl,│ │
|
||||||
|
│ │ ai-compliance-sdk │ │
|
||||||
|
│ │ qdrant ────► ai-compliance-sdk │ │
|
||||||
|
│ │ minio ────► backend-compliance │ │
|
||||||
|
│ │ backend-compliance ──► admin-compliance │ │
|
||||||
|
│ └────────────────────────────────────────────────────────────┘ │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
|
||||||
|
┌──────────────────────────────────────────────────────────────────┐
|
||||||
|
│ stage (App plane only) │
|
||||||
|
│ Calls vm-edge:8443 (KC) + vm-control:587 (Stalwart submission) │
|
||||||
|
│ Calls Polar SANDBOX (never prod Polar webhook URL) │
|
||||||
|
│ Its own ephemeral DBs; cannot read prod data │
|
||||||
|
└──────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### Simplified critical path (customer login → product use)
|
||||||
|
|
||||||
|
```
|
||||||
|
DNS (vm-edge PowerDNS)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
orca-proxy (vm-edge)
|
||||||
|
│
|
||||||
|
├──► keycloak (vm-edge) ──► pg-keycloak (intra-VM)
|
||||||
|
│
|
||||||
|
└──► customer-portal (vm-control)
|
||||||
|
├──► tenant-registry (vm-control) ──► pg-app (vm-data)
|
||||||
|
├──► certifai (vm-data) ──► mongodb (intra-VM)
|
||||||
|
└──► backend-compliance (vm-data) ──► pg-app (intra-VM)
|
||||||
|
──► ai-sdk ──► qdrant + minio
|
||||||
|
──► litellm ──► [external AI APIs]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Failure Scenarios and Deadlock Analysis
|
||||||
|
|
||||||
|
### Scenario A — vm-edge fails (HIGHEST SEVERITY)
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: TOTAL outage. Nothing reachable from internet.
|
||||||
|
No DNS. No TLS. No auth. No new logins. Running JWTs expire within 15 min,
|
||||||
|
then ALL services start returning 401.
|
||||||
|
Backstage and customer portal both fully blocked.
|
||||||
|
Stage also blocked (depends on prod Keycloak).
|
||||||
|
Cascade: T+0: DNS fails → orca-proxy unreachable
|
||||||
|
T+5m: existing JWTs still valid; portal cached → partial reads work
|
||||||
|
T+15m: JWTs expire → full outage
|
||||||
|
Deadlock: None — services downstream don't deadlock, they just fail closed
|
||||||
|
Recovery: 1. Spin up vm-edge-spare (cold standby, same Orca config) — ~3 min provision
|
||||||
|
2. Restore pg-keycloak + pg-infisical from latest backup — ~5 min
|
||||||
|
3. Swap registrar NS records to spare IP (TTL 60s) — ~2 min propagation
|
||||||
|
4. Restart all services on vm-edge-spare via Orca apply — ~3 min
|
||||||
|
Total RTO target: 15 min
|
||||||
|
Mitigation: COLD STANDBY vm-edge-spare. Same Orca config committed in Gitea.
|
||||||
|
Provision cost when idle: €0 (only billed when running).
|
||||||
|
Test recovery quarterly.
|
||||||
|
Severity: CRITICAL — single host owns 3 root dependencies (DNS, auth, secrets)
|
||||||
|
Cost of fix at Tier C: split vm-edge into vm-edge + vm-identity + vm-secrets
|
||||||
|
(back toward original 7-VM design) — €100/mo extra
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario B — vm-control fails (NEW — consequence of plane consolidation)
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: customer-portal: DOWN → /[slug]/* all return 503
|
||||||
|
tenant-registry: DOWN → Keycloak protocol-mapper for products claim breaks
|
||||||
|
→ users can log in but see "No active products"
|
||||||
|
ERPNext + Frappe HD: DOWN → we cannot create sales orders or read tickets
|
||||||
|
Stalwart: DOWN → no outbound emails (trial nudges, exports, ticket replies)
|
||||||
|
MariaDB: DOWN → ERPNext queries fail; backups paused
|
||||||
|
Products (CERTifAI, compliance): UNAFFECTED (on vm-data, JWTs still validate)
|
||||||
|
Existing logged-in users: can use products directly via product subdomain
|
||||||
|
IF they bookmark it; portal home is 503.
|
||||||
|
Cascade: T+0: portal 503; new tenant onboarding blocked (registry down)
|
||||||
|
T+15m: existing JWTs missing refreshed products claim
|
||||||
|
T+1h: trial emails not sent → trial nudge cadence breaks
|
||||||
|
Deadlock: None
|
||||||
|
Recovery: Restart vm-control containers via Orca. If MariaDB corrupt: restore mysqldump.
|
||||||
|
RTO target: 10 min (portal) / 60 min (ERPNext)
|
||||||
|
Mitigation: Multiple services co-hosted = single failure hits many SLAs.
|
||||||
|
Resource limits in Orca prevent ERPNext OOM from killing portal.
|
||||||
|
Quarterly drill: deliberately stop portal, measure recovery.
|
||||||
|
Severity: HIGH — three services down at once, but products keep serving customers
|
||||||
|
Cost of fix at Tier B/C: split vm-control → vm-portal + vm-ops (ERPNext)
|
||||||
|
— €64/mo extra at m2.small
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario C — vm-data fails
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: tenant-registry queries: FAIL (pg-app down) → portal returns 503 for tenant lookup
|
||||||
|
customer-portal: DEGRADED (login works, dashboard fails)
|
||||||
|
CERTifAI: COMPLETELY DOWN
|
||||||
|
backend-compliance + ai-sdk + admin: COMPLETELY DOWN
|
||||||
|
ERPNext + Stalwart: UNAFFECTED
|
||||||
|
Cascade: T+0: products down; portal degraded
|
||||||
|
T+15m: support tickets pile up
|
||||||
|
Note: prod is partial — users see error pages but ERPNext + auth still work
|
||||||
|
Recovery: Restart vm-data containers. If pg-app corrupt: restore from pg_dump (RPO 6h).
|
||||||
|
RTO target: 20 min
|
||||||
|
Mitigation: This is the SCALE-event VM. RISK-1 below makes this the worst SPOF:
|
||||||
|
one pg-app instance owns tenant_registry + compliance schemas.
|
||||||
|
HIGH PRIORITY fix: split pg-app into separate clusters at Tier B/C transition.
|
||||||
|
Severity: HIGH — products down, business operations (ERPNext) still work so we can
|
||||||
|
contact customers
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario D — LiteLLM fails
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: CERTifAI: AI features fail (summarization, chat completion).
|
||||||
|
CERTifAI dashboard, sessions: UNAFFECTED.
|
||||||
|
compliance AI generation: FAILS (DSFA/TOM/VVT generation blocked).
|
||||||
|
Compliance CRUD: UNAFFECTED.
|
||||||
|
Cascade: Soft degradation only. Products show "AI features temporarily unavailable" banner.
|
||||||
|
Deadlock: None.
|
||||||
|
Recovery: Restart LiteLLM on vm-data (stateless, ~30s).
|
||||||
|
Severity: MEDIUM — graceful degradation by design
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario E — Stage VM compromised or buggy
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: On stage itself: stage portal serves bad data; stage testers see errors.
|
||||||
|
On prod: NONE if isolation rules in §2 are intact.
|
||||||
|
Worst case if isolation breaks:
|
||||||
|
- Stage code tries to call prod pg-app → fails (no creds in /stage/* Infisical)
|
||||||
|
- Stage emits real email → blocked by Stalwart recipient filter
|
||||||
|
- Stage triggers Polar charge → goes to sandbox, no real money
|
||||||
|
Cascade: None to prod by design.
|
||||||
|
Recovery: Roll back stage to previous image via Orca. RTO target: 5 min.
|
||||||
|
Mitigation: The 5 enforcement rules in §2 are the load-bearing controls. Verify quarterly
|
||||||
|
via deliberate red-team: try to write to prod pg-app from stage and confirm 401.
|
||||||
|
Severity: LOW (in prod) / HIGH (on stage, but stage SLA is 95%)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario F — Full Cold Start (Power Loss, All VMs Restart Simultaneously)
|
||||||
|
|
||||||
|
```
|
||||||
|
Three VMs boot at once. Services must start in dependency order or services
|
||||||
|
crash-loop until their deps are ready.
|
||||||
|
|
||||||
|
DEADLOCK RISK: vm-control services (portal, tenant-registry) start before vm-data
|
||||||
|
services (pg-app, certifai, compliance). They'll crash-loop ~2-5min
|
||||||
|
with backoff retries.
|
||||||
|
Same for ERPNext on vm-control trying to reach Keycloak on vm-edge.
|
||||||
|
|
||||||
|
RESOLUTION: Orca enforces cross-VM startup ordering via health-check dependencies.
|
||||||
|
Bootstrap exception: Keycloak DB URI in Orca env on vm-edge (not from
|
||||||
|
Infisical — chicken-and-egg solved).
|
||||||
|
|
||||||
|
Required cold start sequence:
|
||||||
|
|
||||||
|
Phase 0 — Data roots on vm-data (parallel):
|
||||||
|
pg-app, mongodb, qdrant, minio
|
||||||
|
Phase 0 — Data roots on vm-control (parallel):
|
||||||
|
mariadb, redis-erpnext
|
||||||
|
Phase 0 — Data roots on vm-edge (parallel):
|
||||||
|
pg-keycloak, pg-infisical, redis-infisical
|
||||||
|
|
||||||
|
Phase 1 — Secrets + DNS on vm-edge:
|
||||||
|
infisical (needs: pg-infisical, redis-infisical)
|
||||||
|
powerdns-auth (no deps)
|
||||||
|
|
||||||
|
Phase 2 — Identity on vm-edge:
|
||||||
|
keycloak (needs: pg-keycloak [Phase 0], infisical [Phase 1])
|
||||||
|
gitea (needs: sqlite; ready from Phase 0)
|
||||||
|
|
||||||
|
Phase 3 — Control on vm-control + Data services on vm-data (parallel):
|
||||||
|
tenant-registry (needs: keycloak [Phase 2], pg-app [Phase 0, remote])
|
||||||
|
erpnext + frappe-hd (needs: mariadb, redis-erpnext [Phase 0], keycloak [Phase 2])
|
||||||
|
stalwart (needs: infisical [Phase 1])
|
||||||
|
litellm (needs: infisical)
|
||||||
|
certifai (needs: keycloak, mongodb, litellm)
|
||||||
|
backend-compliance (needs: keycloak, pg-app)
|
||||||
|
ai-compliance-sdk (needs: pg-app, qdrant, litellm)
|
||||||
|
admin-compliance (needs: backend + sdk)
|
||||||
|
|
||||||
|
Phase 4 — Customer-facing on vm-control:
|
||||||
|
customer-portal (needs: keycloak, tenant-registry)
|
||||||
|
|
||||||
|
Phase 5 — Gateway on vm-edge (last):
|
||||||
|
orca-proxy (waits for all backends healthy before opening listener)
|
||||||
|
|
||||||
|
Estimated cold-start time: 6-10 minutes (faster than 7-VM since less network roundtrip)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scenario G — Tenant Registry fails
|
||||||
|
|
||||||
|
```
|
||||||
|
Impact: Portal cannot resolve tenant from subdomain → /[slug]/* all 503
|
||||||
|
Keycloak protocol mapper cannot get products claim → JWT missing field
|
||||||
|
→ users can log in but see "No active products"
|
||||||
|
Products (CERTifAI, compliance) themselves: UNAFFECTED if already authenticated
|
||||||
|
Cascade: New logins degraded.
|
||||||
|
Existing sessions continue.
|
||||||
|
Deadlock: None.
|
||||||
|
Recovery: Restart tenant-registry on vm-control. pg-app on vm-data must be healthy.
|
||||||
|
RTO target: ≤ 60s
|
||||||
|
Mitigation: Portal caches slug → tenant mapping with 60s TTL.
|
||||||
|
Short outage invisible to customers.
|
||||||
|
Severity: MEDIUM
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Cross-Dependency Summary Table
|
||||||
|
|
||||||
|
```
|
||||||
|
Needs → │PG-KC│PG-Inf│PG-App│Mongo│Maria│Redis│Minio│Qdrant│ KC │Infis│Lit. │T.Reg│
|
||||||
|
─────────────────────┼─────┼──────┼──────┼─────┼─────┼─────┼─────┼──────┼─────┼─────┼─────┼─────┤
|
||||||
|
keycloak │ ● │ │ │ │ │ │ │ │ │ ◐* │ │ │
|
||||||
|
infisical │ │ ● │ │ │ │ ● │ │ │ │ │ │ │
|
||||||
|
gitea │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||||
|
tenant-registry │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||||||
|
customer-portal │ │ │ │ │ │ │ │ │ ● │ ● │ │ ● │
|
||||||
|
erpnext │ │ │ │ │ ● │ ● │ │ │ ● │ ● │ │ │
|
||||||
|
frappe-hd │ │ │ │ │ ● │ ● │ │ │ │ ● │ │ │
|
||||||
|
stalwart │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||||
|
certifai │ │ │ │ ● │ │ │ │ │ ● │ ● │ ◐ │ │
|
||||||
|
litellm │ │ │ │ │ │ │ │ │ │ ● │ │ │
|
||||||
|
backend-compl. │ │ │ ● │ │ │ │ │ │ ● │ ● │ │ │
|
||||||
|
ai-compl-sdk │ │ │ ● │ │ │ │ │ ● │ │ ● │ ◐ │ │
|
||||||
|
admin-compl. │ │ │ │ │ │ │ │ │ │ │ │ │
|
||||||
|
orca-proxy │ │ │ │ │ │ │ │ │ │ │ │ │
|
||||||
|
stage-app │ │ │ │ │ │ │ │ │ ● │ ◑ │ │ ◑ │
|
||||||
|
|
||||||
|
● = hard dependency (cannot start without)
|
||||||
|
◐ = soft dependency (starts, features degrade)
|
||||||
|
◑ = stage-only read-mostly dependency (writes blocked by Infisical scope)
|
||||||
|
◐*= bootstrap exception (Keycloak DB URI in Orca env on vm-edge, not Infisical)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Open Infrastructure Risks (Priority Order)
|
||||||
|
|
||||||
|
```
|
||||||
|
RISK-1 pg-app (vm-data) is a single instance serving tenant_registry + compliance schemas.
|
||||||
|
One crash blocks portal AND compliance product simultaneously.
|
||||||
|
→ Mitigation: split into pg-registry + pg-compliance at Tier B (200 customers).
|
||||||
|
Move pg-registry to its own DBaaS PostgreSQL cluster (€213/mo).
|
||||||
|
Priority: HIGH — fix before 100 customers; flagged also in COST_PLAN.md
|
||||||
|
|
||||||
|
RISK-2 vm-edge is a single VM owning 3 root dependencies (DNS, auth, secrets).
|
||||||
|
Failure = total external outage. Highest blast radius in the system.
|
||||||
|
→ Mitigation:
|
||||||
|
Phase A: cold-standby vm-edge-spare (idle cost €0; tested quarterly)
|
||||||
|
Phase B (Tier C, 500 cust): split vm-edge into vm-edge + vm-identity + vm-secrets
|
||||||
|
Priority: HIGH
|
||||||
|
|
||||||
|
RISK-3 vm-control hosts 5 service groups (portal, tenant-registry, ERPNext, Frappe HD,
|
||||||
|
Stalwart). Co-tenant noise risk; one OOM kills the others.
|
||||||
|
→ Mitigation:
|
||||||
|
Phase A: hard Orca resource limits per service (see §6 co-tenant notes)
|
||||||
|
Phase B (Tier B): split vm-control → vm-portal + vm-ops at €64/mo extra
|
||||||
|
Priority: MEDIUM
|
||||||
|
|
||||||
|
RISK-4 Keycloak is a single instance with no clustering.
|
||||||
|
Any Keycloak outage = total auth failure within JWT TTL.
|
||||||
|
→ Mitigation: short-term: tested runbook + 15min RTO target
|
||||||
|
long-term: Keycloak active-passive cluster (Phase 2, on split vm-identity)
|
||||||
|
Priority: MEDIUM
|
||||||
|
|
||||||
|
RISK-5 Stage isolation depends on 5 enforcement controls (see §2 table).
|
||||||
|
If any one breaks, stage code can affect prod customers.
|
||||||
|
→ Mitigation: quarterly red-team verification of each control.
|
||||||
|
Especially: Infisical secret-path scoping and Stalwart recipient filter.
|
||||||
|
Priority: MEDIUM — easy to forget once it's working
|
||||||
|
|
||||||
|
RISK-6 Infisical downtime during multi-VM restart causes delayed cold start.
|
||||||
|
→ Mitigation: Orca startup ordering + bootstrap secrets for Keycloak only
|
||||||
|
Priority: LOW — documented runbook; cold start is rare
|
||||||
|
|
||||||
|
RISK-7 ERPNext → Tenant Registry webhook has no guaranteed delivery.
|
||||||
|
Failed activation = tenant not active after contract signed.
|
||||||
|
→ Mitigation: Frappe retry + idempotent /activate endpoint + manual Backstage trigger
|
||||||
|
Priority: LOW
|
||||||
|
|
||||||
|
RISK-8 LiteLLM calls external AI APIs (OpenAI / Anthropic).
|
||||||
|
→ Mitigation: LiteLLM fallback routing; products degrade gracefully.
|
||||||
|
Priority: LOW — external dependency, by design
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 13. Growth Trajectory — when to add VMs
|
||||||
|
|
||||||
|
The locked 4-VM topology is right for 5–~200 customers. Past that, expect to add VMs back in this order:
|
||||||
|
|
||||||
|
```
|
||||||
|
Tier A (5–200 cust): 4 VMs as locked €192/mo compute (36M upfront)
|
||||||
|
↓
|
||||||
|
Tier B (200–500): Bump vm-data m2.med → m2.large +€64/mo
|
||||||
|
Add cold-standby vm-edge-spare +€0 (idle, paid only on swap)
|
||||||
|
↓
|
||||||
|
Tier C (500–1000): Split vm-data: vm-data + vm-data-db +€64/mo
|
||||||
|
(postgres-app moves to its own VM, or DBaaS cluster +€213/mo)
|
||||||
|
Split vm-control: vm-control + vm-ops +€64/mo
|
||||||
|
(ERPNext + MariaDB + Stalwart move to vm-ops)
|
||||||
|
↓
|
||||||
|
Tier D (1000–2000): Split vm-edge: vm-edge + vm-identity + vm-secrets +€96/mo
|
||||||
|
HA Keycloak active-passive on 2× vm-identity +€32/mo
|
||||||
|
Octavia Load Balancer Double Instance +€58/mo
|
||||||
|
vm-data m2.large → m2.xlarge or 2× +€128–256/mo
|
||||||
|
↓
|
||||||
|
Final topology ≈ 8 prod VMs + DBaaS
|
||||||
|
```
|
||||||
|
|
||||||
|
Each step is justified by a measurable signal (>80% RAM, >70% CPU, sustained queue depth, or a specific outage scenario). Never split preemptively.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 14. Cost summary (see COST_PLAN.md for full breakdown)
|
||||||
|
|
||||||
|
| Mode | Compute €/mo | Storage €/mo | Network €/mo | Total net | + 19% VAT |
|
||||||
|
|---|---:|---:|---:|---:|---:|
|
||||||
|
| On-Demand | 434.50 | 112 | 2.92 | 549.42 | 653.81 |
|
||||||
|
| 12-month commit | 295.20 | 112 | 2.92 | 410.12 | 488.04 |
|
||||||
|
| 36-month no upfront | 216.00 | 112 | 2.92 | 330.92 | 393.79 |
|
||||||
|
| 36-month upfront | 192.00 | 112 | 2.92 | 306.92 | 365.23 |
|
||||||
|
|
||||||
|
Plus €6,912 net one-time payment if signing 36M-upfront for the compute portion.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*End of document. Review quarterly or after any significant infrastructure change. Topology last locked 2026-05-18.*
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
Copyright (c) 2026 Sharang Parnerkar and Benjamin Boenisch
|
||||||
|
All rights reserved.
|
||||||
|
|
||||||
|
This source code, documentation, and all accompanying materials in this
|
||||||
|
repository (the "Software") are the proprietary and confidential property
|
||||||
|
of Sharang Parnerkar and Benjamin Boenisch (the "Copyright Holders").
|
||||||
|
|
||||||
|
No part of the Software may be copied, reproduced, modified, merged,
|
||||||
|
published, distributed, sublicensed, sold, or used in any form or by any
|
||||||
|
means -- electronic, mechanical, or otherwise -- without the prior written
|
||||||
|
permission of the Copyright Holders.
|
||||||
|
|
||||||
|
Access to this repository does not constitute a license. Cloning, forking,
|
||||||
|
or otherwise obtaining a copy of the Software does not grant any right to
|
||||||
|
use, run, or redistribute it.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT.
|
||||||
|
|
||||||
|
Unauthorized use, reproduction, or disclosure is strictly prohibited and
|
||||||
|
may result in civil and/or criminal penalties under applicable law.
|
||||||
|
|
||||||
|
For licensing inquiries, contact: sharang@meghsakha.com
|
||||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -1,3 +1,59 @@
|
|||||||
# docs
|
# docs
|
||||||
|
|
||||||
Platform-wide architecture, integration spec, runbooks.
|
Platform-wide architecture, integration spec, runbooks.
|
||||||
|
|
||||||
|
> Part of the **Breakpilot Platform**. For the big picture see [`platform/docs`](https://gitea.meghsakha.com/platform/docs):
|
||||||
|
> [Architecture](https://gitea.meghsakha.com/platform/docs/src/branch/main/PLATFORM_ARCHITECTURE.md) ·
|
||||||
|
> [Infrastructure](https://gitea.meghsakha.com/platform/docs/src/branch/main/INFRASTRUCTURE.md) ·
|
||||||
|
> [Product Integration Spec](https://gitea.meghsakha.com/platform/docs/src/branch/main/PRODUCT_INTEGRATION_SPEC.md) ·
|
||||||
|
> [Implementation Plan](https://gitea.meghsakha.com/platform/docs/src/branch/main/IMPLEMENTATION_PLAN.md)
|
||||||
|
|
||||||
|
## What this is
|
||||||
|
|
||||||
|
Platform-wide architecture, integration spec, runbooks. Scaffolded under milestone M0.1. See [`platform/docs`](https://gitea.meghsakha.com/platform/docs) for the full architecture context.
|
||||||
|
|
||||||
|
**Plane:** Control
|
||||||
|
**Owner:** @sharang
|
||||||
|
**Status:** pre-alpha
|
||||||
|
**Linked milestone:** [M0.1](https://gitea.meghsakha.com/platform/docs/src/branch/main/IMPLEMENTATION_PLAN.md)
|
||||||
|
|
||||||
|
## Run locally
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# prerequisites: see CONTRIBUTING.md for tooling once code lands
|
||||||
|
make dev # starts dependencies + this service on http://localhost:3000
|
||||||
|
make test # unit + integration
|
||||||
|
make e2e # only if this repo ships user-facing flows
|
||||||
|
```
|
||||||
|
|
||||||
|
Local secrets come from `.env.local` (gitignored). Template at `.env.example`.
|
||||||
|
|
||||||
|
## Endpoints / surface
|
||||||
|
|
||||||
|
{{For services: list the top-level routes or commands.
|
||||||
|
For libraries: list the public API entry points.
|
||||||
|
For IaC: list the make targets.}}
|
||||||
|
|
||||||
|
## Deployment
|
||||||
|
|
||||||
|
| Env | URL | How |
|
||||||
|
|---|---|---|
|
||||||
|
| dev | `http://localhost:3000` | `make dev` |
|
||||||
|
| stage | `https://docs.stage.yourplatform.com` | auto on merge to `main` |
|
||||||
|
| prod | `https://docs.yourplatform.com` | manual: tag `vX.Y.Z` + sign-off |
|
||||||
|
|
||||||
|
Rollback: `orca rollout undo docs --env={{env}}`.
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
- Traces, logs, metrics: [SigNoz](https://signoz.meghsakha.com) — service name `docs`
|
||||||
|
- Audit events: Tenant Registry `/audit` (Retraced-shape schema)
|
||||||
|
- On-call: `oncall@yourplatform.com` · runbook at `platform/docs/runbooks/docs.md`
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
See [`CONTRIBUTING.md`](./CONTRIBUTING.md). TL;DR: branch from main, open a PR, 1 review + green CI, squash-merge.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Proprietary — all rights reserved. Copyright (c) 2026 Sharang Parnerkar and Benjamin Boenisch. See [`LICENSE`](./LICENSE).
|
||||||
|
|||||||
+39
@@ -0,0 +1,39 @@
|
|||||||
|
# git-cliff config — generates release notes from Conventional Commits.
|
||||||
|
# Preset: keepachangelog.
|
||||||
|
|
||||||
|
[changelog]
|
||||||
|
header = """
|
||||||
|
# Changelog
|
||||||
|
|
||||||
|
All notable changes to this repo. Format: [Keep a Changelog](https://keepachangelog.com/).
|
||||||
|
"""
|
||||||
|
body = """
|
||||||
|
{% if version %}\
|
||||||
|
## [{{ version | trim_start_matches(pat="v") }}] - {{ timestamp | date(format="%Y-%m-%d") }}
|
||||||
|
{% else %}\
|
||||||
|
## [Unreleased]
|
||||||
|
{% endif %}\
|
||||||
|
{% for group, commits in commits | group_by(attribute="group") %}
|
||||||
|
### {{ group | upper_first }}
|
||||||
|
{% for commit in commits %}
|
||||||
|
- {{ commit.message | upper_first }}\
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
"""
|
||||||
|
trim = true
|
||||||
|
|
||||||
|
[git]
|
||||||
|
conventional_commits = true
|
||||||
|
filter_unconventional = true
|
||||||
|
commit_parsers = [
|
||||||
|
{ message = "^feat", group = "Added" },
|
||||||
|
{ message = "^fix", group = "Fixed" },
|
||||||
|
{ message = "^perf", group = "Changed" },
|
||||||
|
{ message = "^refactor", group = "Changed" },
|
||||||
|
{ message = "^docs", group = "Docs" },
|
||||||
|
{ message = "^chore", skip = true },
|
||||||
|
{ message = "^ci", skip = true },
|
||||||
|
{ message = "^test", skip = true },
|
||||||
|
]
|
||||||
|
filter_commits = true
|
||||||
|
tag_pattern = "v[0-9]*"
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
// commitlint.config.cjs — Conventional Commits enforcement for every repo.
|
||||||
|
// Used by .gitea/workflows/ci-*.yaml `wagoid/commitlint-github-action`.
|
||||||
|
|
||||||
|
module.exports = {
|
||||||
|
extends: ['@commitlint/config-conventional'],
|
||||||
|
rules: {
|
||||||
|
'type-enum': [2, 'always', [
|
||||||
|
'feat', // new feature
|
||||||
|
'fix', // bug fix
|
||||||
|
'docs', // documentation
|
||||||
|
'chore', // tooling, deps, no production code change
|
||||||
|
'refactor', // refactor with no behavior change
|
||||||
|
'test', // tests only
|
||||||
|
'perf', // performance
|
||||||
|
'build', // build system, Dockerfile
|
||||||
|
'ci', // CI config
|
||||||
|
'revert', // revert a prior commit
|
||||||
|
]],
|
||||||
|
'subject-case': [2, 'always', 'sentence-case'],
|
||||||
|
'subject-max-length': [2, 'always', 72],
|
||||||
|
'body-max-line-length': [1, 'always', 100],
|
||||||
|
'footer-leading-blank': [2, 'always'],
|
||||||
|
'references-empty': [1, 'never'], // warn if no Refs: M1.2 footer
|
||||||
|
},
|
||||||
|
parserPreset: {
|
||||||
|
parserOpts: {
|
||||||
|
// Capture milestone references: "Refs: M5.2" or "Closes: M5.2"
|
||||||
|
referenceActions: ['close', 'closes', 'closed', 'fix', 'fixes', 'fixed', 'refs', 'ref'],
|
||||||
|
issuePrefixes: ['M', '#'],
|
||||||
|
},
|
||||||
|
},
|
||||||
|
};
|
||||||
Reference in New Issue
Block a user