Files
docs/PRODUCT_INTEGRATION_SPEC.md
sharang 03a5b4846e
ci / shared (push) Successful in 4s
chore(domain): yourplatform.com → breakpilot.com
Apply platform-domain decision (2026-05-18). No services touched; docs/config only.

Refs: M1.1
2026-05-18 20:28:41 +00:00

50 KiB
Raw Permalink Blame History

Product Integration Specification

Status: Design Draft Authors: Sharang, Benjamin Date: 2026-05-11 Companion docs: PLATFORM_ARCHITECTURE.md, INFRASTRUCTURE.md Contract version: 1.0


1. Purpose

This document defines the contract that every product must implement to be sold on the platform as a B2B building block. The contract is enforced by the Tenant Registry, Customer Portal, and Orca deployment pipeline. A product that does not implement the contract cannot be activated for a tenant.

The contract is designed to be first-party today, third-party-ready tomorrow — the technical surface is identical for our own products and any future external developers, with stricter verification gates for the latter.


2. Core Principles

1. ONE TENANT, ONE TRUTH
   Every request is scoped by org_id from the JWT. Cross-tenant data leakage is the
   single largest commercial risk; the spec treats it as a contract violation.

2. PLATFORM OWNS IDENTITY, BILLING, ROUTING
   Products NEVER implement their own login, never store passwords, never invoice
   customers directly. These are platform concerns.

3. PRODUCTS OWN THEIR DOMAIN AND DATA
   Products own their database, their data model, their backup, their RTO/RPO.
   No cross-product database sharing. Composition is via APIs, not via DB joins.

4. STATELESS APPLICATIONS, STATEFUL DATA STORES
   Application containers are replaceable in seconds. State lives in databases
   that have explicit backup contracts.

5. CONTRACT EVOLVES, PRODUCTS DECLARE COMPATIBILITY
   Products declare which contract version they implement. The platform supports
   N and N-1; deprecation is announced before removal.

3. Required Surfaces

A product is composed of five surfaces. Three are mandatory, one is tier-gated, one is mandatory documentation.

┌──────────────────────┬──────────────────────────────────────────┬───────────────┐
│ Surface              │ What                                     │ Requirement   │
├──────────────────────┼──────────────────────────────────────────┼───────────────┤
│ Backend API          │ REST + OpenAPI 3.0 spec                  │ REQUIRED      │
│ Frontend             │ Web component (custom element)           │ REQUIRED      │
│ MCP Server           │ MCP server exposing tenant-scoped tools  │ REQUIRED for │
│                      │                                          │ Enterprise    │
│                      │                                          │ tier; opt for │
│                      │                                          │ Starter/Pro   │
│ Documentation        │ README, API docs, integration guide,     │ REQUIRED      │
│                      │ runbook, data model, GDPR retention      │               │
│ Observability        │ /health, /metrics, structured logs,      │ REQUIRED      │
│                      │ audit event emission                     │               │
└──────────────────────┴──────────────────────────────────────────┴───────────────┘

4. Backend API Contract

4.1 Mandatory Endpoints

Every product backend must implement these endpoints. The Tenant Registry health-checks them on deploy; any missing endpoint blocks the registration.

GET    /health
       Returns 200 if healthy, 503 if unhealthy.
       Body: {"status": "ok"|"degraded"|"down", "checks": {"db": "ok", "deps": "ok"}}
       Authentication: NONE (Orca probe)

GET    /version
       Returns product version and contract version it implements.
       Body: {"product": "certifai", "version": "1.4.2", "contract": "1.0",
              "build": "<git sha>", "deployed_at": "2026-05-10T..."}
       Authentication: NONE

GET    /v1/usage
       Query: ?tenant_id=<uuid>&from=<iso>&to=<iso>&project_id=<uuid>
       Returns billing-relevant usage metrics for a tenant (and optional project).
       Body: {"tenant_id": "...", "project_id": "...", "period": {...},
              "metrics": {"seats_active": 12, "api_calls": 14203, ...}}
       Authentication: SERVICE TOKEN (called by billing job)
       Note: products with high-cardinality usage (LLM tokens, etc.) SHOULD also
       stream per-event metering to /internal/usage/events on Tenant Registry.
       Event shape is Lago-compatible (transaction_id, code, external_subscription_id,
       properties) so we can swap to a Lago instance later without changing producers.

POST   /v1/tenants/{id}/provision
       Body: {"plan": "...", "config": {...}, "contract_version": "1.0"}
       Initializes tenant-specific resources (schemas, default data, queues).
       Must be idempotent: a second call with same params is a no-op.
       Authentication: SERVICE TOKEN (called by Tenant Registry)

POST   /v1/tenants/{id}/suspend
       Soft-suspend: data retained, all customer access blocked.
       Authentication: SERVICE TOKEN

POST   /v1/tenants/{id}/reactivate
       Reverse of suspend.
       Authentication: SERVICE TOKEN

POST   /v1/tenants/{id}/terminate
       Hard terminate: schedules data for erasure per retention policy.
       Body: {"reason": "...", "scheduled_erasure_at": "..."}
       Authentication: SERVICE TOKEN

POST   /v1/tenants/{id}/export
       GDPR Article 20 (data portability) export for a tenant.
       Returns: signed URL to a ZIP containing all tenant data in JSON + binary blobs.
       Authentication: USER JWT (IT_ADMIN or LEGAL role)

DELETE /v1/tenants/{id}/data
       GDPR Article 17 (right to erasure) full deletion.
       Body: {"confirm": "<tenant_slug>"}  ← safety check
       Authentication: SERVICE TOKEN + USER JWT (IT_ADMIN signing off)

4.2 Authentication Modes

USER JWT          Bearer token issued by Keycloak for a user session.
                  Contains: sub, org_id, org_roles, products, plan.
                  Validated via Keycloak JWKS endpoint.
                  Used for: all customer-facing endpoints.

SERVICE TOKEN     Short-lived JWT issued by Keycloak via OAuth 2.0
                  client_credentials flow.
                  Each service has a Keycloak client (id: certifai-svc,
                  compliance-svc, etc.) with declared scopes.
                  Used for: platform-to-product calls (provisioning),
                  product-to-product calls (inter-product API).
                  TTL: 15 minutes max.

ORCA PROBE        No authentication. Local network only.
                  Used for: /health, /version (Orca polls these).
                  Must not leak tenant data.

4.3 Tenant Scoping Rules

1. EVERY non-probe endpoint extracts tenant context from JWT or path.
   USER JWT → tenant = jwt.org_id
   SERVICE TOKEN → tenant = path parameter, validated against service scopes

2. EVERY query to the product database includes WHERE tenant_id = $1.
   No exceptions. Code review enforces this; tests verify it.

3. EVERY response includes only data for the requested tenant.
   The product asserts this invariant in middleware (defense in depth).

4. EVERY log line and audit event includes tenant_id as a structured field.

4.4 OpenAPI Spec

Every product publishes openapi.yaml at /openapi.yaml. The Tenant Registry pulls this on product registration and validates that the mandatory endpoints from §4.1 are present with correct signatures.

Product OpenAPI must:
  - Be valid OpenAPI 3.0 (3.1 not yet — tooling gap)
  - Include all mandatory endpoints from §4.1
  - Document all custom endpoints with examples
  - Declare authentication mode for each endpoint
  - Declare scopes consumed (for SERVICE TOKEN endpoints)
  - Include error response schemas (4xx, 5xx)

5. Frontend Contract

A product declares its frontend type in the manifest. The portal renders accordingly. Three types are supported:

interactive   Full UI shipped as a web component custom element.
              Customer OPERATES the product through this UI.
              Examples: CERTifAI, breakpilot-compliance, classic SaaS products.

widget        Only a small dashboard tile component; no full product page.
              Customer SEES product output in a tile; deeper management
              happens on a portal-rendered config page.
              Examples: monitoring, status reporting.

headless      No frontend code at all. The portal renders a generic
              management UI from a portal_config block in the manifest.
              Customer CONFIGURES (API keys, webhooks) and their own
              systems consume the product via API/MCP.
              Examples: notetaker bot, document classifier, webhook router.

The portal branches its rendering on manifest.frontend.type. Backend, MCP, observability, and lifecycle contracts are identical across all three types — only the customer-facing surface changes.

5.A Interactive (Web Component)

The frontend is a custom element registered with a product-specific tag name. The Customer Portal loads the bundle and renders the element with attributes passed in by the portal.

5.A.1 Why web components

Our products span Rust/Dioxus, Next.js, Go, Python. Web components are the only framework-agnostic surface that lets all of these ship a frontend without forcing a stack rewrite:

  • CERTifAI compiles Dioxus → WASM → wraps in a custom element
  • breakpilot-compliance wraps React components via @r2wc/react-to-web-component
  • Any future Vue/Svelte/Solid product also works

5.A.2 The Tag Contract

Each product declares ONE primary tag in its manifest. The portal renders it like this:

<certifai-dashboard
  tenant="acme"
  tenant-id="uuid-acme"
  jwt="<short-lived portal-issued JWT>"
  locale="en"
  theme="light"
  api-base="https://certifai-api.internal/v1"
  audit-callback-url="/api/audit"
/>

Attributes the portal passes (the product MUST handle these):

tenant              tenant slug (acme)
tenant-id           tenant UUID (uuid-acme)
jwt                 short-lived JWT (≤ 5 min), product validates against Keycloak JWKS
locale              en / de / fr / es / pt
theme               light / dark
api-base            backend URL the product should call
audit-callback-url  URL to POST audit events to (portal-relative)

5.A.3 Events the Product Must Emit

The component emits events upward via CustomEvent. The portal listens for these and integrates them:

breakpilot:navigate
  Detail: {path: "/sub/route", title: "Page Title"}
  Portal updates browser URL + breadcrumb without reloading.

breakpilot:error
  Detail: {code: "...", message: "...", recoverable: true|false}
  Portal shows toast / blocking error.

breakpilot:audit
  Detail: {action: "...", target: "...", metadata: {...}}
  Portal forwards to central audit log via audit-callback-url.

breakpilot:loading
  Detail: {state: "start"|"end", description: "Generating DSFA..."}
  Portal shows progress indicator.

breakpilot:request-upgrade
  Detail: {feature: "...", required_plan: "enterprise"}
  Portal opens upgrade-quote flow.

5.A.4 Design System Compatibility

The platform publishes @breakpilot/design-tokens (CSS variables, fonts, spacing). Products are encouraged but not required to consume it. The portal injects design tokens into the shadow DOM root so consuming them is a single CSS line:

:host { color: var(--bp-text); background: var(--bp-surface); }

Products that ship custom styling must respect the theme attribute and the prefers-color-scheme media query.

5.A.5 Bundle Loading

Product publishes a bundle at:
  https://cdn.breakpilot.com/products/{name}/{version}/element.js

Portal loads it lazily via dynamic import when the user navigates to /[tenant]/products/{name}.
Portal caches the bundle URL per product version (declared in tenant_products.config).
Bundle size budget: ≤ 500KB gzipped for first load.

5.B Widget

A widget product declares ONE custom element that renders only as a dashboard tile. It receives the same attributes as interactive products but emits no breakpilot:navigate events — clicking the tile takes the user to a portal-rendered config page (same surface as headless products in §5.C).

<status-monitor-widget
  tenant="acme"
  tenant-id="uuid-acme"
  jwt="<short-lived JWT>"
  locale="en"
  theme="light"
  api-base="https://monitor-api.internal/v1"
/>

Constraints:

Bundle size budget       ≤ 50KB gzipped (widgets load eagerly on dashboard)
Dimensions               declared in manifest (e.g., 200×120 or 400×240)
Refresh                  widget polls own API; portal does not push updates
Allowed events           breakpilot:error, breakpilot:audit, breakpilot:request-upgrade
                         (NOT breakpilot:navigate — click-through is portal-controlled)

API keys, webhooks, and full management UI for widget products use the same portal-rendered config page as headless products (§5.C).

5.C Headless

The product ships NO frontend code. The portal renders a generic management UI from a portal_config block in the manifest. This page is served at /[tenant]/products/{name} and contains the same elements regardless of which product it is — populated entirely from manifest data.

5.C.1 What the Portal Renders

┌──────────────────────────────────────────────────────────┐
│  Notetaker                                  [Status: OK] │
│  ────────────────────────────────────────────────────────│
│                                                           │
│  USAGE (last 30 days)                                    │
│  ┌──────────────────────────────────────────────────┐   │
│  │  142 sessions processed   ▁▃▆█▆▄▂▃▅▆▄▂▁▂▃▄▅▆▇█  │   │
│  └──────────────────────────────────────────────────┘   │
│                                                           │
│  API KEYS                                  [+ Generate]  │
│  ────────────────────────────────────────────────────    │
│  • prod-key       k_xxx...4f12  scopes: r,w   2026-01-04│
│  • staging-key    k_xxx...9a83  scopes: r     2026-04-22│
│                                                           │
│  WEBHOOKS                                  [+ Add]       │
│  ────────────────────────────────────────────────────    │
│  • https://acme.example.com/notetaker/cb                 │
│    events: session.completed, session.failed             │
│    last 24h: 142 delivered, 0 failed   [Test]           │
│                                                           │
│  CODE SAMPLES                                            │
│  ────────────────────────────────────────────────────    │
│  [curl] [JS] [Python]                                    │
│  curl -X POST https://notetaker-api.breakpilot.com/v1  │
│    -H "Authorization: ApiKey k_xxx"                      │
│    -H "X-Tenant: acme"                                   │
│    -d '{...}'                                            │
│                                                           │
│  DOCS  ►  developers.breakpilot.com/products/notetaker │
└──────────────────────────────────────────────────────────┘

5.C.2 Manifest Requirement

A headless product must include a portal_config block declaring:

  • sections: which UI sections to render (subset of: api_keys, webhooks, usage, docs, code_samples, custom_actions)
  • webhook_events: the catalog of events the product can emit
  • api_key_scopes: the catalog of scopes that can be granted on a key
  • code_samples: at least one language with a working request example
  • status_endpoint: optional URL for the portal to poll for the status badge

See §10.1 for the full schema.

5.C.3 API Keys

API keys are a portal concern, not a product concern. Tenant Registry generates and stores key hashes; the product validates incoming keys against POST /internal/api-keys/verify on Tenant Registry. This means:

  • Key rotation is portal-controlled
  • Scope enforcement is consistent across all headless products
  • Revocation is instant (registry updates a single row)

5.C.4 Webhooks

The portal owns webhook configuration UI and delivery logging. Products POST event payloads to a portal endpoint (/internal/webhooks/dispatch); the portal handles signing, delivery, retry, dead-letter, and the customer-visible delivery log.

This keeps webhook UX consistent across all headless products and means a product cannot accidentally leak events from one tenant to another's webhook URL.

5.C.5 No Impersonation

Backstage shows no "Impersonate" button for headless products — there is no UI to enter. Debugging is via API call logs, audit events, webhook delivery history, and admin actions declared in the manifest (e.g., "Flush Queue", "Rotate Keys", "Reset State").


6. MCP Server (Required for Enterprise)

6.1 What it is

An MCP (Model Context Protocol) server exposes the product's capabilities as tools that customer-side AI agents can call. The customer's IT Admin configures the MCP endpoint in their AI agent platform (Claude Desktop, Cursor, internal agents, etc.).

6.2 Required Behavior

1. ONE MCP server per product
   Endpoint: https://mcp.{product}.breakpilot.com  (or unified mcp.breakpilot.com/{product})

2. Authentication via SCOPED API KEY
   Customer IT Admin generates API key in /[tenant]/settings/api-keys.
   Key carries tenant_id binding and scopes (read/write per product domain).
   No user JWT for MCP — agents authenticate as the org, not as a user.

3. Tools are tenant-scoped
   Every tool call uses the API key's tenant_id binding.
   Cross-tenant calls are impossible by construction.

4. Tool catalog declared in manifest
   Each tool: name, description, parameters (JSONSchema), required_scopes.

5. Audit every tool call
   Emit breakpilot:audit-equivalent server-side: actor=api_key_id,
   action=tool_name, metadata=parameters.

6.3 Example Tool Catalogs

CERTifAI MCP tools:
  list_ai_agents              → returns agents configured for this tenant
  get_llm_usage               → returns LiteLLM usage for date range
  run_news_search             → SearXNG search
  list_chat_sessions          → user's chat history

Compliance MCP tools:
  create_dsfa                 → starts a DSFA workflow
  check_tom_status            → returns TOM compliance status
  list_dsr_requests           → returns open Data Subject Requests
  approve_dsfa                → marks DSFA as approved
  list_ai_act_assessments     → returns AI Act assessments

6.4 Activation

Enterprise customers automatically get MCP enabled. Starter/Pro customers see "Available on Enterprise" in the API Keys page. Tenant Registry checks tenant.plan before issuing an MCP API key.


7. Documentation Contract

A product ships five required documents. They are published at developers.breakpilot.com/products/{name}/.

1. README                  What does it do? Value prop in 200 words.
                           Who is the typical user? What are the workflows?

2. API Reference           Auto-generated from openapi.yaml.
                           Hosted via Redoc or Stoplight Elements.

3. Integration Guide       For customer IT teams. How to:
                           - Enable the product on their tenant
                           - Configure SSO and roles
                           - Wire into their workflows
                           - Use the MCP server (if applicable)
                           - Generate and manage API keys

4. Operational Runbook     For us. How to:
                           - Deploy a new version
                           - Roll back
                           - Debug a stuck tenant
                           - Reset tenant state
                           - Investigate slow queries

5. Data Model + GDPR       What data is stored, in which table/collection,
                           personal data category (Art. 9 special category?),
                           retention period, GDPR lawful basis.
                           Used by customer DPOs for their own Verzeichnis.

8. Observability Contract

8.1 Health Check

GET /health    returns:
  200 {"status": "ok", "checks": {...}}                  — all good
  200 {"status": "degraded", "checks": {...}, "reason"}  — degraded but serving
  503 {"status": "down", "checks": {...}, "reason"}      — restart me

Orca polls every 30s. Three consecutive 503s triggers automatic restart.

8.2 Metrics

GET /metrics   returns Prometheus exposition format.

Required metrics:
  bp_http_requests_total{method, route, status, tenant_id}
  bp_http_request_duration_seconds{method, route, tenant_id}
  bp_active_tenants_gauge
  bp_db_query_duration_seconds{operation}
  bp_external_api_calls_total{provider, status}  (LLM calls, etc.)

8.3 Structured Logging

All logs are JSON. All log lines include:
  ts             ISO-8601 timestamp
  level          debug|info|warn|error
  service        product name (certifai)
  tenant_id      tenant UUID (or "system" for non-tenant ops)
  user_sub       user UUID if applicable
  request_id     trace ID
  msg            human-readable message
  ...            additional structured fields

No PII in logs (use the PII redaction middleware from breakpilot-core).

8.4 Audit Events

Audit events go to the central audit log in Tenant Registry. Products emit them via POST to the audit-callback-url passed by the portal (frontend) or directly to Tenant Registry API (backend).

Event format (Retraced-shape — transformable 1:1 if we swap to BoxyHQ Retraced later):
  {
    "tenant_id":  "uuid-acme",        # → Retraced "group.id"
    "project_id": "uuid-prod" | null, # optional sub-tenancy scope
    "product":    "certifai",         # which product emitted
    "actor": {
      "id":   "user-uuid" | "svc:certifai" | "api_key:keyid",
      "type": "user" | "service" | "api_key",
      "name": "alice@acme.com"
    },
    "action":      "dsfa.approve",   # dotted: <domain>.<verb>
    "crud":        "u",              # c|r|u|d
    "target": {
      "id":   "<entity-id>",
      "type": "dsfa" | "llm_config" | ...,
      "name": "<human label>"
    },
    "source_ip":   "1.2.3.4",
    "description": "Alice approved DSFA #42 for Customer Data Processing",
    "fields":      {...},            # additional structured metadata
    "created_at":  "2026-05-11T14:23:01Z"
  }

Mandatory event categories per product:
  config changes          everything in product settings
  data exports            anyone exporting tenant data
  data deletions          erasures and bulk deletes
  permission changes      role grants/revocations within product
  approvals               business-significant approvals (DSFA, etc.)
  cross-product calls     service-token calls into other products (auto-emitted
                          by both caller and callee, with on_behalf_of in fields)

The portal /audit page renders these events filtered by tenant + product + actor + action + time range. The schema is intentionally Retraced-compatible so the storage layer can be swapped without changing producers.


9. Plane-by-Plane Integration Requirements

9.1 Identity Plane

[REGISTRATION]
  - Register an OIDC client in Keycloak (id: {product}-client)
    Confidential, client_credentials grant for service tokens,
    authorization_code grant if product has its own UI flows.
  - Declare role mappings in product manifest:
      role_mappings:
        IT_ADMIN: Admin
        LEGAL: Auditor
        FINANCE: ReadOnly
        USER: Member
  - Declare an entitlement key (e.g. "certifai") that goes into JWT products claim.

[RUNTIME]
  - Validate JWT via Keycloak JWKS endpoint (cache JWKS for 5 min).
  - Reject if products claim does not include this product's entitlement key.
  - Reject if iss is not the platform Keycloak.
  - Reject if exp expired or nbf future.

[NEVER]
  - Never validate JWT against a static secret. JWKS only.
  - Never issue tokens. Never accept passwords. Never store credentials.

9.2 Control Plane

[REGISTRATION]
  - On first deploy, product POSTs to Tenant Registry:
      POST /catalog/products
        body: manifest (see §10)
    Tenant Registry verifies the manifest, pulls openapi.yaml, validates
    mandatory endpoints, registers the product.

  - Product appears in Backstage product picker when creating sales orders.

[LIFECYCLE]
  - On tenant.activate: Tenant Registry calls product /v1/tenants/{id}/provision
  - On tenant.suspend:  calls /suspend
  - On tenant.churn:    calls /terminate
  - On contract.renew:  no call (idempotent: just stays active)

[USAGE METERING]
  - Tenant Registry runs a daily job hitting product /v1/usage for billing.
  - Product is responsible for accurate metering and idempotent reporting.

[BACKSTAGE ACTIONS]
  - Product declares custom admin actions in manifest:
      admin_actions:
        - name: "Rebuild RAG Index"
          endpoint: POST /v1/tenants/{id}/admin/rebuild-rag
          confirm: required
          plane: data
  - Backstage renders these as buttons on /backstage/tenants/{id}/products/{name}
  - Calls are SERVICE TOKEN authenticated and audit-logged.

[AUDIT EVENTS]
  - Product POSTs all audit events to Tenant Registry /audit endpoint.
  - Tenant Registry stores them in audit_log table for cross-product unified view.

9.3 Data Plane

[DATA OWNERSHIP]
  - Product owns its database. No other service queries it directly.
  - Cross-product composition is via the inter-product service-token API (§11),
    never via shared DB connections.

[ISOLATION]
  - Every table/collection has a tenant_id (or org_id) column.
  - Every query filters by it.
  - Database user permissions cannot bypass it.

[PROJECT SCOPING — OPTIONAL]
  - Products MAY support sub-tenancy via projects (mirrors GCP Project /
    AWS Account pattern). Allows customers to separate dev / staging / prod
    or per-team data within a single tenant.
  - Declared in manifest:
      data:
        supports_projects: true
  - Implementation:
      - All tenant-scoped tables/collections add project_id column.
      - Compound unique constraints become (tenant_id, project_id, key).
      - All endpoints accept optional ?project_id=<uuid>; absence means
        the tenant's default implicit project.
      - JWT may carry an active project_id claim; products SHOULD respect it
        if present.
  - Reference implementation: breakpilot-compliance already uses this pattern
    (sdk_states UNIQUE on (tenant_id, project_id) since March 2026).
  - Products that do NOT support projects must still gracefully ignore
    project_id parameters (return tenant-wide data).

[TENANT LIFECYCLE CONTRACT]
  Products MUST honor the tenant.status passed in the JWT (`tenant_status`
  custom claim) and behave per the table below. See PLATFORM_ARCHITECTURE.md
  P15 + P16 for the full state machine.

  ┌──────────┬───────────────────────────────────────────────────────────┐
  │ status   │ Product behavior                                          │
  ├──────────┼───────────────────────────────────────────────────────────┤
  │ demo     │ Accept all calls. Apply NO billing meter. Honor           │
  │          │ /v1/tenants/demo/reset (idempotent). Seed from            │
  │          │ catalog.demo.seed_data_url. Audit emitted but tagged      │
  │          │ {"demo": true} so portal can hide from real audit.        │
  ├──────────┼───────────────────────────────────────────────────────────┤
  │ trial    │ Accept all calls up to catalog.trial_quota; over quota    │
  │          │ return 429 with header X-Trial-Limit-Reset. Show "Trial"  │
  │          │ context in any product UI banner area provided by host.   │
  ├──────────┼───────────────────────────────────────────────────────────┤
  │ active   │ Normal operation.                                         │
  ├──────────┼───────────────────────────────────────────────────────────┤
  │ frozen   │ Per data.frozen_behavior in manifest (typically reads     │
  │          │ allowed, writes return 402, background jobs paused).      │
  │          │ /export MUST work; webhook deliveries MUST stop.          │
  ├──────────┼───────────────────────────────────────────────────────────┤
  │ archived │ All API calls return 410 Gone. Data already deleted by    │
  │          │ the offboarding step; this state is for audit only.       │
  └──────────┴───────────────────────────────────────────────────────────┘

  Products MUST implement:
    - GET  /v1/tenants/{id}/export
        Returns one ZIP per tenant containing every format declared in
        data.offboarding_export_formats. Synchronous OK if <60s; async
        with signed URL otherwise.
    - DELETE /v1/tenants/{id}/data
        Removes all tenant data within 30 days. Audit log retained
        separately (see §8.4). Idempotent.
    - POST /v1/tenants/demo/reset
        Restores seed data. Only callable from the portal service token.

[BACKUP CONTRACT]
  - Product declares in manifest:
      backup:
        data_stores: [postgres, qdrant, minio]
        rpo: 6h
        rto: 30min
        retention_days: 30
  - Infra Plane executes backups per declaration (pg_dump, etc.).
  - Product publishes restore procedure in operational runbook.

[GDPR ENDPOINTS]
  - /v1/tenants/{id}/export returns ALL data for the tenant (JSON + blobs in ZIP).
  - DELETE /v1/tenants/{id}/data deletes everything within 30 days of call.
  - Both endpoints emit audit events.

[DATA RESIDENCY]
  - All data stays in EU (database, object storage, cache).
  - Product declares any external data flows (e.g., LLM calls to OpenAI EU endpoint)
    in the data model documentation.

9.4 Infra Plane

[IaC]
  - Orca manifest at: /orca/manifests/{vm}/{product}.toml
  - Manifest declares: image, resource limits, health check, secret refs,
    network rules, replicas, restart policy.
  - Changes go through Gitea PR → Gitea Actions → Orca apply.

[SECRETS]
  - All secrets via Infisical machine identity.
  - Secret path namespacing: /prod/{product}/{KEY}
  - Manifest references paths, never values:
      secrets:
        DB_URL: /prod/certifai/MONGODB_URI
        LLM_KEY: /prod/certifai/LITELLM_MASTER_KEY
  - Bootstrap secrets (DB URIs for Keycloak only) are the lone exception.

[NETWORKING]
  - Product services bind only to the private network.
  - Public-facing routes pass through Orca-Proxy.
  - Inter-product calls use internal DNS names (e.g., certifai.internal:8080).

[BUILD + DEPLOY]
  - Dockerfile in product repo root.
  - Gitea Actions pipeline:
      fmt → lint → test → build → push → orca apply → e2e
  - Image tagged with git SHA + semver.

[COLD START]
  - Product declares startup dependencies in manifest:
      depends_on: [keycloak, postgres-app, infisical]
  - Orca enforces ordering on full restart (see INFRASTRUCTURE.md §10 Scenario F).

10. Product Manifest

The canonical declaration of a product, used by Tenant Registry, Orca, and Backstage. One file, committed to product repo, applied via deployment pipeline.

# product.manifest.yaml
schema_version: "1.0"

product:
  id: certifai
  name: "CERTifAI"
  description: "Self-hosted GDPR-compliant AI infrastructure dashboard"
  vendor: breakpilot          # we; future third-parties will use their slug
  contract_version: "1.0"
  product_version: "1.4.2"
  repo: git.breakpilot.com/sharang/certifai

catalog:
  # Renders in /[tenant]/catalog and /backstage/products
  category: "AI Infrastructure"     # AI Infrastructure | Compliance | Productivity | Security | Data
  tagline: "GDPR-compliant LLMs without leaving the EU"
  hero_image: https://cdn.breakpilot.com/products/certifai/hero.png
  screenshots:
    - https://cdn.breakpilot.com/products/certifai/dashboard.png
    - https://cdn.breakpilot.com/products/certifai/agents.png
  pricing_summary: "From €X/seat/month — included on Professional and Enterprise plans"
  available_on_plans: [trial, professional, enterprise]   # 'trial' opt-in for self-serve
  trial_days: 14
  trial_quota:                       # caps applied while tenant.status == trial
    llm_tokens_per_day: 100_000
    api_calls_per_day:  10_000
  works_well_with: [compliance]     # cross-product affinity; surfaced in catalog
  depends_on_products: []           # hard dependencies (rare; for compositions)

  demo:
    supported:    true               # MUST be true unless explicitly waived
    seed_data_url: https://cdn.breakpilot.com/products/certifai/demo/seed-v3.tar.gz
    reset_endpoint: /v1/tenants/demo/reset    # called nightly by portal cron
    persona_hints:                   # for sales rep talk track
      - "GDPR officer at a 200-person SaaS"
      - "CTO replacing OpenAI calls with EU-hosted LLMs"

identity:
  oidc_client_id: certifai-client
  entitlement_key: certifai
  role_mappings:
    IT_ADMIN: Admin
    CXO: Member
    FINANCE: Viewer
    LEGAL: Viewer
    USER: Member
  required_scopes:
    - read:agents
    - write:agents
    - read:usage

frontend:
  type: interactive          # interactive | widget | headless
  tag: certifai-dashboard
  bundle_url: https://cdn.breakpilot.com/products/certifai/{version}/element.js
  bundle_size_kb: 380
  routes:
    - path: /
      label: "Dashboard"
    - path: /agents
      label: "AI Agents"
      required_role: Member
    - path: /providers
      label: "Providers"
      required_role: Admin

backend:
  openapi_url: /openapi.yaml
  base_url: https://certifai-api.internal/v1
  health_url: /health
  service_token_audience: certifai-svc

mcp:
  enabled: true
  required_plan: enterprise
  endpoint: https://mcp.breakpilot.com/certifai
  tools:
    - name: list_ai_agents
      description: "Returns AI agents configured for the tenant"
      required_scope: read:agents
    - name: get_llm_usage
      description: "Returns LLM usage metrics"
      required_scope: read:usage
    # ... more tools

data:
  data_stores:
    - type: mongodb
      vm: vm-certifai
    - type: external_api
      provider: litellm
      pii_class: low
  tenant_scoping:
    field: org_id
    enforcement: middleware
  supports_projects: false        # see §9.3 PROJECT SCOPING
  retention_default_days: 365
  gdpr_export: /v1/tenants/{id}/export
  gdpr_erasure: /v1/tenants/{id}/data
  offboarding_export_formats: [json, csv]    # produced by P16 final-export step
  frozen_behavior:
    reads:  allow            # customer can still pull data / download exports
    writes: deny_402         # POST/PUT/DELETE return 402 Payment Required
    background_jobs: pause   # scheduled work suspended, queue preserved

backup:
  rpo: 24h
  rto: 30min
  retention_days: 30

infra:
  image: registry.breakpilot.com/certifai-dashboard
  vm: vm-certifai
  replicas: 1
  resource_limits:
    cpu: "2000m"
    memory: "4Gi"
  health_check:
    path: /health
    interval: 30s
    timeout: 5s
    threshold: 3
  secrets:
    - MONGODB_URI: /prod/certifai/MONGODB_URI
    - KEYCLOAK_CLIENT_SECRET: /prod/certifai/KEYCLOAK_CLIENT_SECRET
    - LITELLM_MASTER_KEY: /prod/certifai/LITELLM_MASTER_KEY
  depends_on:
    - keycloak
    - mongodb
    - infisical

admin_actions:
  - name: "Reset LiteLLM API Key"
    description: "Rotates the per-tenant LiteLLM key"
    endpoint: POST /v1/tenants/{id}/admin/rotate-litellm-key
    confirm: required
    audit_required: true

observability:
  metrics: /metrics
  logs:
    format: json
    pii_redaction: true
  audit_endpoint: tenant-registry.internal/audit

10.1 Manifest Variants by Frontend Type

The example above shows an interactive product. Headless and widget products differ only in the frontend block.

Widget variant

frontend:
  type: widget
  tag: status-monitor-widget
  bundle_url: https://cdn.breakpilot.com/products/status/{version}/widget.js
  bundle_size_kb: 38
  dimensions:
    width: 400
    height: 240
  poll_interval_s: 60
  portal_config:
    # same shape as headless (§ below) — used for click-through management page
    sections: [api_keys, webhooks, usage, docs]
    api_key_scopes: [...]
    webhook_events: [...]

Headless variant (no frontend bundle)

frontend:
  type: headless
  # NO tag, NO bundle_url — the portal renders 100% of the customer UI
  portal_config:
    sections:
      - api_keys
      - webhooks
      - usage
      - code_samples
      - docs
    status_endpoint: /v1/status     # optional; portal polls for status badge
    api_key_scopes:
      - id: read
        description: "Read sessions and results"
      - id: write
        description: "Create new sessions"
      - id: admin
        description: "Manage settings (rare; consider before granting)"
    webhook_events:
      - name: session.completed
        description: "Fires when a notetaker session is fully processed"
        payload_schema_url: /schemas/session.completed.json
      - name: session.failed
        description: "Fires when a session cannot be processed"
        payload_schema_url: /schemas/session.failed.json
    code_samples:
      - language: curl
        title: "Create a session"
        snippet: |
          curl -X POST https://notetaker-api.breakpilot.com/v1/sessions \
            -H "Authorization: ApiKey k_xxx" \
            -H "X-Tenant: acme" \
            -d '{"audio_url": "...", "language": "en"}'
      - language: python
        title: "Create a session"
        snippet: |
          import requests
          requests.post(
              "https://notetaker-api.breakpilot.com/v1/sessions",
              headers={"Authorization": "ApiKey k_xxx", "X-Tenant": "acme"},
              json={"audio_url": "...", "language": "en"},
          )

The Tenant Registry validates the frontend block against the type:

  • interactive requires tag and bundle_url; portal_config is optional
  • widget requires tag, bundle_url, dimensions, AND portal_config
  • headless MUST NOT declare tag or bundle_url; portal_config is required

11. Service Token Model (Inter-Product Communication)

Products can call each other directly. Auth is via short-lived service tokens issued by Keycloak's client_credentials flow.

11.1 Flow

1. Compliance product needs to list AI agents for an AI Act assessment.

2. Compliance backend requests a service token:
   POST https://auth.breakpilot.com/realms/breakpilot-prod/protocol/openid-connect/token
   Body: grant_type=client_credentials
         client_id=compliance-svc
         client_secret=<from Infisical>
         scope=read:certifai-agents
   Response: JWT (15 min TTL)

3. Compliance calls CERTifAI:
   GET https://certifai-api.internal/v1/tenants/{tenant_id}/agents
   Authorization: Bearer <service token>
   X-On-Behalf-Of-User: <user_sub>   ← original user, for audit
   X-Service-Reason: ai-act-assessment

4. CERTifAI validates token:
   - Issued by platform Keycloak: ok
   - Audience includes "certifai-svc": ok
   - Scopes include "read:certifai-agents": ok
   - tenant_id in path matches caller's intent: ok (no cross-tenant)

5. CERTifAI returns data.

6. Both sides emit audit events:
   {actor: "svc:compliance", action: "certifai.list_agents",
    on_behalf_of: "user_sub", tenant_id: "...", reason: "ai-act-assessment"}

11.2 Scope Catalog

Each service declares scopes it offers (other services can request these) and scopes it consumes (it needs from other services).

certifai offers:
  read:certifai-agents
  read:certifai-usage
  write:certifai-settings (rare; consider before granting)

compliance offers:
  read:compliance-status
  read:compliance-dsfa
  write:compliance-events  (for cross-product event emission)

billing-service consumes:
  read:certifai-usage
  read:compliance-status

compliance consumes:
  read:certifai-agents     (for AI Act assessments)

Scopes are granted in Keycloak per service client. Grants are reviewed quarterly.

11.3 Third-Party Readiness

When we open the platform to third parties:

- Same OIDC client_credentials flow
- Manifests are SIGNED by third-party developer keys (signature verified by Tenant Registry)
- Third-party scopes are read-only by default; write scopes require manual approval
- Network isolation: third-party services run in a separate Orca subnet
- Resource limits enforced (CPU, memory, network egress)
- Per-tenant install requires explicit IT Admin consent (OAuth consent screen)

The contract surface today is the same — we just add verification gates.


12. Versioning and Contract Evolution

12.1 Versions in play

contract_version  This document. Updated when the platform changes what products
                  must implement. Currently 1.0. Bumped on breaking changes.

product_version   The product's own version (semver). Tracked by Tenant Registry.
                  Independent of contract version.

api_version       The version in URL paths (/v1/, /v2/). Within a contract version,
                  a product may have multiple API versions live.

12.2 Platform supports N and N-1

The platform always supports the current contract version and the previous one. Deprecation announced in this doc before any breaking change.

12.3 Breaking Change Process

1. Announce in this doc (one section per breaking change with motivation).
2. Update contract_version, e.g. 1.0 → 2.0.
3. New products required to ship 2.0 from day one.
4. Existing products get 12 months to migrate.
5. After 12 months, 1.0 retired; tenants on 1.0 products are migrated or churned.

13. Onboarding Checklist for a New Product

A product is "ready to ship to a customer" when all boxes are ticked.

☐ Backend API
   ☐ openapi.yaml committed and validated
   ☐ Mandatory endpoints implemented (§4.1)
   ☐ JWT validation via Keycloak JWKS
   ☐ Service token validation
   ☐ Tenant scoping enforced in middleware + tested
   ☐ /v1/tenants/{id}/provision idempotency test passes
   ☐ /v1/tenants/{id}/export produces valid GDPR-compliant ZIP
   ☐ DELETE /v1/tenants/{id}/data is irreversible and audited

☐ Frontend (manifest declares one of: interactive | widget | headless)

   For frontend.type = interactive:
   ☐ Custom element registered with declared tag
   ☐ Bundle published to CDN (≤ 500KB gzipped)
   ☐ Handles all required attributes (§5.A.2)
   ☐ Emits all event types (§5.A.3)
   ☐ Light + dark theme support (§5.A.4)
   ☐ At least one locale beyond English

   For frontend.type = widget:
   ☐ Widget custom element registered with declared tag
   ☐ Bundle published to CDN (≤ 50KB gzipped)
   ☐ Tile dimensions declared in manifest
   ☐ Allowed events only (no breakpilot:navigate)
   ☐ portal_config block complete (for click-through page)

   For frontend.type = headless:
   ☐ NO tag and NO bundle_url declared
   ☐ portal_config.sections declared
   ☐ portal_config.api_key_scopes catalog complete
   ☐ portal_config.webhook_events catalog with payload schemas
   ☐ portal_config.code_samples in at least one language
   ☐ Webhook payloads include HMAC signature for verification
   ☐ Status endpoint returns valid format (if declared)
   ☐ POST /internal/api-keys/verify integration tested with Tenant Registry
   ☐ POST /internal/webhooks/dispatch integration tested with portal

☐ MCP (if Enterprise plan or applicable)
   ☐ MCP server deployed
   ☐ Tool catalog declared in manifest
   ☐ API key authentication implemented
   ☐ All tools tenant-scoped and audited

☐ Documentation
   ☐ README published at developers.breakpilot.com/products/{name}
   ☐ API reference auto-generated and live
   ☐ Integration guide for customer IT
   ☐ Operational runbook for us
   ☐ Data model + GDPR retention table

☐ Observability
   ☐ /health implemented and returns valid format
   ☐ /metrics in Prometheus format
   ☐ JSON structured logging
   ☐ Audit events emitted for all listed categories
   ☐ No PII in logs (PII redaction tested)

☐ Identity integration
   ☐ Keycloak OIDC client registered
   ☐ Role mappings declared and tested
   ☐ Entitlement key included in tenant JWTs (verified end-to-end)

☐ Control integration
   ☐ product.manifest.yaml committed
   ☐ Registered with Tenant Registry catalog
   ☐ Lifecycle endpoints tested via Backstage "Create Test Tenant"
   ☐ Usage endpoint returns valid format
   ☐ Backstage admin actions render correctly

☐ Data integration
   ☐ All tables/collections have tenant_id
   ☐ Cross-tenant query test (negative test) passes
   ☐ Backup contract declared and Infra Plane is executing it
   ☐ GDPR export tested with real data
   ☐ Data residency confirmed (no exfiltration outside EU)

☐ Infra integration
   ☐ Orca manifest committed and applies cleanly
   ☐ Dockerfile builds reproducibly
   ☐ All secrets in Infisical (zero hardcoded)
   ☐ Gitea Actions pipeline green
   ☐ Resource limits set and tested under load
   ☐ Cold start dependency order declared

14. Gap Analysis — Existing Products

CERTifAI vs. Contract 1.0

✓ OIDC via Keycloak — already implemented
✓ Role data model (Admin/Member/Viewer) — exists
✗ Mandatory endpoints — NONE of §4.1 implemented yet
✗ Frontend as web component — currently a full Dioxus fullstack app
✗ MCP server — not implemented
✗ Tenant scoping in queries — only chat is user-scoped, no org_id scoping
✗ Service token validation — not implemented
✗ GDPR export/erasure — not implemented
✗ /health, /metrics, structured audit emission — not implemented
✓ Orca + Infisical compatible — already deployed this way

Effort estimate: 4-6 weeks of focused work

breakpilot-compliance vs. Contract 1.0

✓ Multi-tenant via X-Tenant-ID — exists (needs JWT validation upgrade)
✓ Modular Next.js frontend — close to web-component-wrappable
✗ Mandatory endpoints — partially implemented (usage endpoint missing)
✗ JWT validation at proxy — currently raw header trust
✗ Frontend as web component — needs wrapping with @r2wc/react-to-web-component
✗ MCP server — not implemented
✓ Backup contract — declared informally, needs to be in manifest
✗ GDPR export/erasure — partial (DSR module exists, doesn't cover whole tenant)
✓ Observability — partial (structured logs, no /metrics)

Effort estimate: 3-5 weeks of focused work

15. Open Items

- Design tokens package (@breakpilot/design-tokens) — needs to exist before web components ship
- CDN for product bundles — pick provider (Hetzner Object Storage + Cloudflare?)
- MCP gateway — single mcp.breakpilot.com vs. per-product subdomains
- Third-party manifest signing — defer until first real third-party conversation
- Inter-product event bus — explicitly deferred; service tokens cover the use cases for now
- Contract testing — automate manifest + openapi validation in Gitea Actions
- Customer-facing catalog UI — defined at /[tenant]/catalog (see PLATFORM_ARCHITECTURE.md
  §5a operating principles); Backstage product picker reuses same catalog metadata.

OSS swap-in points (designed-for, not adopted yet):
- Audit log storage: BoxyHQ Retraced — our event schema is Retraced-shape (§8.4),
  swap when audit query patterns outgrow PostgreSQL or when a customer asks for
  exportable SOC2-grade audit retention.
- Usage metering: Lago — our /v1/usage endpoint plus optional per-event stream
  (§4.1) is Lago-compatible. Swap when LiteLLM token billing requires real-time
  metering or per-customer pricing tiers we cannot model in Stripe.
- Customer IdP federation (SCIM): BoxyHQ Jackson or Keycloak's SCIM module.
  Adopt when first enterprise customer asks for automated user provisioning.
- Feature flags / per-tenant feature gating: OpenFeature (vendor-neutral).
  Adopt when product features need finer-than-plan-tier gating per tenant.

End of document. Contract version 1.0. Next review: after first product (CERTifAI or compliance) achieves full compliance with §13 checklist.