Skip to content
GitLab

Crosscutting Concepts

Multi-tenancy is not a feature — it is an invariant. Every layer of the platform enforces tenant isolation.

LayerMechanism
APIJWT extraction, tenant_id on every service call
DatabasePostgreSQL Row-Level Security (SET LOCAL app.tenant_id)
EventsPer-tenant Redpanda topics (tenant.{id}.{domain})
StorageCeph path prefix ({tenant_id}/{domain}/{product}/)
ServingTenant-scoped queries, per-tenant resource limits
CRDsKubernetes CRDs include tenant_id in spec

No cross-tenant data access is possible. Tenant isolation is enforced at the service layer, not the database layer alone.

Every column in the platform has a classification level. Classification propagates through the data pipeline using the high-water mark rule: when data from multiple sources is combined, the output inherits the highest classification of any input.

LevelAccessMasking
PublicAll consumersNone
InternalAuthenticated usersNone
Confidential (PII)Role-basedSHA-256 hash, last-4, or REDACTED
RestrictedNamed principals onlyFull column masking

When a consumer’s clearance is below the column’s classification, the platform applies automatic masking:

  • PII Name — SHA-256 hash (first 16 hex characters)
  • PII Identifier — Last 4 characters visible, rest masked
  • PII Contact[REDACTED]
  • Business Confidential — Column omitted entirely

Masking is applied at query time in the serving layer, not at rest.

  • Protocol: OIDC via Authentik
  • Portal: NextAuth v5 with BFF token relay (ADR-039)
  • API: JWT Bearer token validation
  • CLI: Device authorization flow

The platform applies different failure strategies based on the criticality of the operation.

Security operations never degrade. If the auth service is unavailable, requests are rejected — not permitted with reduced security.

  • JWT validation failure → 401 Unauthorized
  • Classification check failure → most restrictive level applied
  • Masking pipeline failure → column omitted

Quality operations degrade gracefully. If a quality check cannot run, the platform serves the last-known-good data rather than returning nothing.

  • Quality check timeout → warning logged, data served
  • Serving store unavailable → fallback to next tier
  • Analytics query timeout → partial results with degradation notice

All external dependencies have circuit breakers:

  • Serving stores (StarRocks, Redis, TimescaleDB)
  • Notification service
  • Intelligence service (Claude API)
  • KServe inference endpoints

States: Closed (normal) → Open (failing, fast-fail) → Half-Open (probe).

SignalToolPurpose
MetricsPrometheus + GrafanaInfrastructure and application metrics
LogsLoki + AlloyStructured logs from all services
TracesTempoDistributed request tracing
ExecutionDagster UIPipeline monitoring and debugging
AlertsAlertmanagerOn-call notification routing

Every execution produces structured events that are queryable in the control-plane API.

All infrastructure changes follow the GitOps pattern:

  1. Commit changes to git
  2. ArgoCD detects the change
  3. ArgoCD syncs the cluster to match git state
  4. Drift is detected and auto-healed

Manual kubectl patches are never applied in production. Drift between git and the cluster is treated as a P0 issue.

Each crosscutting concern has a detailed specification in the platform design documents:

ConcernSpecificationKey sections
Multi-tenancyAPI AuthenticationJWT extraction, RLS enforcement, tenant scoping
ClassificationGovernance ModelClassification taxonomy, propagation rules, column masking
Serving isolationServing Layer5-point enforcement, store-level namespacing
Event isolationOrchestrationPer-tenant topics, event contracts
Quality gatesQuality & GovernanceBlocking vs warning severity, SLA tracking
ResilienceAPI MiddlewareCircuit breakers, rate limiting, request tracing