Quality & Governance
Akili implements Federated Computational Governance — domain teams own their data product quality and classification, while the platform enforces global policies computationally. Governance has four pillars: lineage, classification, quality enforcement, and SLA management.
Quality Enforcement
Section titled “Quality Enforcement”flowchart TD
MAT[Materialization Complete] --> QC[Run Quality Checks]
subgraph Tiers["Three Expressiveness Tiers"]
QC --> T1[Tier 1: Declarative YAML]
QC --> T2[Tier 2: Custom SQL]
QC --> T3[Tier 3: Custom Python]
end
T1 --> SCORE[Calculate Quality Score]
T2 --> SCORE
T3 --> SCORE
SCORE --> GATE{Pass / Fail Gate}
GATE -->|All blocking checks pass| PROMOTE[Promote to Serving Stores]
GATE -->|Any blocking check fails| BLOCK[Block Downstream Propagation]
PROMOTE --> DOWNSTREAM[Publish data.available]
PROMOTE --> CATALOG[Update Quality Score in Catalog]
BLOCK --> ALERT[Emit quality.failed Event]
BLOCK --> STALE[Consumers See Stale-but-Correct Data]
Three Expressiveness Tiers
Section titled “Three Expressiveness Tiers”Developers declare quality rules in quality.yaml. The platform translates them into platform quality check functions.
| Tier | Developer Writes | Platform Translates To |
|---|---|---|
| Declarative YAML | type: not_null, column: X | Quality check with SQL null-count query |
| Custom SQL | type: custom_sql, sql: "..." | Quality check that executes SQL and checks assertion |
| Custom Python | type: custom_python, module: X | Quality check that imports and calls the function |
Example quality.yaml:
transform_checks: - name: completeness_outlet_id type: not_null column: outlet_id severity: blocking
- name: revenue_positive type: expression sql: "SELECT COUNT(*) FROM {table} WHERE total_revenue < 0" threshold: 0 severity: blocking
- name: row_count_reasonable type: volume_anomaly threshold: 0.3 severity: warningBlocking vs. Warning
Section titled “Blocking vs. Warning”severity: blocking— The platform blocks downstream materialization. No data is promoted to serving stores.severity: warning— The platform logs a warning; downstream processing continues.
This is the structural mechanism behind the platform’s “no bad data served” guarantee. Quality gates are not opt-in — they are prerequisites for data promotion.
Quality Score
Section titled “Quality Score”Each product’s quality score is a rolling average of check results over the last 30 days:
quality_score = (passing_checks / total_checks) * 100Scores are synced to the data catalog and displayed in the catalog and portal.
Classification System
Section titled “Classification System”Every data product declares a sensitivity level, ordered by increasing restriction:
public -> internal -> confidential -> restrictedHigh-Water Mark Propagation
Section titled “High-Water Mark Propagation”The output classification of a data product must be greater than or equal to the highest classification of any input. This is enforced at deploy time.
raw.orders (internal) + raw.payroll (confidential) = output MUST be >= confidentialThis prevents two governance violations:
- Classification laundering — Creating a “public” product that reads from “confidential” inputs
- Clearance bypass — Building a product using inputs the developer lacks clearance to access
Classification propagation is transitive — if product C depends on B which depends on A (confidential), then C’s classification must be >= confidential even though C only directly references B.
Column-Level Classification (ADR-032)
Section titled “Column-Level Classification (ADR-032)”Beyond product-level classification, individual columns can declare their own sensitivity using a hierarchical taxonomy:
| Classification | Full Access | Masked Access | Denied |
|---|---|---|---|
pii.name | Original value | SHA-256 hash (truncated) | Column omitted |
pii.identifier | Original value | Last 4 chars, rest * | Column omitted |
pii.contact | Original value | [REDACTED] | Column omitted |
business.confidential | Original value | N/A (full or nothing) | Column omitted |
business.internal | Always visible | Always visible | Always visible |
public | Always visible | Always visible | Always visible |
The serving layer applies dynamic masking at query time based on consumer clearance. This eliminates the need for derivative products just to strip sensitive columns.
Lineage Tracking
Section titled “Lineage Tracking”Metadata flows into the data catalog from four pathways:
| Pathway | When | What |
|---|---|---|
| Manifest Registration | Build-time | Product identity, schemas, classification, access teams |
| Asset graph | Deploy-time | Dependency edges, external system connections, serving store edges |
| Execution Events | Run-time | Freshness, row count, duration, quality scores, partition status |
| Deployment Lineage | Version tracking | Which manifest version produced which data snapshot |
The sync is one-directional: The execution engine is the source of truth for operational data; the data catalog provides discovery, search, and visualization.
Impact Analysis
Section titled “Impact Analysis”The lineage graph enables two key queries:
- “What breaks if X fails?” — Show all downstream dependents
- “Where does Y come from?” — Show full upstream lineage to source systems
SLA Management
Section titled “SLA Management”Each data product has implicit SLA expectations based on its schedule:
| Signal | Threshold | Alert |
|---|---|---|
| Freshness | 2x schedule interval | ”Product X is stale” |
| Quality score | < 95% over 24h | ”Quality degradation on X” |
| Execution duration | > 3x historical p95 | ”Slow execution on X” |
| Availability | Any serving failure | ”Serving endpoint down for X” |
A platform sensor monitors these every 5 minutes, emitting alerts via event bus messages to notification channels.
Emergent Business Ontology
Section titled “Emergent Business Ontology”As products are deployed, the platform automatically builds a business ontology — a structured vocabulary of the organization’s data concepts. Concepts emerge from product metadata rather than being imposed top-down.
Concept Registry
Section titled “Concept Registry”| Source | Extraction Rule |
|---|---|
identity columns | Each unique identity column name becomes a concept |
| Domain names | Each domain becomes a concept |
| Product tags | Tags in product.yaml are registered as concept associations |
| Semantic intents | Intent metadata enriches column concepts with operational semantics |
Concept Maturity Lifecycle
Section titled “Concept Maturity Lifecycle”| State | Meaning |
|---|---|
draft | Auto-extracted, not yet reviewed |
proposed | Submitted for domain approval |
accepted | Approved by domain owner |
canonical | Organization-wide standard term (Published Language) |
deprecated | Retained for historical reference |
Regulatory Compliance
Section titled “Regulatory Compliance”Right-to-Erasure (GDPR/CCPA)
Section titled “Right-to-Erasure (GDPR/CCPA)”The platform supports structured deletion that propagates through the lineage graph:
- Request — Deletion request submitted via API
- Impact — Platform traverses lineage graph to identify affected products
- Plan — Deletion plan: which products, which records, which method
- Execute — Position delete files in the data lake written (non-destructive, auditable)
- Verify — Post-deletion verification confirms no residual data
- Audit — Permanent audit record (never deleted)
Deletion propagation stops at aggregation boundaries where individual contributions cannot be identified (e.g., SUM, COUNT).
Retention Policies
Section titled “Retention Policies”Products declare retention in product.yaml:
retention: period: "365d" basis: created_at review_date: "2026-06-01"The platform evaluates retention daily and emits retention.expired events, but does not auto-delete. Product owners must explicitly trigger deletion or extend the period.
Related
Section titled “Related”- System Architecture — Platform overview
- Orchestration — How quality checks integrate with the execution engine
- Serving Layer — Where classification enforcement happens at query time