compute.yaml
Declares how the product executes. Most fields have sensible defaults; many products won’t need this file at all beyond the entrypoint.
Example
Section titled “Example”apiVersion: akili/v1kind: Compute
runtime: sqlmode: transformengine: auto
schedule: type: cron expression: "0 6 * * *" timezone: Africa/Nairobi
resources: cpu: "500m" memory: "1Gi" timeout: 30m
entrypoint: logic/transform.sql
retry: max_attempts: 3 backoff: exponential initial_delay: 30s
requirements: # Python only - pandas>=2.0 - scikit-learn>=1.3Field Reference
Section titled “Field Reference”| Field | Type | Required | Default | Validation | Description |
|---|---|---|---|---|---|
runtime | enum | No | Inferred from entrypoint extension | sql, python | sql uses DuckDB/Spark SQL. python uses Python process. |
mode | enum | No | transform | transform, train, inference | Execution mode. See ML as Data Products below. |
engine | enum | No | auto | auto, duckdb, spark | auto selects DuckDB (<10GB) or Spark (>=10GB). |
schedule.type | enum | No | event | cron, event, manual | Trigger type |
schedule.expression | string | Cron only | — | Standard 5-field cron | Cron schedule expression |
schedule.timezone | string | No | UTC | IANA timezone | Schedule timezone |
resources.cpu | string | No | 500m | K8s CPU format | CPU request |
resources.memory | string | No | 1Gi | K8s memory format | Memory request |
resources.timeout | duration | No | 30m | Duration string | Max execution wall time |
entrypoint | string | No | logic/transform.sql or logic/transform.py | Relative path, must exist | Path to logic file |
retry.max_attempts | int | No | 3 | — | Total attempts including first |
retry.backoff | enum | No | exponential | fixed, exponential | fixed = constant delay. exponential = doubles each attempt. |
retry.initial_delay | duration | No | 30s | Duration string | Delay before first retry |
requirements | string[] | No | none | Python package specifiers | Additional Python packages. Pre-installed: pandas, numpy, scikit-learn, pyarrow. |
Schedule Types
Section titled “Schedule Types”schedule.type | Dagster Implementation | When It Runs |
|---|---|---|
cron | ScheduleDefinition with cron expression | At scheduled time |
event | @sensor watching upstream data.available events on Redpanda | When all required inputs are materialized |
manual | No automation | Only via akili run CLI or POST /api/v1/products/:id/run |
Default is event — the product runs when its inputs are ready. For source-aligned products with no internal inputs, cron or manual is typical.
ML as Data Products
Section titled “ML as Data Products”The mode field enables three data product behaviors through the same manifest structure:
| Mode | Entrypoint | Inputs | Output |
|---|---|---|---|
transform | transform.sql or transform.py | DataFrames | DataFrame to Iceberg table |
train | train.py | DataFrames | Model artifact to S3 (/tenants/{id}/models/{name}/{version}/) |
inference | predict.py | DataFrames + model artifact | DataFrame to Iceberg table |
Training entrypoint:
def train(inputs: dict[str, pd.DataFrame], context: dict) -> ModelArtifact: """Train a model and return it for storage."""Inference entrypoint:
def predict(inputs: dict[str, pd.DataFrame], model: Any, context: dict) -> pd.DataFrame: """Load model, score inputs, return predictions."""Logic Files
Section titled “Logic Files”SQL Transforms
Section titled “SQL Transforms”SQL transforms run against DuckDB (default) or Spark SQL (large data). Input data products are available as table references matching their id from inputs.yaml, with hyphens replaced by underscores.
-- logic/transform.sql-- Inputs available as tables: cleaned_outlets, cleaned_transactions
SELECT t.outlet_id, t.transaction_date AS sale_date, SUM(t.amount) AS total_revenue, COUNT(*) AS transaction_count, AVG(t.amount) AS avg_basket_size, o.territory_code, NOW() AS updated_atFROM cleaned_transactions tJOIN cleaned_outlets o ON t.outlet_id = o.outlet_idWHERE t.transaction_date = '{{ partition_key }}'GROUP BY t.outlet_id, t.transaction_date, o.territory_code{{ partition_key }} is injected by the platform at execution time. For non-partitioned products, the full dataset is available without filtering.
Python Transforms
Section titled “Python Transforms”import pandas as pd
def transform(inputs: dict[str, pd.DataFrame]) -> pd.DataFrame: """ Args: inputs: Dict mapping input id to DataFrame. Keys match inputs.yaml ids (hyphens replaced with underscores). Returns: DataFrame matching the schema declared in output.yaml. """ transactions = inputs["cleaned_transactions"] outlets = inputs["cleaned_outlets"]
merged = transactions.merge(outlets, on="outlet_id")
result = ( merged .groupby(["outlet_id", "transaction_date", "territory_code"]) .agg( total_revenue=("amount", "sum"), transaction_count=("amount", "count"), avg_basket_size=("amount", "mean"), ) .reset_index() .rename(columns={"transaction_date": "sale_date"}) )
result["updated_at"] = pd.Timestamp.now(tz="UTC") return resultTemplate Functions
Section titled “Template Functions”Downstream products reference upstream semantic intents via template functions:
-- Reference semantic intent tier values in SQLSELECT customer_id, store_id, SUM(net_revenue) AS total_spendFROM store_performanceWHERE customer_segment IN ({{ intent('customer_classification', 'high_value') }})GROUP BY 1, 2| Function | Description | Resolves To |
|---|---|---|
{{ intent('<name>', '<tier>') }} | Current tier values as comma-separated quoted list | 'Engaged' or 'Engaged', 'Premium' |
{{ intent_column('<name>') }} | Current column name for the intent | customer_segment |
Template resolution happens during codegen (akili codegen), not at raw SQL execution time. Unresolvable templates fail the build with a clear error.
Logic File Rules
Section titled “Logic File Rules”- SQL files must be valid SQL syntax (parse-checked at
akili validate). - Python files must expose a
transform(),train(), orpredict()function matching themodein compute.yaml. - Python files can import from
requirementslisted in compute.yaml plus pre-installed packages. - Logic files must not perform I/O directly. All I/O is handled by the platform.
- Logic files receive data and return data. Side effects are forbidden.
- Internal decomposition (CTEs, helper functions, modular SQL) is private. No external product can reference internal structures. A product’s public interface is its output schema only.