compute.yaml

Declares how the product executes. Most fields have sensible defaults; many products won’t need this file at all beyond the entrypoint.

Example

apiVersion: akili/v1
kind: Compute

runtime: sql
mode: transform
engine: auto

schedule:
  type: cron
  expression: "0 6 * * *"
  timezone: Africa/Nairobi

resources:
  cpu: "500m"
  memory: "1Gi"
  timeout: 30m

entrypoint: logic/transform.sql

retry:
  max_attempts: 3
  backoff: exponential
  initial_delay: 30s

requirements:             # Python only
  - pandas>=2.0
  - scikit-learn>=1.3

Field Reference

Field	Type	Required	Default	Validation	Description
`runtime`	enum	No	Inferred from entrypoint extension	`sql`, `python`	`sql` uses DuckDB/Spark SQL. `python` uses Python process.
`mode`	enum	No	`transform`	`transform`, `train`, `inference`	Execution mode. See ML as Data Products below.
`engine`	enum	No	`auto`	`auto`, `duckdb`, `spark`	`auto` selects DuckDB (<10GB) or Spark (>=10GB).
`schedule.type`	enum	No	`event`	`cron`, `event`, `manual`	Trigger type
`schedule.expression`	string	Cron only	—	Standard 5-field cron	Cron schedule expression
`schedule.timezone`	string	No	`UTC`	IANA timezone	Schedule timezone
`resources.cpu`	string	No	`500m`	K8s CPU format	CPU request
`resources.memory`	string	No	`1Gi`	K8s memory format	Memory request
`resources.timeout`	duration	No	`30m`	Duration string	Max execution wall time
`entrypoint`	string	No	`logic/transform.sql` or `logic/transform.py`	Relative path, must exist	Path to logic file
`retry.max_attempts`	int	No	`3`	—	Total attempts including first
`retry.backoff`	enum	No	`exponential`	`fixed`, `exponential`	`fixed` = constant delay. `exponential` = doubles each attempt.
`retry.initial_delay`	duration	No	`30s`	Duration string	Delay before first retry
`requirements`	string[]	No	none	Python package specifiers	Additional Python packages. Pre-installed: pandas, numpy, scikit-learn, pyarrow.

Schedule Types

`schedule.type`	Dagster Implementation	When It Runs
`cron`	`ScheduleDefinition` with cron expression	At scheduled time
`event`	`@sensor` watching upstream `data.available` events on Redpanda	When all required inputs are materialized
`manual`	No automation	Only via `akili run` CLI or `POST /api/v1/products/:id/run`

Default is event — the product runs when its inputs are ready. For source-aligned products with no internal inputs, cron or manual is typical.

ML as Data Products

The mode field enables three data product behaviors through the same manifest structure:

Mode	Entrypoint	Inputs	Output
`transform`	`transform.sql` or `transform.py`	DataFrames	DataFrame to Iceberg table
`train`	`train.py`	DataFrames	Model artifact to S3 (`/tenants/{id}/models/{name}/{version}/`)
`inference`	`predict.py`	DataFrames + model artifact	DataFrame to Iceberg table

Training entrypoint:

def train(inputs: dict[str, pd.DataFrame], context: dict) -> ModelArtifact:
    """Train a model and return it for storage."""

Inference entrypoint:

def predict(inputs: dict[str, pd.DataFrame], model: Any, context: dict) -> pd.DataFrame:
    """Load model, score inputs, return predictions."""

Logic Files

SQL Transforms

SQL transforms run against DuckDB (default) or Spark SQL (large data). Input data products are available as table references matching their id from inputs.yaml, with hyphens replaced by underscores.

-- logic/transform.sql
-- Inputs available as tables: cleaned_outlets, cleaned_transactions

SELECT
    t.outlet_id,
    t.transaction_date AS sale_date,
    SUM(t.amount) AS total_revenue,
    COUNT(*) AS transaction_count,
    AVG(t.amount) AS avg_basket_size,
    o.territory_code,
    NOW() AS updated_at
FROM cleaned_transactions t
JOIN cleaned_outlets o ON t.outlet_id = o.outlet_id
WHERE t.transaction_date = '{{ partition_key }}'
GROUP BY t.outlet_id, t.transaction_date, o.territory_code

{{ partition_key }} is injected by the platform at execution time. For non-partitioned products, the full dataset is available without filtering.

Python Transforms

import pandas as pd

def transform(inputs: dict[str, pd.DataFrame]) -> pd.DataFrame:
    """
    Args:
        inputs: Dict mapping input id to DataFrame.
                Keys match inputs.yaml ids (hyphens replaced with underscores).
    Returns:
        DataFrame matching the schema declared in output.yaml.
    """
    transactions = inputs["cleaned_transactions"]
    outlets = inputs["cleaned_outlets"]

    merged = transactions.merge(outlets, on="outlet_id")

    result = (
        merged
        .groupby(["outlet_id", "transaction_date", "territory_code"])
        .agg(
            total_revenue=("amount", "sum"),
            transaction_count=("amount", "count"),
            avg_basket_size=("amount", "mean"),
        )
        .reset_index()
        .rename(columns={"transaction_date": "sale_date"})
    )

    result["updated_at"] = pd.Timestamp.now(tz="UTC")
    return result

Template Functions

Downstream products reference upstream semantic intents via template functions:

-- Reference semantic intent tier values in SQL
SELECT customer_id, store_id,
       SUM(net_revenue) AS total_spend
FROM store_performance
WHERE customer_segment IN ({{ intent('customer_classification', 'high_value') }})
GROUP BY 1, 2

Function	Description	Resolves To
`{{ intent('<name>', '<tier>') }}`	Current tier values as comma-separated quoted list	`'Engaged'` or `'Engaged', 'Premium'`
`{{ intent_column('<name>') }}`	Current column name for the intent	`customer_segment`

Template resolution happens during codegen (akili codegen), not at raw SQL execution time. Unresolvable templates fail the build with a clear error.

Logic File Rules

SQL files must be valid SQL syntax (parse-checked at akili validate).
Python files must expose a transform(), train(), or predict() function matching the mode in compute.yaml.
Python files can import from requirements listed in compute.yaml plus pre-installed packages.
Logic files must not perform I/O directly. All I/O is handled by the platform.
Logic files receive data and return data. Side effects are forbidden.
Internal decomposition (CTEs, helper functions, modular SQL) is private. No external product can reference internal structures. A product’s public interface is its output schema only.