Dead Letter Queue Management

When a pipeline stage fails after exhausting all retries, the failed event is sent to the dead letter queue (DLQ). The DLQ captures the original event, error context, and metadata needed for investigation and replay. Events in the DLQ are never automatically re-processed — they require explicit operator action.

How Events Reach the DLQ

%%{init: {'flowchart': {'curve': 'basis'}}}%%
flowchart TB
    EVENT[Event arrives] --> PROCESS[Process event]
    PROCESS -- success --> DONE[Complete]
    PROCESS -- failure --> RETRY{Retries remaining?}
    RETRY -- yes --> BACKOFF[Exponential backoff]
    BACKOFF --> PROCESS
    RETRY -- no --> DLQ[Dead Letter Queue]
    DLQ --> ALERT[Alert: new DLQ entry]
    DLQ --> INSPECT[Operator inspects]
    INSPECT --> REPLAY[Replay]
    INSPECT --> ACK[Acknowledge]
    INSPECT --> PURGE[Purge]

What Goes to the DLQ

Failure Type	DLQ Topic Suffix	Example
Unparseable event (bad JSON)	`.parse_failure`	Corrupt message on event bus
Missing required field	`.parse_failure`	Event without `product_id`
Event for unknown product	`.correlation`	Stale reference after product deletion
Quality gate failure after max retries	`.quality_gate`	Persistent data quality issue
Pipeline runtime error after retries	`.execution`	Transform SQL error, resource exhaustion
Event publish failure (outbox)	`.event_publish`	Redpanda unavailable

What Does NOT Go to the DLQ

Events for unknown topics are logged and dropped, not queued. This prevents unbounded DLQ growth from misconfigured subscriptions.

Retry Policy

Retries happen before DLQ routing. Only retry-exhausted events enter the DLQ.

Backoff Calculation

delay(n) = min(initial_delay * multiplier^n * jitter, max_delay)

n=0:  ~30s
n=1:  ~60s
n=2:  ~120s
n=3:  ~240s
n=4:  300s (capped)

Jitter (0-20% random variance) prevents thundering herd when multiple products retry simultaneously.

Retryable vs Non-Retryable Failures

Category	Failure Types	Behavior
Retryable	`runtime_error`, `quality_gate_failure`, `transient_io_error`, `iceberg_write_failure`	Retry with backoff, then DLQ
Non-retryable	`schema_mismatch`, `permission_denied`, `missing_input_location`, `authentication_error`	Sent directly to DLQ (no retry)

Non-retryable failures skip the retry loop entirely — retrying a permission error will never succeed.

Inspecting the DLQ

Listing DLQ Entries

# List all DLQ entries across products
akili dlq list

# Filter by product
akili dlq list --product daily-orders

# As JSON for scripting
akili dlq list --json

Output columns:

Column	Description
`id`	DLQ entry identifier
`product`	Affected product name
`stage`	Where the failure occurred (parse, correlation, execution, quality_gate)
`error`	Error message summary
`created_at`	When the entry was created
`retry_count`	Number of retries attempted before DLQ

Viewing Entry Details

# Get full details including the original event payload
akili dlq get dlq-entry-abc123

# As JSON
akili dlq get dlq-entry-abc123 --json

The detail view includes:

Original event payload (the data that failed)
Full error message and stack trace
Retry history (timestamp and error for each attempt)
Correlation ID for tracing related events

Replaying Events

Single Event Replay

# Replay a single DLQ entry
akili dlq replay dlq-entry-abc123

# Replay using the current product version (not the version that failed)
akili dlq replay dlq-entry-abc123 --use-current-version

When replayed:

The event is re-injected at the stage where it originally failed
A new event_id is assigned to avoid deduplication rejection
The original correlation_id is preserved for end-to-end tracing
The DLQ entry is marked as replayed

Batch Replay

# Replay multiple entries
akili dlq replay-batch dlq-entry-1 dlq-entry-2 dlq-entry-3

# Batch replay with current version
akili dlq replay-batch dlq-entry-1 dlq-entry-2 --use-current-version

Purging Entries

Single Entry Purge

After investigation, purge entries that do not need replay:

akili dlq purge dlq-entry-abc123

Bulk Purge

Purge all entries older than a specified number of days:

# Purge entries older than 30 days (requires --confirm)
akili dlq purge-all --older-than-days 30 --confirm

Circuit Breaker

If a product fails N consecutive execution windows, the circuit breaker opens to prevent further resource waste:

CLOSED (normal)
  --> On failure: increment counter
  --> If counter >= 5: transition to OPEN

OPEN (broken)
  --> Skip all executions for this product
  --> Alert: circuit_breaker opened
  --> After 300s: transition to HALF_OPEN

HALF_OPEN (testing)
  --> Allow 1 execution through
  --> Success: transition to CLOSED
  --> Failure: transition to OPEN

When the circuit breaker is open, new events for the product are sent directly to the DLQ without attempting execution.

DLQ Investigation Workflow

A typical DLQ investigation follows these steps:

Alert received — DLQ monitor detects a new entry and sends notification
Inspect the entry — akili dlq get <id> to see the error and payload
Identify root cause:
- Schema mismatch? Update the input schema or source
- Transform error? Fix the SQL/Python and redeploy
- Transient failure? Replay the event directly
- Quality gate? Adjust thresholds or fix upstream data
Fix and replay — Fix the root cause, then akili dlq replay <id>
Verify — Check that the replayed event completes successfully
Purge remaining — Clean up acknowledged entries

Monitoring DLQ Health

# Check overall platform status (includes DLQ health)
akili status

# Check product-specific execution health
akili run list daily-orders

# View governance SLA (DLQ entries may indicate SLA risk)
akili governance sla daily-orders

Troubleshooting — common error patterns
Data Lifecycle — failure handling at each stage
akili dlq — full CLI reference