Dead Letter Queue Management
When a pipeline stage fails after exhausting all retries, the failed event is sent to the dead letter queue (DLQ). The DLQ captures the original event, error context, and metadata needed for investigation and replay. Events in the DLQ are never automatically re-processed — they require explicit operator action.
How Events Reach the DLQ
Section titled “How Events Reach the DLQ”%%{init: {'flowchart': {'curve': 'basis'}}}%%
flowchart TB
EVENT[Event arrives] --> PROCESS[Process event]
PROCESS -- success --> DONE[Complete]
PROCESS -- failure --> RETRY{Retries remaining?}
RETRY -- yes --> BACKOFF[Exponential backoff]
BACKOFF --> PROCESS
RETRY -- no --> DLQ[Dead Letter Queue]
DLQ --> ALERT[Alert: new DLQ entry]
DLQ --> INSPECT[Operator inspects]
INSPECT --> REPLAY[Replay]
INSPECT --> ACK[Acknowledge]
INSPECT --> PURGE[Purge]
What Goes to the DLQ
Section titled “What Goes to the DLQ”| Failure Type | DLQ Topic Suffix | Example |
|---|---|---|
| Unparseable event (bad JSON) | .parse_failure | Corrupt message on event bus |
| Missing required field | .parse_failure | Event without product_id |
| Event for unknown product | .correlation | Stale reference after product deletion |
| Quality gate failure after max retries | .quality_gate | Persistent data quality issue |
| Pipeline runtime error after retries | .execution | Transform SQL error, resource exhaustion |
| Event publish failure (outbox) | .event_publish | Redpanda unavailable |
What Does NOT Go to the DLQ
Section titled “What Does NOT Go to the DLQ”Events for unknown topics are logged and dropped, not queued. This prevents unbounded DLQ growth from misconfigured subscriptions.
Retry Policy
Section titled “Retry Policy”Retries happen before DLQ routing. Only retry-exhausted events enter the DLQ.
Backoff Calculation
Section titled “Backoff Calculation”delay(n) = min(initial_delay * multiplier^n * jitter, max_delay)
n=0: ~30sn=1: ~60sn=2: ~120sn=3: ~240sn=4: 300s (capped)Jitter (0-20% random variance) prevents thundering herd when multiple products retry simultaneously.
Retryable vs Non-Retryable Failures
Section titled “Retryable vs Non-Retryable Failures”| Category | Failure Types | Behavior |
|---|---|---|
| Retryable | runtime_error, quality_gate_failure, transient_io_error, iceberg_write_failure | Retry with backoff, then DLQ |
| Non-retryable | schema_mismatch, permission_denied, missing_input_location, authentication_error | Sent directly to DLQ (no retry) |
Non-retryable failures skip the retry loop entirely — retrying a permission error will never succeed.
Inspecting the DLQ
Section titled “Inspecting the DLQ”Listing DLQ Entries
Section titled “Listing DLQ Entries”# List all DLQ entries across productsakili dlq list
# Filter by productakili dlq list --product daily-orders
# As JSON for scriptingakili dlq list --jsonOutput columns:
| Column | Description |
|---|---|
id | DLQ entry identifier |
product | Affected product name |
stage | Where the failure occurred (parse, correlation, execution, quality_gate) |
error | Error message summary |
created_at | When the entry was created |
retry_count | Number of retries attempted before DLQ |
Viewing Entry Details
Section titled “Viewing Entry Details”# Get full details including the original event payloadakili dlq get dlq-entry-abc123
# As JSONakili dlq get dlq-entry-abc123 --jsonThe detail view includes:
- Original event payload (the data that failed)
- Full error message and stack trace
- Retry history (timestamp and error for each attempt)
- Correlation ID for tracing related events
Replaying Events
Section titled “Replaying Events”Single Event Replay
Section titled “Single Event Replay”# Replay a single DLQ entryakili dlq replay dlq-entry-abc123
# Replay using the current product version (not the version that failed)akili dlq replay dlq-entry-abc123 --use-current-versionWhen replayed:
- The event is re-injected at the stage where it originally failed
- A new
event_idis assigned to avoid deduplication rejection - The original
correlation_idis preserved for end-to-end tracing - The DLQ entry is marked as replayed
Batch Replay
Section titled “Batch Replay”# Replay multiple entriesakili dlq replay-batch dlq-entry-1 dlq-entry-2 dlq-entry-3
# Batch replay with current versionakili dlq replay-batch dlq-entry-1 dlq-entry-2 --use-current-versionPurging Entries
Section titled “Purging Entries”Single Entry Purge
Section titled “Single Entry Purge”After investigation, purge entries that do not need replay:
akili dlq purge dlq-entry-abc123Bulk Purge
Section titled “Bulk Purge”Purge all entries older than a specified number of days:
# Purge entries older than 30 days (requires --confirm)akili dlq purge-all --older-than-days 30 --confirmCircuit Breaker
Section titled “Circuit Breaker”If a product fails N consecutive execution windows, the circuit breaker opens to prevent further resource waste:
CLOSED (normal) --> On failure: increment counter --> If counter >= 5: transition to OPEN
OPEN (broken) --> Skip all executions for this product --> Alert: circuit_breaker opened --> After 300s: transition to HALF_OPEN
HALF_OPEN (testing) --> Allow 1 execution through --> Success: transition to CLOSED --> Failure: transition to OPENWhen the circuit breaker is open, new events for the product are sent directly to the DLQ without attempting execution.
DLQ Investigation Workflow
Section titled “DLQ Investigation Workflow”A typical DLQ investigation follows these steps:
- Alert received — DLQ monitor detects a new entry and sends notification
- Inspect the entry —
akili dlq get <id>to see the error and payload - Identify root cause:
- Schema mismatch? Update the input schema or source
- Transform error? Fix the SQL/Python and redeploy
- Transient failure? Replay the event directly
- Quality gate? Adjust thresholds or fix upstream data
- Fix and replay — Fix the root cause, then
akili dlq replay <id> - Verify — Check that the replayed event completes successfully
- Purge remaining — Clean up acknowledged entries
Monitoring DLQ Health
Section titled “Monitoring DLQ Health”# Check overall platform status (includes DLQ health)akili status
# Check product-specific execution healthakili run list daily-orders
# View governance SLA (DLQ entries may indicate SLA risk)akili governance sla daily-ordersRelated
Section titled “Related”- Troubleshooting — common error patterns
- Data Lifecycle — failure handling at each stage
akili dlq— full CLI reference