Skip to content
GitLab

Dead Letter Queue Management

When a pipeline stage fails after exhausting all retries, the failed event is sent to the dead letter queue (DLQ). The DLQ captures the original event, error context, and metadata needed for investigation and replay. Events in the DLQ are never automatically re-processed — they require explicit operator action.

%%{init: {'flowchart': {'curve': 'basis'}}}%%
flowchart TB
    EVENT[Event arrives] --> PROCESS[Process event]
    PROCESS -- success --> DONE[Complete]
    PROCESS -- failure --> RETRY{Retries remaining?}
    RETRY -- yes --> BACKOFF[Exponential backoff]
    BACKOFF --> PROCESS
    RETRY -- no --> DLQ[Dead Letter Queue]
    DLQ --> ALERT[Alert: new DLQ entry]
    DLQ --> INSPECT[Operator inspects]
    INSPECT --> REPLAY[Replay]
    INSPECT --> ACK[Acknowledge]
    INSPECT --> PURGE[Purge]
Failure TypeDLQ Topic SuffixExample
Unparseable event (bad JSON).parse_failureCorrupt message on event bus
Missing required field.parse_failureEvent without product_id
Event for unknown product.correlationStale reference after product deletion
Quality gate failure after max retries.quality_gatePersistent data quality issue
Pipeline runtime error after retries.executionTransform SQL error, resource exhaustion
Event publish failure (outbox).event_publishRedpanda unavailable

Events for unknown topics are logged and dropped, not queued. This prevents unbounded DLQ growth from misconfigured subscriptions.

Retries happen before DLQ routing. Only retry-exhausted events enter the DLQ.

delay(n) = min(initial_delay * multiplier^n * jitter, max_delay)
n=0: ~30s
n=1: ~60s
n=2: ~120s
n=3: ~240s
n=4: 300s (capped)

Jitter (0-20% random variance) prevents thundering herd when multiple products retry simultaneously.

CategoryFailure TypesBehavior
Retryableruntime_error, quality_gate_failure, transient_io_error, iceberg_write_failureRetry with backoff, then DLQ
Non-retryableschema_mismatch, permission_denied, missing_input_location, authentication_errorSent directly to DLQ (no retry)

Non-retryable failures skip the retry loop entirely — retrying a permission error will never succeed.

Terminal window
# List all DLQ entries across products
akili dlq list
# Filter by product
akili dlq list --product daily-orders
# As JSON for scripting
akili dlq list --json

Output columns:

ColumnDescription
idDLQ entry identifier
productAffected product name
stageWhere the failure occurred (parse, correlation, execution, quality_gate)
errorError message summary
created_atWhen the entry was created
retry_countNumber of retries attempted before DLQ
Terminal window
# Get full details including the original event payload
akili dlq get dlq-entry-abc123
# As JSON
akili dlq get dlq-entry-abc123 --json

The detail view includes:

  • Original event payload (the data that failed)
  • Full error message and stack trace
  • Retry history (timestamp and error for each attempt)
  • Correlation ID for tracing related events
Terminal window
# Replay a single DLQ entry
akili dlq replay dlq-entry-abc123
# Replay using the current product version (not the version that failed)
akili dlq replay dlq-entry-abc123 --use-current-version

When replayed:

  • The event is re-injected at the stage where it originally failed
  • A new event_id is assigned to avoid deduplication rejection
  • The original correlation_id is preserved for end-to-end tracing
  • The DLQ entry is marked as replayed
Terminal window
# Replay multiple entries
akili dlq replay-batch dlq-entry-1 dlq-entry-2 dlq-entry-3
# Batch replay with current version
akili dlq replay-batch dlq-entry-1 dlq-entry-2 --use-current-version

After investigation, purge entries that do not need replay:

Terminal window
akili dlq purge dlq-entry-abc123

Purge all entries older than a specified number of days:

Terminal window
# Purge entries older than 30 days (requires --confirm)
akili dlq purge-all --older-than-days 30 --confirm

If a product fails N consecutive execution windows, the circuit breaker opens to prevent further resource waste:

CLOSED (normal)
--> On failure: increment counter
--> If counter >= 5: transition to OPEN
OPEN (broken)
--> Skip all executions for this product
--> Alert: circuit_breaker opened
--> After 300s: transition to HALF_OPEN
HALF_OPEN (testing)
--> Allow 1 execution through
--> Success: transition to CLOSED
--> Failure: transition to OPEN

When the circuit breaker is open, new events for the product are sent directly to the DLQ without attempting execution.

A typical DLQ investigation follows these steps:

  1. Alert received — DLQ monitor detects a new entry and sends notification
  2. Inspect the entryakili dlq get <id> to see the error and payload
  3. Identify root cause:
    • Schema mismatch? Update the input schema or source
    • Transform error? Fix the SQL/Python and redeploy
    • Transient failure? Replay the event directly
    • Quality gate? Adjust thresholds or fix upstream data
  4. Fix and replay — Fix the root cause, then akili dlq replay <id>
  5. Verify — Check that the replayed event completes successfully
  6. Purge remaining — Clean up acknowledged entries
Terminal window
# Check overall platform status (includes DLQ health)
akili status
# Check product-specific execution health
akili run list daily-orders
# View governance SLA (DLQ entries may indicate SLA risk)
akili governance sla daily-orders