Background Jobs with Claude Code

Job queues, retry logic, dead letter queues, and monitoring. The infrastructure that keeps async work reliable.

Background jobs are where a lot of silent failures live. A job fails, retries, fails again, and disappears — no error in the UI, no alert, just missing data days later. Claude Code helps build job infrastructure that makes failures visible.

The job definition prompt

Define a background job for [task]. Requirements:
- Job data: [what data the job needs]
- Idempotency key: [field that makes this job unique]
- Timeout: [how long before we consider it stuck]
- Max retries: [number]
- Retry backoff: exponential, starting at [delay]
- On max retries: move to dead letter queue

The job should be idempotent — running it twice should be safe.

Idempotency is the most important property. Background jobs get retried. If a retry causes a duplicate action (send two emails, charge twice), that's a serious bug.

Retry strategy

Implement retry logic for this job:
- Retryable errors: network timeouts, temporary service unavailability (5xx)
- Non-retryable errors: validation errors, permanent failures (4xx)
- Backoff: exponential with jitter to prevent thundering herd
- Max delay: [cap the backoff at this value]

Log each retry: job ID, attempt number, error, next retry time.

The distinction between retryable and non-retryable errors is crucial. Retrying a 400 Bad Request is pointless and wastes queue capacity. Claude handles this distinction correctly when you specify it.

Dead letter queue

Implement a dead letter queue for failed jobs:
- Move jobs here after max retries exhausted
- Store: original job data, all error messages from each attempt, timestamps
- Alert: send notification when job lands in DLQ
- Replay: admin endpoint to requeue a DLQ job after fixing the underlying issue

Never automatically discard DLQ jobs — keep indefinitely until explicitly resolved.

The DLQ is your audit trail. Every failed job should end up here with full context, not silently disappear.

Job monitoring

Add monitoring for the job queue:
- Metrics: queue depth, processing rate, failure rate, avg processing time
- Alerts: queue depth above [N] (stuck or overloaded), failure rate above [X]%
- Dashboard data: jobs processed today, failed today, in DLQ

Track job processing time by job type to identify slow jobs.

Scheduled jobs (cron)

Implement scheduled job [name] to run [schedule].

Requirements:
- Only one instance should run at a time (distributed lock)
- Log start, completion, duration, any errors
- If previous run is still executing when next is scheduled: skip and alert
- Idempotent: safe to run multiple times if deduplication fails

Lock implementation: [Redis SETNX with expiry / database advisory lock]

The distributed lock is essential for scheduled jobs on multiple instances. Without it, every instance runs the job simultaneously — usually fine for reads, catastrophic for writes.

Testing background jobs

Write tests for this job. Test:
- Successful execution: job completes, expected side effects occur
- Transient failure: job retries, eventually succeeds
- Permanent failure: job hits max retries, moves to DLQ
- Idempotency: running same job twice produces same result
- Timeout: job that exceeds timeout is marked as failed

Use an in-memory queue implementation for testing — don't require Redis in tests.

Background job patterns — retry logic, DLQ, distributed locks, testing — are in the Agent Prompt Playbook. $29.