Background jobs are where a lot of silent failures live. A job fails, retries, fails again, and disappears — no error in the UI, no alert, just missing data days later. Claude Code helps build job infrastructure that makes failures visible.
The job definition prompt
Define a background job for [task]. Requirements:
- Job data: [what data the job needs]
- Idempotency key: [field that makes this job unique]
- Timeout: [how long before we consider it stuck]
- Max retries: [number]
- Retry backoff: exponential, starting at [delay]
- On max retries: move to dead letter queue
The job should be idempotent — running it twice should be safe.
Idempotency is the most important property. Background jobs get retried. If a retry causes a duplicate action (send two emails, charge twice), that's a serious bug.
Retry strategy
Implement retry logic for this job:
- Retryable errors: network timeouts, temporary service unavailability (5xx)
- Non-retryable errors: validation errors, permanent failures (4xx)
- Backoff: exponential with jitter to prevent thundering herd
- Max delay: [cap the backoff at this value]
Log each retry: job ID, attempt number, error, next retry time.
The distinction between retryable and non-retryable errors is crucial. Retrying a 400 Bad Request is pointless and wastes queue capacity. Claude handles this distinction correctly when you specify it.
Dead letter queue
Implement a dead letter queue for failed jobs:
- Move jobs here after max retries exhausted
- Store: original job data, all error messages from each attempt, timestamps
- Alert: send notification when job lands in DLQ
- Replay: admin endpoint to requeue a DLQ job after fixing the underlying issue
Never automatically discard DLQ jobs — keep indefinitely until explicitly resolved.
The DLQ is your audit trail. Every failed job should end up here with full context, not silently disappear.
Job monitoring
Add monitoring for the job queue:
- Metrics: queue depth, processing rate, failure rate, avg processing time
- Alerts: queue depth above [N] (stuck or overloaded), failure rate above [X]%
- Dashboard data: jobs processed today, failed today, in DLQ
Track job processing time by job type to identify slow jobs.
Scheduled jobs (cron)
Implement scheduled job [name] to run [schedule].
Requirements:
- Only one instance should run at a time (distributed lock)
- Log start, completion, duration, any errors
- If previous run is still executing when next is scheduled: skip and alert
- Idempotent: safe to run multiple times if deduplication fails
Lock implementation: [Redis SETNX with expiry / database advisory lock]
The distributed lock is essential for scheduled jobs on multiple instances. Without it, every instance runs the job simultaneously — usually fine for reads, catastrophic for writes.
Testing background jobs
Write tests for this job. Test:
- Successful execution: job completes, expected side effects occur
- Transient failure: job retries, eventually succeeds
- Permanent failure: job hits max retries, moves to DLQ
- Idempotency: running same job twice produces same result
- Timeout: job that exceeds timeout is marked as failed
Use an in-memory queue implementation for testing — don't require Redis in tests.
Background job patterns — retry logic, DLQ, distributed locks, testing — are in the Agent Prompt Playbook. $29.