Building Data Pipelines with Claude Code

Data pipelines are a place where Claude Code can save significant time — and a place where subtle bugs have significant consequences. A pipeline that silently drops records or processes duplicates can corrupt data in ways that are hard to detect and hard to fix.

Here's how to get reliable pipeline code from Claude.

The idempotency requirement

Always specify this upfront:

This pipeline must be idempotent: running it twice produces
the same result as running it once.

Specifically:
- Don't insert if record already exists
- Track what's been processed using [checkpoint table/field]
- If the pipeline fails mid-run and restarts, it should continue
  from where it stopped, not restart from the beginning

Without idempotency, a pipeline that crashes and restarts will duplicate data. Claude doesn't add idempotency by default — you have to ask for it.

Checkpoint pattern

Implement checkpointing for this batch job:
- Track last processed ID/timestamp in a checkpoint table
- On startup, read checkpoint and continue from there
- Update checkpoint after every [batch size] records
- On failure, roll back to last checkpoint

Schema for checkpoint table: [describe or let Claude propose]

This gives you restartable pipelines without re-processing from the beginning every time. For large datasets, this is the difference between a 2-minute restart and a 2-hour restart.

Error handling in pipelines

This pipeline should handle errors without stopping. Requirements:
- Log failed records to an error table with: record ID, error message, timestamp
- Continue processing remaining records
- At the end, report: processed count, success count, failure count
- Don't throw — accumulate errors and report at the end

A failure rate above [X]% should be flagged as a pipeline error.

The "don't throw" constraint changes the error model from "fail fast" to "fail gracefully." For pipelines processing thousands of records, one bad record shouldn't kill the whole run.

Testing data transformations

Write tests for this transformation function. Test cases:
- Valid input with all fields present
- Valid input with optional fields missing
- Input with null values where nulls are allowed
- Input with null values where nulls are not allowed
- Boundary values: empty strings, zero, max values
- Input that should be rejected

For each test: input data, expected output or expected error.

Data transformation tests need to be specific about the data. The prompt above generates tests that are readable as documentation — you can see what the function is supposed to do from the tests.

Performance for large datasets

This pipeline will process [X million] records. 
Optimize for throughput:
- Process in batches of [N] (explain the tradeoffs)
- Use bulk insert instead of single-record inserts
- Index the fields used in WHERE clauses
- Avoid N+1 queries — fetch related data in bulk

Profile the critical path and tell me where time will go.

Monitoring pipeline runs

Add monitoring to this pipeline. Track:
- Start time, end time, total duration
- Records processed, succeeded, failed
- Records per second throughput
- Any anomalies: unusually high failure rate, slower than expected

Log a summary on completion. Write to a pipeline_runs table for history.

Pipeline patterns — idempotency, checkpointing, error accumulation, bulk operations — are in the Agent Prompt Playbook. $29.