Make an ETL Pipeline Idempotent
A real data problem you debug end to end in a live cloud workspace, then show on your portfolio. No tutorial, no toy app - a broken system that behaves like production.
The scenario
The daily order ingestion pipeline (ETL - Extract, Transform, Load) is doubling row counts every time it runs. Operations noticed the orders table had 200 rows after two runs of the same 100-row CSV file. Downstream dashboards now show inflated revenue figures.
The broken code you start with
cursor.execute("""
INSERT INTO orders (order_id, customer_id, total_amount)
VALUES (%s, %s, %s)
""", (row['order_id'], row['customer_id'], row['total_amount']))
# no ON CONFLICT - the next run inserts everything againWhat this teaches you
What you did: Swapped INSERT for INSERT...ON CONFLICT DO UPDATE on the order_id (or another unique key). Now running the loader twice with the same CSV produces 100 rows both times - duplicates are upserted to their latest state instead of inserted again.
Why it matters: ETL pipelines MUST be replayable - network glitches, retries, re-runs on schedule all need to be safe. Duplicates compound into cascading errors downstream (inflated revenue, wrong aggregations).
In the real world: Production pipelines design around idempotency from day one - exactly-once delivery is baked in via checksums, watermarks, deduplication tables, or distributed consensus. The rerun cost is so high that safety-first is non-negotiable.
What you'll practice
- Spotting why a re-run duplicates records
- Writing an idempotent upsert (ON CONFLICT)
- Making a load safe to retry
Why this impresses a hiring manager
- This is a real python problem teams hit in production - not a synthetic puzzle.
- It shows you can diagnose and fix a Data issue in a live system end to end.
- It lands on your portfolio as a scenario a hiring manager can open and click through.
Switched from INSERT to UPSERT (INSERT...ON CONFLICT) so re-running the same CSV produces the same row count, not duplicates
Keep going
Build this project free
You're in a real cloud workspace in 30 seconds. Fix it, and it lands on your portfolio.
Start this project →