Data EngineerJuniorpythonpostgresqletl

Make an ETL Pipeline Idempotent

A real data problem you debug end to end in a live cloud workspace, then show on your portfolio. No tutorial, no toy app - a broken system that behaves like production.

Level
Junior
Time
~20 min
Cost
Free

The scenario

The daily order ingestion pipeline (ETL - Extract, Transform, Load) is doubling row counts every time it runs. Operations noticed the orders table had 200 rows after two runs of the same 100-row CSV file. Downstream dashboards now show inflated revenue figures.

The broken code you start with

load_orders.py (re-runs duplicate every row)
cursor.execute("""
    INSERT INTO orders (order_id, customer_id, total_amount)
    VALUES (%s, %s, %s)
""", (row['order_id'], row['customer_id'], row['total_amount']))
# no ON CONFLICT - the next run inserts everything again

What this teaches you

What you did: Swapped INSERT for INSERT...ON CONFLICT DO UPDATE on the order_id (or another unique key). Now running the loader twice with the same CSV produces 100 rows both times - duplicates are upserted to their latest state instead of inserted again.

Why it matters: ETL pipelines MUST be replayable - network glitches, retries, re-runs on schedule all need to be safe. Duplicates compound into cascading errors downstream (inflated revenue, wrong aggregations).

In the real world: Production pipelines design around idempotency from day one - exactly-once delivery is baked in via checksums, watermarks, deduplication tables, or distributed consensus. The rerun cost is so high that safety-first is non-negotiable.

What you'll practice

Why this impresses a hiring manager

On your portfolio, this becomes

Switched from INSERT to UPSERT (INSERT...ON CONFLICT) so re-running the same CSV produces the same row count, not duplicates

Keep going

Rank Rows Per Group With SQL Window FunctionsData projectData roadmapStep by step to hiredData interview questionsSTAR answersAll Data projectsProjects hub

Build this project free

You're in a real cloud workspace in 30 seconds. Fix it, and it lands on your portfolio.

Start this project →