# ETL Pipeline Optimization: A Data Lead's Technical Primer > **Purpose:** Deep technical education on the architectural options considered for the lot+internet ETL optimization task. > **Audience:** Interim Data Lead / CTO preparing for a final interview. > **Problem (precisely stated):** The PostgreSQL car-listings ETL takes **16 hours for the first-time (backfill) load of a large brand** — VW or BMW. The SLA target is **< 4 hours for that same first-time load**. The daily *incremental* run (200K–5M rows/country) is already fast; it is **not** the problem. Getting this distinction right changes which solutions are relevant. > **Revision note (v2):** This version corrects a framing error in v1, which treated the task as general/incremental optimization and over-weighted streaming options that do not speed up a one-time backfill. See §2 for the corrected diagnosis, and §3.12–§3.14 for options v1 omitted. --- ## Table of Contents 1. [Mental Models: What You Must Understand First](#1-mental-models) 2. [The Current Pipeline — What Is Actually Broken and Why](#2-current-pipeline-analysis) R. [Redesign Before Re-Engine — The Missing Axis](#r-redesign-before-re-engine) 3. [The Solution Landscape — Deep Technical Explanations](#3-the-solution-landscape) - [3.1 PostgreSQL Surgical Rewrite — the SLA winner](#31-postgresql-surgical-rewrite--the-sla-winner) - [3.2 PostgreSQL + ClickHouse (OLTP/OLAP Split)](#32-postgresql--clickhouse) - [3.3 Lakehouse: S3 + dbt + DuckDB / Spark](#33-lakehouse-s3--dbt--duckdb--spark) - [3.4 Managed Cloud DWH: BigQuery / Snowflake](#34-managed-cloud-dwh-bigquery--snowflake) - [3.5 CDC-First Event Backbone: Debezium + Kafka / Redpanda](#35-cdc-first-event-backbone) - [3.6 Streaming SQL: Materialize / RisingWave](#36-streaming-sql-materialize--risingwave) - [3.7 Medallion Architecture on Apache Iceberg](#37-medallion-architecture-on-apache-iceberg) - [3.8 Apache Doris / StarRocks](#38-apache-doris--starrocks) - [3.9 Full Cloud Native Stack](#39-full-cloud-native-stack) - [3.10 TimescaleDB for Price History](#310-timescaledb-for-price-history) - [3.11 Two-Horizon Strategy (Quick Win + Platform Build)](#311-two-horizon-strategy) - [3.12 Citus — Distributed / Parallel PostgreSQL](#312-citus--distributed--parallel-postgresql) - [3.13 In-Postgres Columnar Extensions](#313-in-postgres-columnar-extensions) - [3.14 Vertical Scaling](#314-vertical-scaling) 4. [AI & Real-Time Readiness Matrix](#4-ai--real-time-readiness-matrix) 5. [Decision Framework: How to Choose](#5-decision-framework) 6. [Comparative Summary Table](#6-comparative-summary-table) 7. [Glossary](#7-glossary) --- ## 1. Mental Models Before evaluating any solution, you need these foundational concepts locked. Everything else builds on them. --- ### 1.1 OLTP vs OLAP — The Most Important Distinction **OLTP (Online Transaction Processing)** Designed to handle many small, fast read/write operations. Banks processing transactions, e-commerce handling orders, scrapers writing listings. Optimized for: - INSERT / UPDATE / DELETE on individual rows - Concurrent users doing small operations - Data integrity (ACID: Atomicity, Consistency, Isolation, Durability) - Row-oriented storage (explained below) **OLAP (Online Analytical Processing)** Designed for complex queries over large datasets. Analytics dashboards, ML feature extraction, business reporting. Optimized for: - Aggregations (SUM, AVG, COUNT, GROUP BY) over millions/billions of rows - Columnar storage (explained below) - Read-heavy, write-infrequent workloads - Denormalized, wide tables **The critical insight for this pipeline:** The `ads_` table is an 80-column wide analytical mart. It is stored in PostgreSQL — an OLTP engine. You are running an OLAP workload on an OLTP system. This is the architectural root cause of the 16-hour runtime. --- ### 1.2 Row Storage vs Column Storage **Row-oriented storage (PostgreSQL, MySQL)** All columns of a row are stored together on disk: ``` Row 1: [lid=1, make="VW", model="Golf", price=15000, fuel="diesel", mileage=90000, ...] Row 2: [lid=2, make="BMW", model="3er", price=28000, fuel="petrol", mileage=45000, ...] Row 3: [lid=3, make="VW", model="Polo", price=9000, fuel="diesel", mileage=120000, ...] ``` To compute `AVG(price) WHERE make='VW'`, PostgreSQL must: 1. Read every row from disk (all 80 columns) 2. Filter make='VW' 3. Extract price 4. Compute AVG For 80M rows with 80 columns, you read ~80x more data than you actually need. **Column-oriented storage (ClickHouse, BigQuery, Doris)** Each column is stored separately: ``` lid column: [1, 2, 3, ...] make column: ["VW", "BMW", "VW", ...] price column: [15000, 28000, 9000, ...] fuel column: ["diesel", "petrol", "diesel", ...] ``` For `AVG(price) WHERE make='VW'`: 1. Read ONLY the `make` column (small) 2. Read ONLY the `price` column (small) 3. Filter and compute **Compression bonus:** A column of the same data type with similar values compresses extremely well. `fuel` column has ~6 distinct values across 80M rows — it compresses to almost nothing. In ClickHouse, a 5TB table often stores in 500GB. **Rule of thumb:** OLTP operations (upsert 1 row) → row store. Analytical queries (aggregate millions of rows) → column store. --- ### 1.3 MVCC — Why PostgreSQL Upserts Are Expensive **MVCC (Multi-Version Concurrency Control)** is PostgreSQL's mechanism for allowing concurrent reads and writes without locks. When you `UPDATE` a row in PostgreSQL: 1. PostgreSQL does NOT modify the row in place 2. It creates a **new version** of the row (new tuple) with the updated values 3. It marks the old version as "dead" (no longer visible to new transactions) 4. The old version sits on disk until `VACUUM` cleans it up ``` Before UPDATE: [lid=1, price=15000, xmin=100, xmax=null] ← visible After UPDATE: [lid=1, price=15000, xmin=100, xmax=200] ← dead (marked deleted at transaction 200) [lid=1, price=16000, xmin=200, xmax=null] ← new visible version ``` **Why this destroys performance for the pipeline:** The `ON CONFLICT DO UPDATE` in `pipeline.sql` updates ~5M rows per run. Each updated row: - Creates a new row version (doubles disk I/O for that page) - Leaves a dead tuple behind - Adds work for VACUUM to clean up After many daily runs, the table accumulates enormous **bloat** (dead tuples). Queries slow down because they must scan bloated pages. The disk fills up faster than the data growth alone would suggest. **The 80-column UPDATE makes this worse:** PostgreSQL's MVCC stores the entire row for each version. An 80-column row version is large. Compare to a system designed for upserts (Doris, ClickHouse) which either avoids this entirely or handles it through background compaction. > **Scope note (v2):** MVCC update bloat applies to the *daily incremental* run, which updates millions of existing rows. It is **not** a cause of the 16-hour *first-time* load — on the first run the mart is empty, every row is an `INSERT`, and the `DO UPDATE` branch never executes. See §2.6. --- ### 1.4 Batch vs Streaming vs Micro-batch **Batch Processing** Process data in large chunks at scheduled intervals. "Run the ETL once a day at midnight." ``` Midnight: collect all changes from the day → process 200K-5M rows → write to mart ``` - Simplest to implement and reason about - High latency (data can be 24h stale) - Efficient for large sequential I/O - The current pipeline is batch **Streaming Processing** Process data event by event as it arrives. Millisecond to second latency. ``` 10:00:01 - Scraper finds new listing → event emitted → processed immediately 10:00:01.050 - mart updated ``` - Complex to implement correctly (ordering, late data, exactly-once semantics) - Low latency (near real-time) - Enables real-time price alerts, fraud detection, live deduplication - Kafka + Flink / Materialize / RisingWave **Micro-batch** A middle ground. Buffer events for a short window (seconds to minutes) and process as small batches. ``` Every 30 seconds: process all events received in the last 30 seconds ``` - Simpler than pure streaming (batch semantics on small chunks) - Latency: seconds to minutes - Spark Structured Streaming works this way **Why this matters for the interview:** When you talk about moving to streaming, you're not just talking about speed. You're describing a fundamental shift in how the system is architected — from a schedule-driven batch job to an event-driven reactive system. --- ### 1.5 ETL vs ELT **ETL (Extract → Transform → Load)** Transform data BEFORE loading it to the destination. ``` Scraper data → [JOIN, normalize, convert currency] → load into ads_ ``` The current `pipeline.sql` is ETL. The Jinja SQL extracts from source tables, transforms (currency conversion, usage state computation, VIN cleaning), and loads into the mart. **ELT (Extract → Load → Transform)** Load raw data first, then transform inside the destination using SQL. ``` Scraper data → load raw to Bronze → transform to Silver/Gold using SQL (dbt) ``` ELT is enabled by powerful analytical engines (BigQuery, Snowflake, DuckDB, Spark) where running SQL transformations on raw data is fast and cheap. dbt (covered later) is the tool that makes ELT manageable. **Modern default:** ELT, because you preserve the raw data and can re-transform if logic changes. ETL permanently loses the pre-transformation state. --- ### 1.6 The N+1 Problem in ETL Pipelines In SQL, the "N+1 problem" is when you execute 1 query to get a list of N items, then N additional queries for each item. Extremely inefficient. In the current pipeline, the Jinja template creates this exact pattern at the ETL level: ```python # What the Jinja template effectively does: for make_slug in makes: # ~50 makes for portal in portals: # ~5 portals create_temp_table(make_slug, portal) # Query 1: 4-way JOIN recover_prices(make_slug, portal) # Query 2: window functions on 3B rows upsert_to_mart(make_slug, portal) # Query 3: 80-column ON CONFLICT ``` 50 makes × 5 portals = 250 iterations, each with 3 expensive queries = 750 sequential operations. No parallelism. This is the primary driver of the 16-hour runtime. --- ### 1.7 Incremental Processing and the Watermark Pattern The pipeline uses `:last_modified` as a **watermark** — a timestamp that marks "I have processed everything up to this point." ```sql WHERE m.modified > :'last_modified' ``` Only rows in `mapping.lot_*` that changed since the last run are processed. This is correct incremental logic — but the implementation only filters on the mapping table, not on the full JOIN. This means: - Mapping changed for 50K VW listings → we still JOIN against the full `full_auto_mobile` (80M rows) - The filter is applied AFTER the JOIN, not before The fix **for the daily run**: resolve the changed `lid`s from `mapping.lot_*` first and let that small set drive the joins. Note this does **not** help the first-time backfill, where the filter admits the whole brand — see §1.8 and §2.4. --- ### 1.8 Backfill vs. Incremental — Two Different Problems in One Pipeline A single ETL pipeline almost always runs in two regimes, and they have *different* bottlenecks: | | First-time load (backfill) | Daily incremental | |---|---|---| | Trigger | New brand/country, or rebuild after a logic change | Scheduled daily run | | Volume | The entire history of a brand | Only what changed (here: 200K–5M rows/country) | | Write pattern | Pure `INSERT` into an empty table | `UPDATE`-heavy upsert into a populated table | | Dominant cost | Raw scan/join/insert volume; index build; single-threading | Conflict checks; MVCC dead-tuple bloat; vacuum pressure | | Right fixes | Parallelism, bulk-load technique, chunked transactions | Filter push-down, upsert efficiency, vacuum tuning | The lot+internet SLA (16 h → < 4 h) is explicitly stated **for the first-time load**. So the backfill column is the one that matters for the deadline. A solution that only makes the *incremental* path faster — anything built on change-data-capture, for instance — can be excellent engineering and still **miss the assignment**, because CDC has nothing to capture during a one-time historical load. For every option in §3, ask first: *which regime does this actually speed up?* --- ## 2. Current Pipeline Analysis — Corrected Diagnosis Now let's diagnose the actual `pipeline.sql` file with surgical precision. **This Part is substantially rewritten from v1**, which mis-framed the problem. --- ### 2.0 The framing everything depends on: backfill vs. incremental `pipeline.sql` runs in two completely different regimes, and the 16-hour number belongs to only one of them. ```sql SELECT EXISTS(SELECT 1 FROM {{schema}}.ads_{{make_slug}} LIMIT 1) not_first_run\gset SELECT COALESCE(max(modified), '2018-01-01') last_modified FROM {{schema}}.ads_{{make_slug}}\gset ``` - **First run (backfill):** the mart `ads_` is empty, so `max(modified)` is `NULL` and `last_modified` falls back to **`2018-01-01`**. The filter `m.modified > :'last_modified'` then admits the *entire history* of that brand. Every row is an `INSERT` (no conflicts — the table is empty). **This is the 16-hour run.** - **Daily run (incremental):** `last_modified` is the real high-water mark, so only the 200K–5M rows changed since yesterday are admitted, and the `\if :not_first_run` block also runs. This is almost certainly already fast — 16 h for 5M rows would be ≈85 rows/second, absurd for PostgreSQL. **Consequence:** the SLA is about making a *one-time bulk load* faster. Change-data-capture, streaming SQL, and incremental view maintenance do **not** speed up a one-time backfill — they optimize the regime that is already fast. Keep this test in mind for every option in §3. --- ### 2.1 Root cause #1 — the whole job runs single-threaded The hardware is **38 vCPU**. The pipeline is one `psql` process running one script top to bottom; the Jinja `{% for portal %}` loop is *inside* one script per `make_slug`, and brands run one after another. So a 16-hour VW load uses **roughly one of 38 cores** — ~37 sit idle. This workload is *embarrassingly parallel*: - Across brands (≈50 `make_slug`) — for the daily run. - Across portals (≈5) within one brand. - Within one portal, by `lid` range — the join and insert for `lid` 0–1M are independent of `lid` 1M–2M. So even a *single brand's* first-time load can be split across many cores. This one fact — not architecture — is the biggest lever (see §3.1). --- ### 2.2 Root cause #2 — the entire run is one 16-hour transaction ```sql BEGIN; ... 16 hours of work ... COMMIT; ``` A transaction open for 16 hours is independently harmful, regardless of what it does: - **It freezes autovacuum across the whole database.** PostgreSQL cannot vacuum away any dead tuple newer than the oldest running transaction's snapshot — the `xmin` horizon. A 16-hour transaction means 16 hours during which *no table anywhere* in the database can be properly cleaned; bloat accumulates system-wide. - **Catastrophic failure mode.** A failure at hour 15 rolls back everything — no partial progress, no resume point. - **WAL and lock pressure.** All the work's WAL must be retained until commit; locks on the temp objects and target table are held the whole time. The fix is not only speed — it is to **chunk the work into many short transactions** (per portal, or per `lid` range), each committing independently, so progress is durable and autovacuum is never starved. --- ### 2.3 Root cause #3 — per-row user-defined function calls The SELECTs call user-defined functions **once per row**, over millions of rows: | Function | Where | Calls/row | |---|---|---| | `convert_currency_by_counties(...)` | raw temp table, the `INSERT…SELECT`, the stat-refresh CTE | up to 3× | | `raw.clean_vin(ads.vin)` | raw temp table | 1× | | `raw.clean_car_plate(ads.car_plate)` | raw temp table | 1× | | `{{schema}}.compute_usage_state(...)` | raw temp table | 1× | | `pg_temp.jsonb_to_array(...)` | raw temp table (seals + equipment) | 2× | If `convert_currency_by_counties` or `compute_usage_state` is PL/pgSQL performing an internal lookup query, that is **a query per row** — tens of millions of nested queries in a backfill. This cost is invisible in the SQL text and is a prime suspect for "where did the 16 hours go." **Profile it first** (`EXPLAIN (ANALYZE, BUFFERS)`, `track_functions = all`, `pg_stat_user_functions`). Remedy: mark pure functions `IMMUTABLE PARALLEL SAFE` so the planner can inline/cache them; rewrite PL/pgSQL lookups as set-based joins (e.g. join a currency-rate table once instead of calling a function per row). *(v1 of this document did not mention these functions at all — an omission.)* --- ### 2.4 Root cause #4 — the 4-way join, done in full on the backfill ```sql FROM full_auto_{{portal}} ads JOIN cust_auto_{{portal}} cust ON cust.acc_id = ads.acc_id JOIN stat_auto_{{portal}} stat ON stat.lid = ads.lid JOIN mapping.lot_{{make_slug}} m ON m.lid = ads.lid WHERE m.model IS NOT NULL AND m.article_type IN ('car','bus') AND m.modified > :'last_modified' AND m.country = ... ``` Only `mapping.lot_*` is filtered by `modified`. On the **daily** run that filter is highly selective, and the right structure is to resolve the changed `lid`s from `mapping.lot_*` first and let that small set *drive* the joins (a pushed-down semi-join) instead of joining full tables and filtering afterwards. On the **first-time** run, however, `m.modified > '2018-01-01'` selects *the whole brand*. No filter makes the join small — you genuinely need every VW row joined to its cust/stat/mapping. The backfill join is large by nature. The lever here is therefore **parallelism and join strategy**, not filter push-down: ensure hash joins with adequate `work_mem` (so they don't spill onto the SATA SSDs) and split the join by `lid` range across workers. --- ### 2.5 Root cause #5 — the 3-billion-row price-recovery scan ```sql CREATE TEMP TABLE prices_{{make_slug}}_{{portal}} AS SELECT lid, price_last_new, price_first_new FROM ( SELECT row_number() OVER price_w_lv num, lid, analytic.first(price) OVER price_w_lv price_last_new, analytic.first(price) OVER price_w_fv price_first_new FROM prices_auto_{{portal}} WHERE lid = ANY (SELECT lid FROM raw_ads_{{make_slug}}_{{portal}} WHERE ...) WINDOW price_w_lv AS (PARTITION BY lid ORDER BY dt DESC ...), price_w_fv AS (PARTITION BY lid ORDER BY dt ASC ...) ) dt_ WHERE num = 1; ``` This recovers `price_first` / `price_last` from the 3-billion-row `prices_auto_*` history when they are missing in `stat`. It runs **once per (make, portal)**, is never cached between iterations, and computes *both* window aggregates for *every* row only to discard all but `num = 1` — wasteful. Two fixes, in increasing order of ambition: 1. **Pre-aggregate once.** Build `price_summary(lid, price_first, price_last, price_min, price_max)` in a single pass per portal per day, then the pipeline just *looks it up*. TimescaleDB Continuous Aggregates (§3.10) automate the incremental maintenance of exactly this table. 2. **Eliminate the need.** The whole subquery is a *fallback for missing data*. The better question: why are `price_first`/`price_last` `NULL` in `stat` at all? If `stat` were maintained correctly upstream (or via a small trigger as prices arrive), the 3-billion-row recovery scan would not need to exist. **Fix the source, don't optimize the workaround.** --- ### 2.6 What is NOT a root cause of the 16 hours: MVCC update bloat v1 of this document claimed the 80-column `ON CONFLICT DO UPDATE` causes MVCC dead-tuple bloat that drives the 16-hour runtime. **For the first-time load this is wrong.** On the first run the target table is empty, so every row takes the `INSERT` path; the `DO UPDATE` branch is never executed — zero conflicts, zero dead tuples. MVCC update bloat is real, but it is a problem of *sustained daily operation*, where the daily run updates millions of existing rows and leaves dead tuples behind. It degrades the daily pipeline and analytics queries over weeks; it does not explain the backfill number. The two regimes have different bottlenecks and must be reasoned about separately. What *does* make the first-time bulk `INSERT` slow: - **Index maintenance.** Every inserted row updates every index on `ads_` (at minimum the unique index on `(lid, make)` that `ON CONFLICT` requires). Bulk-load best practice: drop non-essential indexes, load, rebuild them in parallel afterwards — or load into a fresh partition and `ATTACH` it. - **WAL volume.** Every insert is logged; a multi-million-row load generates a large WAL. Loading into partitions and using `COPY` reduces overhead. - **Single-process insert** — see §2.1. --- ### 2.7 The configuration and hardware angle v1 ignored A 16-hour job is governed as much by `postgresql.conf` as by SQL: | Setting | Why it matters for a backfill | |---|---| | `max_parallel_workers`, `max_parallel_workers_per_gather` | Whether `CREATE TABLE AS SELECT` and big joins can use multiple cores at all. | | `work_mem` | Too low → hash joins and sorts spill to disk. On **SATA** SSD (not NVMe) that spill is slow. | | `maintenance_work_mem` | Speed of `CREATE INDEX` after a bulk load. | | `shared_buffers`, `effective_cache_size` | How much of the working set stays in RAM. | | `temp_buffers` | Temp-table performance — the pipeline leans heavily on temp tables. | | `wal_compression`, `max_wal_size`, checkpoint settings | WAL/checkpoint pressure during a write-heavy load. | Two hardware facts the test deliberately includes: - **64 GB RAM is small** relative to the data. One country alone is ~5 TB; 64 GB means heavy reliance on disk. - **SATA SSD, not NVMe.** The box is I/O-bound. And 10 × 3.84 TB in RAID-10 ≈ **19 TB usable**, while 15 countries × 5 TB = 75 TB — the data does **not** fit one server, so the quoted hardware is **per-country (or per-cluster)**. That confirms parallelism happens *within* one 38-vCPU box, and makes **vertical scaling** (more RAM, NVMe) a legitimate, cheap option in its own right (see §3.14). --- ### 2.8 Minor code smells worth a mention - `UPDATE raw.ads_{{make_slug}}` in the `\if :not_first_run` block hardcodes schema `raw.`, while the rest of the script uses `{{schema}}.ads_{{make_slug}}`. Either `raw` is always the schema (then the parameter is misleading) or this is a latent bug. - `NOT lid IN (SELECT lid FROM …)` in the same block: `NOT IN` against a subquery returns *no rows* if the subquery yields a single `NULL`. `NOT EXISTS` is both safer and usually faster. - The `raw_stat` block uses `UNION` (which sorts/deduplicates) where `UNION ALL` is almost certainly correct and cheaper, since portals are disjoint by `lid`. --- ## R. Redesign Before Re-Engine **The axis the test brief did not suggest — and the one with the most leverage.** ### R.0 Why this Part exists The test brief offers four directions: rewrite on PostgreSQL; split OLTP/OLAP; move to a lakehouse; adopt a cloud DWH. Every one is a statement about **which storage engine**. Part 3 of this document answers that question fourteen ways — but all fourteen accept the brief's hidden premise: that the problem is *where the data lives*. It is not. The 16-hour runtime is produced by **how the pipeline is designed**. The highest-leverage fixes change the *design*, and they are *engine-agnostic* — they would help on PostgreSQL, ClickHouse, or anything else, because they remove work rather than relocating it. The correct sequence is **redesign first, choose an engine last — if at all.** ### R.1 Two different verbs: *optimize* vs. *redesign* - **Optimize** — make the existing shape run faster. Parallelize the loop, raise `work_mem`, add an index. The shape is unchanged; you remove friction from it. *All of Part 3 is optimization* — including the engine swaps, because moving an unchanged shape onto a faster engine is still optimization. - **Redesign** — change the shape so the expensive work no longer exists. Don't speed up the 4-way join — arrange things so it does not run. Don't speed up the 3-billion-row scan — maintain its result so nothing is ever scanned. Optimization has a hard ceiling: the shape itself. Redesign moves the ceiling. The 16-hour job is a *symptom*; the four choices below are its *generators*. Optimizing treats symptoms; redesigning removes generators. ### R.2 The four design flaws that generate the 16 hours #### Flaw 1 — One pipeline doing two incompatible jobs **Evidence.** The script branches on `not_first_run`, `last_modified` defaults to `2018-01-01` on the first run, and a `\if :not_first_run` block bolts on extra logic. One Jinja template serves both the *first-time historical seed* and the *daily delta*. **Why it's a flaw.** These are different problems: the seed is a pure `INSERT` of five years of history that should be resumable and non-blocking; the sync is an `UPDATE`-heavy delta that needs conflict logic and a minutes-level SLA. Forcing both through one template means the seed inherits upsert machinery it never uses, and neither path is designed *for its own characteristics*. **Redesign.** Two programs. A `seed` job — bulk, parallel, `COPY`-based, partition-`ATTACH`, insert-only, **resumable**, run as a background task. A `sync` job — small delta, genuinely SLA'd, daily. Once `seed` is a resumable background job, the "<4h SLA on the backfill" largely dissolves: a one-time historical load does not need a 4-hour SLA — it needs to not lose 16 hours of progress on failure and not block the daily pipeline. #### Flaw 2 — "One Big Table": the wrong data model **Evidence.** `INSERT INTO ads_{{make_slug}}` lists ~80 columns spanning listing, customer, stats, price history, and mapping. A *second* code block (`\if :not_first_run` → `UPDATE raw.ads_*`) exists solely to refresh stat columns when stats changed but mapping did not. **Why it's a flaw.** The 80 columns have wildly different change rates — dealer/customer data changes monthly, mapping occasionally, price/stats constantly. Folding them into one physical row means every stat change rewrites a whole 80-column row and every index on it. Both code paths — the main upsert *and* the separate stat-refresh dance — are re-denormalization logic the OBT model forces on you. **Redesign.** Dimensional modeling (Kimball). A narrow **fact** table (listing core + foreign keys), **dimension** tables for customer/dealer (a slowly-changing dimension) and mapping, and a separate **price/stats fact**. The wide 80-column shape becomes a **view** over the model, or a serving-layer projection. A stat change now touches one narrow fact row. The 80-column upsert disappears as a category — not optimized, *eliminated*. Neither the brief nor Part 3 mentions dimensional modeling; it is the single biggest omission. #### Flaw 3 — Recompute instead of precompute **Evidence.** Every run, per row, the SELECTs call `clean_vin`, `clean_car_plate`, `compute_usage_state`, `convert_currency_by_counties` (×3 across the script), `jsonb_to_array` (×2), plus per-row `jsonb_set` / `jsonb_build_object`. The `prices_*` subquery scans `prices_auto_*` (3B rows) with window functions to recover `price_first`/`price_last`. **Why it's a flaw.** These are deterministic functions of source data. A first-time load recomputes *five years* of VIN-cleaning, currency conversion, and usage-state derivation in one window — work that should have been done incrementally, once, as each row was born. **Redesign — shift-left.** Move deterministic transforms to *write time*: - `clean_vin(vin)`, `clean_car_plate`, converted price, `compute_usage_state` → **stored generated columns** (or trigger-maintained columns) on the source tables, computed once at scrape/map time. - `prices_auto_*` is an **event log** — don't scan it, *fold* it. Maintain a `price_state(lid, first_price, first_dt, last_price, last_dt, price_min, price_max)` projection, updated by a trigger (or a streaming consumer) as each price event arrives. First/last price becomes an O(1) lookup, forever. (TimescaleDB Continuous Aggregates, §3.10, are one packaged way; a plain trigger is another.) After shift-left, the mart build is a near-pure join+project — little is left to compute, so the backfill is fast *by construction*. #### Flaw 4 — Analytical compute, row-by-row, inside an OLTP engine **Evidence.** The transformation is expressed as SQL executed by PostgreSQL — row-at-a-time execution, a single process, MVCC overhead, intermediate `TEMP TABLE` round-trips. **Why it's a flaw.** Joining and transforming millions of wide rows is an *analytical* (scan-heavy, vectorizable) workload; PostgreSQL's executor is built for transactional row access. This is a tool-fit problem — **but the fix is not necessarily a new database.** **Redesign — separate compute from storage.** Keep all data in PostgreSQL (storage, system of record, consumer interface). Run the *transform step* in a **vectorized, multi-core, columnar engine that is a library, not a server**: - **DuckDB** — in-process analytical SQL. It `ATTACH`es a PostgreSQL database directly, runs the join+transform across all cores with vectorized execution and no MVCC, and bulk-writes the result back. The existing logic is SQL, so the port is SQL→SQL — minimal rewrite. - **Polars** — Rust dataframe library, multi-threaded, larger-than-RAM streaming engine; a better fit if you'd rather express transforms as dataframes. This delivers the OLAP speed-up with **zero new infrastructure, zero data migration, zero new operational surface**. It is distinct from "tune Postgres SQL" (Flaw 4 stays) and from "adopt ClickHouse" (a server to run, secure, sync, and migrate consumers onto). It is the most under-rated option in this whole analysis. > **DuckDB-as-compute is *not* §3.13.** §3.13 (`pg_duckdb`) embeds DuckDB *inside* Postgres to accelerate *queries*. The pattern here uses DuckDB as an *external ETL compute engine* for the *transform step*. Related tools, different role. ### R.3 Two domain-specific sources of pure waste Beyond the four flaws, the car-listings domain reveals work the pipeline does that has no value: - **Reprocessing terminal-state entities.** A delisted/sold listing will never change again — yet a backfill reprocesses it like any other. **Freeze terminal entities**: once a listing reaches a terminal state, write it to an immutable partition and never touch it again. Only *active* listings (~80M) are live. - **Reprocessing cold data.** If consumers query mostly the last ~12 months, backfilling 5 years of price history into the hot mart is largely wasted I/O. **Tier the data**: hot (recent, in the mart) vs. cold (archived, cheap storage, queried rarely). A smaller hot set is a smaller, faster backfill. Both are "do less," not "do it faster" — the purest form of redesign. ### R.4 The redesign pattern catalogue Engine-agnostic patterns, in rough order of leverage-per-effort: 1. **Split seed from sync** — two programs, two SLAs. 2. **Shift-left** — precompute deterministic transforms at write time (generated / trigger columns). 3. **Event-fold the log** — maintain `price_state` instead of scanning `prices_auto_*`. 4. **Compute ≠ storage** — vectorized engine (DuckDB / Polars) for the transform; Postgres stays storage. 5. **Set-based, not looped** — eliminate the per-`(make, portal)` loop; one job the planner parallelizes. 6. **Dimensional model** — fact + dimensions + a compatibility view, replacing the 80-column OBT. 7. **Append-only fact + view-of-latest** — insert-only beats upsert; "current" is a view. 8. **Freeze terminal entities / tier cold data** — stop reprocessing data that will not change or is not read. 9. **Resumability & checkpointing** — make the seed restartable; a 16h job that loses all progress on failure is an operational defect independent of speed. Pattern 5 deserves a note: the per-brand loop exists only because the *source schema* is sharded into per-brand `mapping.lot_` tables and per-portal `full_auto_` tables. The loop is an artefact of that sharding, not a requirement. Consolidating `mapping.lot_*` into one native-partitioned table lets the ETL be a single set-based job that PostgreSQL parallelizes internally — strictly better than scripting 250 iterations. Fixing the *source schema* is itself a redesign lever. ### R.5 The Redesign-First plan | Stage | Window | Cost | Move | Outcome | |---|---|---|---|---| | 0 | Week 1 | €0 | Instrument & measure — find which flaw dominates | Evidence, not guesses | | 1 | 2–4 wk | €0 | Split seed/sync; transform in DuckDB; load via partition-`ATTACH` | **Backfill < 4h** — structurally, not by tuning | | 2 | 4–8 wk | €0 | Shift-left transforms; event-fold price log; freeze terminal listings | Backfill becomes trivial; recurring cost collapses | | 3 | 2–3 mo | €0 (in Postgres) | Dimensional remodel behind a compatibility view | 80-column upsert eliminated; clean ML substrate | | 4 | later, **only if measured** | €0–800/mo | Columnar serving layer — default in-Postgres columnar, escalate to ClickHouse/Doris **only on evidence** | Fast analytics — *if* a real need is proven | Stages 1–3 are €0, stay inside PostgreSQL, and break no consumer. They fix the actual problem. Stage 4 — the only stage that is "pick an engine" — is **contingent and last**. ### R.6 Why engine choice should be the last, smallest decision If you remodel dimensionally (Flaw 2) and shift computation left (Flaw 3) *first*, then by the time you reach "which engine": - the data model is clean, so projecting it into any columnar store is trivial; - the transform is cheap, so the compute engine barely matters; - you have **measurements** of real consumer load, so the choice is evidence-based, not aspirational; - the decision is small, late, and low-risk — a bolt-on, not a migration. Choosing the engine *first* — the brief's framing, and Part 3's — inverts this: you commit to a destination before you know the route, and you migrate a *broken model* into a new system, preserving the flaws and adding a second platform to operate. **Redesign makes engine choice easy or unnecessary; re-engining first makes redesign harder.** ### R.7 The same redesign also wins the AI argument A dimensional model, an event-folded price history, and shift-left derived columns are not just faster — they are the *correct substrate for ML*. Conformed facts and dimensions are what feature engineering consumes; an event log with a maintained projection is exactly how you build point-in-time-correct training data; precomputed derived columns are reusable features. The redesign that fixes the 16-hour backfill is the *same* redesign that makes the platform AI-ready. "Bolt on ClickHouse" does neither. --- ## 3. The Solution Landscape --- ### 3.1 PostgreSQL Surgical Rewrite — the SLA winner **What it is:** Keep everything in PostgreSQL. Fix the pipeline's *execution model* and the database configuration. This is the option that actually wins the stated SLA, and for the backfill problem it is very likely *sufficient on its own*. The changes, in order of impact on the **first-time load**: #### (a) Parallelize — the single biggest lever The job uses ~1 of 38 cores (§2.1). Parallelize on three axes: - **Daily run:** across `make_slug` (≈50) and portal (≈5) — a 250-cell grid, trivially parallel. - **First-time load of one brand:** across portals, and within a portal across **`lid` ranges** (`lid` 0–1M, 1M–2M, …). Each range is an independent join + insert. Drive it from an orchestrator (Airflow, Prefect, Dagster) or a Python `ProcessPoolExecutor`, capped at ~24–30 workers so the box keeps I/O headroom. Realistic gain on a 38-vCPU host: **8–12×**. ```python from concurrent.futures import ProcessPoolExecutor units = [(make, portal, lo, hi) for make in makes for portal in portals for lo, hi in lid_ranges(make, portal)] with ProcessPoolExecutor(max_workers=28) as ex: ex.map(run_chunk, units) # each unit is its own short transaction ``` #### (b) Chunk the 16-hour transaction Each parallel unit must be **its own short transaction** that commits independently (§2.2). Progress becomes durable (a failure resumes from the last committed chunk), autovacuum is never starved by a long-lived `xmin` horizon, and locks/WAL are released continuously. Never wrap a multi-hour load in one `BEGIN/COMMIT`. #### (c) Bulk-load discipline for the backfill On the first-time load every row is an `INSERT`. Treat it as a bulk load: - Load into a **fresh partition** (or standalone table) with **no secondary indexes**, then `CREATE INDEX` in parallel afterwards, then `ATTACH PARTITION`. Index-after-load is dramatically faster than maintain-during-load. - Use **`COPY`** (or `INSERT … SELECT` into an `UNLOGGED` staging table) rather than row-by-row inserts. - Raise `maintenance_work_mem` for the post-load index build. #### (d) Eliminate the 3B-row price scan Pre-aggregate `prices_auto_*` once per day into `price_summary(lid, price_first, price_last, price_min, price_max)` and have the pipeline *look it up* instead of recomputing window functions per (make, portal). TimescaleDB Continuous Aggregates (§3.10) automate this; a plain nightly materialized table works too. Better still, fix the upstream `NULL`s so the recovery is rarely needed (§2.5). #### (e) Audit and fix the per-row UDFs Profile `convert_currency_by_counties`, `clean_vin`, `clean_car_plate`, `compute_usage_state` (§2.3). Mark pure ones `IMMUTABLE PARALLEL SAFE`; rewrite any PL/pgSQL that runs a lookup query per call as a **set-based join** done once. This can be a large, invisible win. #### (f) Partition the big tables Range-partition `prices_auto_*` by month and `ads_` by country (or month). Partition pruning shrinks scans; per-partition parallel load and index builds become possible; old partitions can be compressed or detached. ```sql CREATE TABLE prices_auto_mobile (lid BIGINT, price DECIMAL, dt DATE, ...) PARTITION BY RANGE (dt); CREATE TABLE prices_auto_mobile_2024_01 PARTITION OF prices_auto_mobile FOR VALUES FROM ('2024-01-01') TO ('2024-02-01'); ``` #### (g) Tune `postgresql.conf` Set `work_mem` high enough that brand-sized hash joins stay in RAM (spilling to SATA SSD is the slow path), raise `max_parallel_workers(_per_gather)`, size `shared_buffers` / `effective_cache_size` / `temp_buffers` for the box, and tune checkpoint/WAL settings for a write-heavy load (§2.7). #### (h) For the daily run only: fix the upsert path Resolve changed `lid`s from `mapping.lot_*` first and let them drive the joins; replace `ON CONFLICT DO UPDATE` with a `COPY`-to-staging then a targeted `UPDATE … FROM` so only genuinely changed rows are written (less MVCC bloat); ensure autovacuum keeps up. ```sql CREATE UNLOGGED TABLE staging_ads_vw (LIKE ads_volkswagen); COPY staging_ads_vw FROM '/tmp/new_ads_vw.csv'; UPDATE ads_volkswagen t SET price = s.price, model = s.model -- ...all columns FROM staging_ads_vw s WHERE t.lid = s.lid AND (t.ads_modified <> s.ads_modified OR t.mapping_modified <> s.mapping_modified); INSERT INTO ads_volkswagen SELECT * FROM staging_ads_vw s WHERE NOT EXISTS (SELECT 1 FROM ads_volkswagen WHERE lid = s.lid); ``` **Expected result:** parallelism (8–12×) alone takes 16 h → ~1.5–2 h; the price pre-aggregation and UDF fixes add more margin. **The < 4 h SLA is met comfortably, at $0 additional cost.** **Summary:** - Cost: $0 (existing hardware) + ~1 engineer for 2–4 weeks - Timeline: 2–4 weeks - Risk: Low — same system, same schema, same consumers; fully reversible (keep the old script behind a flag) - Ceiling: Wins the SLA and absorbs ~30% growth with partitioning. Does **not** by itself fix the OLTP-on-OLAP mismatch for analytics consumers — that is Horizon 2. --- ### 3.2 PostgreSQL + ClickHouse #### What is ClickHouse? ClickHouse is an open-source columnar database created by Yandex (2016) for web analytics at billion-event scale. It is now the most widely deployed open-source OLAP database. **Core design choices:** **1. Column-oriented storage (MergeTree)** Data is stored in "parts" — sorted, compressed columnar files on disk. When new data is inserted, it lands in a new part. Background merge threads continuously merge smaller parts into larger, sorted ones. ``` Table: ads_volkswagen Part 001: lid[1..1000], make["VW","VW",...], price[15000,16000,...] — compressed Part 002: lid[1001..2000], make["VW","VW",...], price[14000,17000,...] After merge: Part 001-002: lid[1..2000], make["VW","VW",...], price[14000,15000,...] — better sorted, better compressed ``` **2. Vectorized query execution** Instead of processing one row at a time (PostgreSQL), ClickHouse processes 8192 rows at once (a "vector") using CPU SIMD instructions. This gives 10-100x CPU efficiency for aggregations. **3. The ReplacingMergeTree engine — the upsert solution** ```sql CREATE TABLE ads_volkswagen ( lid UInt64, make String, model String, price Float64, modified DateTime, -- ... all 80 columns ) ENGINE = ReplacingMergeTree(modified) ORDER BY (make, lid); ``` When you `INSERT` a row with the same `ORDER BY` key (make, lid) as an existing row: - ClickHouse does NOT immediately deduplicate - Both versions coexist as separate rows - Background merge eventually deduplicates, keeping the row with the highest `modified` value - For immediate consistency: add `FINAL` to queries: `SELECT * FROM ads_volkswagen FINAL` **Why this matters:** No `ON CONFLICT DO UPDATE`. No MVCC dead tuples. You just INSERT — deduplication happens asynchronously in the background. **ClickHouse query performance:** For `SELECT make, AVG(price), COUNT(*) FROM ads_volkswagen GROUP BY make`: - PostgreSQL (row store, 80M rows): scans all 80 columns × 80M rows ≈ minutes - ClickHouse (column store, 80M rows): reads only `make` column + `price` column ≈ seconds #### How the OLTP/OLAP split works in practice ``` ┌─────────────────────────────┐ │ Scrapers (write) │ └──────────────┬──────────────┘ │ INSERT/UPDATE ┌──────────────▼──────────────┐ │ PostgreSQL (OLTP source) │ │ - full_auto_* │ │ - cust_auto_* │ │ - stat_auto_* │ │ - mapping.lot_* │ └──────────────┬──────────────┘ │ Daily ETL (fixed pipeline) ┌──────────────▼──────────────┐ │ ClickHouse (OLAP) │ │ - ads_volkswagen │ │ - ads_bmw │ │ - ads_mercedes │ └──────────┬──────────┬────────┘ │ │ ┌───────────────▼─┐ ┌───▼────────────────┐ │ Analytics Dash │ │ ML Feature Ext. │ └─────────────────┘ └────────────────────-┘ ``` Postgres consumers (applications that query `ads_` with standard SQL) can still query Postgres directly. ClickHouse is added for the heavy analytics and ML workloads that are currently making Postgres slow. #### ClickHouse + PostgreSQL sync options **Option A: ClickHouse PostgreSQL Table Engine (simple)** ```sql -- In ClickHouse, create a table that reads from Postgres: CREATE TABLE pg_ads_vw ENGINE = PostgreSQL( 'postgres-host:5432', 'db', 'ads_volkswagen', 'user', 'password' ); -- Then materialize locally: INSERT INTO ads_volkswagen_local SELECT * FROM pg_ads_vw WHERE modified > yesterday(); ``` Works for daily batch sync. Not suitable for real-time. **Option B: Debezium CDC → Kafka → ClickHouse (real-time)** Covered in detail in Section 3.5. **Summary:** - Cost: $0 (self-hosted on existing hardware) to $400–800/month (ClickHouse Cloud) - Timeline: 4–8 weeks - Risk: Medium (new system, two systems to maintain) - Ceiling: Solves analytics reads permanently; write-side still needs Option 3.1's fixes --- ### 3.3 Lakehouse: S3 + dbt + DuckDB / Spark #### What is a Data Lakehouse? The term was coined by Databricks in 2020. It combines: - **Data Lake:** cheap object storage (S3, GCS, Azure Blob), schema-on-read, raw data at any format - **Data Warehouse:** ACID transactions, schema enforcement, fast queries, governed access Before lakehouses, companies had to choose: - Store raw data in a lake (cheap, unstructured, hard to query) — OR — - Move data to a warehouse (expensive, structured, queryable) The lakehouse adds a **table format layer** (Apache Iceberg, Delta Lake, Hudi) on top of object storage that provides: - ACID transactions (safe concurrent reads and writes) - Schema enforcement and evolution - Row-level deletes and updates (S3 is otherwise append-only) - Time travel (query historical snapshots) #### What is dbt (data build tool)? dbt is a transformation framework. It does ONE thing: run `SELECT` statements and materialize the results as tables or views. It handles all the DDL (CREATE TABLE, INSERT, MERGE) automatically. **Why dbt matters:** Before dbt, transformation logic lived in complex SQL scripts, Python scripts, or stored procedures — hard to test, version, or document. dbt treats transformations as code: version-controlled, testable, self-documenting. **dbt model example (replacing the pipeline.sql for VW):** ```sql -- models/silver/ads_volkswagen_de.sql {{ config( materialized='incremental', unique_key='lid', on_schema_change='append_new_columns' ) }} SELECT ads.lid, ads.portal, ads.price, m.make, m.model, m.fuel, {{ convert_currency('de', 'm.currency', 'ads.price', 'stat.last_visible') }} AS price_converted, stat.first_visible, stat.last_visible, GREATEST(ads.modified, m.modified, stat.modified) AS modified FROM {{ source('raw', 'full_auto_mobile') }} ads JOIN {{ source('mapping', 'lot_volkswagen') }} m ON m.lid = ads.lid JOIN {{ source('raw', 'stat_auto_mobile') }} stat ON stat.lid = ads.lid JOIN {{ source('raw', 'cust_auto_mobile') }} cust ON cust.acc_id = ads.acc_id {% if is_incremental() %} -- On incremental runs, only process rows modified since last run WHERE GREATEST(ads.modified, m.modified, stat.modified) > ( SELECT MAX(modified) FROM {{ this }} ) {% endif %} ``` **Key dbt concepts:** - `{{ source(...) }}` — references a raw source table - `{{ this }}` — refers to the model's own table (for incremental logic) - `is_incremental()` — true on incremental runs, false on first run (full load) - `materialized='incremental'` — tells dbt to MERGE new rows into the existing table - `unique_key='lid'` — dbt uses this for the MERGE ON condition dbt then generates and runs the SQL, handles the CREATE TABLE/MERGE/INSERT logic, runs tests, and generates documentation automatically. **dbt transformations replace the Jinja SQL** in the current pipeline — cleaner, testable, version-controlled. #### What is DuckDB? DuckDB is an in-process analytical SQL database — think "SQLite, but for analytics." It runs embedded inside a Python or R process. No server to deploy. No configuration. **DuckDB's key capability for this use case:** ```python import duckdb conn = duckdb.connect() # DuckDB can query Parquet files on S3 directly — no data loading result = conn.execute(""" SELECT make, AVG(price) as avg_price, COUNT(*) as count FROM read_parquet('s3://lot-internet-data/gold/ads_volkswagen/dt=2024-01-*/**.parquet') WHERE country = 'DE' GROUP BY make ORDER BY count DESC """).df() ``` DuckDB can: - Read Parquet, CSV, JSON from local disk or S3 - Run complex SQL (window functions, CTEs, JOINs) efficiently - Handle 100s of GB on a single machine (with 64GB RAM, you can process 500GB+ comfortably) - Write results back to Parquet For the lot+internet pipeline, DuckDB running on the existing 38-vCPU machine could replace Spark for processing individual countries — much simpler to operate. **DuckDB vs Spark:** | Aspect | DuckDB | Apache Spark | |---|---|---| | Architecture | Single process, embedded | Distributed cluster | | Data scale | Up to ~500GB RAM efficiently | Petabytes (horizontal scale) | | Setup complexity | `pip install duckdb` | Cluster management, YARN/K8s | | Cost | Free | Cluster cost ($500–$3,000/month EMR) | | Use case | One country at a time | Multiple countries simultaneously | | Learning curve | SQL knowledge sufficient | Spark/Scala/PySpark expertise | For 5TB/country with 64GB RAM: DuckDB with parallel country processing (one DuckDB per country, 15 parallel) is likely sufficient and far simpler. #### What is Apache Spark? Spark is a distributed data processing framework. Unlike DuckDB (one machine), Spark splits work across a cluster of machines. **Core concepts:** - **Driver:** The coordinator — parses your code, creates execution plan, monitors workers - **Executor:** Worker nodes — do actual computation on partitions of data - **RDD (Resilient Distributed Dataset):** Spark's core abstraction — an immutable distributed collection of objects - **DataFrame:** A higher-level abstraction over RDD, with SQL-like API - **Partition:** A chunk of data assigned to one executor. More partitions = more parallelism ```python # PySpark example: the pipeline rewritten from pyspark.sql import SparkSession from pyspark.sql.functions import * spark = SparkSession.builder.appName("etl").getOrCreate() # Read from S3 Bronze layer full_auto = spark.read.parquet("s3://lot-internet/bronze/full_auto_mobile/") mapping = spark.read.parquet("s3://lot-internet/bronze/mapping_lot/") # Filter incremental last_modified = get_last_modified_watermark() changed = mapping.filter(col("modified") > last_modified) # Join (distributed across cluster) result = changed \ .join(full_auto, "lid") \ .join(stat_auto, "lid") \ .join(cust_auto, "acc_id") \ .withColumn("price_converted", udf_convert_currency("country", "currency", "price", "last_visible")) # Write to Iceberg Gold table result.writeTo("iceberg.gold.ads_volkswagen") \ .option("merge-schema", "true") \ .createOrReplace() ``` **For this use case:** Spark adds significant complexity. Unless you're processing all 15 countries simultaneously on a cluster, DuckDB is simpler and sufficient. **Summary:** - Cost: S3 ~$25/TB/month ($375/month for 15TB) + compute (DuckDB: $0, Spark: $500–2,000/month) - Timeline: 3–5 months (full rewrite) - Risk: High (full pipeline rewrite) - Ceiling: Excellent long-term; best for AI/ML (time travel, schema evolution) --- ### 3.4 Managed Cloud DWH: BigQuery / Snowflake #### What is BigQuery? Google BigQuery is a fully managed, serverless data warehouse. You write SQL, Google runs it — no clusters, no configuration, no capacity planning. **How BigQuery works internally (Dremel architecture):** - Data stored in Google's Capacitor columnar format in their Colossus distributed filesystem - Query execution distributed across thousands of Google servers (Dremel) - Separation of compute and storage — query scale is completely elastic **BigQuery pricing model:** - Storage: $20/TB/month (active data), $10/TB/month (long-term, untouched 90+ days) - Queries: $5/TB scanned — **you pay for data read, not compute time** **BigQuery MERGE (solves the ON CONFLICT problem):** ```sql MERGE `project.dataset.ads_volkswagen` T USING (SELECT * FROM staging_ads_vw) S ON T.lid = S.lid AND T.make = S.make WHEN MATCHED AND ( T.ads_modified != S.ads_modified OR T.mapping_modified != S.mapping_modified ) THEN UPDATE SET T.price = S.price, T.model = S.model, -- ... all columns WHEN NOT MATCHED THEN INSERT (lid, make, model, price, ...) VALUES (S.lid, S.make, S.model, S.price, ...); ``` BigQuery runs this MERGE across thousands of workers. 5M rows: 2–5 minutes. No MVCC. No bloat. No 16 hours. **BigQuery cost calculation for this pipeline:** - Incremental run: 200K–5M rows ≈ 1–2GB scanned per make ≈ $0.005–$0.01 per run - Daily run across 50 makes × 15 countries = 750 runs ≈ $3.75–$7.50/day ≈ $112–$225/month in query costs - Storage: 5TB × 15 countries = 75TB × $20 = $1,500/month - **Total estimate: ~$1,600–$1,700/month** However: full table scans for analytics are expensive. `SELECT AVG(price) FROM ads_volkswagen` scans the whole table → $0.375 per scan. High-frequency dashboards can spike costs unexpectedly. #### What is Snowflake? Snowflake is a cloud data warehouse with a unique "multi-cluster shared data" architecture: - **Storage layer:** Your data in S3/GCS/Azure Blob (Snowflake manages it, but it's in your cloud) - **Compute layer:** "Virtual Warehouses" — isolated compute clusters you size and pause - **Cloud Services layer:** Metadata, query optimization, access control **Key difference from BigQuery:** - Snowflake: pay per second of compute (warehouse running time). You choose warehouse size (XS to 4XL). - BigQuery: pay per TB scanned. No warehouse to manage. For predictable daily batch pipelines: Snowflake's model is more predictable in cost. For ad-hoc analytics with unpredictable query patterns: BigQuery's model can surprise you. **Snowflake Virtual Warehouse:** ```sql -- Create a warehouse (compute cluster) for the ETL CREATE WAREHOUSE etl_warehouse WAREHOUSE_SIZE = 'MEDIUM' -- 4 nodes AUTO_SUSPEND = 60 -- pause after 60 seconds idle AUTO_RESUME = TRUE; -- Run the pipeline using this warehouse USE WAREHOUSE etl_warehouse; MERGE INTO ads_volkswagen USING staging ON ... -- Warehouse runs during MERGE, auto-suspends after ``` A Medium warehouse costs ~$0.10/credit. A 5M-row MERGE takes ~5 minutes on a Medium = ~0.5 credits = $0.05. Very efficient for batch. **GDPR Note:** For 15 European countries with customer data (emails, phones in the SQL), data residency matters. Both BigQuery EU and Snowflake EU regions are GDPR-compliant, but you must explicitly configure them. This is a legitimate concern to raise in an interview. **Summary:** - Cost: BigQuery ~$1,600–1,700/month; Snowflake ~$1,000–2,000/month depending on usage - Timeline: 3–6 months - Risk: High (vendor lock-in, full migration, consumer rewrite) - Ceiling: Infinite scale, fully managed, excellent for AI/ML integrations --- ### 3.5 CDC-First Event Backbone > **Relevance check (read first):** CDC speeds up the *daily incremental* path and unlocks real-time consumers — it does **nothing** for the first-time backfill (there is no "change" to capture during a one-time historical load). It does not, on its own, hit the stated SLA. Treat this as a *Horizon 2 / future-optionality* option, not as the answer to "16 h → 4 h". Honest framing: real-time was not requested by the test — it is a forward-looking bet. #### What is Change Data Capture (CDC)? CDC captures changes to a database and streams them as events. Instead of asking "what changed since midnight?" (polling), CDC tells you "this row just changed" (push). Think of it as the difference between: - **Polling:** calling someone every hour to ask "did anything change?" — inefficient - **CDC:** they call YOU whenever something changes — efficient #### How PostgreSQL WAL Works PostgreSQL maintains a **Write-Ahead Log (WAL)** — a sequential append-only file recording every change to the database before it's applied to the data files. Primary purpose: crash recovery and replication. WAL records look conceptually like: ``` LSN 0/1A2B3C4D: BEGIN transaction 500 LSN 0/1A2B3C6E: INSERT INTO full_auto_mobile: lid=99001, price=15000, modified=NOW() LSN 0/1A2B3D00: UPDATE mapping.lot_volkswagen: lid=99001, model='Golf', modified=NOW() LSN 0/1A2B3E10: COMMIT transaction 500 ``` **Logical replication** (available in PostgreSQL 10+) decodes the WAL into human-readable change events. Debezium uses this. #### What is Debezium? Debezium is an open-source CDC platform that reads PostgreSQL's logical replication stream and publishes change events to Kafka. **Setup:** ```sql -- Enable logical replication in PostgreSQL ALTER SYSTEM SET wal_level = logical; CREATE PUBLICATION debezium_pub FOR TABLE full_auto_mobile, mapping.lot_volkswagen; ``` **Debezium event format (JSON):** ```json { "op": "u", "before": { "lid": 99001, "price": 15000, "modified": "2024-01-15T10:00:00Z" }, "after": { "lid": 99001, "price": 16000, "modified": "2024-01-16T09:30:00Z" }, "source": { "table": "full_auto_mobile", "lsn": 28041045, "ts_ms": 1705398600000 } } ``` This event is published to a Kafka topic: `postgres.public.full_auto_mobile`. #### What is Apache Kafka? Kafka is a distributed event streaming platform. Think of it as a highly durable, ordered, distributed message queue. **Core concepts visualized:** ``` Producer Kafka Cluster Consumers ──────── ───────────── ───────── Debezium ──writes──► Topic: full_auto_mobile.changes ──reads──► ETL Pipeline (Postgres CDC) ──reads──► ClickHouse Sync Partition 0: [msg1, msg2, ...]──reads──► Fraud Detection Partition 1: [msg5, msg6, ...]──reads──► Price Alerts Partition 2: [msg3, msg4, ...]──reads──► ML Feature Store ``` **Key Kafka concepts:** | Concept | Explanation | |---|---| | **Topic** | A named category of events. Like a database table for events. | | **Partition** | A topic is split into N ordered, append-only partitions for parallelism. | | **Offset** | The position of a message within a partition. Like a row number, but immutable. | | **Producer** | Writes events to a topic (Debezium, scrapers). | | **Consumer** | Reads events from a topic (ETL, ClickHouse, ML pipeline). | | **Consumer Group** | A set of consumers sharing work on a topic. Each partition assigned to one consumer. | | **Retention** | Kafka retains events for a configured period (e.g., 7 days). Consumers can replay. | **Why Kafka is architecturally significant:** Without Kafka, every downstream system (ETL, analytics, fraud detection) must query Postgres separately. With Kafka: - One event stream, many independent consumers - Each consumer has its own offset — if ClickHouse goes down, it resumes from where it stopped - Scrapers and consumers are decoupled — scrapers don't know or care who consumes their data - Adding a new consumer (e.g., a new ML pipeline) requires zero changes to the source **This is the publish-subscribe pattern at infrastructure scale.** #### What is Redpanda? Redpanda is a Kafka-compatible drop-in replacement written in C++ (Kafka is JVM-based, Scala/Java). | Aspect | Apache Kafka | Redpanda | |---|---|---| | Language | Scala/Java (JVM) | C++ | | Latency | ~5–20ms | ~0.5–2ms | | ZooKeeper | Required (pre-Kafka 3.x) | Not needed | | Operations | Complex | Simple (single binary) | | Compatibility | Native | 100% Kafka API compatible | | Cost | Free | Free (open source) | For a data team that doesn't want to operate a full Kafka cluster, Redpanda is significantly simpler. #### The CDC Architecture Flow ``` PostgreSQL WAL │ ▼ (logical replication slot) Debezium Connector │ ▼ (JSON events) Kafka/Redpanda Topics: - full_auto_mobile.changes - mapping.lot_volkswagen.changes - stat_auto_mobile.changes │ ├──► ETL Consumer (Python/Flink): reads events, applies transforms, writes to mart ├──► ClickHouse Consumer: streams into ReplacingMergeTree ├──► Fraud Detection Service: scores new listings in real time ├──► Price Alert Service: triggers alerts when price drops └──► ML Feature Pipeline: updates feature store in near real time ``` **How this solves the 4-way JOIN problem:** Instead of joining full_auto_* + cust_auto_* + stat_auto_* + mapping.lot_* on every run, the CDC pipeline maintains a **join result** that is updated incrementally. When a mapping row changes (event on Kafka), only that lid needs re-joining — not the full 80M rows. **Summary:** - Cost: Redpanda self-hosted $0; Confluent Cloud $400–$1,200/month - Timeline: 4–8 weeks - Risk: Medium (new infrastructure, but well-understood technology) - Ceiling: Enables everything — real-time, AI, multiple consumers from one stream --- ### 3.6 Streaming SQL: Materialize / RisingWave > **Relevance check (read first):** A streaming materialized view still has to *hydrate its initial state* from the full history — that initial hydration **is** the backfill problem, not an escape from it. Streaming SQL shines for the *ongoing* regime. Like CDC, this is Horizon 2 / real-time optionality, not the SLA fix. #### The Core Concept: Incremental View Maintenance (IVM) Traditional materialized view: ```sql CREATE MATERIALIZED VIEW ads_vw_summary AS SELECT make, model, AVG(price) FROM ads_volkswagen GROUP BY make, model; -- To update it: REFRESH MATERIALIZED VIEW ads_vw_summary -- Cost: re-scans the entire ads_volkswagen table from scratch ``` **IVM:** When a row in `ads_volkswagen` changes, only the affected groups in `ads_vw_summary` are recomputed. Like a spreadsheet formula — change one cell, only dependent cells update. ``` ads_volkswagen: lid=1, price changes from 15000 to 16000 ↓ IVM detects this affects: make='VW', model='Golf' group ↓ Recomputes AVG(price) for that group only ↓ Total work: O(1), not O(80M rows) ``` **Materialize and RisingWave implement IVM over streaming data.** They maintain SQL query results continuously as data arrives from Kafka (via CDC). #### How Materialize Works ```sql -- 1. Connect to Kafka (fed by Debezium) CREATE SOURCE full_auto_changes FROM KAFKA BROKER 'kafka:9092' TOPIC 'postgres.public.full_auto_mobile' FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://schema-registry:8081'; CREATE SOURCE mapping_changes FROM KAFKA BROKER 'kafka:9092' TOPIC 'postgres.public.mapping_lot_volkswagen' FORMAT AVRO ...; -- 2. Define a continuously maintained view (no REFRESH needed - ever) CREATE MATERIALIZED VIEW ads_volkswagen AS SELECT ads.lid, ads.portal, m.make, m.model, m.fuel, ads.price, cust.name AS cust_name, stat.first_visible, stat.last_visible, NOW() AS modified FROM full_auto_changes ads JOIN mapping_changes m ON m.lid = ads.lid JOIN cust_changes cust ON cust.acc_id = ads.acc_id JOIN stat_changes stat ON stat.lid = ads.lid WHERE m.model IS NOT NULL; -- 3. Query it like a normal table - results are always up to date SELECT * FROM ads_volkswagen WHERE make = 'Volkswagen' AND country = 'DE'; ``` **What happens under the hood:** When a new listing arrives in Kafka: 1. Materialize receives the event 2. Looks up the mapping, cust, stat for that `lid` from its in-memory state 3. Computes the joined row 4. Updates only that row in the materialized view 5. Total time: milliseconds The 4-way JOIN never "runs" as a batch — it's maintained as a **dataflow graph** that processes each event incrementally. #### RisingWave vs Materialize Both are streaming SQL databases with Postgres-compatible wire protocols. | Aspect | Materialize | RisingWave | |---|---|---| | License | BSL (not fully open source) | Apache 2.0 (fully open source) | | State storage | In-memory + EBS | In-memory + S3 (cheaper for large state) | | SQL compatibility | Postgres (partial) | Postgres (broader) | | Maturity | More mature | Rapidly catching up | | Self-hosting | Complex | Simpler | **The state management challenge:** For streaming JOINs, Materialize/RisingWave must keep the current state of each joined table in memory to process new events. For 80M listings across 50 makes × 15 countries: the state can be very large. RisingWave's S3 spill handles state that exceeds RAM — crucial for your scale. **Summary:** - Cost: Materialize Cloud from $0.13/credit-hour (~$500–2,000/month at scale); RisingWave self-hosted $0 - Timeline: 4–8 weeks (requires Kafka first) - Risk: Medium-high (relatively new technology, state management complexity) - Ceiling: Eliminates the batch problem architecturally; real-time comes for free --- ### 3.7 Medallion Architecture on Apache Iceberg #### The Medallion Architecture Coined by Databricks, now an industry standard for organizing data lakes. ``` ┌─────────────────────────────────────────────────────────────────────┐ │ BRONZE LAYER │ │ Raw, unmodified source data. Append-only. Never overwritten. │ │ Schema: exactly as scrapers produce it │ │ Format: Parquet or JSON on S3/MinIO │ │ Retention: Forever (cheap storage) │ │ Example: s3://lot-internet/bronze/full_auto_mobile/dt=2024-01-15/ │ └──────────────────────────┬──────────────────────────────────────────┘ │ dbt/Spark/DuckDB transformation ┌──────────────────────────▼──────────────────────────────────────────┐ │ SILVER LAYER │ │ Cleaned, normalized, validated. Incrementally updated. │ │ Schema: enforced, typed, documented │ │ Content: VIN cleaned, currency converted, make/model normalized │ │ Format: Iceberg tables on S3 │ │ Example: s3://lot-internet/silver/ads_volkswagen_normalized/ │ └──────────────────────────┬──────────────────────────────────────────┘ │ dbt/Spark aggregation ┌──────────────────────────▼──────────────────────────────────────────┐ │ GOLD LAYER │ │ Business-ready, denormalized, consumer-specific. │ │ Multiple Gold tables for different consumers. │ │ Example: s3://lot-internet/gold/ads_volkswagen_analytics/ │ │ s3://lot-internet/gold/ads_volkswagen_ml_features/ │ │ s3://lot-internet/gold/ads_volkswagen_export/ │ └─────────────────────────────────────────────────────────────────────┘ ``` **Why Bronze is non-negotiable:** If you transform data before storing it (ETL), you lose the raw state permanently. When the mapping logic has a bug (and it will), you can reprocess Silver/Gold from Bronze without re-scraping. This is your source of truth. **Multiple Gold tables:** Different consumers have different needs: - Analytics dashboard: needs aggregated metrics - ML training: needs raw feature values with time travel - Customer export: needs filtered, formatted data Instead of forcing all consumers to use one wide mart table, each gets a Gold table optimized for their access pattern. Bronze and Silver are processed once. #### Apache Iceberg Deep Dive Iceberg is a table format — a specification for how data files and metadata are organized on object storage to support ACID transactions and SQL operations. **The Iceberg metadata tree:** ``` s3://lot-internet/gold/ads_volkswagen/ ├── metadata/ │ ├── v1.metadata.json ← Table metadata (schema, partition spec, snapshots) │ ├── v2.metadata.json ← After first commit │ └── v3.metadata.json ← After second commit (current) ├── metadata/snap-001.avro ← Snapshot 1: list of data files ├── metadata/snap-002.avro ← Snapshot 2: list of data files └── data/ ├── dt=2024-01/ │ ├── part-00001.parquet ← Data files (immutable once written) │ └── part-00002.parquet └── dt=2024-02/ └── part-00003.parquet ``` When you UPDATE a row in Iceberg: 1. Write a new Parquet file with the updated rows 2. Write a delete file marking the old rows as deleted 3. Commit a new snapshot that references the new data + delete files 4. The old snapshot (and data files) remain — this enables time travel **MERGE INTO with Iceberg (replaces ON CONFLICT DO UPDATE):** ```sql MERGE INTO iceberg.gold.ads_volkswagen AS target USING staging AS source ON target.lid = source.lid AND target.make = source.make WHEN MATCHED AND ( target.ads_modified != source.ads_modified OR target.mapping_modified != source.mapping_modified ) THEN UPDATE SET target.price = source.price, target.model = source.model, -- ...all columns WHEN NOT MATCHED THEN INSERT *; ``` This runs efficiently because Iceberg: - Has partition pruning (only reads relevant partitions) - Uses file-level statistics (min/max per column per file) to skip entire files - No MVCC overhead — Parquet files are immutable **Time Travel for ML:** ```python # In dbt or Spark: # Train a model on data as of 6 months ago: spark.read \ .option("snapshot-id", "snap-2023-07-01") \ .format("iceberg") \ .load("s3://lot-internet/gold/ads_volkswagen") \ .createOrReplaceTempView("ads_vw_july") ``` This is impossible in the current PostgreSQL pipeline without a separate historical snapshot system. **Iceberg Partitioning for the price history problem:** ```sql CREATE TABLE prices_auto_mobile ( lid BIGINT, price DECIMAL(18, 2), dt DATE ) USING ICEBERG PARTITIONED BY (months(dt), bucket(256, lid)); ``` - `months(dt)`: Partition by calendar month → queries filtering by recent dates scan only recent partitions - `bucket(256, lid)`: Sub-partition by hashing lid → queries for specific `lid`s scan only 1/256 of files The window function query (currently scanning 3B rows) now scans only a few partitions. **Summary:** - Cost: MinIO self-hosted ($0) + DuckDB ($0) = effectively free; or S3 ~$375/month for 15TB - Timeline: 3–5 months - Risk: Medium (new patterns, but each component is well-understood) - Ceiling: Best long-term architecture for AI/ML; industry standard --- ### 3.8 Apache Doris / StarRocks #### What is Apache Doris? Apache Doris is an MPP (Massively Parallel Processing) real-time analytical database. Originally created by Baidu (2017). **StarRocks** is a fork started in 2020 by core Doris contributors, focusing on performance and operational simplicity. **Why it's uniquely relevant here:** Doris/StarRocks is specifically designed for the workload in `pipeline.sql`: high-frequency upserts of wide rows + fast analytical queries on the same data. It's the intersection of ClickHouse (analytical speed) and a proper upsert system. #### The Frontend/Backend Architecture ``` ┌──────────────────────────────┐ │ Frontend (FE) Nodes │ │ - SQL parsing │ │ - Query planning │ │ - Metadata management │ │ - HA: Leader + Followers │ └──────────────┬───────────────┘ │ Query execution plan ┌──────────────▼───────────────┐ │ Backend (BE) Nodes │ │ - Data storage │ │ - Query execution │ │ - Horizontal scaling │ └──────────────────────────────┘ ``` FE nodes are stateless (metadata in BDB-JE / Paxos). BE nodes store data and execute compute. Add BE nodes to scale storage and compute independently. **MySQL wire protocol compatibility:** Doris/StarRocks speaks MySQL protocol. Any tool that works with MySQL works with Doris. Your Postgres clients need only a connection string change (in many cases). #### The Three Data Models **1. Duplicate Key Model** — Append-only, allows duplicates. For events/logs. **2. Aggregate Key Model** — Pre-aggregates on ingest. For pre-computed metrics. **3. Unique Key Model** — The relevant one for this pipeline: ```sql CREATE TABLE ads_volkswagen ( lid BIGINT, make VARCHAR(64), model VARCHAR(128), price DECIMAL(18, 2), fuel VARCHAR(32), mileage INT, modified DATETIME, -- ... all 80 columns ) ENGINE = OLAP UNIQUE KEY(lid, make) DISTRIBUTED BY HASH(lid) BUCKETS 32 PROPERTIES ( "replication_num" = "3", "enable_unique_key_merge_on_write" = "true" -- immediate deduplication ); ``` **How Unique Key Model handles upserts:** - `enable_unique_key_merge_on_write = true`: deduplication happens at write time (synchronous). New row with same key immediately replaces old. No eventual consistency. - `enable_unique_key_merge_on_write = false`: deduplication at read time (FINAL semantics, like ClickHouse). Better write throughput. **Stream Load — direct HTTP bulk ingest:** ```bash curl -X PUT \ -H "label: etl_vw_2024_01_16" \ -H "column_separator: ," \ -T /tmp/ads_volkswagen_new.csv \ "http://doris-fe:8030/api/database/ads_volkswagen/_stream_load" ``` At millions of rows/second. No ON CONFLICT overhead. No MVCC. **Routine Load — continuous Kafka consumption:** ```sql CREATE ROUTINE LOAD load_ads_vw ON ads_volkswagen PROPERTIES ( "max_batch_interval" = "10", "max_batch_rows" = "200000" ) FROM KAFKA ( "kafka_broker_list" = "kafka:9092", "kafka_topic" = "ads_volkswagen_changes", "kafka_partitions" = "0,1,2,3", "kafka_offsets" = "OFFSET_BEGINNING" ); ``` Doris continuously consumes from Kafka and upserts rows. Combined with CDC (Section 3.5), this gives real-time upserts without any batch pipeline. #### Doris vs ClickHouse for This Use Case | Aspect | ClickHouse | Doris / StarRocks | |---|---|---| | **Upsert model** | ReplacingMergeTree (async, eventual) | Unique Key (sync option available) | | **Join performance** | Good (hash join) | Better (MPP joins, runtime filter) | | **MySQL compat.** | No (custom protocol) | Yes (MySQL wire) | | **Kafka ingest** | Kafka table engine | Built-in Routine Load | | **Materialized views** | Synchronous | Synchronous + async | | **Schema changes** | Hard (column adds require full rewrite) | ALTER TABLE (online) | | **Self-hosted ease** | Medium | Medium | | **Community** | Larger | Growing fast | **The key Doris advantage here:** built-in Routine Load from Kafka + Unique Key model = the entire ETL pipeline (the 250-iteration batch) can be replaced by a continuously running ingestion job that processes events as they arrive. **Summary:** - Cost: $0 (self-hosted on existing 38-vCPU hardware) - Timeline: 4–8 weeks - Risk: Medium (less Western adoption than ClickHouse, but production-proven at Baidu/Meituan scale) - Ceiling: High — handles real-time upserts + analytics queries on the same data --- ### 3.9 Full Cloud Native Stack #### The Component Stack ``` Scrapers │ ▼ Confluent Cloud (Managed Kafka) │ ├─► Kafka Connect (managed connectors) │ - PostgreSQL CDC source │ - BigQuery/Snowflake sink │ ├─► ksqlDB (stream processing SQL) │ - Filter, transform, enrich events │ └─► BigQuery / Snowflake (DWH) │ ├─► dbt Cloud (transformations) │ ├─► Looker / Metabase (BI) │ └─► Vertex AI / SageMaker (ML) │ ├─► Feature Store (online + offline) ├─► Model Training └─► Model Serving (REST endpoints) ``` #### Confluent — Managed Kafka Confluent is the company founded by Kafka's creators (LinkedIn alumni). Confluent Cloud is fully managed Kafka with: **Schema Registry:** Enforces data contracts between producers and consumers. If a scraper changes a field type, the Schema Registry rejects the incompatible schema — protecting downstream consumers from silent data corruption. ```json // Schema for full_auto_mobile events (Avro): { "type": "record", "name": "FullAutoMobile", "fields": [ {"name": "lid", "type": "long"}, {"name": "price", "type": ["null", "double"], "default": null}, {"name": "modified", "type": "string"} ] } ``` **ksqlDB:** Write SQL to continuously process Kafka streams: ```sql -- Filter and transform events in real time: CREATE STREAM ads_vw_germany AS SELECT lid, price, model, fuel FROM full_auto_mobile_changes WHERE make = 'Volkswagen' AND country = 'DE' EMIT CHANGES; ``` **Kafka Connect:** Pre-built connectors for 200+ systems. A PostgreSQL CDC → Kafka connector and a Kafka → BigQuery connector are point-and-click in Confluent Cloud. #### What is a Feature Store? A Feature Store is a central repository for ML features — computed attributes used to train and serve ML models. **Why it matters for this pipeline:** Your `ads_` table contains raw features (price, mileage, model, etc.). ML models need: 1. **Historical features** for training: "What was the average price of a 2019 Golf in Germany in Q3 2023?" 2. **Online features** for serving: "What is the current price percentile of this listing?" (< 100ms) Without a Feature Store: - Training uses historical batch data - Serving recomputes features on demand - **Training-serving skew**: the feature computed at training time differs from the feature at serving time → model performance degrades silently in production With a Feature Store (Feast, Tecton, Vertex AI Feature Store): - Features are computed once and stored - Same feature values used for both training and serving - Feature history is versioned — models can train on "features as of date X" **Feast (open-source Feature Store) example:** ```python from feast import FeatureStore, FeatureView, Entity, Field from feast.types import Float64, String # Define a feature view from your ads table: ads_features = FeatureView( name="ads_volkswagen_features", entities=[Entity(name="lid", join_keys=["lid"])], schema=[ Field(name="price", dtype=Float64), Field(name="price_pct_vs_model_avg", dtype=Float64), Field(name="days_on_market", dtype=Float64), ], source=BigQuerySource(table="project.gold.ads_volkswagen_features"), ttl=timedelta(days=1), ) # Online feature retrieval (< 10ms, from Redis): features = store.get_online_features( features=["ads_volkswagen_features:price", "ads_volkswagen_features:days_on_market"], entity_rows=[{"lid": 99001}] ).to_dict() ``` #### Vertex AI / SageMaker Overview Managed platforms for the full ML lifecycle: **Training:** Run model training jobs on managed compute. Integrates with BigQuery/S3 for data access. **Experiment tracking:** Log parameters, metrics, artifacts per training run. Compare experiments. **Model Registry:** Version and store trained models. Promote from "staging" to "production." **Model Serving:** Deploy models as REST endpoints. Autoscaling. A/B testing. **For car listings AI use cases:** - **Price recommendation:** Given a listing's specs, what price should the dealer set? (Regression model) - **Listing quality score:** Is this listing complete, accurate, likely to convert? (Classification model) - **Fraud detection:** Does this VIN appear in multiple listings? Is the price anomalously low? (Anomaly detection) - **Similar listings:** What other cars are similar to this one? (Embedding + vector search) **Summary:** - Cost: $3,000–$8,000/month at full scale - Timeline: 3–6 months - Risk: High (vendor lock-in, data residency, all consumers rewritten) - Ceiling: Maximum — infinite scale, fully managed, best AI/ML integration --- ### 3.10 TimescaleDB for Price History #### What is TimescaleDB? TimescaleDB is a PostgreSQL **extension** — not a separate database. You install it on your existing Postgres and it adds time-series superpowers. ```sql -- Install extension (one line): CREATE EXTENSION IF NOT EXISTS timescaledb; ``` That's it. All your existing Postgres applications continue to work unchanged. #### Hypertables — Automatic Time Partitioning ```sql -- Convert prices_auto_mobile to a hypertable partitioned by month: SELECT create_hypertable('prices_auto_mobile', 'dt', chunk_time_interval => INTERVAL '1 month'); ``` After this call, `prices_auto_mobile` looks and behaves exactly like before to all queries. But internally: - TimescaleDB creates a "chunk" for each month - Each chunk is a separate physical table - A query `WHERE dt > '2024-01-01'` automatically skips all earlier chunks (partition pruning) - Adding data for a new month creates a new chunk automatically For 3B rows across several years: instead of one massive table, you have ~72 monthly chunks (6 years × 12 months). The window function query that scans "recent prices" now scans only 1–3 chunks instead of all 3B rows. #### Columnar Compression TimescaleDB's **columnar compression** compresses chunks that are older than a threshold: ```sql ALTER TABLE prices_auto_mobile SET ( timescaledb.compress, timescaledb.compress_segmentby = 'lid', -- group rows by lid for compression timescaledb.compress_orderby = 'dt DESC' -- sort within group ); -- Auto-compress chunks older than 3 months: SELECT add_compression_policy('prices_auto_mobile', INTERVAL '3 months'); ``` **Compression results for time-series data:** A 3B-row price history table that takes 500GB uncompressed typically compresses to 50–100GB with TimescaleDB. This dramatically improves query speed (less I/O) and reduces disk usage. #### Continuous Aggregates — Pre-computed Window Functions This is the most powerful feature for the pipeline: ```sql -- Pre-compute first/last price per listing: CREATE MATERIALIZED VIEW price_first_last WITH (timescaledb.continuous) AS SELECT lid, first(price, dt) AS price_first, -- price at earliest dt last(price, dt) AS price_last, -- price at most recent dt MIN(price) AS price_min, MAX(price) AS price_max FROM prices_auto_mobile GROUP BY lid; -- Auto-refresh when new price data arrives: SELECT add_continuous_aggregate_policy('price_first_last', start_offset => INTERVAL '3 months', end_offset => INTERVAL '1 hour', schedule_interval => INTERVAL '1 hour' ); ``` Now the pipeline's price recovery query changes from: ```sql -- Before: scans up to 3B rows every run, 250 times SELECT row_number() OVER price_w_lv, lid, analytic.first(price) OVER price_w_lv price_last_new, ... FROM prices_auto_ WHERE lid = ANY(...) WINDOW price_w_lv AS (PARTITION BY lid ORDER BY dt DESC...) ``` to: ```sql -- After: sub-millisecond index lookup SELECT lid, price_first, price_last, price_min, price_max FROM price_first_last WHERE lid = ANY(...) ``` This eliminates one of the two major bottlenecks in the current pipeline with minimal risk (it's still PostgreSQL, same schema, same tools). **Summary:** - Cost: $0 (extension on existing Postgres) - Timeline: 1–2 weeks - Risk: Very low (it's a Postgres extension, fully backward compatible) - Ceiling: Solves the price history bottleneck permanently; doesn't solve the 4-way JOIN or ON CONFLICT bottlenecks --- ### 3.11 Two-Horizon Strategy #### Why One Solution Isn't Enough The interview task asks for a solution that hits the 4-hour SLA. But if you only answer that question, you're proposing a local optimization for today's problem while ignoring tomorrow's architectural needs. A CTO or Data Lead answers: **"Here's how we hit 4 hours this month, and here's the platform we build over the next 6–9 months so we don't have this conversation again in 18 months."** #### Horizon 1: Hit the SLA (Weeks 0–6) **Goal:** Bring runtime from 16 hours to under 4 hours using existing infrastructure. | Action | Expected Speedup | Risk | |---|---|---| | Parallelize make/portal loop (8–10 parallel) | 4–8x | Low | | Add TimescaleDB + Continuous Aggregates on prices_auto_* | 2–5x (price recovery step) | Very Low | | Replace ON CONFLICT with COPY+staging swap | 1.5–2x | Low | | Push-down filter to get changed lids before JOIN | 2–3x | Low | Combined: 16 hours → 1.5–3 hours. Exceeds the SLA with margin. **What this is NOT:** A permanent solution. It's buying time to build the right platform. #### Horizon 2: Build the AI-Ready Platform (Months 2–9) Pick one target architecture based on budget, team, and AI/RT timeline: **Path B2a — Kafka + Doris (Best for: low budget, self-hosted, RT requirements in 6–12 months)** - Month 2–3: Deploy Kafka/Redpanda, set up Debezium CDC - Month 4–5: Deploy Doris, migrate mart tables, set up Routine Load - Month 6: Decommission the batch pipeline (Kafka feeds Doris directly) - Month 7+: Feature store for ML, price alert service on Kafka **Path B2b — Medallion on Iceberg + ClickHouse (Best for: ML focus, historical analysis, moderate budget)** - Month 2–3: Set up MinIO, migrate Bronze layer (raw scraper output to Parquet) - Month 4–5: Build Silver dbt models, ClickHouse for analytics serving - Month 6–7: Gold layer, Iceberg time travel for ML training data - Month 8+: Feature store, batch ML pipelines **Path B2c — Managed Cloud (Best for: small team, GDPR-compliant EU regions, willing to pay)** - Month 2–3: Confluent Cloud setup, Postgres CDC to Kafka - Month 4–5: BigQuery/Snowflake as target, Kafka→DWH connectors - Month 6–7: dbt Cloud for transformations, existing Postgres consumers via compatibility views - Month 8+: Vertex AI / SageMaker feature store and model serving #### Why This Strategy Wins in an Interview When the interviewer asks "what's your solution?", answering with only "fix the SQL" shows you can optimize. Answering with "Horizon 1 fixes the immediate problem while we build Horizon 2" shows you think in systems, not patches. It demonstrates: - Business awareness (SLA must be hit now, not in 6 months) - Technical depth (you understand why the current approach fails at scale) - Strategic thinking (you're designing for where the company needs to be, not where it is) --- ### 3.12 Citus — Distributed / Parallel PostgreSQL **What it is:** Citus is an open-source PostgreSQL **extension** (maintained by Microsoft) that turns one or more Postgres servers into a distributed, parallel database. You keep the PostgreSQL SQL dialect, drivers, and tooling. **How it works:** You designate a **distribution column** (e.g. `lid` or `country`). Citus shards a table into many smaller Postgres tables ("shards") and spreads them across worker nodes — or, in **single-node Citus**, across many cores of one machine. A query is planned by a **coordinator** and executed **in parallel** across all shards. ```sql SELECT create_distributed_table('ads_volkswagen', 'lid'); SELECT create_distributed_table('prices_auto_mobile', 'lid', colocate_with => 'ads_volkswagen'); ``` **Why it is relevant here:** It attacks §2.1 (single-threading) *without* leaving PostgreSQL. The 4-way join and the bulk insert become parallel across shards automatically — no orchestrator-level chunking code to write. Co-locating `prices_auto_*` and `ads_` on the same `lid` shards makes the price join shard-local (no cross-node shuffle). **Trade-offs:** another extension to operate; queries that don't align with the shard key can be slow; cross-shard transactions are constrained. For a team that wants Postgres-native parallelism, single-node Citus is a genuine middle path between "rewrite the orchestration" (§3.1) and "adopt a new engine" (§3.2 / §3.8). - Cost: $0 (open-source extension); Timeline: 3–6 weeks; Risk: Low–Medium; Ceiling: parallel Postgres — still row-store for analytics. --- ### 3.13 In-Postgres Columnar Extensions **What they are:** Extensions that embed a **columnar / vectorized analytical engine inside PostgreSQL**, so you get OLAP performance without operating a second database. - **pg_duckdb** — embeds DuckDB inside Postgres; runs analytical queries (and queries over Parquet/Iceberg on object storage) with DuckDB's vectorized engine, from a normal Postgres connection. - **pg_mooncake** — adds **columnar tables** (backed by Parquet, Iceberg-compatible) to Postgres, kept in sync with regular tables; analytical queries hit the column store. - **pg_analytics / ParadeDB** — a column-store + analytics extension in the same spirit. **Why relevant:** This is the **lowest-friction OLAP option**. The §1.1 OLTP-on-OLAP mismatch — an 80-column analytical mart trapped in row-store Postgres — is the real long-term problem. These extensions let analytics consumers get column-store speed *without* a separate ClickHouse/Doris cluster to deploy, secure, monitor, and sync. For a small team, eliminating the "operate two databases" tax is significant. **Trade-offs:** younger and less battle-tested than ClickHouse/Doris; smaller community; performance ceiling below a dedicated columnar engine. Best seen as a pragmatic **first step** of Horizon 2 — and a fallback if standing up a separate columnar store is judged too heavy. - Cost: $0 (open-source extensions); Timeline: 2–4 weeks; Risk: Medium (maturity); Ceiling: good columnar reads in-Postgres; not a match for a dedicated OLAP cluster at extreme scale. --- ### 3.14 Vertical Scaling **What it is:** Before (or alongside) any re-architecture, simply **give the existing PostgreSQL better hardware**. The test spec is pointed: **64 GB RAM** against multi-TB data, and **SATA SSD** rather than NVMe. Two consequences (§2.7): the working set cannot stay in RAM, and the box is I/O-bound — and the backfill's hash joins and sorts spill onto the *slower* class of SSD. **The intervention:** - More RAM (256–512 GB) → joins, sorts, and the working set stay in memory; far less spill. - **NVMe** instead of SATA SSD → multiple-times-higher throughput and IOPS exactly where this workload hurts. - More vCPU only if parallelism (§3.1 / §3.12) is already in place to use it. **Why it belongs in the comparison:** the assignment asks for something "viable for the next 2–3 years (data grows ~30%)." **30% over three years is modest.** A hardware refresh, combined with the §3.1 rewrite, plausibly carries the workload for the full horizon at a **one-off** cost — no new systems, no migration, no new operational surface. A credible CTO answer names this explicitly rather than reaching straight for a distributed platform. It is rarely the *whole* answer (it does not fix the OLTP/OLAP mismatch), but omitting it signals reaching for complexity before economics. - Cost: one-off hardware (cloud: a larger instance; on-prem: a RAM/NVMe upgrade — order of a few thousand euros); Timeline: days–weeks; Risk: Very Low; Ceiling: buys years, doesn't change the architecture. --- ## 4. AI & Real-Time Readiness Matrix This is the critical lens a CTO applies that a pure data engineer often misses. | Solution | Real-Time Capable | ML Training Data | Feature Store Ready | Fraud Detection | Price Alerts | Vector Search | |---|:---:|:---:|:---:|:---:|:---:|:---:| | 3.1 Postgres Rewrite | No | No (no time travel) | No | No | No | With pgvector | | 3.2 + ClickHouse | Partial (minutes lag) | Limited | No | Partial | Partial | No | | 3.3 Lakehouse Iceberg | Near-RT (minutes) | **Yes (time travel)** | Partial | Partial | Partial | With external | | 3.4 BigQuery/Snowflake | Near-RT (minutes) | **Yes** | With Vertex/Sagemaker | Yes | Partial | With extensions | | 3.5 Kafka CDC | **Yes (seconds)** | Partial (via S3) | Partial | **Yes** | **Yes** | No | | 3.6 Materialize/RisingWave | **Yes (ms)** | Limited | No | **Yes** | **Yes** | No | | 3.7 Iceberg Medallion | Near-RT (minutes) | **Yes (time travel)** | **Yes** | Partial | Partial | With external | | 3.8 Doris/StarRocks | **Yes (seconds)** | Limited | No | **Yes** | **Yes** | Limited | | 3.9 Full Cloud Native | **Yes (seconds)** | **Yes** | **Yes (managed)** | **Yes** | **Yes** | **Yes** | | 3.10 TimescaleDB | No (still batch) | Partial | No | No | No | No | | 3.11 Two-Horizon | **Yes (H2)** | **Yes (H2)** | **Yes (H2)** | **Yes (H2)** | **Yes (H2)** | **Yes (H2)** | | 3.12 Citus | No | No | No | No | No | With pgvector | | 3.13 In-Postgres Columnar | No | Limited | No | Partial | No | With pgvector | | 3.14 Vertical Scaling | No | No | No | No | No | No | --- ## 5. Decision Framework Use this when presenting to the company. Frame it as questions, not prescriptions. ### Question 1: What is the team's current expertise? - Postgres-heavy team → Start with 3.1 + 3.10, then 3.2 or 3.8 - Python/cloud-native team → Jump to 3.7 + 3.3 - No dedicated data engineering → 3.4 (managed cloud handles operations) ### Question 2: What is the budget ceiling? - $0 additional: Options 3.1, 3.10, 3.8 (Doris), 3.6 (RisingWave self-hosted) - $500–1,500/month: Options 3.2 (ClickHouse Cloud), 3.7 (S3 storage) - $2,000–5,000/month: Options 3.4 (BigQuery/Snowflake), 3.9 (Cloud Native) ### Question 3: When does real-time land? - Not in the roadmap: 3.1 + 3.10 (optimize batch) - Within 12 months: 3.5 (Kafka CDC) + 3.8 (Doris) now - Already needed: 3.6 (Materialize) + 3.5 (Kafka) now ### Question 4: What are the AI/ML priorities? - Price prediction (near-term): needs feature store, Iceberg time travel - Fraud detection (near-term): needs real-time stream processing (Kafka + Flink/Materialize) - Recommendations (medium-term): needs vector search (pgvector, or dedicated Qdrant/Pinecone) - Full ML platform (long-term): Vertex AI / SageMaker + Feature Store ### Question 5: What is the migration risk tolerance? - Low risk: 3.1, 3.10, 3.2 (additive, no consumer disruption) - Medium risk: 3.5, 3.7, 3.8 (new systems alongside existing) - High risk acceptable: 3.3, 3.4, 3.6, 3.9 (full platform replacement) ### Question 6: What is the team size for this project? - 1 engineer: 3.1, 3.10 (Postgres optimization only) - 2 engineers: 3.2, 3.5, 3.8, 3.7 - 3+ engineers: 3.3, 3.4, 3.6, 3.9 --- ## 6. Comparative Summary Table "Hits backfill SLA?" below means the stated target: **first-time (backfill) load of a large brand under 4 hours.** | # | Solution | Hits backfill SLA? | Add. Cost/mo | Team | Timeline | AI/RT Ready | Risk | |---|---|:---:|---:|:---:|---|:---:|:---:| | 3.1 | Postgres Surgical Rewrite | **Yes** | $0 | 1 | 2–4 wk | No | Low | | 3.2 | Postgres + ClickHouse | Via 3.1; CH = reads | $0–800 | 2 | 4–8 wk | Partial | Med | | 3.3 | Lakehouse (Iceberg+DuckDB) | Yes | $0–375 | 2–3 | 3–5 mo | Yes | Med-High | | 3.4 | BigQuery / Snowflake | Yes | $1,600–2,000 | 3+ | 3–6 mo | Yes | High | | 3.5 | CDC + Kafka/Redpanda | **No** (incremental only) | $0–1,200 | 2 | 4–8 wk | Yes (enabler) | Med | | 3.6 | Materialize/RisingWave | **No** (state hydration = backfill) | $0–2,000 | 2 | 4–8 wk | Yes | Med-High | | 3.7 | Medallion + Iceberg | Yes | $0–500 | 2–3 | 3–5 mo | Yes | Med | | 3.8 | Doris / StarRocks | Yes | $0 | 2 | 4–8 wk | Partial | Med | | 3.9 | Full Cloud Native | Yes | $3,000–8,000 | 3+ | 3–6 mo | Yes (best) | High | | 3.10 | TimescaleDB | Partial (price step only) | $0 | 1 | 1–2 wk | No | Very Low | | 3.11 | Two-Horizon Strategy | **Yes** (H1 = 3.1) | $0 → grows | 1→3 | 6 wk + 6 mo | Yes (H2) | Low→Med | | 3.12 | Citus (parallel Postgres) | **Yes** | $0 | 1–2 | 3–6 wk | No | Low-Med | | 3.13 | In-Postgres Columnar | Via 3.1; columnar reads | $0 | 1–2 | 2–4 wk | Partial | Med | | 3.14 | Vertical Scaling | Contributes (with 3.1) | one-off HW | 1 | days–wk | No | Very Low | **The honest read:** options 3.1, 3.12, 3.14 (and 3.10 for the price step) are what actually move the stated SLA, at $0–one-off cost. Everything else is a *Horizon 2* answer to growth, analytics performance, real-time, or AI — valuable, but a different question from "16 h → 4 h." --- ## 7. Glossary | Term | Plain English | |---|---| | **Airflow** | A workflow orchestration tool. You define pipelines as Python code; Airflow schedules, runs, retries, and monitors them. | | **Batch processing** | Processing data in large chunks at scheduled intervals (e.g., once per day). | | **BE (Backend, in Doris)** | The worker node in a Doris cluster that stores data and executes queries. | | **Bronze / Silver / Gold** | Three-layer data organization pattern (Medallion). Raw → cleaned → business-ready. | | **CDC (Change Data Capture)** | Capturing row-level changes in a database and streaming them as events. | | **Chunk (TimescaleDB)** | A physical time-partitioned sub-table managed automatically by TimescaleDB. | | **Columnar storage** | Storing each column separately on disk instead of each row. Excellent for analytics. | | **Consumer Group (Kafka)** | A set of Kafka consumers that divide work on a topic among themselves. | | **Continuous Aggregate** | A materialized view that automatically stays up to date as new data arrives (TimescaleDB). | | **dbt** | Data build tool. Write SELECT queries; dbt handles the DDL and runs them on a schedule. | | **Debezium** | Open-source CDC tool that reads PostgreSQL WAL and publishes changes to Kafka. | | **DuckDB** | Embedded analytical database (like SQLite but for analytics). No server needed. | | **ELT** | Extract, Load, Transform. Load raw data first, then transform it in the target system. | | **ETL** | Extract, Transform, Load. Transform data before loading to the destination. | | **FE (Frontend, in Doris)** | The coordinator node in a Doris cluster that handles SQL parsing and query planning. | | **Feature Store** | Central repository for ML features — ensures training and serving use the same values. | | **Hypertable** | A TimescaleDB table that is automatically partitioned by time into chunks. | | **Iceberg** | Open table format providing ACID transactions, schema evolution, and time travel on object storage. | | **IVM (Incremental View Maintenance)** | Updating only the affected parts of a materialized view when underlying data changes. | | **Kafka** | Distributed event streaming platform. High-durability, ordered message queue. | | **Kafka Offset** | The position of a message within a Kafka partition. Consumers track their offset. | | **Kafka Partition** | A topic is split into N ordered, append-only partitions for parallelism. | | **Kafka Topic** | A named category of events in Kafka. Like a database table, but for events. | | **Lakehouse** | Architecture combining cheap data lake storage with data warehouse ACID + query capabilities. | | **LSN (Log Sequence Number)** | PostgreSQL's unique identifier for a position in the WAL. | | **Medallion Architecture** | Bronze/Silver/Gold data organization pattern. | | **MergeTree** | ClickHouse's core storage engine. Data stored in sorted columnar parts that are periodically merged. | | **MPP (Massively Parallel Processing)** | Distributing a query across many CPUs/nodes simultaneously. Used by Doris, BigQuery, Snowflake. | | **MVCC** | PostgreSQL's concurrency control: instead of modifying rows in place, creates new versions. Causes bloat. | | **ON CONFLICT DO UPDATE** | PostgreSQL's upsert syntax. Slow at scale due to MVCC overhead. | | **OLAP** | Online Analytical Processing. Optimized for complex aggregations over large datasets. | | **OLTP** | Online Transaction Processing. Optimized for fast single-row reads and writes. | | **Partition (PostgreSQL)** | A physical sub-table of a large table. Queries that filter on the partition key skip irrelevant partitions. | | **Partition Pruning** | Query optimizer skipping irrelevant partitions. Dramatically speeds up range queries. | | **Redpanda** | Kafka-compatible event streaming platform written in C++. Lower latency, simpler operations. | | **ReplacingMergeTree** | ClickHouse engine that deduplicates rows with the same sort key during background merges. | | **RisingWave** | Open-source streaming SQL database with Postgres compatibility and S3 state spill. | | **Routine Load (Doris)** | Built-in Kafka consumer in Doris for continuous data ingestion. | | **Schema Registry** | Service that enforces data contracts (schemas) between Kafka producers and consumers. | | **SIMD** | Single Instruction, Multiple Data. CPU feature used by ClickHouse/DuckDB to process many values in parallel. | | **Spark** | Distributed data processing framework. Splits work across a cluster of machines. | | **Stream Load (Doris)** | HTTP-based bulk data ingestion API for Doris. Millions of rows/second. | | **Streaming** | Processing data event-by-event as it arrives. Millisecond to second latency. | | **Time Travel** | Querying historical snapshots of a table as it was at a specific point in time (Iceberg, Snowflake, BigQuery). | | **TimescaleDB** | PostgreSQL extension adding time-series features: hypertables, compression, continuous aggregates. | | **Unique Key Model (Doris)** | Doris data model where inserts with existing keys replace old rows. The upsert model. | | **Vectorized Execution** | Processing 8192 rows at once using CPU SIMD instructions (ClickHouse, DuckDB). | | **Virtual Warehouse (Snowflake)** | An isolated compute cluster in Snowflake. You pay per second while it's running. | | **WAL (Write-Ahead Log)** | PostgreSQL's sequential change log. Primary source for CDC. | | **Watermark** | A timestamp tracking "I have processed all events up to this point." Used in incremental ETL and streaming. | | **Window Function** | SQL function that computes a value per row using a "window" of surrounding rows (e.g., PARTITION BY lid ORDER BY dt). | ### Glossary — v2 additions | Term | Plain English | |---|---| | **Autovacuum** | PostgreSQL's background process that reclaims space from dead tuples. Starved by long-running transactions. | | **Backfill / first-time load** | The initial load of a table's entire history (vs. a daily incremental). The regime the 16 h figure belongs to. | | **Bloat / dead tuple** | Old row versions left by MVCC after UPDATE/DELETE, until vacuum reclaims them. | | **Citus** | PostgreSQL extension that shards tables and runs queries in parallel across cores/nodes. | | **COPY** | PostgreSQL's bulk-load command — far faster than row-by-row INSERT. | | **`maintenance_work_mem`** | Memory budget for maintenance ops; governs `CREATE INDEX` speed after a bulk load. | | **NVMe vs SATA SSD** | NVMe SSDs deliver several times the throughput/IOPS of SATA SSDs — relevant when a workload is I/O-bound. | | **`ATTACH PARTITION` / partition swap** | Adding a pre-built table as a partition in one near-instant metadata operation — used to "publish" a backfill. | | **pg_duckdb / pg_mooncake** | Extensions embedding a columnar/vectorized engine inside PostgreSQL for OLAP without a second database. | | **UDF (User-Defined Function)** | A custom SQL / PL-pgSQL function. Called per row, it can silently dominate runtime. | | **Vertical scaling** | Making one machine bigger (RAM, NVMe, vCPU) rather than adding machines. | | **`work_mem`** | Per-operation memory for sorts/hash joins. Too low → spill to disk. | | **xmin horizon** | The oldest transaction snapshot still in use; vacuum cannot clean tuples newer than it. A 16 h transaction freezes it. | ### Glossary — Part R (redesign) additions | Term | Plain English | |---|---| | **Dimensional modeling (Kimball)** | Organising a warehouse as narrow **fact** tables (events/measurements) joined to **dimension** tables (descriptive context). | | **Fact / dimension table** | Fact = the thing that happens often (a listing, a price change); dimension = slowly-changing context (a dealer, a model mapping). | | **OBT ("One Big Table")** | Denormalising everything into one wide table. Simple to query, painful to maintain when columns change at different rates. | | **Shift-left** | Moving computation earlier in the pipeline — ideally to write time — so it is done once, not re-done on every read/rebuild. | | **Generated column** | A column whose value PostgreSQL computes from other columns automatically — a built-in way to precompute a transform. | | **Event sourcing / event-fold** | Treating a log of events as the source of truth and maintaining a "current state" projection by folding new events into it. | | **Slowly Changing Dimension (SCD)** | A dimension whose attributes change rarely; SCD Type 2 keeps history by versioning rows. | | **Polars** | A fast, multi-threaded, columnar DataFrame library (Rust core) with a larger-than-RAM streaming engine. | | **Compute ≠ storage** | The principle that the engine doing the transformation need not be the engine storing the data (e.g. DuckDB transforming data that lives in Postgres). | | **Optimize vs. redesign** | Optimize = make the existing shape faster; redesign = change the shape so the expensive work no longer exists. | --- *Document v3 (redesign axis added) — 2026-05-21 | v2 2026-05-21 | v1 2026-05-20 | Context: lot+internet final interview preparation | Source materials: test_task_etl_optimization.pdf + pipeline.sql*