From Batch Pipeline
to Data Platform

Three nested architectures for the car-listings data product — and a recommendation

ETL Pipeline Optimization · Technical Presentation
Lot Internet — Final Interview · 2026-05-22

ETL → Data Platform

1 / 16

Agenda

The next 60 minutes

Act	What we cover	Outcome
1 · Reframe	What the problem actually is — beyond "a slow batch job"	The real question
2 · Diagnose	Why it is slow — and why it stays slow	Mechanical vs structural
3 · Prescribe	Three architectures, nested — for three honest situations	Case 1 / 2 / 3
4 · Decide	Comparison, a calculated cost breakdown, and a recommendation	What to do now
5 · De-risk	Migration, rollback, metrics, what Phase 0 confirms	How we ship it safely

Please interrupt at any point — the most useful version of this is a conversation.

ETL → Data Platform

2 / 16

Act 1 · Reframe

The brief asks "which engine?" — I think that is the wrong question

Three facts change the answer:

The 16-hour job is a one-time backfill — not the daily run. The script's watermark falls back to 2018-01-01 on an empty mart, so a first run ingests all history. The daily incremental is already fast.
A slow backfill freezes the product. The make/model mapping taxonomy is the product's quality — and improving it means re-deriving the mart. At 16h × 50 brands × 15 countries, that is prohibitive.
The architecture must serve where the business is going — AI: price intelligence, ML deduplication, fraud, semantic search. None of that runs on a once-a-day reporting mart.

The real question: not "which engine" — but how we turn a brittle batch reporting pipeline into a data-product platform. This still answers the brief in full — SLA, cost, migration — the reframe is the value added on top.

ETL → Data Platform

3 / 16

Act 1 · Reframe

16 hours = a one-time backfill, not the daily run

	First-time load (backfill)	Daily incremental
Trigger	New brand / country; rebuild after a logic change	Scheduled daily run
Volume	The brand's entire history	200K–5M rows / country
Write pattern	Pure INSERT into an empty mart	UPDATE-heavy upsert
The 16 hours	◀ this one	Already fast

Proof by arithmetic: 16h for 5M rows ≈ 85 rows/second — impossibly slow for a daily run. 16h only makes sense as the full historical load.

What the slow backfill blocks: onboarding a country (≈50 backfills) · shipping a taxonomy improvement · disaster recovery. It is a tax on the product, not just a batch window.

ETL → Data Platform

4 / 16

Act 2 · Diagnose

The slowness is designed in

Mechanical — why 16h (fix by engineering)

Single-threaded — uses 1 of 38 cores
One 16-hour transaction — freezes autovacuum DB-wide
Per-row UDF calls — tens of millions of them
3-billion-row price-history scan, re-run per (make, portal)
Full 4-way join on every run

Structural — why it regenerates (fix by redesign)

One pipeline doing two incompatible jobs
"One Big Table" — 80 columns, any change rewrites the row
Recompute, not precompute
Analytical work, row-by-row, in an OLTP engine
Listing-centric model — the business is vehicle-centric

Tuning fixes the left column — that is how a Postgres rewrite hits 4 hours. Only redesign fixes the right.

ETL → Data Platform

5 / 16

Act 2 · Diagnose — the core insight

Your data model describes adverts. Your business is about cars.

listing
mobile.de

listing
autoscout24

listing
relisted later

↓

ONE physical vehicle

ads_<make> is keyed one row per listing.

The same physical car appears as many listings — across portals, relisted over time.
The brief says the mart "powers deduplication" — collapsing those back to one car. Yet today dedup is a downstream concern with no home in the schema.
Deduplication, price history, and every AI use case are vehicle-level — trapped in an advert-level table.

This is the difference between making the pipeline faster and making the company's data valuable.

ETL → Data Platform

6 / 16

Act 2 · Diagnose — the core insight

Model the domain as entities

dealerdimension · SCD2

◀ placed by───

listingone advert, one portal

N → 1───▶

VEHICLEthe resolved physical car

classified ▶───

mappingmake/model taxonomy

↓ its price history

price_eventprice stream + first / last / min / max

PAYOFF ①

Dedup is first-class

Assigning a vehicle_id to listings is entity resolution — it finally has a home.

PAYOFF ②

Analytics get narrow tables

A price change touches one row — not an 80-column rewrite.

PAYOFF ③

AI gets a clean substrate

Price models, recommendations, fraud — all read the entity model.

ads_<make> becomes a view over these entities — nothing downstream breaks.

Governance: dealer as a dimension is the natural control point for what PII reaches which consumer (GDPR).

ETL → Data Platform

7 / 16

Act 3 · Prescribe

Three architectures — nested, not three separate bets

CASE 3 — AI-Native Data Product

CASE 2 — Platformize

CASE 1 — Stabilize

Redesign in PostgreSQL · entity model · €0 infra — you do this regardless

+ columnar OLAP serving · + CDC event backbone

+ lakehouse (Iceberg) · + streaming · + feature & vector stores

Case 1 is the foundation — done in every scenario.
Case 2 = Case 1 + a serving layer and an event backbone.
Case 3 = Case 2 + a lakehouse and ML infrastructure.
Because they nest, committing to Case 1 today forecloses nothing.

B vs C is a strategy-and-funding decision — made later, on evidence.

ETL → Data Platform

8 / 16

Act 3 · Prescribe

Case 1 — Stabilize all-PostgreSQL · €0 infra

When this is you: a small team, no budget for new infrastructure, the SLA must be fixed and costs flat. For many companies in your position — this is the whole answer.

Scrapers

→

PostgreSQLlanding + entity model + shift-left columns + price_state

→

SEED / SYNC jobsDuckDB transform · all-core · COPY + partition-ATTACH

→

ads_<make> VIEW→ consumers, unchanged

Technical approach

Seed job: COPY into index-free partitions → build indexes in parallel → ATTACH PARTITION to publish atomically; resumable, chunked per (make × portal × lid-range).
Sync job: resolve changed lids from mapping first → COPY to staging → targeted UPDATE … FROM (only changed rows — no MVCC churn).
Transform in DuckDB: vectorized, multi-core; attaches PostgreSQL / reads staged Parquet — moves the join+transform off single-threaded SQL at €0 (a library, not a server).
Shift-left: deterministic transforms become GENERATED … STORED columns; price_state maintained by a trigger → the 3-billion-row scan disappears.
Entity model behind a compatibility view; vehicle_id bootstrapped from VIN + deterministic rules.

Answers the brief: SLA <4h structurally · incremental = thin delta merge · 4-way join eliminated · consumers unchanged (compatibility view).

Infra cost

€0

Team

1–2 engineers

Timeline

2–3 mo · SLA in 3–4 wk

AI ceiling

Low

ETL → Data Platform

9 / 16

Act 3 · Prescribe

Case 2 — Platformize Case 1 + serving layer + CDC

When this is you: analytics consumers are genuinely strained, you want to ship product improvements faster, moderate budget, AI is a real 12-month roadmap item.

PostgreSQLOLTP · system of record · entity model

→

Debezium CDC → Kafkaevent backbone

→

ClickHouse / Doriscolumnar OLAP serving

→

analytics · BI · exports

Technical approach

CDC: Debezium reads each PostgreSQL WAL via logical replication (a PUBLICATION + replication slot) → publishes row-level change events to Kafka / Redpanda topics.
Serving: ClickHouse ingests via its Kafka table engine; the mart is a ReplacingMergeTree keyed (make, lid), version = modified — columnar, vectorized, 10–100× on wide scans.
Sizing: the mart compresses to a few hundred GB columnar — one central ClickHouse deployment serves all 15 countries.
Iteration velocity: a taxonomy change emits CDC events → only the affected vehicles re-derive — no full re-backfill.
Migration: PostgreSQL stays system-of-record; analytics/export consumers cut over one at a time, parity-validated.

Answers the brief: the "split OLTP / OLAP" direction — done after the redesign, so a clean model is projected, not a broken one migrated.

Infra cost

€0–800 / mo

Team

2–4 engineers

Timeline

+2–4 months

AI ceiling

Medium–High

ETL → Data Platform

10 / 16

Act 3 · Prescribe

Case 3 — AI-Native Data Product Case 2 + lakehouse + ML platform

When this is you: AI products are the strategy — price intelligence, ML dedup, fraud, semantic search, an LLM query layer. Well-funded; 5+ engineers including ML / MLOps.

Kafka

→

BronzeIceberg · raw · append-only

→

Silverentity model · time-travel

→

Serving · Feature store · Vector store

→

AI products

Technical approach

Lakehouse: Apache Iceberg tables on object storage — Bronze (raw, append-only) → Silver (entity model) → Gold (serving projections). Reprocessable raw makes a model/taxonomy change cheap.
Time-travel: Iceberg snapshots give ACID + query AS OF a past state → point-in-time-correct training data, essential for trustworthy price models.
Streaming: Flink (or Spark Structured Streaming) maintains Silver continuously from Kafka — the entity model is never "rebuilt," it is always current.
Feature store (Feast): offline store = lakehouse, online store = Redis; point-in-time joins eliminate train/serve skew.
Vector store (Qdrant / pgvector): vehicle embeddings → semantic "find similar" and an LLM query layer over the catalogue.

A strategic, funded commitment — not a technical default. Decide the AI strategy first; then build the platform.

Infra cost

€1–5k / mo

Team

5+ incl. ML

Timeline

+6–12 months

AI ceiling

Full

ETL → Data Platform

11 / 16

Act 4 · Decide

The three, side by side

Dimension	Case 1 — Stabilize	Case 2 — Platformize	Case 3 — AI-Native
Hits the 4h SLA	Yes — structurally	Yes (via Case 1)	Yes (continuous re-derive)
Recurring cost / month	€0	€0–800	€1,000–5,000
Team	1–2 engineers	2–4 engineers	5+ incl. ML / MLOps
Timeline	~2–3 months	+2–4 months	+6–12 months
Iteration velocity	Cheap re-derive	Incremental via CDC	Continuous, near-free
AI ceiling	Low (AI-ready in principle)	Medium–high (on-ramp)	Full (products delivered)
Operational complexity	Low — pure Postgres	Medium — +2 systems	High — platform + ML
Risk	Low	Medium	Higher (staged to mitigate)

Cost, team, AI ceiling and complexity rise together. The one row that does not change: all three hit the SLA — because all three contain Case 1.

ETL → Data Platform

12 / 16

Act 4 · Decide

What each case actually costs — and how the number is built

Cost component	Case 1 — Stabilize	Case 2 — Platformize	Case 3 — AI-Native
New infrastructure	None — reuses the existing hardware	1 ClickHouse + 1 Kafka node (central, serves all 15 countries)	+ object storage, stream compute, feature & vector stores
Software licences	€0 — open-source throughout	€0 — open-source	€0 — open-source
Recurring / month	€0	€0 self-hosted → ~€800 cloud-hosted	€1,000–5,000 (storage + stream + training compute)
One-off engineering	1 eng × 2–3 mo ≈ €16–24k	+ 2–4 eng × 2–4 mo ≈ €40–90k	+ 5+ eng × 6–12 mo — a strategic programme
3-year TCO (indicative)	≈ €20k	≈ €60–145k	Strategic-budget scale (people-dominated)

WHY €0 INFRA — CASE 1

Runs on the existing 38-vCPU servers; PostgreSQL + DuckDB are open-source. The only cost is one-off engineering — there is no recurring spend.

WHY ONE NODE — CASE 2

The mart compresses to a few hundred GB columnar — so a single ClickHouse deployment serves all 15 countries. Cost scales with reads, not with country count.

Assumes ≈ €8k loaded cost per engineer-month. Recurring range spans self-hosted on existing capacity vs. cloud-managed. All software open-source; all data stays in the EU.

The cost ladder mirrors the value ladder — and Case 1, the unconditional move, is an order of magnitude cheaper than the rest, with €0 recurring cost.

ETL → Data Platform

13 / 16

Act 4 · Decide

Do Case 1 now. Decide the rest on evidence.

Now — unconditional

Build Case 1.

Hits the SLA in weeks
Solves iteration velocity
€0 infrastructure cost
Operable across all 15 countries
It is the first phase of Cases 2 & 3 — never wasted

Later — on evidence

Case 2 — when measurement shows analytics is genuinely slow, or taxonomy iteration is throttled.
Case 3 — when AI products are a funded, staffed strategic commitment.

What I would not do — the brief's implied move — is pick a target engine first. That commits to a destination before the route is known, and migrates an unfixed model.

Build the foundation now; let measured analytics load and a funded AI strategy decide how far you grow it.

ETL → Data Platform

14 / 16

Act 5 · De-risk

Phased — and every phase reversible

Phase	Delivers	Gate / note
0 · Measure	Profile the pipeline; confirm open questions (below)	Evidence, not guesses
1 · Split + DuckDB	Case 1 — the SLA	Exit gate: VW backfill <4h, output bit-identical
2 · Shift-left	Case 1 — precompute, fold the price log	Parity-validated
3 · Entity remodel	Case 1 — model behind compatibility views	Dual-run until parity holds
4 · Serving + CDC	Case 2 (if chosen)	Additive — consumers migrate one by one
5 · Lakehouse + ML	Case 3 (if chosen)	Staged, each stage valuable

Rollback & metrics

Old pipeline behind a feature flag · the rewrite targets the identical ads_<make> schema · dual-run until parity. Watch: backfill runtime · iteration velocity (cost to re-derive after a taxonomy change) · row/checksum parity · analytics p95.

What Phase 0 confirms

How often a real backfill happens · why price_first/last are NULL (fixing it upstream removes the 3B-row scan) · whether analytics is measurably slow (the Case 2 trigger) · the AI strategy (the Case 3 trigger).

ETL → Data Platform

15 / 16

Build the foundation now —
decide the rest deliberately

The 16-hour job is a symptom. The real problem is a pipeline shape and a data model that freeze your product. Fixing it is a redesign — not a new database.

Case 1 now — hits the SLA in weeks, €0 infrastructure, the foundation for everything after.
Then let evidence decide how far to grow it.

Thank you — happy to go deeper anywhere.

ETL → Data Platform

16 / 16

From Batch Pipelineto Data Platform

Build the foundation now —decide the rest deliberately

From Batch Pipeline
to Data Platform

Build the foundation now —
decide the rest deliberately