From Batch Pipeline
to Data Platform

Three nested architectures for the car-listings data product — and a recommendation
ETL Pipeline Optimization  ·  Technical Presentation
Lot Internet — Final Interview  ·  2026-05-22
ETL → Data Platform
1 / 16
Agenda
The next 60 minutes
ActWhat we coverOutcome
1 · ReframeWhat the problem actually is — beyond "a slow batch job"The real question
2 · DiagnoseWhy it is slow — and why it stays slowMechanical vs structural
3 · PrescribeThree architectures, nested — for three honest situationsCase 1 / 2 / 3
4 · DecideComparison, a calculated cost breakdown, and a recommendationWhat to do now
5 · De-riskMigration, rollback, metrics, what Phase 0 confirmsHow we ship it safely
Please interrupt at any point — the most useful version of this is a conversation.
ETL → Data Platform
2 / 16
Act 1 · Reframe
The brief asks "which engine?" — I think that is the wrong question

Three facts change the answer:

The real question: not "which engine" — but how we turn a brittle batch reporting pipeline into a data-product platform. This still answers the brief in full — SLA, cost, migration — the reframe is the value added on top.
ETL → Data Platform
3 / 16
Act 1 · Reframe
16 hours = a one-time backfill, not the daily run
 First-time load (backfill)Daily incremental
TriggerNew brand / country; rebuild after a logic changeScheduled daily run
VolumeThe brand's entire history200K–5M rows / country
Write patternPure INSERT into an empty martUPDATE-heavy upsert
The 16 hours◀ this oneAlready fast
Proof by arithmetic: 16h for 5M rows ≈ 85 rows/second — impossibly slow for a daily run. 16h only makes sense as the full historical load.
What the slow backfill blocks: onboarding a country (≈50 backfills) · shipping a taxonomy improvement · disaster recovery. It is a tax on the product, not just a batch window.
ETL → Data Platform
4 / 16
Act 2 · Diagnose
The slowness is designed in
Mechanical — why 16h  (fix by engineering)
  • Single-threaded — uses 1 of 38 cores
  • One 16-hour transaction — freezes autovacuum DB-wide
  • Per-row UDF calls — tens of millions of them
  • 3-billion-row price-history scan, re-run per (make, portal)
  • Full 4-way join on every run
Structural — why it regenerates  (fix by redesign)
  • One pipeline doing two incompatible jobs
  • "One Big Table" — 80 columns, any change rewrites the row
  • Recompute, not precompute
  • Analytical work, row-by-row, in an OLTP engine
  • Listing-centric model — the business is vehicle-centric
Tuning fixes the left column — that is how a Postgres rewrite hits 4 hours. Only redesign fixes the right.
ETL → Data Platform
5 / 16
Act 2 · Diagnose — the core insight
Your data model describes adverts. Your business is about cars.
listing
mobile.de
listing
autoscout24
listing
relisted later
ONE physical vehicle

ads_<make> is keyed one row per listing.

  • The same physical car appears as many listings — across portals, relisted over time.
  • The brief says the mart "powers deduplication" — collapsing those back to one car. Yet today dedup is a downstream concern with no home in the schema.
  • Deduplication, price history, and every AI use case are vehicle-level — trapped in an advert-level table.
This is the difference between making the pipeline faster and making the company's data valuable.
ETL → Data Platform
6 / 16
Act 2 · Diagnose — the core insight
Model the domain as entities
dealerdimension · SCD2
◀ placed by───
listingone advert, one portal
N → 1───▶
VEHICLEthe resolved physical car
classified ▶───
mappingmake/model taxonomy
↓  its price history
price_eventprice stream + first / last / min / max
PAYOFF ①
Dedup is first-class
Assigning a vehicle_id to listings is entity resolution — it finally has a home.
PAYOFF ②
Analytics get narrow tables
A price change touches one row — not an 80-column rewrite.
PAYOFF ③
AI gets a clean substrate
Price models, recommendations, fraud — all read the entity model.

ads_<make> becomes a view over these entities — nothing downstream breaks.

Governance: dealer as a dimension is the natural control point for what PII reaches which consumer (GDPR).
ETL → Data Platform
7 / 16
Act 3 · Prescribe
Three architectures — nested, not three separate bets
CASE 3 — AI-Native Data Product
CASE 2 — Platformize
CASE 1 — Stabilize
Redesign in PostgreSQL · entity model · €0 infra — you do this regardless
+ columnar OLAP serving  ·  + CDC event backbone
+ lakehouse (Iceberg)  ·  + streaming  ·  + feature & vector stores
  • Case 1 is the foundation — done in every scenario.
  • Case 2 = Case 1 + a serving layer and an event backbone.
  • Case 3 = Case 2 + a lakehouse and ML infrastructure.
  • Because they nest, committing to Case 1 today forecloses nothing.
B vs C is a strategy-and-funding decision — made later, on evidence.
ETL → Data Platform
8 / 16
Act 3 · Prescribe
Case 1 — Stabilize all-PostgreSQL · €0 infra
When this is you: a small team, no budget for new infrastructure, the SLA must be fixed and costs flat. For many companies in your position — this is the whole answer.
Scrapers
PostgreSQLlanding + entity model + shift-left columns + price_state
SEED / SYNC jobsDuckDB transform · all-core · COPY + partition-ATTACH
ads_<make> VIEW→ consumers, unchanged
Technical approach
Answers the brief: SLA <4h structurally · incremental = thin delta merge · 4-way join eliminated · consumers unchanged (compatibility view).
Infra cost
€0
Team
1–2 engineers
Timeline
2–3 mo · SLA in 3–4 wk
AI ceiling
Low
ETL → Data Platform
9 / 16
Act 3 · Prescribe
Case 2 — Platformize Case 1 + serving layer + CDC
When this is you: analytics consumers are genuinely strained, you want to ship product improvements faster, moderate budget, AI is a real 12-month roadmap item.
PostgreSQLOLTP · system of record · entity model
ClickHouse / Doriscolumnar OLAP serving
analytics · BI · exports
Technical approach
Answers the brief: the "split OLTP / OLAP" direction — done after the redesign, so a clean model is projected, not a broken one migrated.
Infra cost
€0–800 / mo
Team
2–4 engineers
Timeline
+2–4 months
AI ceiling
Medium–High
ETL → Data Platform
10 / 16
Act 3 · Prescribe
Case 3 — AI-Native Data Product Case 2 + lakehouse + ML platform
When this is you: AI products are the strategy — price intelligence, ML dedup, fraud, semantic search, an LLM query layer. Well-funded; 5+ engineers including ML / MLOps.
BronzeIceberg · raw · append-only
Silverentity model · time-travel
Serving · Feature store · Vector store
AI products
Technical approach
A strategic, funded commitment — not a technical default. Decide the AI strategy first; then build the platform.
Infra cost
€1–5k / mo
Team
5+ incl. ML
Timeline
+6–12 months
AI ceiling
Full
ETL → Data Platform
11 / 16
Act 4 · Decide
The three, side by side
DimensionCase 1 — StabilizeCase 2 — PlatformizeCase 3 — AI-Native
Hits the 4h SLAYes — structurallyYes (via Case 1)Yes (continuous re-derive)
Recurring cost / month€0€0–800€1,000–5,000
Team1–2 engineers2–4 engineers5+ incl. ML / MLOps
Timeline~2–3 months+2–4 months+6–12 months
Iteration velocityCheap re-deriveIncremental via CDCContinuous, near-free
AI ceilingLow (AI-ready in principle)Medium–high (on-ramp)Full (products delivered)
Operational complexityLow — pure PostgresMedium — +2 systemsHigh — platform + ML
RiskLowMediumHigher (staged to mitigate)
Cost, team, AI ceiling and complexity rise together. The one row that does not change: all three hit the SLA — because all three contain Case 1.
ETL → Data Platform
12 / 16
Act 4 · Decide
What each case actually costs — and how the number is built
Cost componentCase 1 — StabilizeCase 2 — PlatformizeCase 3 — AI-Native
New infrastructureNone — reuses the existing hardware1 ClickHouse + 1 Kafka node (central, serves all 15 countries)+ object storage, stream compute, feature & vector stores
Software licences€0 — open-source throughout€0 — open-source€0 — open-source
Recurring / month€0€0 self-hosted → ~€800 cloud-hosted€1,000–5,000 (storage + stream + training compute)
One-off engineering1 eng × 2–3 mo ≈ €16–24k+ 2–4 eng × 2–4 mo ≈ €40–90k+ 5+ eng × 6–12 mo — a strategic programme
3-year TCO (indicative)≈ €20k≈ €60–145kStrategic-budget scale (people-dominated)
WHY €0 INFRA — CASE 1
Runs on the existing 38-vCPU servers; PostgreSQL + DuckDB are open-source. The only cost is one-off engineering — there is no recurring spend.
WHY ONE NODE — CASE 2
The mart compresses to a few hundred GB columnar — so a single ClickHouse deployment serves all 15 countries. Cost scales with reads, not with country count.

Assumes ≈ €8k loaded cost per engineer-month. Recurring range spans self-hosted on existing capacity vs. cloud-managed. All software open-source; all data stays in the EU.

The cost ladder mirrors the value ladder — and Case 1, the unconditional move, is an order of magnitude cheaper than the rest, with €0 recurring cost.
ETL → Data Platform
13 / 16
Act 4 · Decide
Do Case 1 now. Decide the rest on evidence.
Now — unconditional

Build Case 1.

  • Hits the SLA in weeks
  • Solves iteration velocity
  • €0 infrastructure cost
  • Operable across all 15 countries
  • It is the first phase of Cases 2 & 3 — never wasted
Later — on evidence
  • Case 2 — when measurement shows analytics is genuinely slow, or taxonomy iteration is throttled.
  • Case 3 — when AI products are a funded, staffed strategic commitment.
What I would not do — the brief's implied move — is pick a target engine first. That commits to a destination before the route is known, and migrates an unfixed model.
Build the foundation now; let measured analytics load and a funded AI strategy decide how far you grow it.
ETL → Data Platform
14 / 16
Act 5 · De-risk
Phased — and every phase reversible
PhaseDeliversGate / note
0 · MeasureProfile the pipeline; confirm open questions (below)Evidence, not guesses
1 · Split + DuckDBCase 1 — the SLAExit gate: VW backfill <4h, output bit-identical
2 · Shift-leftCase 1 — precompute, fold the price logParity-validated
3 · Entity remodelCase 1 — model behind compatibility viewsDual-run until parity holds
4 · Serving + CDCCase 2 (if chosen)Additive — consumers migrate one by one
5 · Lakehouse + MLCase 3 (if chosen)Staged, each stage valuable
Rollback & metrics

Old pipeline behind a feature flag · the rewrite targets the identical ads_<make> schema · dual-run until parity. Watch: backfill runtime · iteration velocity (cost to re-derive after a taxonomy change) · row/checksum parity · analytics p95.

What Phase 0 confirms

How often a real backfill happens · why price_first/last are NULL (fixing it upstream removes the 3B-row scan) · whether analytics is measurably slow (the Case 2 trigger) · the AI strategy (the Case 3 trigger).

ETL → Data Platform
15 / 16

Build the foundation now —
decide the rest deliberately

The 16-hour job is a symptom. The real problem is a pipeline shape and a data model that freeze your product. Fixing it is a redesign — not a new database.
Case 1 now — hits the SLA in weeks, €0 infrastructure, the foundation for everything after.
Then let evidence decide how far to grow it.

Thank you — happy to go deeper anywhere.
ETL → Data Platform
16 / 16