| Act | What we cover | Outcome |
|---|---|---|
| 1 · Reframe | What the problem actually is — beyond "a slow batch job" | The real question |
| 2 · Diagnose | Why it is slow — and why it stays slow | Mechanical vs structural |
| 3 · Prescribe | Three architectures, nested — for three honest situations | Case 1 / 2 / 3 |
| 4 · Decide | Comparison, a calculated cost breakdown, and a recommendation | What to do now |
| 5 · De-risk | Migration, rollback, metrics, what Phase 0 confirms | How we ship it safely |
Three facts change the answer:
| First-time load (backfill) | Daily incremental | |
|---|---|---|
| Trigger | New brand / country; rebuild after a logic change | Scheduled daily run |
| Volume | The brand's entire history | 200K–5M rows / country |
| Write pattern | Pure INSERT into an empty mart | UPDATE-heavy upsert |
| The 16 hours | ◀ this one | Already fast |
ads_<make> is keyed one row per listing.
ads_<make> becomes a view over these entities — nothing downstream breaks.
| Dimension | Case 1 — Stabilize | Case 2 — Platformize | Case 3 — AI-Native |
|---|---|---|---|
| Hits the 4h SLA | Yes — structurally | Yes (via Case 1) | Yes (continuous re-derive) |
| Recurring cost / month | €0 | €0–800 | €1,000–5,000 |
| Team | 1–2 engineers | 2–4 engineers | 5+ incl. ML / MLOps |
| Timeline | ~2–3 months | +2–4 months | +6–12 months |
| Iteration velocity | Cheap re-derive | Incremental via CDC | Continuous, near-free |
| AI ceiling | Low (AI-ready in principle) | Medium–high (on-ramp) | Full (products delivered) |
| Operational complexity | Low — pure Postgres | Medium — +2 systems | High — platform + ML |
| Risk | Low | Medium | Higher (staged to mitigate) |
| Cost component | Case 1 — Stabilize | Case 2 — Platformize | Case 3 — AI-Native |
|---|---|---|---|
| New infrastructure | None — reuses the existing hardware | 1 ClickHouse + 1 Kafka node (central, serves all 15 countries) | + object storage, stream compute, feature & vector stores |
| Software licences | €0 — open-source throughout | €0 — open-source | €0 — open-source |
| Recurring / month | €0 | €0 self-hosted → ~€800 cloud-hosted | €1,000–5,000 (storage + stream + training compute) |
| One-off engineering | 1 eng × 2–3 mo ≈ €16–24k | + 2–4 eng × 2–4 mo ≈ €40–90k | + 5+ eng × 6–12 mo — a strategic programme |
| 3-year TCO (indicative) | ≈ €20k | ≈ €60–145k | Strategic-budget scale (people-dominated) |
Assumes ≈ €8k loaded cost per engineer-month. Recurring range spans self-hosted on existing capacity vs. cloud-managed. All software open-source; all data stays in the EU.
Build Case 1.
| Phase | Delivers | Gate / note |
|---|---|---|
| 0 · Measure | Profile the pipeline; confirm open questions (below) | Evidence, not guesses |
| 1 · Split + DuckDB | Case 1 — the SLA | Exit gate: VW backfill <4h, output bit-identical |
| 2 · Shift-left | Case 1 — precompute, fold the price log | Parity-validated |
| 3 · Entity remodel | Case 1 — model behind compatibility views | Dual-run until parity holds |
| 4 · Serving + CDC | Case 2 (if chosen) | Additive — consumers migrate one by one |
| 5 · Lakehouse + ML | Case 3 (if chosen) | Staged, each stage valuable |
Old pipeline behind a feature flag · the rewrite targets the identical ads_<make> schema · dual-run until parity. Watch: backfill runtime · iteration velocity (cost to re-derive after a taxonomy change) · row/checksum parity · analytics p95.
How often a real backfill happens · why price_first/last are NULL (fixing it upstream removes the 3B-row scan) · whether analytics is measurably slow (the Case 2 trigger) · the AI strategy (the Case 3 trigger).