What we’re solving
Mirror a B2B multi-tenant MySQL source — 6 tables, with epoch_data at 150M rows growing +2M/day toward 1B+ — into an analytical store that stays queryable in under 5 minutes. Six constraints drove every downstream decision.
Why binlog CDC is mandatory
Polling on updated_at is impossible — the column doesn’t exist, and polling would miss deletes regardless. Binlog change-data-capture is the only approach that captures inserts, updates, and deletes uniformly and resumes cleanly from its offset after a crash.
The stack
| Layer | Choice | Why |
|---|---|---|
| Change capture | Debezium MySQL | Reads the binlog; emits insert / update / delete uniformly; resumes from offset |
| Transport / buffer | Apache Kafka (or Redpanda) | Absorbs burst syncs; decouples source from sink; replay for rebuilds |
| Analytical DB | ClickHouse | Columnar, ~10× compression, native Kafka engine, ReplacingMergeTree for CDC merges |
| Aggregations | ClickHouse materialized views | Incremental on insert — no separate stream processor (no Flink) needed yet |
| Raw archive | S3 / Cloudflare R2 | Cheap long-term backup and a future AI-training source |
| Observability | Prometheus + Grafana | Lag-as-SLO is the headline metric |
CDC semantics on a columnar store
ClickHouse is append-oriented, so CDC correctness needs care on three fronts: late updates, hard deletes, and read-side consistency. ReplacingMergeTree collapses row versions on a sort key, deletes propagate as tombstones, and materialized views stay correct because they aggregate incrementally on insert.
Cost & what I deferred
Open-source and low-ops: ~$200/mo to start, with managed services only where ops cost would exceed rent. A schema registry (Apicurio) and a dedicated stream processor (Flink) are deliberately deferred — added when schema evolution or aggregation complexity actually demands them, not on day one.