Real-Time Pipeline for Wearable Health Data

Mirror a multi-tenant MySQL source into a queryable analytical store in under 5 minutes — built lean for a startup (~$200/mo) but designed to scale from 150M to 1B+ rows. CDC architecture, data modeling, failure handling, and the trade-offs.

What we’re solving

Mirror a B2B multi-tenant MySQL source — 6 tables, with epoch_data at 150M rows growing +2M/day toward 1B+ — into an analytical store that stays queryable in under 5 minutes. Six constraints drove every downstream decision.

Why binlog CDC is mandatory

Polling on updated_at is impossible — the column doesn’t exist, and polling would miss deletes regardless. Binlog change-data-capture is the only approach that captures inserts, updates, and deletes uniformly and resumes cleanly from its offset after a crash.

The stack

Layer	Choice	Why
Change capture	Debezium MySQL	Reads the binlog; emits insert / update / delete uniformly; resumes from offset
Transport / buffer	Apache Kafka (or Redpanda)	Absorbs burst syncs; decouples source from sink; replay for rebuilds
Analytical DB	ClickHouse	Columnar, ~10× compression, native Kafka engine, `ReplacingMergeTree` for CDC merges
Aggregations	ClickHouse materialized views	Incremental on insert — no separate stream processor (no Flink) needed yet
Raw archive	S3 / Cloudflare R2	Cheap long-term backup and a future AI-training source
Observability	Prometheus + Grafana	Lag-as-SLO is the headline metric

CDC semantics on a columnar store

ClickHouse is append-oriented, so CDC correctness needs care on three fronts: late updates, hard deletes, and read-side consistency. ReplacingMergeTree collapses row versions on a sort key, deletes propagate as tombstones, and materialized views stay correct because they aggregate incrementally on insert.

Cost & what I deferred

Open-source and low-ops: ~$200/mo to start, with managed services only where ops cost would exceed rent. A schema registry (Apicurio) and a dedicated stream processor (Flink) are deliberately deferred — added when schema evolution or aggregation complexity actually demands them, not on day one.