← All case studies

Case study

Real-Time Pipeline for Wearable Health Data

Mirror a multi-tenant MySQL source into a queryable analytical store in under 5 minutes — built lean for a startup (~$200/mo) but designed to scale from 150M to 1B+ rows. CDC architecture, data modeling, failure handling, and the trade-offs.

Lead Data Engineer · case study· Architecture design
DebeziumApache KafkaClickHouseS3 / R2PrometheusGrafana
<5 min
source-to-query freshness
1B+
rows at target scale
~$200/mo
starting cost
~10×
compression on sensor data

What we’re solving

Mirror a B2B multi-tenant MySQL source — 6 tables, with epoch_data at 150M rows growing +2M/day toward 1B+ — into an analytical store that stays queryable in under 5 minutes. Six constraints drove every downstream decision.

Why binlog CDC is mandatory

Polling on updated_at is impossible — the column doesn’t exist, and polling would miss deletes regardless. Binlog change-data-capture is the only approach that captures inserts, updates, and deletes uniformly and resumes cleanly from its offset after a crash.

The stack

Layer Choice Why
Change capture Debezium MySQL Reads the binlog; emits insert / update / delete uniformly; resumes from offset
Transport / buffer Apache Kafka (or Redpanda) Absorbs burst syncs; decouples source from sink; replay for rebuilds
Analytical DB ClickHouse Columnar, ~10× compression, native Kafka engine, ReplacingMergeTree for CDC merges
Aggregations ClickHouse materialized views Incremental on insert — no separate stream processor (no Flink) needed yet
Raw archive S3 / Cloudflare R2 Cheap long-term backup and a future AI-training source
Observability Prometheus + Grafana Lag-as-SLO is the headline metric

CDC semantics on a columnar store

ClickHouse is append-oriented, so CDC correctness needs care on three fronts: late updates, hard deletes, and read-side consistency. ReplacingMergeTree collapses row versions on a sort key, deletes propagate as tombstones, and materialized views stay correct because they aggregate incrementally on insert.

Cost & what I deferred

Open-source and low-ops: ~$200/mo to start, with managed services only where ops cost would exceed rent. A schema registry (Apicurio) and a dedicated stream processor (Flink) are deliberately deferred — added when schema evolution or aggregation complexity actually demands them, not on day one.

Full deliverables

Have a data platform problem worth solving?

I help teams design real-time pipelines, cloud-native architecture, and AI that earns its keep.