Thryve · Lead Data Engineer Case Study

Real-Time Pipeline for Wearable Health Data

From MySQL to a queryable analytical store in <5 minutes — built lean for a startup, designed to scale to 1B+ rows.

Scope: CDC architecture · Data modeling · Failure handling · Cost & tradeoffs
01 · Problem Framing 2 / 16

What we're actually solving

Mirror MySQL → analytical DB in near real-time. The constraints drive every choice.

Source MySQL — 6 tables, B2B multi-tenant
TableVolumeProfile
epoch_data150M, +2M/day → 1B+High freq · bursty
end_user500KLow freq · PII
app230Tenant config
epoch_data_type500Reference
data_source50Reference
3 reference tables<100 eachStatic
Constraints that shape the design
  • Burst syncs. One user offline a week → ~15K rows in seconds. Pipeline must absorb spikes without lag bleed.
  • 5-min freshness. Rules out hourly batch. Rules in CDC + streaming.
  • Updates & deletes propagate. Can't poll on updated_at (column doesn't exist; also misses deletes). Binlog CDC is mandatory.
  • Scale 150M → 1B+. Columnar OLAP, not OLTP. Compression matters a lot here.
  • Multi-tenant (230 apps). Tenant pruning needs to be in the storage layout.
  • Likely regulated. Health data (GDPR — sample user is in Berlin). Deletion must be real, not soft.
02 · Decision Summary 3 / 16

The stack in one slide

Open-source, low-ops, ~$200/mo to start. Every paid piece earns its keep.

LayerChoiceWhy (one sentence)
Change capture Debezium MySQL connector De-facto OSS standard; reads binlog; emits inserts/updates/deletes uniformly; resumes from offset on crash.
Transport / buffer Apache Kafka (or Redpanda) Absorbs burst syncs; decouples source from sink; gives us replay if ClickHouse needs a rebuild.
Analytical DB ClickHouse Columnar, 10× compression on sensor data, native Kafka engine, ReplacingMergeTree handles CDC merges, $0 license.
Aggregations ClickHouse Materialized Views Incremental on insert — no separate stream processor (no Flink) needed at this stage.
Raw archive S3 or Cloudflare R2 Kafka tiered storage / sink connector → cheap long-term backup + AI training source later.
Observability Prometheus + Grafana Free, well-supported by Debezium/Kafka/ClickHouse exporters. Lag-as-SLO is the headline metric.
Schema registry Apicurio (OSS) — deferred Add when schema evolution becomes painful; not blocking on day one.
The "why not Snowflake / BigQuery" question: both work, both blow the budget at startup volume for the streaming-ingest write pattern. We can always promote ClickHouse → a lakehouse later; the reverse is painful.
03 · Architecture 4 / 16

End-to-end data flow

Six moving parts. Each component has one job and a clear failure mode.

SOURCE MySQL App backend (source of truth) binlog ROW format · 7d retention INGEST Debezium MySQL CDC connector snapshot + binlog tail TRANSPORT Kafka topic per source table · prefix cdc.thryve epoch_data 24 part. end_user 3 part. app 1 part. data_source 1 part. epoch_data_type 1 part. reference_* 1 part. key = primary key · 7d retention STORAGE & SERVE ClickHouse columnar OLAP · ReplacingMergeTree Kafka engine tables (consumers) ↓ on each batch Raw mirror tables app · end_user · epoch_data · dims Materialized views daily_resting_hr · weekly_activity user_data_quality · cohort_metrics binlog CDC events CONSUMERS Data & AI teams 90-day HR trends Cohort HRV studies Recovery score model training Grafana · Superset · Jupyter · dbt Product features Real-time HR anomaly alerts Daily recovery score Weekly activity digests via ClickHouse HTTP/native API Operations Data quality dashboards Delayed-sync % · user gaps Pipeline freshness SLO Grafana on ClickHouse + Prom S3 / R2 archive Raw Kafka events (Parquet) Failsafe replay source Future: AI training corpus Kafka Connect S3 Sink, hourly Cross-cutting Prometheus + Grafana Debezium lag · Kafka consumer lag CH ingestion rate · end-to-end freshness Alertmanager / PagerDuty Lag > 5 min → page DLQ rate > 0 → page · CH disk > 80% → warn Schema evolution Debezium captures DDL events Manual ALTER on CH for breaking changes
04 · Why ClickHouse 5 / 16

The analytical DB choice carries most of the design

It's the one decision that's hardest to reverse, so it gets the deepest justification.

What ClickHouse gives us

  • Columnar + 10× compression on sensor data (mostly repeating LONG/DOUBLE columns). 1B rows ≈ 30–80 GB on disk.
  • Native Kafka engine. No separate ETL process — Kafka topic → MV → MergeTree, all in CH.
  • ReplacingMergeTree handles CDC merge semantics with a single version column. No external orchestration needed for updates/deletes.
  • Materialized Views are incremental. Aggregates update on insert, not on schedule.
  • TTL. Tier raw → aggregates → archive without writing batch jobs.
  • Open source & self-hostable. $0 license, runs on a single VM until it doesn't.

What it costs us (honest)

  • Eventual-consistency reads. ReplacingMergeTree merges in background — point-in-time queries need FINAL or a wrapper view.
  • JOINs are weaker than Snowflake/BigQuery. We mitigate by denormalizing fact tables.
  • Schema changes are more manual than warehouse equivalents.
  • Ops if self-hosting at scale — Zookeeper/Keeper, replication. Managed (Altinity, CH Cloud) for ~$300/mo when we outgrow one node.
Alternatives considered
OptionVerdict
Snowflake Best DX. But streaming ingest at our write rate gets expensive fast (Snowpipe Streaming + warehouse credits). Not justified for a startup until query workload demands it.
BigQuery Same shape as Snowflake. Storage is cheap; streaming inserts and on-demand query costs add up. Strong if already on GCP.
TimescaleDB / Postgres Lovely for time-series <500M rows. Compression and query speed fall behind ClickHouse at 1B+. JOINs are easier — relevant if many small dim joins.
Apache Druid / Pinot Excellent real-time OLAP, but heavier to operate (coordinators, brokers, historicals) and weaker ad-hoc SQL ergonomics.
Iceberg + Trino lakehouse Strong for AI training corpus. Doesn't solve sub-5-min freshness alone (needs streaming on top). Good future tier, not the day-one tool.
Reversibility: ClickHouse → lakehouse later is a one-time backfill. The CDC + Kafka layer doesn't change. We're not locked in.
05 · Why CDC (not polling) 6 / 16

Debezium + Kafka is the only path that satisfies all four constraints

Polling can't handle deletes; trigger-based capture pollutes the source; CDC binlog wins.

Why binlog CDC, not polling

  • No updated_at column in the schema → polling has nothing reliable to filter on.
  • Polling misses deletes. If a user exercises GDPR right-to-erasure, polling never sees the missing row.
  • Polling misses intermediate states. If a row is updated twice between polls, downstream sees only the second state — analytics could miss a corrective edit.
  • Polling is hostile to a 150M-row table — full-scan queries pin MySQL CPU.
  • Binlog already exists for replication; reading it is cheap.

Why Kafka in the middle

  • Decouples source rate from sink rate. ClickHouse can stop for maintenance without blocking the source.
  • Buffers bursts (15K-row syncs amplified across users → still flat for Kafka).
  • Replay for free. Rebuild ClickHouse table? Rewind the consumer.
  • Fan-out for free. Same events feed CH + S3 archive + future Flink jobs.
Configuration that matters
# Debezium MySQL connector — the settings that matter
connector.class: io.debezium.connector.mysql.MySqlConnector
database.include.list: thryve
table.include.list: thryve.app, thryve.end_user, thryve.epoch_data, ...

# Snapshot once, then tail binlog (no table locks, uses InnoDB MVCC)
snapshot.mode: initial
snapshot.locking.mode: none

# Performance for the bursty epoch_data table
max.batch.size: 8192
max.queue.size: 32768
poll.interval.ms: 500

# Topic per table, keyed by PK → in-row ordering preserved
topic.prefix: cdc.thryve
message.key.columns: thryve.epoch_data:id

# Flatten Debezium envelope; mark deletes via __deleted=true
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.delete.handling.mode: rewrite
One subtle but important choice: Kafka topic key = source primary key. Same row's events always land on the same partition → ordering preserved within a row. We don't care about cross-row ordering for analytics.
06 · Per-Table Strategy 7 / 16

Stream vs. batch — yes, it differs per table

Same infrastructure, different consumption pattern. Snapshots are a failsafe, not a primary path.

Table Primary path Failsafe CH engine Reasoning
epoch_data CDC stream Daily incremental snapshot to S3 (Parquet) ReplacingMergeTree(_version) High volume, 5-min freshness is the whole point. Snapshot covers the "Kafka retention expired during outage" scenario.
end_user CDC stream Hourly full snapshot ReplacingMergeTree(_version) PII — needs deletes to propagate fast (GDPR). Small enough that hourly full snapshot is fine as backup.
app CDC stream Hourly full snapshot ReplacingMergeTree(_version) Tenant config — small, slow-changing, but critical for joins. Snapshot makes recovery trivial.
epoch_data_type
data_source
CDC stream Hourly full snapshot ReplacingMergeTree(_version) Reference data. Could be loaded once and forgotten, but streaming keeps it self-healing.
Reference tables
(generation_*, trustworthiness)
Snapshot only Manual reload Dictionary (in-memory) Static, <100 rows. Loaded as ClickHouse dictionaries for fast lookup — no need for streaming pipeline at all.
Why uniform-ish? The temptation is to "optimize" by snapshotting small tables in batch only. But: same Debezium connector covers all tables for $0 extra effort. Operating one pipeline pattern is cheaper than operating two. We use snapshots as belt-and-suspenders, not as a separate workflow.
07 · Data Modeling 8 / 16

Wide fact table + thin dimensions + materialized views

The ORDER BY key is the most important decision in the schema.

-- Fact mirror (denormalized app_id for tenant pruning)
CREATE TABLE epoch_data (
  id              UInt64,
  end_user_id     UInt32,
  app_id          UInt16,    -- denormalized
  data_type_id    UInt32,
  data_source_id  UInt16,
  start_ts        DateTime64(3, 'UTC'),
  end_ts          Nullable(DateTime64(3, 'UTC')),
  created_at      DateTime64(3, 'UTC'),
  double_value    Nullable(Float64),
  long_value      Nullable(Int64),
  boolean_value   Nullable(UInt8),
  string_value    Nullable(String),
  trustworthiness_id UInt8,
  _version        UInt64,       -- binlog pos
  _deleted        UInt8
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id)
TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;
Why this ORDER BY? Every analytical query in Appendix D is per-user over a time range. (app_id, end_user_id, data_type_id, start_ts) means typical queries read 1 contiguous block instead of scanning. Including id guarantees uniqueness for dedup.
-- Incremental MV: daily resting HR per user
CREATE MATERIALIZED VIEW mv_daily_resting_hr
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (app_id, end_user_id, date)
AS SELECT
  app_id, end_user_id,
  toDate(start_ts) AS date,
  avgState(long_value) AS avg_hr,
  countState()         AS samples
FROM epoch_data
WHERE data_type_id = 3001   -- HeartRateResting
  AND _deleted = 0
GROUP BY app_id, end_user_id, date;

-- 90-day trend query — sub-second on billions of rows
SELECT date, avgMerge(avg_hr) AS resting_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817 AND date >= today()-90
GROUP BY date ORDER BY date;

The value-column problem

5 nullable value cols, only one populated per row. We keep them as-is — analytical queries always filter by data_type_id, which fixes which column to read. Splitting into per-type tables would add JOINs without a real benefit.

08 · Updates & Deletes 9 / 16

CDC semantics on an append-oriented store

Three problems: late updates, hard deletes, and read-side correctness.

The mechanism

  • ReplacingMergeTree(_version) keeps only the latest version per ORDER BY key during background merges.
  • _version = MySQL binlog position (monotonic, unique per change). Comes for free from Debezium's source.ts_ms or GTID.
  • Inserts: written as new rows; old row gets superseded on next merge.
  • Deletes: Debezium's delete event → row with _deleted=1, also superseding any prior version.

The read-side trap

  • Merges are asynchronous. A naive SELECT can see both the old and new versions until the next merge.
  • Options, in order of preference:
    1. Wrap in a view with argMax dedup — predictable cost, recommended for most queries.
    2. SELECT ... FINAL — convenient, slower (forces merge at read time).
    3. OPTIMIZE TABLE ... FINAL periodically — heavy, schedule off-peak.
  • For aggregates: use AggregatingMergeTree + *State functions. The aggregate is itself idempotent under replay.
-- Recommended pattern: a view that hides the dedup logic
CREATE VIEW epoch_data_current AS
SELECT
  id, end_user_id, app_id, data_type_id, data_source_id,
  start_ts, end_ts, created_at,
  argMax(double_value, _version)   AS double_value,
  argMax(long_value, _version)     AS long_value,
  argMax(boolean_value, _version)  AS boolean_value,
  argMax(_deleted, _version)       AS is_deleted
FROM epoch_data
GROUP BY id, end_user_id, app_id, data_type_id,
         data_source_id, start_ts, end_ts, created_at
HAVING is_deleted = 0;
GDPR hard-delete: ReplacingMergeTree only logically deletes. To physically remove on user erasure request, run ALTER TABLE epoch_data DELETE WHERE end_user_id = ? — this rewrites affected parts. Schedule monthly batch (or on-demand) for outstanding erasure requests. Same for end_user table.

Late-arriving data

Bursts of week-old measurements are not a problem because we partition by start_ts. The new rows just land in older partitions. Materialized views over those partitions will incrementally update on the new insertions automatically.

09 · Burst Syncs 10 / 16

Walking through a 15K-row burst

Where the latency goes, what saturates first, and what gives us headroom.

The path of a single burst

  1. MySQL. Backend inserts ~15K rows over ~2 seconds (batched INSERTs). Binlog grows ~3 MB.
  2. Debezium. Tails binlog. Reads & serializes events at >100K/s for narrow rows. Latency added: <1 sec.
  3. Kafka. Producer batches; partitioned by id (round-robin across 24 partitions). Throughput: ~150K msg/s on a single broker. Latency: <1 sec.
  4. ClickHouse Kafka engine. Consumes in batches (kafka_max_block_size = 65K). Inserts as 1–2 parts. Latency: 1–3 sec for batch close.
  5. Materialized views fire on insert. Latency: ms.

End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.

What about 100 simultaneous bursts?

= 1.5M events. Still ~10 seconds end-to-end on a single-broker / single-CH-node setup. We saturate Debezium's binlog read first (~50–100K events/sec on commodity hardware). At sustained 2M+ msg/sec we'd add Debezium replicas or split connectors by table.

Knobs that matter
KnobDefaultTune to
Debezium max.batch.size20488192 — bigger batches, lower CPU
Kafka topic partitions (epoch_data)124 — parallel consume
Kafka linger.ms050 — better batching
CH kafka_max_block_size1M65K — predictable latency
CH kafka_flush_interval_ms7.5s2000 — tighter freshness
CH MergeTree parts_to_throw_insert3000monitor — bursts can create many small parts
The "too many parts" risk. ClickHouse merges parts in the background. If we ingest faster than it merges, inserts start to slow down or fail. Mitigation: cap CH consumer flush interval at 2s (forces decent batch sizes), and monitor system.parts. If we're consistently above ~300 parts per partition, switch to a buffered insert pattern or scale CH.
10 · Failure Handling 11 / 16

The failure modes we'll actually see

Detection comes from lag, not health checks. Recovery is mostly "wait for the buffer to drain."

Failure Detection Recovery Likelihood
ClickHouse goes down Kafka consumer lag > threshold Kafka holds messages (7-day retention). Restart CH → consumer resumes from offset. Medium
Debezium connector crash Kafka Connect REST + producer rate = 0 Auto-restart via systemd. Resumes from last committed binlog offset. No data loss. Medium
MySQL binlog purged before Debezium reads Debezium "binlog not found" error Worst case. Use Debezium incremental snapshot to backfill affected table without downtime. Low (if retention ≥ 7d)
Schema change in MySQL Debezium DDL event; CH insert fails on unknown column Additive: ALTER TABLE … ADD COLUMN. Automate nullable adds; review breaking changes. Medium
Bad/malformed value CH parse error → dead-letter topic Quarantine in epoch_data_dlq. Alert. Manual triage + reprocess. Medium
Duplicate events (Debezium retry) Identical _version ReplacingMergeTree dedupes on next merge. No action needed. High (at-least-once)
Kafka broker down Debezium producer errors Debezium pauses; binlog buffers naturally. Restart Kafka, resume. Low
Too many parts in CH (sustained burst) system.parts > threshold Reduce flush interval, increase batch; scale CH (add shard). Medium
The one metric that catches all of these: end-to-end freshness — max(now() - max(created_at)) per table. Page when > 5 min sustained. It's the SLA. Everything else is diagnostics.
11 · Cost 12 / 16

Startup-friendly: ~$200/mo to start, predictable scale-up

Every paid line item earns its keep. Managed services only where ops cost exceeds rent.

Phase 1 · 0–500M rows

Months 0–12. Single-node everything.

ClickHouse (Hetzner CCX23, 4 vCPU / 16GB / 240GB NVMe)~$60
Kafka or Redpanda (single broker, same spec)~$60
Debezium (small VM, 2 vCPU / 4GB)~$20
S3 / R2 archive (~200 GB compressed)~$5
Prometheus + Grafana (co-hosted)$0
Backups (snapshots to R2, retention 30d)~$20
Total~$165 / mo

Phase 2 · 0.5B–5B rows

Months 12–36. Cluster CH, scale Kafka.

ClickHouse 3-node cluster (CCX33 each)~$330
Kafka 3-broker (CCX23 each) + Keeper~$240
Debezium 2-node HA~$50
S3 / R2 archive (~1.5 TB)~$25
Schema registry (Apicurio)~$15
Monitoring + backups~$60
Total~$720 / mo

For comparison: managed warehouse

Same workload, Snowflake-ish pricing.

Snowpipe Streaming ingest (~2M/day → 60M/mo events)~$400
Warehouse compute (XS, ~6h/day for ETL + queries)~$500
Storage (~500 GB)~$15
Confluent Cloud (Kafka, small cluster)~$500
Debezium / Fivetran CDC (priced per row)~$1500+
Total~$2,900+ / mo

Same correctness, ~15× the bill. Easy call at this stage.

When to revisit: If the data team grows past ~10 analysts running concurrent ad-hoc workloads, or if AI training needs Spark-on-lakehouse, the cost calculus flips. We'd add a lakehouse tier on the side, not migrate off ClickHouse.
12 · Tradeoffs 13 / 16

What I gave up to get what I gave you

Every architectural choice is a trade. Here are the ones I'm most aware of.

I choseOverWhat I gave upWhat I got
ClickHouse Snowflake / BigQuery Best-in-class JOINs · zero-ops scaling · richest ecosystem 15× cost reduction at this scale · faster ingest · self-host control
Debezium + Kafka Fivetran / Airbyte / DMS UI-driven config · vendor support $0 license · full event control · replay · reusable for other sinks
CH Kafka engine Flink / Spark Streaming Complex stream joins · windowed aggregations across topics One less system to operate · MVs cover 90% of needs
Wide denormalized fact Strict star schema Storage savings · dimensional purity No JOIN at query time · pruning by app_id in fact table directly
ReplacingMergeTree CollapsingMergeTree · external dedup Read-side simplicity (need FINAL or view) Simplest mapping from Debezium events · idempotent under replay
Self-hosted on VMs Managed (CH Cloud, Confluent Cloud) ~10 hours/week of ops at scale ~70% cost reduction · no vendor lock-in
5-min SLA Sub-second SLA True real-time (Flink + push to product) Meets stated requirement · sub-second is a feature we add when needed
No stream processor day-one Flink for anomaly detection Real-time alerting in CH itself is harder One less moving part · revisit when product needs it
13 · Deferred 14 / 16

What I'm intentionally not building yet

Deferring is a design choice. These are the things I'd queue for the next quarter, in priority order.

  • Schema registry (Apicurio). Today: schemas are inline in Kafka messages. Becomes painful when 2+ consumers depend on stable shape. Add in 6 months.
  • dbt project on top of ClickHouse. Today: MVs encode the transformations. dbt becomes essential when the data team starts proposing models — adds versioning, tests, docs.
  • Data quality framework (Great Expectations or Soda). Today: handful of CH queries on a Grafana dashboard. Promote when we have >10 metrics worth tracking.
  • Flink for real-time anomaly detection. Today: CH can do windowed queries fast enough for <5-min alerts. Flink wins at sub-second.
  • Iceberg lakehouse for AI training. Today: archive Kafka → S3 Parquet hourly. Promote to Iceberg + Trino when ML needs reproducible point-in-time snapshots.
  • Multi-region HA. Today: single region. Add when we have customers requiring data residency or when uptime SLA needs >99.5%.
  • Fine-grained RBAC. Today: 2 roles (read, write). Add when data team has >5 people with different access scopes.
  • PII tokenization layer. Today: PII columns flow as-is to ClickHouse (encrypted at rest). For HIPAA/stricter compliance, tokenize at Debezium SMT layer.
  • Self-serve schema evolution. Today: data engineer reviews each MySQL DDL. Acceptable since the backend team is small. Automate when schema changes > weekly.
  • Cost telemetry. Today: monthly invoice review. Add query cost attribution per team when we have >3 consumers competing for warehouse time.
The principle: Build for the constraint you have, not the one you might have. Every deferred item is easier to add later than it would be to remove once it's in.
14 · Questions 15 / 16

What I'd ask you if I had more time

Real architectural answers depend on real numbers. These would refine the design substantially.

Operational reality

  • What's actual peak QPS on MySQL? Avg vs P99?
  • How is MySQL hosted — RDS, Aurora, self-managed? Affects binlog config & permissions.
  • What's the current binlog retention? Anything less than 7 days needs to change before we ship.
  • Do we have on-call engineers comfortable operating Kafka? If not, Redpanda or managed.

Compliance & trust

  • GDPR right-to-erasure SLA — hours or days?
  • Any HIPAA scope (US customers)? Changes encryption + audit posture.
  • Are access_token and email considered PII that must be tokenized at ingest?
  • Data residency — EU-only? Affects region/provider choice.

Consumer expectations

  • Who reads from the analytical DB? Notebooks, dashboards, in-product features?
  • What query latency SLA does the data team want? Sub-second matters; few seconds is comfortable for CH.
  • Are there existing tools we should integrate with (Looker, Metabase, dbt)?
  • Is real-time anomaly alerting a near-term product feature or a "nice to have"?

Scale & budget

  • What's the growth trajectory — 1B in 12 months, 24 months, or 60?
  • Budget ceiling for data infra this fiscal year?
  • How fast does the data team need new data sources added (other CRMs, billing, etc.)?
  • What does the AI team need — raw event stream, daily snapshots, or feature store?
Thryve · Lead Data Engineer Case Study

Thank you.

Happy to go deeper on any layer — data modeling, Kafka topology, ClickHouse internals, or the things I deferred.

Companion document: Technical deep-dive with full schemas, runbook, capacity math, and migration plan