Thryve · Lead Data Engineer Case Study

Real-Time Pipeline for Wearable Health Data

From MySQL to a queryable analytical store in <5 minutes — built lean for a startup, designed to scale to 1B+ rows.

Scope: CDC architecture · Data modeling · Failure handling · Cost & tradeoffs

01 · Problem Framing 2 / 16

What we're actually solving

Mirror MySQL → analytical DB in near real-time. The constraints drive every choice.

Source MySQL — 6 tables, B2B multi-tenant

Table	Volume	Profile
`epoch_data`	150M, +2M/day → 1B+	High freq · bursty
`end_user`	500K	Low freq · PII
`app`	230	Tenant config
`epoch_data_type`	500	Reference
`data_source`	50	Reference
3 reference tables	<100 each	Static

Constraints that shape the design

Burst syncs. One user offline a week → ~15K rows in seconds. Pipeline must absorb spikes without lag bleed.
5-min freshness. Rules out hourly batch. Rules in CDC + streaming.
Updates & deletes propagate. Can't poll on updated_at (column doesn't exist; also misses deletes). Binlog CDC is mandatory.
Scale 150M → 1B+. Columnar OLAP, not OLTP. Compression matters a lot here.
Multi-tenant (230 apps). Tenant pruning needs to be in the storage layout.
Likely regulated. Health data (GDPR — sample user is in Berlin). Deletion must be real, not soft.

02 · Decision Summary 3 / 16

The stack in one slide

Open-source, low-ops, ~$200/mo to start. Every paid piece earns its keep.

Layer	Choice	Why (one sentence)
Change capture	Debezium MySQL connector	De-facto OSS standard; reads binlog; emits inserts/updates/deletes uniformly; resumes from offset on crash.
Transport / buffer	Apache Kafka (or Redpanda)	Absorbs burst syncs; decouples source from sink; gives us replay if ClickHouse needs a rebuild.
Analytical DB	ClickHouse	Columnar, 10× compression on sensor data, native Kafka engine, `ReplacingMergeTree` handles CDC merges, $0 license.
Aggregations	ClickHouse Materialized Views	Incremental on insert — no separate stream processor (no Flink) needed at this stage.
Raw archive	S3 or Cloudflare R2	Kafka tiered storage / sink connector → cheap long-term backup + AI training source later.
Observability	Prometheus + Grafana	Free, well-supported by Debezium/Kafka/ClickHouse exporters. Lag-as-SLO is the headline metric.
Schema registry	Apicurio (OSS) — deferred	Add when schema evolution becomes painful; not blocking on day one.

The "why not Snowflake / BigQuery" question: both work, both blow the budget at startup volume for the streaming-ingest write pattern. We can always promote ClickHouse → a lakehouse later; the reverse is painful.

03 · Architecture 4 / 16

End-to-end data flow

Six moving parts. Each component has one job and a clear failure mode.

04 · Why ClickHouse 5 / 16

The analytical DB choice carries most of the design

It's the one decision that's hardest to reverse, so it gets the deepest justification.

What ClickHouse gives us

Columnar + 10× compression on sensor data (mostly repeating LONG/DOUBLE columns). 1B rows ≈ 30–80 GB on disk.
Native Kafka engine. No separate ETL process — Kafka topic → MV → MergeTree, all in CH.
ReplacingMergeTree handles CDC merge semantics with a single version column. No external orchestration needed for updates/deletes.
Materialized Views are incremental. Aggregates update on insert, not on schedule.
TTL. Tier raw → aggregates → archive without writing batch jobs.
Open source & self-hostable. $0 license, runs on a single VM until it doesn't.

What it costs us (honest)

Eventual-consistency reads. ReplacingMergeTree merges in background — point-in-time queries need FINAL or a wrapper view.
JOINs are weaker than Snowflake/BigQuery. We mitigate by denormalizing fact tables.
Schema changes are more manual than warehouse equivalents.
Ops if self-hosting at scale — Zookeeper/Keeper, replication. Managed (Altinity, CH Cloud) for ~$300/mo when we outgrow one node.

Alternatives considered

Option	Verdict
Snowflake	Best DX. But streaming ingest at our write rate gets expensive fast (Snowpipe Streaming + warehouse credits). Not justified for a startup until query workload demands it.
BigQuery	Same shape as Snowflake. Storage is cheap; streaming inserts and on-demand query costs add up. Strong if already on GCP.
TimescaleDB / Postgres	Lovely for time-series <500M rows. Compression and query speed fall behind ClickHouse at 1B+. JOINs are easier — relevant if many small dim joins.
Apache Druid / Pinot	Excellent real-time OLAP, but heavier to operate (coordinators, brokers, historicals) and weaker ad-hoc SQL ergonomics.
Iceberg + Trino lakehouse	Strong for AI training corpus. Doesn't solve sub-5-min freshness alone (needs streaming on top). Good future tier, not the day-one tool.

Reversibility: ClickHouse → lakehouse later is a one-time backfill. The CDC + Kafka layer doesn't change. We're not locked in.

05 · Why CDC (not polling) 6 / 16

Debezium + Kafka is the only path that satisfies all four constraints

Polling can't handle deletes; trigger-based capture pollutes the source; CDC binlog wins.

Why binlog CDC, not polling

No updated_at column in the schema → polling has nothing reliable to filter on.
Polling misses deletes. If a user exercises GDPR right-to-erasure, polling never sees the missing row.
Polling misses intermediate states. If a row is updated twice between polls, downstream sees only the second state — analytics could miss a corrective edit.
Polling is hostile to a 150M-row table — full-scan queries pin MySQL CPU.
Binlog already exists for replication; reading it is cheap.

Why Kafka in the middle

Decouples source rate from sink rate. ClickHouse can stop for maintenance without blocking the source.
Buffers bursts (15K-row syncs amplified across users → still flat for Kafka).
Replay for free. Rebuild ClickHouse table? Rewind the consumer.
Fan-out for free. Same events feed CH + S3 archive + future Flink jobs.

Configuration that matters

# Debezium MySQL connector — the settings that matter
connector.class: io.debezium.connector.mysql.MySqlConnector
database.include.list: thryve
table.include.list: thryve.app, thryve.end_user, thryve.epoch_data, ...

# Snapshot once, then tail binlog (no table locks, uses InnoDB MVCC)
snapshot.mode: initial
snapshot.locking.mode: none

# Performance for the bursty epoch_data table
max.batch.size: 8192
max.queue.size: 32768
poll.interval.ms: 500

# Topic per table, keyed by PK → in-row ordering preserved
topic.prefix: cdc.thryve
message.key.columns: thryve.epoch_data:id

# Flatten Debezium envelope; mark deletes via __deleted=true
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.delete.handling.mode: rewrite

One subtle but important choice: Kafka topic key = source primary key. Same row's events always land on the same partition → ordering preserved within a row. We don't care about cross-row ordering for analytics.

06 · Per-Table Strategy 7 / 16

Stream vs. batch — yes, it differs per table

Same infrastructure, different consumption pattern. Snapshots are a failsafe, not a primary path.

Table	Primary path	Failsafe	CH engine	Reasoning
`epoch_data`	CDC stream	Daily incremental snapshot to S3 (Parquet)	`ReplacingMergeTree(_version)`	High volume, 5-min freshness is the whole point. Snapshot covers the "Kafka retention expired during outage" scenario.
`end_user`	CDC stream	Hourly full snapshot	`ReplacingMergeTree(_version)`	PII — needs deletes to propagate fast (GDPR). Small enough that hourly full snapshot is fine as backup.
`app`	CDC stream	Hourly full snapshot	`ReplacingMergeTree(_version)`	Tenant config — small, slow-changing, but critical for joins. Snapshot makes recovery trivial.
`epoch_data_type` `data_source`	CDC stream	Hourly full snapshot	`ReplacingMergeTree(_version)`	Reference data. Could be loaded once and forgotten, but streaming keeps it self-healing.
Reference tables (`generation_*`, `trustworthiness`)	Snapshot only	Manual reload	`Dictionary` (in-memory)	Static, <100 rows. Loaded as ClickHouse dictionaries for fast lookup — no need for streaming pipeline at all.

Why uniform-ish? The temptation is to "optimize" by snapshotting small tables in batch only. But: same Debezium connector covers all tables for $0 extra effort. Operating one pipeline pattern is cheaper than operating two. We use snapshots as belt-and-suspenders, not as a separate workflow.

07 · Data Modeling 8 / 16

Wide fact table + thin dimensions + materialized views

The `ORDER BY` key is the most important decision in the schema.

-- Fact mirror (denormalized app_id for tenant pruning)
CREATE TABLE epoch_data (
  id              UInt64,
  end_user_id     UInt32,
  app_id          UInt16,    -- denormalized
  data_type_id    UInt32,
  data_source_id  UInt16,
  start_ts        DateTime64(3, 'UTC'),
  end_ts          Nullable(DateTime64(3, 'UTC')),
  created_at      DateTime64(3, 'UTC'),
  double_value    Nullable(Float64),
  long_value      Nullable(Int64),
  boolean_value   Nullable(UInt8),
  string_value    Nullable(String),
  trustworthiness_id UInt8,
  _version        UInt64,       -- binlog pos
  _deleted        UInt8
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id)
TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;

Why this ORDER BY? Every analytical query in Appendix D is per-user over a time range. (app_id, end_user_id, data_type_id, start_ts) means typical queries read 1 contiguous block instead of scanning. Including id guarantees uniqueness for dedup.

-- Incremental MV: daily resting HR per user
CREATE MATERIALIZED VIEW mv_daily_resting_hr
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (app_id, end_user_id, date)
AS SELECT
  app_id, end_user_id,
  toDate(start_ts) AS date,
  avgState(long_value) AS avg_hr,
  countState()         AS samples
FROM epoch_data
WHERE data_type_id = 3001   -- HeartRateResting
  AND _deleted = 0
GROUP BY app_id, end_user_id, date;

-- 90-day trend query — sub-second on billions of rows
SELECT date, avgMerge(avg_hr) AS resting_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817 AND date >= today()-90
GROUP BY date ORDER BY date;

The value-column problem

5 nullable value cols, only one populated per row. We keep them as-is — analytical queries always filter by data_type_id, which fixes which column to read. Splitting into per-type tables would add JOINs without a real benefit.

08 · Updates & Deletes 9 / 16

CDC semantics on an append-oriented store

Three problems: late updates, hard deletes, and read-side correctness.

The mechanism

ReplacingMergeTree(_version) keeps only the latest version per ORDER BY key during background merges.
_version = MySQL binlog position (monotonic, unique per change). Comes for free from Debezium's source.ts_ms or GTID.
Inserts: written as new rows; old row gets superseded on next merge.
Deletes: Debezium's delete event → row with _deleted=1, also superseding any prior version.

The read-side trap

Merges are asynchronous. A naive SELECT can see both the old and new versions until the next merge.
Options, in order of preference:
1. Wrap in a view with argMax dedup — predictable cost, recommended for most queries.
2. SELECT ... FINAL — convenient, slower (forces merge at read time).
3. OPTIMIZE TABLE ... FINAL periodically — heavy, schedule off-peak.
For aggregates: use AggregatingMergeTree + *State functions. The aggregate is itself idempotent under replay.

-- Recommended pattern: a view that hides the dedup logic
CREATE VIEW epoch_data_current AS
SELECT
  id, end_user_id, app_id, data_type_id, data_source_id,
  start_ts, end_ts, created_at,
  argMax(double_value, _version)   AS double_value,
  argMax(long_value, _version)     AS long_value,
  argMax(boolean_value, _version)  AS boolean_value,
  argMax(_deleted, _version)       AS is_deleted
FROM epoch_data
GROUP BY id, end_user_id, app_id, data_type_id,
         data_source_id, start_ts, end_ts, created_at
HAVING is_deleted = 0;

GDPR hard-delete: ReplacingMergeTree only logically deletes. To physically remove on user erasure request, run ALTER TABLE epoch_data DELETE WHERE end_user_id = ? — this rewrites affected parts. Schedule monthly batch (or on-demand) for outstanding erasure requests. Same for end_user table.

Late-arriving data

Bursts of week-old measurements are not a problem because we partition by start_ts. The new rows just land in older partitions. Materialized views over those partitions will incrementally update on the new insertions automatically.

09 · Burst Syncs 10 / 16

Walking through a 15K-row burst

Where the latency goes, what saturates first, and what gives us headroom.

The path of a single burst

MySQL. Backend inserts ~15K rows over ~2 seconds (batched INSERTs). Binlog grows ~3 MB.
Debezium. Tails binlog. Reads & serializes events at >100K/s for narrow rows. Latency added: <1 sec.
Kafka. Producer batches; partitioned by id (round-robin across 24 partitions). Throughput: ~150K msg/s on a single broker. Latency: <1 sec.
ClickHouse Kafka engine. Consumes in batches (kafka_max_block_size = 65K). Inserts as 1–2 parts. Latency: 1–3 sec for batch close.
Materialized views fire on insert. Latency: ms.

End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.

What about 100 simultaneous bursts?

= 1.5M events. Still ~10 seconds end-to-end on a single-broker / single-CH-node setup. We saturate Debezium's binlog read first (~50–100K events/sec on commodity hardware). At sustained 2M+ msg/sec we'd add Debezium replicas or split connectors by table.

Knobs that matter

Knob	Default	Tune to
Debezium `max.batch.size`	2048	8192 — bigger batches, lower CPU
Kafka topic partitions (epoch_data)	1	24 — parallel consume
Kafka `linger.ms`	0	50 — better batching
CH `kafka_max_block_size`	1M	65K — predictable latency
CH `kafka_flush_interval_ms`	7.5s	2000 — tighter freshness
CH MergeTree `parts_to_throw_insert`	3000	monitor — bursts can create many small parts

The "too many parts" risk. ClickHouse merges parts in the background. If we ingest faster than it merges, inserts start to slow down or fail. Mitigation: cap CH consumer flush interval at 2s (forces decent batch sizes), and monitor system.parts. If we're consistently above ~300 parts per partition, switch to a buffered insert pattern or scale CH.

10 · Failure Handling 11 / 16

The failure modes we'll actually see

Detection comes from lag, not health checks. Recovery is mostly "wait for the buffer to drain."

Failure	Detection	Recovery	Likelihood
ClickHouse goes down	Kafka consumer lag > threshold	Kafka holds messages (7-day retention). Restart CH → consumer resumes from offset.	Medium
Debezium connector crash	Kafka Connect REST + producer rate = 0	Auto-restart via systemd. Resumes from last committed binlog offset. No data loss.	Medium
MySQL binlog purged before Debezium reads	Debezium "binlog not found" error	Worst case. Use Debezium incremental snapshot to backfill affected table without downtime.	Low (if retention ≥ 7d)
Schema change in MySQL	Debezium DDL event; CH insert fails on unknown column	Additive: `ALTER TABLE … ADD COLUMN`. Automate nullable adds; review breaking changes.	Medium
Bad/malformed value	CH parse error → dead-letter topic	Quarantine in `epoch_data_dlq`. Alert. Manual triage + reprocess.	Medium
Duplicate events (Debezium retry)	Identical `_version`	ReplacingMergeTree dedupes on next merge. No action needed.	High (at-least-once)
Kafka broker down	Debezium producer errors	Debezium pauses; binlog buffers naturally. Restart Kafka, resume.	Low
Too many parts in CH (sustained burst)	`system.parts` > threshold	Reduce flush interval, increase batch; scale CH (add shard).	Medium

The one metric that catches all of these: end-to-end freshness — max(now() - max(created_at)) per table. Page when > 5 min sustained. It's the SLA. Everything else is diagnostics.

11 · Cost 12 / 16

Startup-friendly: ~$200/mo to start, predictable scale-up

Every paid line item earns its keep. Managed services only where ops cost exceeds rent.

Phase 1 · 0–500M rows

Months 0–12. Single-node everything.

ClickHouse (Hetzner CCX23, 4 vCPU / 16GB / 240GB NVMe)	~$60
Kafka or Redpanda (single broker, same spec)	~$60
Debezium (small VM, 2 vCPU / 4GB)	~$20
S3 / R2 archive (~200 GB compressed)	~$5
Prometheus + Grafana (co-hosted)	$0
Backups (snapshots to R2, retention 30d)	~$20
Total	~$165 / mo

Phase 2 · 0.5B–5B rows

Months 12–36. Cluster CH, scale Kafka.

ClickHouse 3-node cluster (CCX33 each)	~$330
Kafka 3-broker (CCX23 each) + Keeper	~$240
Debezium 2-node HA	~$50
S3 / R2 archive (~1.5 TB)	~$25
Schema registry (Apicurio)	~$15
Monitoring + backups	~$60
Total	~$720 / mo

For comparison: managed warehouse

Same workload, Snowflake-ish pricing.

Snowpipe Streaming ingest (~2M/day → 60M/mo events)	~$400
Warehouse compute (XS, ~6h/day for ETL + queries)	~$500
Storage (~500 GB)	~$15
Confluent Cloud (Kafka, small cluster)	~$500
Debezium / Fivetran CDC (priced per row)	~$1500+
Total	~$2,900+ / mo

Same correctness, ~15× the bill. Easy call at this stage.

When to revisit: If the data team grows past ~10 analysts running concurrent ad-hoc workloads, or if AI training needs Spark-on-lakehouse, the cost calculus flips. We'd add a lakehouse tier on the side, not migrate off ClickHouse.

12 · Tradeoffs 13 / 16

What I gave up to get what I gave you

Every architectural choice is a trade. Here are the ones I'm most aware of.

I chose	Over	What I gave up	What I got
ClickHouse	Snowflake / BigQuery	Best-in-class JOINs · zero-ops scaling · richest ecosystem	15× cost reduction at this scale · faster ingest · self-host control
Debezium + Kafka	Fivetran / Airbyte / DMS	UI-driven config · vendor support	$0 license · full event control · replay · reusable for other sinks
CH Kafka engine	Flink / Spark Streaming	Complex stream joins · windowed aggregations across topics	One less system to operate · MVs cover 90% of needs
Wide denormalized fact	Strict star schema	Storage savings · dimensional purity	No JOIN at query time · pruning by `app_id` in fact table directly
ReplacingMergeTree	CollapsingMergeTree · external dedup	Read-side simplicity (need `FINAL` or view)	Simplest mapping from Debezium events · idempotent under replay
Self-hosted on VMs	Managed (CH Cloud, Confluent Cloud)	~10 hours/week of ops at scale	~70% cost reduction · no vendor lock-in
5-min SLA	Sub-second SLA	True real-time (Flink + push to product)	Meets stated requirement · sub-second is a feature we add when needed
No stream processor day-one	Flink for anomaly detection	Real-time alerting in CH itself is harder	One less moving part · revisit when product needs it

13 · Deferred 14 / 16

What I'm intentionally not building yet

Deferring is a design choice. These are the things I'd queue for the next quarter, in priority order.

Schema registry (Apicurio). Today: schemas are inline in Kafka messages. Becomes painful when 2+ consumers depend on stable shape. Add in 6 months.
dbt project on top of ClickHouse. Today: MVs encode the transformations. dbt becomes essential when the data team starts proposing models — adds versioning, tests, docs.
Data quality framework (Great Expectations or Soda). Today: handful of CH queries on a Grafana dashboard. Promote when we have >10 metrics worth tracking.
Flink for real-time anomaly detection. Today: CH can do windowed queries fast enough for <5-min alerts. Flink wins at sub-second.
Iceberg lakehouse for AI training. Today: archive Kafka → S3 Parquet hourly. Promote to Iceberg + Trino when ML needs reproducible point-in-time snapshots.

Multi-region HA. Today: single region. Add when we have customers requiring data residency or when uptime SLA needs >99.5%.
Fine-grained RBAC. Today: 2 roles (read, write). Add when data team has >5 people with different access scopes.
PII tokenization layer. Today: PII columns flow as-is to ClickHouse (encrypted at rest). For HIPAA/stricter compliance, tokenize at Debezium SMT layer.
Self-serve schema evolution. Today: data engineer reviews each MySQL DDL. Acceptable since the backend team is small. Automate when schema changes > weekly.
Cost telemetry. Today: monthly invoice review. Add query cost attribution per team when we have >3 consumers competing for warehouse time.

The principle: Build for the constraint you have, not the one you might have. Every deferred item is easier to add later than it would be to remove once it's in.

14 · Questions 15 / 16

What I'd ask you if I had more time

Real architectural answers depend on real numbers. These would refine the design substantially.

Operational reality

What's actual peak QPS on MySQL? Avg vs P99?
How is MySQL hosted — RDS, Aurora, self-managed? Affects binlog config & permissions.
What's the current binlog retention? Anything less than 7 days needs to change before we ship.
Do we have on-call engineers comfortable operating Kafka? If not, Redpanda or managed.

Compliance & trust

GDPR right-to-erasure SLA — hours or days?
Any HIPAA scope (US customers)? Changes encryption + audit posture.
Are access_token and email considered PII that must be tokenized at ingest?
Data residency — EU-only? Affects region/provider choice.

Consumer expectations

Who reads from the analytical DB? Notebooks, dashboards, in-product features?
What query latency SLA does the data team want? Sub-second matters; few seconds is comfortable for CH.
Are there existing tools we should integrate with (Looker, Metabase, dbt)?
Is real-time anomaly alerting a near-term product feature or a "nice to have"?

Scale & budget

What's the growth trajectory — 1B in 12 months, 24 months, or 60?
Budget ceiling for data infra this fiscal year?
How fast does the data team need new data sources added (other CRMs, billing, etc.)?
What does the AI team need — raw event stream, daily snapshots, or feature store?

Thryve · Lead Data Engineer Case Study

Thank you.

Happy to go deeper on any layer — data modeling, Kafka topology, ClickHouse internals, or the things I deferred.

Companion document: Technical deep-dive with full schemas, runbook, capacity math, and migration plan

Real-Time Pipeline for Wearable Health Data

What we're actually solving

Mirror MySQL → analytical DB in near real-time. The constraints drive every choice.

The stack in one slide

Open-source, low-ops, ~$200/mo to start. Every paid piece earns its keep.

End-to-end data flow

Six moving parts. Each component has one job and a clear failure mode.

The analytical DB choice carries most of the design

It's the one decision that's hardest to reverse, so it gets the deepest justification.

What ClickHouse gives us

What it costs us (honest)

Debezium + Kafka is the only path that satisfies all four constraints

Polling can't handle deletes; trigger-based capture pollutes the source; CDC binlog wins.

Why binlog CDC, not polling

Why Kafka in the middle

Stream vs. batch — yes, it differs per table

Same infrastructure, different consumption pattern. Snapshots are a failsafe, not a primary path.

Wide fact table + thin dimensions + materialized views

The ORDER BY key is the most important decision in the schema.

The value-column problem

CDC semantics on an append-oriented store

Three problems: late updates, hard deletes, and read-side correctness.

The mechanism

The read-side trap

Late-arriving data

Walking through a 15K-row burst

Where the latency goes, what saturates first, and what gives us headroom.

The path of a single burst

What about 100 simultaneous bursts?

The failure modes we'll actually see

Detection comes from lag, not health checks. Recovery is mostly "wait for the buffer to drain."

Startup-friendly: ~$200/mo to start, predictable scale-up

Every paid line item earns its keep. Managed services only where ops cost exceeds rent.

Phase 1 · 0–500M rows

Phase 2 · 0.5B–5B rows

For comparison: managed warehouse

What I gave up to get what I gave you

Every architectural choice is a trade. Here are the ones I'm most aware of.

What I'm intentionally not building yet

Deferring is a design choice. These are the things I'd queue for the next quarter, in priority order.

What I'd ask you if I had more time

Real architectural answers depend on real numbers. These would refine the design substantially.

Operational reality

Compliance & trust

Consumer expectations

Scale & budget

Thank you.

The `ORDER BY` key is the most important decision in the schema.