Technical Design Document

Real-Time Pipeline for Wearable Health Data

A streaming CDC pipeline from MySQL to ClickHouse — designed for <5-minute freshness, billion-row scale, and a startup-sized infrastructure budget.

Author: Lead Data Engineer candidate  ·  For: Thryve case study
Companion: Solution Presentation (15 slides, A4 landscape)
Status: Draft v1 · Open for review
Contents

Table of Contents

1Executive Summary3
2Problem Analysis4
3Architecture Overview6
4Component Deep Dive8
4.1 Debezium (Change Data Capture)8
4.2 Kafka (Transport & Buffer)10
4.3 ClickHouse (Analytical DB)11
4.4 Object Storage Archive13
4.5 Observability14
5Per-Table Strategy15
6Data Modeling16
7Handling Updates & Deletes19
8Burst Sync Handling21
9Schema Evolution22
10Failure Modes & Operational Runbook23
11Capacity Planning25
12Cost Analysis26
13Migration & Rollout Plan27
14Future Considerations28
15Open Questions for Stakeholders29
Section 1

Executive Summary

Thryve collects sensor data from consumer wearable devices and stores it in a MySQL database. As the data and AI teams grow, they need an analytical store that is updated in near real-time, supports the burst-write pattern characteristic of wearable syncs, and stays affordable while the company scales from 150M to 1B+ rows.

This document specifies a streaming Change Data Capture (CDC) pipeline built around three open-source components — Debezium, Apache Kafka, and ClickHouse — that delivers end-to-end freshness in 5–30 seconds (versus the 5-minute SLA), handles updates and deletes correctly via ReplacingMergeTree semantics, and costs roughly $165/month in Phase 1 growing to ~$720/month at 5B rows. The same workload on a managed warehouse stack (Snowflake + Confluent + Fivetran) would cost approximately 15× more at this stage.

Headline decisions

LayerChoiceRationale (one line)
Change captureDebezium MySQL connectorBinlog-based CDC — only path that propagates updates and deletes correctly without a source schema change.
TransportApache Kafka (Redpanda viable)Absorbs burst syncs, decouples source from sink, gives replay for free.
Analytical DBClickHouseColumnar, >10× compression on sensor data, native Kafka engine, $0 license.
AggregationClickHouse Materialized ViewsIncremental on insert — removes the need for a separate stream processor today.
ArchiveS3 / Cloudflare R2Cheap long-term backup of raw events; future AI training corpus.
ObservabilityPrometheus + GrafanaFree, with first-class exporters for every component in the stack.

What we explicitly defer

Several common pieces are not in the day-one design — not because they're wrong, but because they don't earn their complexity yet. Each is easier to add later than it would be to remove if added prematurely:

What I'd want to know before shipping

The architecture is robust to the constraints in the case study, but several stakeholder decisions could refine it materially — chiefly the GDPR erasure SLA, the actual peak write QPS on MySQL, and whether the data team needs sub-second or few-second query latency. These are catalogued in §15.

Section 2

Problem Analysis

2.1 Source-of-truth shape

The MySQL backend stores six logical entities, with one fact table (epoch_data) holding ~99.99% of the row count. The dimensions are tiny and slow-changing, dominated by a 500K-row end_user table. This shape — one heavy fact, several thin dimensions — is the textbook profile for a columnar analytical store, and it gets favorable treatment in every layer of the design.

TableRowsProfileImplications
epoch_data150M → 1B+High-frequency, burst writes (~15K rows per user sync), 2M/day baselineDrives all storage, ingest, and partitioning decisions.
end_user500KLow-frequency writes, PII-bearingNeeds GDPR-compliant deletion. Joins to fact via end_user_id.
app230Tenant configuration (B2B multi-tenant)Denormalized into fact for tenant pruning at query time.
epoch_data_type500Reference; rarely changesLoaded as ClickHouse dictionary for fast in-query lookup.
data_source50Reference; rarely changesSame as above.
generation_*, trustworthiness<100 eachStaticDictionary-backed.

2.2 The constraints that drive the design

Burst syncs

A wearable that's been offline for a week may sync ~15,000 epoch_data rows in a single request. These rows have old start_timestamp values (the measurement times) and a new created_at value (the sync time). The pipeline must:

Freshness within 5 minutes

This requirement rules out hourly micro-batch ETL but is comfortably satisfied by a streaming pipeline. In practice, the architecture below achieves end-to-end freshness of 5–30 seconds at normal load and 1–3 minutes under burst conditions. The 5-minute SLA gives operational headroom for transient ClickHouse maintenance or single-broker Kafka restarts.

Updates and deletes must propagate

This single requirement eliminates two otherwise-attractive options:

Binlog-based CDC is the only mechanism that satisfies all three correctness requirements (inserts, updates, deletes) at the required latency.

Scale: 150M → 1B+ rows

At 1B rows with the sensor-data profile (mostly small integers and floats, with high inter-row correlation), ClickHouse typically achieves >10× compression — putting on-disk size at 30–80 GB. This fits on a single node well past the 1B threshold, deferring sharding complexity to a later phase. Snowflake/BigQuery would handle this easily too; the question is cost, not feasibility (see §12).

Multi-tenancy (230 apps)

Most analytical queries are scoped to a single app (a Thryve customer). Tenant-scoping must be physically efficient in the storage layout — accomplished by including app_id as the first key in the fact table's ORDER BY, allowing ClickHouse to skip any data not belonging to the queried tenant.

Implicit constraint: regulated data

The case data shows a German user (Maria Schmidt, Europe/Berlin). Even without explicit mention, wearable health data in a European context is governed by GDPR. This affects the design in three concrete places: (a) deletes must propagate end-to-end and be physically removable on demand (§7); (b) PII columns like email and access_token warrant encryption at rest and access logging; (c) data residency may pin component deployment to EU regions, affecting cloud-provider choice. HIPAA scope (for any US customers) would tighten these further; this is one of the questions in §15.

2.3 Workload patterns the analytical DB must serve

Appendix D of the case enumerates representative queries. They cluster into three patterns:

Pattern A — Per-user time-series scans

"What is the 90-day trend of resting heart rate for user X?" These queries read a contiguous slice of epoch_data filtered by end_user_id, data_type_id, and a date range. The fact-table ORDER BY (app_id, end_user_id, data_type_id, start_timestamp) reduces these to localized, mostly-sequential reads — the cheapest pattern on ClickHouse.

Pattern B — Cohort aggregations

"How does average HRV compare across users with condition X?" These touch many users but typically a single data_type_id over a recent time window. Solved cleanly by materialized aggregates partitioned by month: mv_daily_hrv_per_user rolls epoch-level data into one row per user per day, after which cohort summaries are a few thousand rows of aggregation rather than millions of raw events.

Pattern C — Data-quality and operational queries

"Which users have HR gaps > 24h in the last week?" / "What % of rows arrive > 48h after measurement?" These are global scans over recent partitions. They are not user-scoped, so they don't benefit from the user-level ordering — but they only need recent partitions (small disk footprint) and the columnar layout means scanning only the columns they care about (start_timestamp, created_at, data_type_id) is cheap.

Pattern A and B together account for the vast majority of expected queries. Pattern C is operational and tolerates higher latency. The architecture optimizes for A and B, and accepts that C will use more compute per query.

2.4 What the analytical DB will not be doing

It is useful to be explicit about workloads the design does not need to support, to avoid carrying their costs:

The design principle behind these tradeoffs: ClickHouse is being chosen for what it does best — fast scans and aggregations over time-ordered columnar data — and the schema is shaped to keep queries in that sweet spot. The day a query workload emerges that fights ClickHouse's grain, the right answer is to add a complementary tier (a key-value cache, a lakehouse for ad-hoc JOINs), not to rebuild the foundation.
Section 3

Architecture Overview

The pipeline is six components arranged in a single linear flow, with a side branch to object-storage archive and a cross-cutting observability plane. Each component has a single job, a clear contract with its neighbors, and a well-understood failure mode.

MySQL source of truth binlog (ROW) Debezium MySQL connector snapshot + binlog Kafka topic per table keyed by PK · 7d retention ClickHouse Kafka engine ReplacingMergeTree + MVs binlog events consume Materialized views daily_resting_hr · weekly_activity user_data_quality · cohort_metrics S3 / R2 archive Parquet · hourly replay source + AI corpus Data & AI teams 90-day trends · cohort studies Recovery score training Grafana · Superset · Jupyter · dbt Product features HR anomaly alerts Daily recovery score · digests ClickHouse HTTP / native client Operations Pipeline freshness SLO · DLQ rate Data quality dashboards Grafana on ClickHouse + Prom Cross-cutting · Observability & alerting Prometheus + Grafana Debezium lag · Kafka consumer lag CH ingestion · freshness SLO Alertmanager / PagerDuty Lag > 5 min → page DLQ > 0 → page · disk > 80% → warn Schema evolution DDL events captured Manual ALTER on CH for breaks

Why this shape

The architecture is consciously linear. There is exactly one production path from each MySQL change to its representation in ClickHouse, and the path passes through systems that are individually simple, individually resilient, and well-understood by the operations community. The cost of this conservatism is that we forgo certain "exotic" optimizations — e.g., direct MySQL-to-ClickHouse replication (skipping Kafka) is possible via tools like Sinkhole or ClickHouse's MaterializedMySQL engine, but it sacrifices the replay-from-buffer capability that has saved us in nearly every incident on similar pipelines.

What flows where

MySQL → Debezium
Binary log (binlog) in ROW format. Debezium reads it via the MySQL replication protocol, presenting itself as a replica.
Debezium → Kafka
One topic per source table, prefixed cdc.thryve.. Each message keyed by the source primary key to preserve per-row ordering within a partition.
Kafka → ClickHouse
Pulled by ClickHouse's Kafka engine table. A materialized view inside ClickHouse moves rows from the Kafka engine into the persistent MergeTree.
Kafka → S3/R2
Kafka Connect S3 Sink writes the raw event stream as hourly Parquet files. Pure failsafe + future AI corpus, not in the critical path.
Observability
Every component exposes Prometheus metrics. End-to-end freshness is computed as max(now() - max(created_at)) per table and is the SLA dashboard's headline metric.

3.1 Pipeline phases over time

The architecture is intentionally staged so that the same blueprint scales from single-node Phase 1 to a 3-shard Phase 2 cluster without rework. The phasing is driven by row count and concurrent-user count, not by calendar time.

PhaseTriggerWhat changesWhat stays the same
Phase 1
0 – 500M rows
Day-one shipping volume Single-node ClickHouse · single-broker Kafka (Redpanda viable) · Debezium standalone Schemas · topic layout · MV definitions · monitoring queries
Phase 2
0.5B – 5B rows
CH disk > 60%, query latency drifting; or write-rate > 10K events/sec sustained 3-node ClickHouse with sharding by app_id · 3-broker Kafka · 2-node Debezium HA · add schema registry Logical schemas · query interface · per-table strategy
Phase 3
5B+ rows
AI workloads need raw event log; or consumer concurrency exceeds CH capacity Add Iceberg-on-S3 tier fed by Kafka Connect · introduce Trino for ad-hoc cross-domain JOINs · evaluate dbt Cloud The CDC layer (Debezium + Kafka) feeds both tiers — no source-side rework
The intent: No phase change requires rebuilding earlier work. Each phase adds capacity without invalidating the schemas, query patterns, or operational runbooks established in the prior phase.

3.2 What we explicitly do not include in the day-one architecture

Section 4

Component Deep Dive

4.1 Debezium · Change Data Capture

Debezium is the de-facto open-source CDC framework. The MySQL connector presents itself to MySQL as a replica (using a dedicated replication user), reads the binary log (binlog) in ROW format, and emits a structured event per change. The events include both before and after states for updates, and emit explicit delete events.

Why Debezium specifically

Critical configuration

# Debezium MySQL connector — production settings
name: thryve-mysql-cdc
connector.class: io.debezium.connector.mysql.MySqlConnector
tasks.max: 1             # single-threaded by design

# Source connection
database.hostname: mysql-replica.internal
database.port: 3306
database.user: debezium_cdc
database.password: ${file:/run/secrets/dbz-pw:value}
database.server.id: 184054   # unique among MySQL replicas
database.include.list: thryve
table.include.list: thryve.app, thryve.end_user, thryve.epoch_data,
                     thryve.epoch_data_type, thryve.data_source,
                     thryve.generation_type, thryve.generation_method,
                     thryve.trustworthiness

# Snapshot mode — initial snapshot, then tail binlog forever
snapshot.mode: initial
snapshot.locking.mode: none     # MVCC, no table locks
snapshot.fetch.size: 10000

# Throughput tuning for the bursty epoch_data table
max.batch.size: 8192
max.queue.size: 32768
poll.interval.ms: 500

# Topic naming & partitioning
topic.prefix: cdc.thryve
message.key.columns: thryve.epoch_data:id;thryve.end_user:id

# Flatten the Debezium envelope and surface delete tombstones.
# Without unwrap, every event is wrapped in {before, after, source, op}.
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones: false
transforms.unwrap.delete.handling.mode: rewrite
transforms.unwrap.add.fields: op,source.ts_ms,source.pos,source.file

# Schema history (where DDL events are persisted)
schema.history.internal.kafka.bootstrap.servers: kafka:9092
schema.history.internal.kafka.topic: cdc.thryve.schema-history

MySQL prerequisites

Debezium requires the source MySQL instance to satisfy a few conditions. These should be verified before the connector is deployed:

What the events look like

After the unwrap SMT is applied, an event for an epoch_data insert looks roughly like this:

{
  "id": 991200,
  "end_user_id": 42817,
  "data_type_id": 3000,
  "data_source_id": 5,
  "start_timestamp": 1741003200000,
  "end_timestamp": 1741003260000,
  "long_value": 72,
  "created_at": 1741254720000,
  "timezone_offset": 60,
  "__op": "c",                         // c = create, u = update, d = delete
  "__source_ts_ms": 1741254720123,
  "__source_pos": 8742019,            // binlog position — our _version
  "__source_file": "binlog.000142",
  "__deleted": "false"
}

The crucial fields for downstream processing are __source_pos (used as the _version in ClickHouse's ReplacingMergeTree — monotonically increasing within a binlog file) and __deleted (set to "true" by the delete.handling.mode = rewrite SMT for tombstone events).

Failure modes specific to Debezium

FailureDetectionRecovery
Connector crashKafka Connect REST /connectors/{name}/status = FAILEDsystemd auto-restart; connector resumes from last committed binlog offset (no data loss).
MySQL binlog purged before Debezium reads itBinlogNotFoundException in connector logsManual: trigger incremental snapshot for affected tables via Kafka topic signal.
Schema change to non-nullable columnConnector pauses with schema errorManual: apply matching ALTER on ClickHouse, then resume.
MySQL replica failoverConnection reset; auto-reconnectWith GTID mode, automatic. Without GTID, requires manual offset reset.
Source DB password rotationAuth errors in logsUpdate connector secret; restart task.

4.2 Apache Kafka · Transport and Buffer

Kafka sits between Debezium and ClickHouse as a durable, replayable buffer. It is the component that absorbs burst syncs without backpressuring MySQL, that lets ClickHouse be restarted for maintenance without data loss, and that gives us replay if we ever need to rebuild a ClickHouse table from history.

Why a broker, not direct push

Debezium can write directly to many sinks via Debezium Server (Kinesis, Pulsar, HTTP). For a startup, the temptation to skip Kafka and go straight to ClickHouse is real — one less system to operate. The reasons we keep Kafka in the path anyway:

Redpanda as a drop-in

Redpanda is a Kafka-compatible broker written in C++ — single binary, no Zookeeper, lower memory footprint, often better tail latency. For Phase 1 (single-broker), Redpanda is meaningfully easier to operate and produces nearly identical guarantees. I'd recommend it as the starting point and migrate to Apache Kafka only if specific Kafka-ecosystem tools (KSQL, certain connectors) become needed.

Topic layout

TopicPartitionsRetentionReason
cdc.thryve.epoch_data247 daysHighest-volume table. 24 partitions allow parallel ClickHouse consume up to 24 consumer threads.
cdc.thryve.end_user314 daysLow volume but PII-relevant; longer retention to permit re-verification of GDPR deletes.
cdc.thryve.app130 daysTenant config; rarely changes, but want long replay window.
cdc.thryve.data_source
cdc.thryve.epoch_data_type
cdc.thryve.generation_*
cdc.thryve.trustworthiness
1 each30 daysReference data; one partition suffices, long retention is cheap given low volume.

Partitioning is by the source primary key. For epoch_data, this means events for the same row always land on the same partition, preserving in-row ordering. We do not partition by end_user_id because that would create skew (some users sync orders-of-magnitude more data than others), and because cross-row ordering is not required for correctness — ReplacingMergeTree resolves the final state from _version.

Producer / consumer settings

SettingValueReason
Producer acksallWait for all in-sync replicas — durability over latency.
Producer compression.typezstdBest compression ratio for repeated JSON keys.
Producer linger.ms50Allow micro-batching at the producer for better throughput.
Topic min.insync.replicas2Tolerate 1 broker failure without losing durability.
Consumer isolation.levelread_committedOnly read committed records — avoids reading transactional aborts.

4.3 ClickHouse · Analytical Database

ClickHouse is the most consequential choice in the architecture. It owns query latency, storage cost, and most of the operational complexity. Several alternatives could work — see §4.3.4 for the comparison — but ClickHouse aligns better than any of them with the specific shape of this workload: time-ordered, append-heavy, columnar-friendly, with a strong preference for predictable cost over elastic scale.

4.3.1 Why ClickHouse for this workload

4.3.2 What ClickHouse gives up

4.3.3 ClickHouse-specific configuration

A handful of settings are particularly important for our workload:

SettingValueReason
max_partitions_per_insert_block100Allow burst inserts to span multiple monthly partitions when a user syncs week-old data.
Kafka engine kafka_max_block_size65536Cap per-batch insert size to avoid "too many parts" errors during sustained load.
Kafka engine kafka_flush_interval_ms2000Tighter than default 7500ms — keeps freshness under 5s at normal load.
Merge thread counttune to ~50% of available coresBackground merges must keep up with insert rate to prevent part accumulation.
Codecs on time columnsCodec(Delta, ZSTD(3))Delta encoding shrinks monotonically-increasing timestamps to a few bits per row.
Codecs on ID columnsCodec(T64, ZSTD(3))T64 codec is highly effective for unsigned integers with narrow ranges.

4.3.4 Alternatives considered and why I didn't choose them

CandidateVerdict
Snowflake The best developer experience in the analytical-DB space. Its separation of storage from compute and excellent JOIN support are genuinely superior. The case against it for this workload is cost: Snowpipe Streaming inserts at our event rate, combined with the warehouse credits consumed by even modest concurrent queries, easily reaches $2–3K/month at 1B rows. That's defensible at a Series B startup but not a Seed/Series A one. Reconsider when the data team has >5 concurrent users.
BigQuery Similar shape to Snowflake — excellent for ad-hoc analytics, expensive for streaming ingest. Strongest if the team is already on GCP and would benefit from BigQuery's tight integration with the rest of the Google data stack. Same cost objection as Snowflake.
TimescaleDB / Postgres An attractive option up to roughly 500M rows of time-series data. Compression (via TimescaleDB's compressed chunks) is good but not at ClickHouse levels. The selling point is JOIN flexibility and the comfort of Postgres ergonomics. The breaking concern is query performance on the heaviest analytical patterns at 1B+ rows: per-user scans hold up, but cohort aggregations slow down measurably. Best choice if the team strongly prefers Postgres familiarity and is confident growth will be slow.
Apache Druid Purpose-built for real-time OLAP. Excellent ingestion throughput and sub-second queries at scale. Operationally significantly heavier — coordinator, brokers, historicals, middle managers, deep storage — and its SQL surface, while improved, is still less expressive than ClickHouse's. The case for Druid is strongest when sub-second query latency at high concurrency is the dominant requirement. Our 5-min freshness + few-second query SLA doesn't push us into that territory.
Apache Pinot Similar value proposition to Druid; some teams find it easier to operate. Same caveat applies — operational complexity exceeds what a startup data team typically wants to absorb day-one.
Iceberg + Trino / DuckDB The "modern lakehouse" answer. Strong for AI training workloads (Spark / Polars / DuckDB can all read Iceberg directly). Doesn't solve the 5-min freshness requirement on its own — needs a streaming layer feeding it. A natural Phase 3 addition, not a Phase 1 base.
MySQL read replica Trivial to set up but doesn't solve the analytical query problem. A scan over 1B rows of epoch_data is hours on row-oriented storage. Mentioned only to be ruled out explicitly — sometimes proposed by stakeholders unfamiliar with the OLTP / OLAP distinction.

4.3.5 The Kafka → ClickHouse pattern

ClickHouse consumes Kafka via a specific three-table pattern:

-- 1. The Kafka engine table acts as a consumer.
--    Reading from it consumes from Kafka, so it's not directly queried.
CREATE TABLE kafka_epoch_data (
  id UInt64,
  end_user_id UInt32,
  data_type_id UInt32,
  start_timestamp Int64,
  long_value Nullable(Int64),
  double_value Nullable(Float64),
  -- ... other value columns
  __source_pos UInt64,
  __deleted String
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'cdc.thryve.epoch_data',
  kafka_group_name = 'ch_epoch_data_consumer',
  kafka_format = 'JSONEachRow',
  kafka_num_consumers = 8,
  kafka_max_block_size = 65536,
  kafka_flush_interval_ms = 2000;

-- 2. The persistent storage table (the one queries actually hit).
CREATE TABLE epoch_data (
  id UInt64,
  end_user_id UInt32,
  app_id UInt16,
  -- ... full schema (see §6)
  _version UInt64,
  _deleted UInt8
) ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id);

-- 3. The materialized view that moves rows from (1) to (2).
--    Each time the Kafka engine table receives a batch, this view's SELECT
--    runs against the batch and inserts the result into epoch_data.
CREATE MATERIALIZED VIEW mv_ingest_epoch_data TO epoch_data AS
SELECT
  id, end_user_id,
  dictGetUInt16('dim_end_user', 'app_id', end_user_id) AS app_id,
  data_type_id, data_source_id,
  toDateTime64(start_timestamp / 1000, 3, 'UTC') AS start_ts,
  toDateTime64(end_timestamp / 1000, 3, 'UTC') AS end_ts,
  long_value, double_value, boolean_value, string_value,
  __source_pos AS _version,
  if(__deleted = 'true', 1, 0) AS _deleted
FROM kafka_epoch_data;

The dictGetUInt16 call denormalizes app_id from the end_user dimension at write time, so queries never need to JOIN to retrieve it. This is the single most important denormalization in the schema — it converts an expensive JOIN into a free column lookup, at the cost of recomputing app_id on every insert (cheap, since the dictionary is in memory).

4.4 S3 / R2 · Raw Event Archive

The archive tier is not on the critical path — no production query reads from it — but it earns its place for three reasons:

  1. Disaster recovery. If both ClickHouse and Kafka are lost (worst-case multi-region outage, or a destructive operational error), we can rebuild the ClickHouse state from the archive without touching MySQL.
  2. Replay window extension. Kafka's retention is 7–30 days; the archive extends this to "forever, effectively." Useful for replaying months of history when introducing a new materialized view that needs to be backfilled.
  3. AI training corpus. Future ML models need stable, reproducible snapshots of event data — not the live mutating analytical store. A Parquet archive partitioned by date is the standard format for this.

Cloudflare R2 vs AWS S3

S3 is the obvious default. R2 is meaningfully cheaper for this use case because egress is free — and the archive will be read from many directions (ClickHouse for backfills, training jobs, ad-hoc analysts). At ~$0.015/GB/month storage and zero egress, R2 is roughly 1/3 the cost of S3 for a workload that reads as well as writes. The compatibility with the S3 API is essentially complete; the Kafka Connect S3 Sink works against R2 with no code change.

The argument against R2 is data-residency concerns (R2's regions are more limited than S3's) and the absence of certain advanced features (lifecycle policies, Glacier-style tiering). For Phase 1, neither applies meaningfully.

Sink configuration

name: s3-sink-epoch-data
connector.class: io.confluent.connect.s3.S3SinkConnector
topics: cdc.thryve.epoch_data, cdc.thryve.end_user, ...
s3.bucket.name: thryve-cdc-archive
s3.region: auto                          # R2 uses 'auto'
store.url: https://<account>.r2.cloudflarestorage.com

# Hourly partitioning, by event time not arrival time
flush.size: 100000
rotate.interval.ms: 3600000             # 1 hour
timestamp.extractor: RecordField
timestamp.field: source.ts_ms
path.format: "'topic='${topic}/'y='YYYY/'m='MM/'d='dd/'h='HH"
format.class: io.confluent.connect.s3.format.parquet.ParquetFormat
parquet.codec: snappy

Storage layout

Files land at paths like:

s3://thryve-cdc-archive/topic=cdc.thryve.epoch_data/
  year=2026/month=05/day=23/hour=14/
    cdc.thryve.epoch_data+12+0000000000412198192.parquet

The filename includes the topic, partition, and starting Kafka offset, making it trivial to reason about completeness. Parquet with Snappy compression typically gives ~5–8× compression on this event schema.

Lifecycle and cost

At 2M events/day with ~150 bytes per event compressed, the archive grows at ~10 GB/month. At Phase 2 burst rates (~10M events/day), that becomes ~50 GB/month. Even after 5 years, the archive is well under 5 TB — under $80/month at R2 list pricing. Lifecycle policies can move very-old data to colder tiers, but the economics rarely justify the operational complexity at our scale.

4.5 Observability · Prometheus, Grafana, Alertmanager

The observability stack is intentionally boring. Prometheus scrapes metrics from each component's built-in exporter; Grafana visualizes; Alertmanager routes pages. None of this is novel, all of it is well-supported by every component in the pipeline.

The headline metric: end-to-end freshness

The single most important metric is the end-to-end lag between when a row is written to MySQL and when it becomes queryable in ClickHouse. Implementation:

-- Run every 30 seconds via a Grafana → ClickHouse query
SELECT
  'epoch_data' AS table,
  now() - max(created_at) AS freshness_seconds
FROM epoch_data
WHERE created_at >= now() - INTERVAL 10 MINUTE;

This metric collapses every possible pipeline failure into one number. If MySQL is healthy, Debezium is healthy, Kafka is healthy, and ClickHouse is healthy, freshness will be 5–30 seconds. Anything else and freshness will trend upward, hit the 5-minute threshold, and page.

Layered diagnostic metrics

Freshness tells us that something is wrong; the layered metrics tell us what:

LayerMetricWhat it means when red
MySQLmysql_global_status_innodb_log_lsn_currentWrite rate. Rising fast = sustained heavy load on source.
Debeziumdebezium_metrics_milliseconds_behind_sourceHow far behind binlog Debezium is. Above ~1s = read bottleneck.
Debeziumdebezium_metrics_total_number_of_events_seenEvent throughput. Flat-lining = connector stuck.
Kafkakafka_consumer_lag (per consumer group)How many messages the consumer hasn't read yet. Rising = consumer can't keep up.
Kafkakafka_topic_partitions_under_replicatedReplication health. >0 = broker issue.
ClickHousesystem.parts count per partitionBackground merge health. >300 sustained = merges falling behind.
ClickHousesystem.replication_queue size (when replicated)Replica sync health.
ClickHouseInsertedRows per minuteInsert throughput. Flat-lining = Kafka consumer paused.

Alerting policy

Alerts that are not pages: anything that doesn't require action within 30 minutes. The 3 a.m. page should only fire for things that genuinely need a human in the next half-hour.

Section 5

Per-Table Strategy

Stream-versus-batch is the wrong question to ask per table; the right question is "which consumption pattern, knowing that the same Debezium connector reads all tables anyway." The infrastructure is shared, so the per-table decision is really about (a) how the data lands on the ClickHouse side, and (b) what failsafe we wrap around the streaming path.

Table Primary path Failsafe CH engine Reasoning
epoch_data CDC stream Daily incremental snapshot → Parquet on S3/R2 ReplacingMergeTree
(_version)
High volume; 5-min freshness is the whole point. Daily snapshot is the rebuild source if Kafka retention expires during a sustained outage.
end_user CDC stream Hourly full snapshot to R2 ReplacingMergeTree
(_version)
PII-bearing. Deletes must propagate quickly (GDPR). Small enough that an hourly full snapshot is cheap insurance.
app CDC stream Hourly full snapshot ReplacingMergeTree
(_version)
Tenant config — small, slow-changing, but joins from every query. Snapshot makes recovery trivial.
epoch_data_type
data_source
CDC stream Hourly full snapshot ReplacingMergeTree + ClickHouse Dictionary (in-memory) Reference data, used for in-query lookups via dictGetX. Dictionary auto-refreshed from the underlying MergeTree.
generation_type
generation_method
trustworthiness
Snapshot only Dictionary Static, <100 rows. Loaded as a ClickHouse dictionary, refreshed daily by a one-line cron'd query. No streaming overhead.

Why mostly uniform

The temptation, on reading the per-table sizes, is to optimize: stream the heavy table, batch the small ones. We resist that because:

Why ClickHouse Dictionaries for the truly static tables

The generation_* and trustworthiness tables are small enough (~100 rows total) that loading them as in-memory dictionaries is dramatically faster than even an indexed JOIN. Dictionary lookups are O(1) hash lookups, run in the same memory space as the query, and don't touch disk. The CDC pipeline still streams these tables (for $0 incremental cost), but queries use the dictionary form via dictGet('trustworthiness', 'name', trustworthiness_id).

Section 6

Data Modeling

The schema decisions are dominated by one question: what does the ORDER BY on epoch_data look like? Everything else in the model follows from getting that right.

6.1 The fact table

CREATE TABLE epoch_data (
  id                  UInt64,
  end_user_id         UInt32,
  app_id              UInt16,                       -- denormalized at ingest time
  data_type_id        UInt32,
  data_source_id      UInt16,
  third_party_ds_id   UInt16 DEFAULT 0,
  start_ts            DateTime64(3, 'UTC') CODEC(Delta, ZSTD(3)),
  end_ts              Nullable(DateTime64(3, 'UTC')) CODEC(Delta, ZSTD(3)),
  created_at          DateTime64(3, 'UTC'),    -- sync time

  double_value        Nullable(Float64)         CODEC(ZSTD(3)),
  long_value          Nullable(Int64)           CODEC(T64, ZSTD(3)),
  boolean_value       Nullable(UInt8),
  date_value          Nullable(DateTime64(3, 'UTC')),
  string_value        Nullable(String),

  generation_type_id  UInt8 DEFAULT 0,
  generation_method_id UInt8 DEFAULT 0,
  trustworthiness_id  UInt8 DEFAULT 0,
  medical_grade       UInt8 DEFAULT 0,
  chronological_exactness Int16,
  timezone_offset     Int16,
  measurement_id      UInt64,

  _version            UInt64,                       -- binlog position
  _deleted            UInt8 DEFAULT 0,
  _ingested_at        DateTime DEFAULT now()      -- for ops queries
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id)
SETTINGS index_granularity = 8192
TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;

6.2 Why this ORDER BY

The ORDER BY tuple (app_id, end_user_id, data_type_id, start_ts, id) determines the physical layout of every part. ClickHouse stores rows in this order within each partition. The implications:

The case for not putting start_ts earlier in the key: a pure time-ordered key would optimize for "what happened across all users in the last hour" queries, at the cost of "what's this user's history" queries. The case's analytical question set leans heavily on the latter, so user-locality wins.

6.3 Partition strategy

Partitioning by toYYYYMM(start_ts) gives one partition per calendar month per shard. At 1B rows over 3–4 years that's ~30M rows per partition — a comfortable size for ClickHouse merge operations. The benefits:

6.4 The value-column question

The source has five nullable value columns (double_value, long_value, boolean_value, date_value, string_value), only one of which is populated per row based on the row's data_type_id. There are two reasonable approaches:

  1. Keep them as-is in one wide table. Analytical queries always know which column to read because they filter by data_type_id. Null columns compress essentially for free in a columnar store. Single table = simpler schema, fewer joins.
  2. Split into per-type tables (one for HR, one for Steps, etc.). Each table has just the relevant value column. Cleaner per-table, but queries that want multiple data types need UNIONs, and the schema has 15+ tables to operate.

We choose option 1 because (a) the null columns cost essentially nothing in storage, (b) ClickHouse's column-pruning means a query that only reads long_value never touches the other columns on disk anyway, and (c) the per-table-split approach makes cross-type queries (recovery score = f(HR, HRV, sleep)) painful.

6.5 Dimension tables

CREATE TABLE dim_end_user (
  id UInt32,
  app_id UInt16,
  pseudonym String,
  first_name String,
  last_name String,
  email String,
  date_of_birth Nullable(Date),
  timezone LowCardinality(String),
  created_at DateTime64(3, 'UTC'),
  _version UInt64,
  _deleted UInt8 DEFAULT 0
) ENGINE = ReplacingMergeTree(_version)
ORDER BY (app_id, id);

-- And the dictionary that the fact-table ingest MV uses to look up app_id
CREATE DICTIONARY dim_end_user_dict (
  id UInt32,
  app_id UInt16
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'dim_end_user'))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);                    -- refresh every 5–10 minutes

LowCardinality(String) is used for the timezone column because there are at most a few hundred distinct values across all users; LowCardinality stores them as small integer codes with a dictionary, saving substantial space and speeding up filters and grouping.

6.6 Materialized views (the pre-aggregation layer)

Materialized views in ClickHouse are insert-triggered — when rows are inserted into the underlying table, the view's SELECT runs against just the new batch and inserts the result into the view's storage. This is the mechanism we use to keep daily/weekly aggregations always-current with minimal overhead.

-- Daily resting HR per user (powers the "90-day trend" query)
CREATE MATERIALIZED VIEW mv_daily_resting_hr
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (app_id, end_user_id, date)
AS SELECT
  app_id,
  end_user_id,
  toDate(start_ts) AS date,
  avgState(long_value) AS avg_hr,
  minState(long_value) AS min_hr,
  maxState(long_value) AS max_hr,
  countState()         AS samples
FROM epoch_data
WHERE data_type_id = 3001          -- HeartRateResting
  AND _deleted = 0
GROUP BY app_id, end_user_id, date;

-- The 90-day-trend query, now sub-second on billions of rows
SELECT
  date,
  avgMerge(avg_hr) AS resting_hr,
  minMerge(min_hr) AS min_hr,
  maxMerge(max_hr) AS max_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817
  AND date >= today() - 90
GROUP BY date
ORDER BY date;

The *State / *Merge pattern

The materialized view stores intermediate aggregation state (e.g., avgState stores both running sum and count), not the final value. This is necessary because the view receives data in batches, and a true running average can't be incrementally combined without preserving the underlying counts. At query time, avgMerge combines those partial states into the final answer.

This is also what makes burst-sync correctness work: when a user syncs week-old data, the rows arrive at the view, which computes partial aggregates for the old dates and merges them with the existing partial aggregates already stored. The query result is always consistent with the underlying fact table, regardless of arrival order.

Suggested materialized views to ship day-one

ViewPowers
mv_daily_user_metricsDaily summary per user: total steps, distance, calories, resting HR, sleep duration. The dashboard backbone.
mv_user_hrv_30d30-day rolling RMSSD per user. Powers HRV cohort comparisons and anomaly baselines.
mv_user_sleep_stages_dailyDaily total time in each sleep stage. Powers the "sleep quality vs activity" correlation queries.
mv_user_data_quality_dailyPer user per day: row count, time coverage, gap count. Powers operational data-quality dashboards.
mv_data_arrival_latencyHistogram of created_at - start_timestamp buckets. Powers the "% of rows arriving >48h late" operational query.

6.7 Time columns: epoch ms vs DateTime64

The source uses BIGINT milliseconds since epoch for all timestamps. We convert to DateTime64(3, 'UTC') at the ingest materialized view layer because:

The original epoch value is recoverable via toUnixTimestamp64Milli(start_ts) if any downstream consumer needs it in that form.

6.8 Query patterns to validate the model

Three queries from Appendix D, executed against the proposed model, with comments on access patterns:

-- Q1. "90-day trend of resting heart rate for user 42817"
--     Uses MV. Reads ~90 rows. Sub-second.
SELECT date, avgMerge(avg_hr) AS resting_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817 AND date >= today() - 90
GROUP BY date ORDER BY date;

-- Q2. "Users with HR gaps > 24h in the last 7 days"
--     Uses mv_user_data_quality_daily. Reads ~3.5M rows (500K users × 7 days),
--     but each row is small. Few seconds.
SELECT end_user_id, max_gap_minutes
FROM mv_user_data_quality_daily
WHERE data_type_id = 3000
  AND date >= today() - 7
  AND max_gap_minutes > 1440;

-- Q3. "Cohort HRV: condition-X users vs general population"
--     The "cohort" filter is external (likely a SQL IN list of user IDs).
--     Reads from mv_user_hrv_30d. Sub-second per cohort.
SELECT
  if(end_user_id IN cohort_users, 'condition_x', 'general') AS cohort,
  avgMerge(avg_rmssd) AS avg_hrv,
  count() AS users
FROM mv_user_hrv_30d
WHERE date >= today() - 30
GROUP BY cohort;
Section 7

Handling Updates & Deletes

This section addresses the case study's explicit prompt: "How do you handle updates/deletes in an append-oriented system?" The short answer: ReplacingMergeTree with a monotonic version column from the binlog position, combined with a read-side wrapper view that materializes the latest version on demand. The long answer follows.

7.1 The mechanism: ReplacingMergeTree

ReplacingMergeTree(version_column) is a MergeTree variant that, during its background merge cycle, dedupes rows sharing the same ORDER BY key and keeps only the row with the highest version_column value. This is exactly the semantics we need: every UPDATE in MySQL becomes a new row in ClickHouse with a higher _version; the merge eventually collapses the old row away.

The choice of version column matters: it must be monotonically increasing for every change to a given row. Debezium provides several candidates; the binlog position (source.pos) is the natural fit because it's strictly increasing within a binlog file and Debezium guarantees in-order delivery per source row. For changes that cross binlog file rotations, we use a combined value: (file_number * 2^32) + pos (the binlog file numbers themselves are monotonic).

7.2 The mechanism: deletes via tombstones

When MySQL deletes a row, Debezium emits a delete event. With the delete.handling.mode = rewrite SMT, that event becomes a regular insert into the topic but with __deleted = "true". The ingest materialized view sets _deleted = 1 on the resulting ClickHouse row, which has the same ORDER BY key as the original.

Because the delete event always has a higher _version than the row it tombstones, ReplacingMergeTree will keep the _deleted = 1 version after merge. Queries then need to filter WHERE _deleted = 0 to exclude logically-deleted rows.

7.3 The read-side problem

The catch with ReplacingMergeTree is that merges are asynchronous. After an UPDATE, both the old and new versions of the row exist in the table until the next background merge runs (which can be seconds or minutes later, depending on data size and merge schedule). A naive SELECT will see both rows.

There are three ways to handle this, in increasing order of cost and decreasing order of convenience:

Option A: A wrapper view that does the dedup at query time

CREATE VIEW epoch_data_current AS
SELECT
  id, end_user_id, app_id, data_type_id, data_source_id,
  start_ts, end_ts, created_at,
  argMax(double_value, _version)   AS double_value,
  argMax(long_value, _version)     AS long_value,
  argMax(boolean_value, _version)  AS boolean_value,
  argMax(string_value, _version)   AS string_value,
  argMax(generation_type_id, _version) AS generation_type_id,
  argMax(trustworthiness_id, _version) AS trustworthiness_id,
  argMax(_deleted, _version)       AS is_deleted
FROM epoch_data
GROUP BY id, end_user_id, app_id, data_type_id, data_source_id,
         start_ts, end_ts, created_at
HAVING is_deleted = 0;

This is the recommended default. It hides the dedup logic from data-team queries entirely. The cost is the GROUP BY and argMax overhead, which is bounded — typically <2× the cost of a naive SELECT against the same range.

Option B: SELECT ... FINAL

The FINAL modifier tells ClickHouse to merge-on-the-fly at query time. Functionally equivalent to Option A in terms of result, but the implementation forces a sort+merge of every matching part, which can be considerably more expensive for queries that touch many parts.

I'd reserve FINAL for one-off queries where the convenience matters more than the cost; for production query paths, the wrapper view is more predictable.

Option C: OPTIMIZE TABLE ... FINAL

Manually trigger a full merge. The most expensive option (rewrites all affected parts) but produces the cleanest physical state — useful as a scheduled operation off-peak, especially after large backfills or batch deletes.

7.4 GDPR right-to-erasure: physical deletion

ReplacingMergeTree logically deletes rows but doesn't physically remove them from disk until partitions are dropped (via TTL or manual operation). For GDPR compliance, we need a way to physically delete a user's data on demand.

The implementation

-- Triggered when a user's erasure request is approved.
-- ALTER ... DELETE in ClickHouse is a lightweight mutation —
-- it rewrites affected parts in background.
ALTER TABLE epoch_data
DELETE WHERE end_user_id = 42817;

ALTER TABLE dim_end_user
DELETE WHERE id = 42817;

-- And from each materialized view's storage:
ALTER TABLE mv_daily_resting_hr
DELETE WHERE end_user_id = 42817;
-- (Repeat for each MV)

This pattern is wrapped in a single stored procedure (a SQL function or Python script) invoked from an internal "user erasure" admin tool. The procedure must also:

SLA

The mutation completes within minutes for a typical user (most of whose data is concentrated in recent partitions). Backfilling old archive files takes longer (~hours, since it requires scanning S3/R2 hourly Parquet files). I'd commit to 24-hour SLA for full physical erasure, with the SQL-layer deletion happening within 1 hour. Stakeholders may want a stricter SLA — see §15.

7.5 Why ReplacingMergeTree over CollapsingMergeTree

ClickHouse offers a sibling engine, CollapsingMergeTree, that handles updates by storing +1 ("new state") and -1 ("old state") rows that cancel on merge. It's arguably more theoretically clean but requires the writer to produce paired rows for every update. Debezium doesn't natively emit -1 events; doing this would require an SMT or a downstream processor.

ReplacingMergeTree is simpler: Debezium emits each row state independently, and ClickHouse handles the resolution. We lose the ability to do "delta math" (computing the change between successive versions of a row) but that's not a common analytical pattern here.

7.6 The duplicate-event corner case

Debezium has at-least-once delivery semantics. If the connector crashes after emitting an event but before committing the offset, that event will be re-emitted on restart. Our pipeline is naturally idempotent under this:

The pre-merge window allows brief over-counting in MV aggregates, but the steady-state is correct.

7.7 Late-arriving data

The burst-sync pattern produces a different kind of "update" — not a modification of an existing row but the arrival of new rows for old time periods. This is handled trivially by partitioning by start_ts (the measurement time, not the sync time): late rows land in historical partitions automatically, and the materialized views over those partitions incrementally update.

The only operational consideration is part count: a burst sync of a week of data lands in multiple historical partitions. The CH settings allow inserts to span up to 100 partitions per block (max_partitions_per_insert_block) to accommodate this.

Section 8

Burst Sync Handling

The constraint says: "A user offline for days syncs a week of data at once." With ~15,000 epoch_data rows in one burst, and potentially hundreds of users syncing simultaneously after, say, a regional connectivity event, the system needs to absorb spikes of 1M+ events in seconds without inducing pipeline lag for other tenants.

8.1 Walking through a single 15K-row burst

  1. MySQL. The backend writes the batch in one or many INSERT statements over ~2 seconds. The binlog grows by ~3 MB (assuming ~200 bytes per row in binlog format).
  2. Debezium. Reads the binlog at >100K events/sec for narrow rows. Adds <1 second of latency. Emits 15K Kafka messages.
  3. Kafka. Producer batches via linger.ms = 50; the 15K messages spread across 24 partitions of the epoch_data topic. Total Kafka latency: <1 second.
  4. ClickHouse Kafka engine. Consumes in batches up to kafka_max_block_size = 65,536. The 15K-row burst fits in one batch, inserted as 1–2 new parts. Insert latency: ~1–3 seconds (bounded by kafka_flush_interval_ms = 2000).
  5. Materialized views. Fire on each insert. Computing per-user daily aggregates over 15K rows is <100 ms.

End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.

8.2 The 100-simultaneous-bursts scenario

If a connectivity event causes 100 users to sync at once, the total is 1.5M events. The pipeline behaves as follows:

8.3 Sustained burst (the failure mode to actually worry about)

The genuinely concerning scenario is not a one-time burst but sustained high-rate writes — e.g., a viral marketing event that 50× our user count overnight. The bottlenecks we'd encounter in order:

  1. MySQL write capacity. Not our problem to solve at the pipeline level, but it's the first thing to break.
  2. Debezium throughput. Single-threaded per source. At ~50–100K events/sec sustained, we'd need to split connectors by table (e.g., a dedicated connector for epoch_data, another for everything else).
  3. ClickHouse merge backlog. If inserts outpace merges, part count grows. Recovery: throttle insert rate temporarily, scale CH vertically, or add a shard.

8.4 Key knobs

SettingDefaultOur valueWhy
Debezium max.batch.size20488192Bigger batches, less per-event CPU.
Kafka topic partitions (epoch_data)124Allows ClickHouse parallel consume up to 24 threads.
Kafka linger.ms050Better batching at the producer.
CH kafka_max_block_size1M65,536Predictable batch size avoids huge parts during bursts.
CH kafka_flush_interval_ms75002000Tighter latency at normal load.
CH parts_to_throw_insert3000(unchanged, monitor)Safety net; if hit, inserts fail loud.
CH max_partitions_per_insert_block100100Bursts of week-old data span multiple historical partitions.
Section 9

Schema Evolution

Schema changes are an unavoidable reality. The pipeline must handle them gracefully, even though we don't have a separate schema registry on day one.

9.1 The default contract

Debezium captures every DDL event in MySQL and emits it to a dedicated schema-history Kafka topic. The downstream behavior depends on the kind of change:

ChangeDefault behaviorAction required
ADD COLUMN (nullable) Debezium emits events with the new field; ClickHouse Kafka engine ignores unknown fields by default None on day one. To make the new column queryable, ALTER it onto the ClickHouse table and update the ingest MV.
ADD COLUMN (NOT NULL, with default) Same as above. The default value flows through Debezium events. Same as above.
DROP COLUMN Debezium stops including the field; existing CH column stays as-is Optional cleanup: ALTER DROP COLUMN on ClickHouse to reclaim storage.
RENAME COLUMN Treated as DROP + ADD by Debezium Manual: ALTER RENAME on ClickHouse, then redeploy ingest MV with new field name.
CHANGE TYPE (widening, e.g. INT → BIGINT) Debezium handles transparently if compatible; otherwise emits schema change event ALTER MODIFY COLUMN on CH if the type widens (cheap); if narrowing, requires migration plan.
ADD TABLE Debezium picks it up if it matches table.include.list pattern; otherwise ignored Add table to include list, restart connector, optionally trigger incremental snapshot.
DROP TABLE Existing data in CH stays; new events stop flowing Decide whether to retain history in CH or drop the mirror table.

9.2 Operational workflow

For day one (small team, controlled source changes), schema evolution is handled manually:

  1. Backend team announces the planned MySQL schema change in advance.
  2. Data engineer reviews — assesses whether it's additive (safe) or breaking (needs care).
  3. Backend deploys the schema change.
  4. Data engineer applies the corresponding ClickHouse ALTER and updates ingest MVs.
  5. Validate end-to-end freshness remains within SLA after deploy.

This works because the team is small enough that announcements happen in shared channels and changes are infrequent (weekly at most). Once the team grows or change frequency increases, this becomes a bottleneck — at which point we add:

9.3 The failure mode to watch

The most dangerous case is a MySQL schema change that looks safe but breaks downstream — e.g., changing a column from BIGINT to INT when some existing values exceed the new range. Debezium will emit events as the new type; ClickHouse insert will fail; the consumer will fall behind.

Mitigation: a DLQ (dead-letter queue) topic for failed inserts, with an alert on non-zero message count. The DLQ lets the pipeline keep moving for the other 99.99% of events while the offending change is investigated.

Section 10

Failure Modes & Operational Runbook

This section catalogues the failure modes we expect, in rough order of likelihood, with detection and recovery procedures for each. Most failures share a common detection signal — end-to-end freshness rising — and differ in what they require for recovery.

10.1 Most-likely failures

F1 · Duplicate events from Debezium retry

Likelihood: High (happens occasionally during normal operation).
Cause: Debezium has at-least-once delivery. If it crashes after emitting an event but before committing the Kafka offset, the event is re-emitted on restart.
Detection: No detection needed — system handles automatically.
Recovery: ReplacingMergeTree dedupes on next merge. No action required.

F2 · ClickHouse goes down (planned or unplanned)

Likelihood: Medium.
Cause: Restart, hardware failure, disk fill, OOM.
Detection: Kafka consumer lag rises; freshness alert fires after ~5 min.
Recovery: Kafka holds events (7-day retention on epoch_data, 30 days on dims). Restart ClickHouse; the Kafka engine resumes consuming from the saved offset. Bounded blast radius: 7 days of writes before the binlog itself is consumed.

F3 · Debezium connector crash

Likelihood: Medium.
Cause: OOM, MySQL connection loss, schema parsing error.
Detection: Producer rate from Debezium drops to zero; freshness alert fires.
Recovery: Auto-restart via systemd or Kubernetes. Connector resumes from last committed binlog offset (no data loss). If restart loop, check logs for schema errors (often the cause).

F4 · Schema change in MySQL not anticipated

Likelihood: Medium (we control the source, but accidents happen).
Cause: Backend deploys a schema change without coordinating; ClickHouse Kafka engine fails to parse the new event shape.
Detection: DLQ topic accumulates messages; freshness alert fires.
Recovery: Apply matching ALTER on ClickHouse, then re-process the DLQ. For non-trivial changes, this is a 15–60 minute incident.

F5 · Bad/malformed value

Likelihood: Medium.
Cause: Source data with values outside the expected range (e.g., NULL in a NOT NULL column, malformed UTF-8 in a string).
Detection: CH parse error → row routed to DLQ; alert on DLQ rate.
Recovery: Quarantine the affected row in epoch_data_dlq; investigate the source; manually correct or re-insert.

F6 · "Too many parts" in ClickHouse during sustained burst

Likelihood: Medium under heavy load, low at normal load.
Cause: Insert rate exceeds merge capacity; system.parts count per partition climbs past ~300; eventually insert fails with TOO_MANY_PARTS.
Detection: Grafana panel on system.parts; alert at 250 per partition.
Recovery: Reduce kafka_flush_interval_ms (already at 2000), increase kafka_max_block_size (already at 65K), throttle non-critical consumers, or scale ClickHouse (add a shard). If unrecoverable in the short term, pause Kafka consumer to give merges time to catch up.

10.2 Lower-likelihood failures

F7 · Kafka broker down (single broker setup)

Likelihood: Low if managed; medium if self-hosted single-broker.
Cause: Hardware failure, OOM, disk fill.
Detection: Debezium producer errors; freshness alert fires.
Recovery: Debezium pauses gracefully (binlog buffers naturally in MySQL — 7-day retention). Restart Kafka broker; everything resumes.
Prevention: Move to 3-broker cluster in Phase 2.

F8 · MySQL binlog purged before Debezium reads it

Likelihood: Low (assuming 7-day binlog retention).
Cause: Debezium offline for >7 days, or binlog retention misconfigured.
Detection: BinlogNotFoundException in connector logs.
Recovery: Worst-case scenario. Trigger Debezium incremental snapshot for the affected tables — sends a signal via a Kafka topic that tells Debezium to scan the source table and emit synthetic insert events for all rows, while still processing live binlog changes in parallel. Takes hours for epoch_data, minutes for everything else.
Prevention: Monitor Debezium lag actively; alert at 12h sustained lag.

F9 · MySQL replica failover

Likelihood: Low (depends on infrastructure).
Cause: MySQL replica we're reading from is promoted or replaced.
Detection: Connection reset in Debezium logs.
Recovery: With GTID mode on MySQL, Debezium reconnects to the new replica and resumes from the GTID set. Without GTID, it requires manual reset of the binlog offset to a known-good position on the new replica.
Prevention: Use GTID mode (already in MySQL prerequisites, §4.1).

F10 · S3/R2 archive sink falling behind

Likelihood: Low.
Cause: Kafka Connect S3 Sink fails or slows.
Detection: Separate consumer lag metric for the archive sink.
Recovery: Restart sink; it resumes from offset. The archive is not on the critical path, so this is a "warn, fix in business hours" not a page.

10.3 Recovery procedures: the rebuild path

The worst-case rebuild — needing to recreate the ClickHouse state from scratch — proceeds in this order:

  1. Stop the ingest MV. Prevents partially-rebuilt data from being picked up by aggregates.
  2. Drop the affected table.
  3. Recreate the table from the canonical DDL.
  4. Reset Kafka consumer group offset to earliest. The Kafka engine table will start re-consuming from the beginning of the topic's retention window (7–30 days, depending on table).
  5. If older history is needed, load from S3/R2 archive first (via INSERT INTO ... SELECT FROM s3(...)), then enable Kafka consume.
  6. Restart the ingest MV.
  7. Restart materialized views (or recreate them via POPULATE on a small data sample, then incrementally fill from the live stream).
  8. Verify counts against MySQL: a periodic check query that compares row counts and recent-data totals between the source and ClickHouse.

For a typical table this rebuild is a 30-min to 2-hour operation. For epoch_data at 1B+ rows, several hours to a day, dominated by S3 read + parsing time. This is the operation we run drills against — practising it once a quarter ensures we know how long it really takes.

Section 11

Capacity Planning

Capacity decisions follow from a few simple numbers. The estimates here are conservative — real-world ClickHouse performance often exceeds them — but they're useful for sizing decisions.

11.1 Storage math

MetricPhase 1 (500M rows)Phase 2 (5B rows)
Uncompressed size (avg ~250 bytes/row in ClickHouse representation)~125 GB~1.25 TB
Compressed (10–15× with LZ4 + Delta codecs)~10–15 GB~80–125 GB
+ Materialized views (~5% of fact)~1 GB~10 GB
+ Replicated tables (×2 for HA in Phase 2)~180 GB
+ ZooKeeper/Keeper state~100 MB~1 GB
+ Headroom for merges (1.5×)~25 GB~270 GB
Total disk per node~30 GB~300 GB per shard

11.2 Throughput math

MetricBaselinePeak burst (100 simultaneous syncs)
Events/sec to MySQL~25 (2M/day average)~50,000 for 30 seconds
Debezium throughput required~100/sec~50,000/sec (sustainable)
Kafka write rate~25 msg/sec/partition~2,000 msg/sec/partition
ClickHouse insert rate~1 batch/2 sec~1 large batch / 2 sec
End-to-end freshness5–10 sec20–60 sec

11.3 Concurrent query capacity

ClickHouse single-node on the Phase 1 spec (4 vCPU, 16 GB RAM) comfortably handles ~10 concurrent analytical queries, with the heaviest individual query latencies in the 100ms–2s range against MVs. Ad-hoc scans of the raw fact table can hit several seconds.

Phase 2 (3-node cluster, 8 vCPU each) handles ~30 concurrent users at similar latency. Past that, we'd either scale horizontally or add a read-only replica tier for dashboard queries.

11.4 What we're not bottlenecked by

11.5 Headroom thinking

Every component is sized for ~3× current peak. This gives ~12–18 months of organic growth before any component needs scaling, which is the natural cadence for revisiting infra decisions. If growth exceeds 3×, the alerts on end-to-end freshness will signal degradation well before any hard limit is hit.

Section 12

Cost Analysis

All prices in USD; rounded to nearest dollar; based on Hetzner / Cloudflare R2 / Backblaze list pricing as of mid-2026. Substituting US-based equivalents (e.g., DigitalOcean for VMs, AWS S3) increases the totals by ~30–50% but doesn't change the relative comparisons.

12.1 Phase 1 (0–500M rows, 0–12 months)

ItemSpecMonthly
ClickHouse VM (Hetzner CCX23)4 vCPU, 16GB RAM, 240GB NVMe$60
Kafka / Redpanda VM4 vCPU, 16GB RAM, 160GB SSD$60
Debezium VM2 vCPU, 4GB RAM, 40GB SSD$20
S3 / R2 archive~200 GB Parquet, ~10 GB/mo growth$5
Backups (snapshots to R2, 30-day retention)~50 GB$20
Prometheus + Grafana (co-hosted)$0
Bandwidth (egress mostly internal)$0
Total~$165 / mo

12.2 Phase 2 (0.5B–5B rows, 12–36 months)

ItemSpecMonthly
ClickHouse cluster (3 × CCX33)8 vCPU, 32GB RAM, 480GB NVMe each$330
Kafka cluster (3 brokers)4 vCPU, 16GB RAM each$180
ClickHouse Keeper (2 nodes)2 vCPU, 4GB RAM each$40
Debezium HA (2 nodes)2 vCPU, 4GB RAM each$40
S3 / R2 archive~1.5 TB Parquet$25
Schema registry (Apicurio)2 vCPU, 4GB RAM$20
Backups (longer retention, more data)~200 GB$40
Monitoring (still self-hosted)2 vCPU, 8GB RAM$40
Total~$715 / mo

12.3 Comparison: managed warehouse stack

For Phase 1 volume (60M events/month, ~500M rows), a representative managed stack:

ItemNotesMonthly
Snowflake compute (XS, ~6h/day)For ETL + queries~$500
Snowpipe Streaming ingest60M events/mo at ~$5/M~$300
Snowflake storage~500 GB$15
Confluent Cloud KafkaSmall cluster~$400
Fivetran CDC (priced per MAR)Or Airbyte Cloud equivalent~$1,500+
Total~$2,700+ / mo

Same workload, ~16× the bill. The savings on the self-hosted stack are real but they come at the cost of ~5–10 hours/week of operational toil. The right tradeoff at Phase 1; less obvious at Phase 2; usually wrong at Phase 3 (when team time becomes the binding constraint).

12.4 The deciding question for when to switch

Track these two metrics over time, and consider the managed-services tier when both cross a threshold:

  1. Engineer-hours/week spent on pipeline ops. When >15/week sustained, the salary cost exceeds the rent saved.
  2. Data team query concurrency. When >15 concurrent analytical users, the managed warehouses' separation-of-storage-from-compute architecture starts to pay off.
Section 13

Migration & Rollout Plan

Bringing this pipeline online from a greenfield state. Assumes MySQL is running, the data team is ready to consume from a new analytical store, and we have ~4 engineering-weeks of capacity to ship Phase 1.

Week 1 — Foundations

Week 2 — CDC plumbing

Week 3 — The fact table

Week 4 — Materialized views & consumer enablement

Post-launch: weekly cadence

Risk register for the rollout

RiskMitigation
Initial snapshot of epoch_data impacts MySQL performanceRun snapshot from a replica, not the primary. Use snapshot.fetch.size to chunk reads. Schedule for off-peak.
Hidden schema constraints not captured in Appendix APre-flight: dump and review actual MySQL schema. Account for any orphan FKs or non-standard collations.
Data team expects features not in scopePublish the scope-and-deferral list (§14, §15) explicitly. Get sign-off before shipping.
Single VM failure during Phase 1Accept 7-day Kafka buffer; snapshot disk daily; restore from backup if needed. Phase 2 adds redundancy.
Section 14

Future Considerations

14.1 Linking clinical reports and insurance documents

The case study mentions: "Outline how to link clinical report data (e.g., PDF lab reports, insurance claim documents) to a user's wearable data so both can be queried together." This is a Phase 2/3 concern but worth sketching now.

The shape of the problem

Clinical reports are large, unstructured (PDFs), and arrive at a different cadence than wearable data. Wearable data is small-and-frequent; clinical data is large-and-rare. They need to be linkable by user ID and queryable together — e.g., "for users in our diabetes cohort, show their A1c history alongside their resting HR trend."

Recommended approach

  1. Object storage for the raw artifacts. PDFs and claim documents live in S3/R2 under a path like s3://thryve-clinical/end_user_id={id}/year={Y}/month={M}/{document_id}.pdf.
  2. A clinical metadata table in ClickHouse. One row per document, with extracted fields (document type, lab measurements, date of test, source provider), plus a pointer to the S3 path. Built either by manual data entry or by an OCR + NLP pipeline (e.g., Tesseract + a fine-tuned LLM extractor).
  3. Joined queries via standard SQL. The clinical metadata table joins to dim_end_user on end_user_id; queries can compute correlations between extracted clinical fields and wearable aggregates.
  4. For unstructured retrieval (e.g., "find users with X mentioned in their last lab report"), index the extracted text in a vector database (Qdrant or pgvector). Use vector search for retrieval, ClickHouse for analytics.

What changes in the pipeline

Very little. The CDC pipeline doesn't change. The clinical data ingestion is a separate pipeline (likely a periodic batch from the clinical-records provider) that lands metadata in a new ClickHouse table. The join surface is shared (end_user_id), so the data team's mental model stays consistent.

14.2 AI/ML training corpus

The AI team eventually needs reproducible point-in-time snapshots of the event data for training. Three options:

14.3 Real-time anomaly detection

Several Appendix D queries ("detect HR spikes within 5 minutes of arrival") imply real-time alerting. ClickHouse can serve this today with a polling pattern:

-- Run every 30 seconds; check for HR spikes in the last 10 minutes
SELECT end_user_id, max(long_value) AS peak_hr, ...
FROM epoch_data
WHERE data_type_id = 3000
  AND created_at >= now() - INTERVAL 10 MINUTE
  AND long_value > (
    SELECT avgMerge(avg_hr) + 2 * stddevSampMerge(stddev_hr)
    FROM mv_user_hr_baseline_30d
    WHERE end_user_id = epoch_data.end_user_id
  )
GROUP BY end_user_id;

For sub-second alerting (e.g., as a feature of the consumer app), introduce Flink between Kafka and ClickHouse: Flink does the windowed comparison in memory and pushes alerts to a notification topic. This is the natural Phase 2/3 addition for real-time product features.

Section 15

Open Questions for Stakeholders

The architecture above is robust to the constraints in the case study, but real-world refinement depends on details we'd want to confirm with the team. These are the questions I'd ask in the kickoff conversation.

Operational reality

  1. What's the actual peak write QPS on MySQL? Avg vs P99 vs P99.9?
  2. How is MySQL hosted — AWS RDS, Aurora, self-managed? RDS limits some Debezium configuration options (notably binlog_row_image); Aurora has its own CDC offering worth considering.
  3. What's the current binlog retention? Anything less than 7 days needs to change before we ship — otherwise our recovery window is dangerously narrow.
  4. Do we have on-call engineers comfortable operating Kafka? If the answer is "no, and we don't want to be," that flips us toward Redpanda Cloud or AWS MSK Serverless even at the cost premium.
  5. Is there an existing observability stack we should integrate with (Datadog, Honeycomb, New Relic)? Adopting it saves us standing up Prometheus.

Compliance & data governance

  1. What's the SLA for GDPR right-to-erasure? Hours or days? Affects whether the §7.4 procedure needs to be more aggressive.
  2. Any HIPAA scope (US customers)? Tightens encryption, audit logging, and BAA requirements.
  3. Are access_token and email considered PII that must be tokenized before they leave MySQL? Affects whether we add an SMT at Debezium or accept them in ClickHouse with column-level encryption.
  4. What's the data-residency requirement — EU-only? US? Affects choice of cloud provider and region.
  5. Are there audit-logging requirements (who queried what, when)? ClickHouse has query logs but a formal audit pipeline may need additional plumbing.

Consumer expectations

  1. Who are the day-one consumers of the analytical store? Internal data analysts? Product features? AI training jobs?
  2. What query latency SLA does the data team want? <1 second matters for dashboards; "a few seconds" is comfortable for ad-hoc analysis.
  3. What existing tools should we integrate with — Looker, Metabase, Superset, dbt, Mode?
  4. Is real-time anomaly alerting a near-term product feature or a "nice to have"? Affects whether we need Flink in Phase 1 or can defer.
  5. Do consumers need a defined data contract / SLA per table? Or is "best effort within 5 minutes" enough?

Scale & budget

  1. What's the user-growth trajectory — 1B rows in 12 months, 24 months, or 60 months? Affects when we'd transition to Phase 2.
  2. What's the data-infra budget ceiling for this fiscal year? Constrains how aggressively we can adopt managed services.
  3. How fast does the data team expect new data sources to be added (CRM, billing, support tickets, etc.)? Each adds CDC complexity.
  4. What does the AI team need from the data — raw event stream, daily snapshots, or feature-store-style derived data?
  5. Are there partner integrations or external data sharing that need to be supported? Affects whether we need a separate read-replica or data-sharing layer.

Team & process

  1. How does the backend team announce schema changes today? The CDC pipeline depends on this — schema changes that aren't coordinated will cause incidents.
  2. Who owns the analytical SQL layer long-term — a data engineer, the analysts themselves, or shared? Affects whether we invest in dbt early.
  3. What's the team's appetite for self-hosting vs. managed? My recommendation is biased toward self-hosting for cost, but team preference matters more than my preference.
These questions don't block shipping the design above — the design is correct under reasonable defaults for each. They would refine implementation details and inform the Phase 2 timeline. The answer to any one of them changes 5–10% of the architecture; the answer to several in combination could shift the cost calculus materially.