A streaming CDC pipeline from MySQL to ClickHouse — designed for <5-minute freshness, billion-row scale, and a startup-sized infrastructure budget.
Thryve collects sensor data from consumer wearable devices and stores it in a MySQL database. As the data and AI teams grow, they need an analytical store that is updated in near real-time, supports the burst-write pattern characteristic of wearable syncs, and stays affordable while the company scales from 150M to 1B+ rows.
This document specifies a streaming Change Data Capture (CDC) pipeline built around three open-source components — Debezium, Apache Kafka, and ClickHouse — that delivers end-to-end freshness in 5–30 seconds (versus the 5-minute SLA), handles updates and deletes correctly via ReplacingMergeTree semantics, and costs roughly $165/month in Phase 1 growing to ~$720/month at 5B rows. The same workload on a managed warehouse stack (Snowflake + Confluent + Fivetran) would cost approximately 15× more at this stage.
| Layer | Choice | Rationale (one line) |
|---|---|---|
| Change capture | Debezium MySQL connector | Binlog-based CDC — only path that propagates updates and deletes correctly without a source schema change. |
| Transport | Apache Kafka (Redpanda viable) | Absorbs burst syncs, decouples source from sink, gives replay for free. |
| Analytical DB | ClickHouse | Columnar, >10× compression on sensor data, native Kafka engine, $0 license. |
| Aggregation | ClickHouse Materialized Views | Incremental on insert — removes the need for a separate stream processor today. |
| Archive | S3 / Cloudflare R2 | Cheap long-term backup of raw events; future AI training corpus. |
| Observability | Prometheus + Grafana | Free, with first-class exporters for every component in the stack. |
Several common pieces are not in the day-one design — not because they're wrong, but because they don't earn their complexity yet. Each is easier to add later than it would be to remove if added prematurely:
The architecture is robust to the constraints in the case study, but several stakeholder decisions could refine it materially — chiefly the GDPR erasure SLA, the actual peak write QPS on MySQL, and whether the data team needs sub-second or few-second query latency. These are catalogued in §15.
The MySQL backend stores six logical entities, with one fact table (epoch_data) holding ~99.99% of the row count. The dimensions are tiny and slow-changing, dominated by a 500K-row end_user table. This shape — one heavy fact, several thin dimensions — is the textbook profile for a columnar analytical store, and it gets favorable treatment in every layer of the design.
| Table | Rows | Profile | Implications |
|---|---|---|---|
epoch_data | 150M → 1B+ | High-frequency, burst writes (~15K rows per user sync), 2M/day baseline | Drives all storage, ingest, and partitioning decisions. |
end_user | 500K | Low-frequency writes, PII-bearing | Needs GDPR-compliant deletion. Joins to fact via end_user_id. |
app | 230 | Tenant configuration (B2B multi-tenant) | Denormalized into fact for tenant pruning at query time. |
epoch_data_type | 500 | Reference; rarely changes | Loaded as ClickHouse dictionary for fast in-query lookup. |
data_source | 50 | Reference; rarely changes | Same as above. |
generation_*, trustworthiness | <100 each | Static | Dictionary-backed. |
A wearable that's been offline for a week may sync ~15,000 epoch_data rows in a single request. These rows have old start_timestamp values (the measurement times) and a new created_at value (the sync time). The pipeline must:
start_timestamp, not created_at);This requirement rules out hourly micro-batch ETL but is comfortably satisfied by a streaming pipeline. In practice, the architecture below achieves end-to-end freshness of 5–30 seconds at normal load and 1–3 minutes under burst conditions. The 5-minute SLA gives operational headroom for transient ClickHouse maintenance or single-broker Kafka restarts.
This single requirement eliminates two otherwise-attractive options:
created_at > last_seen): does not capture updates to existing rows and never sees deletes. There is no updated_at column in the source schema, so this would be doubly broken.SELECT * dumps): can capture state but cannot meet 5-minute freshness on a 150M-row fact table without saturating MySQL.Binlog-based CDC is the only mechanism that satisfies all three correctness requirements (inserts, updates, deletes) at the required latency.
At 1B rows with the sensor-data profile (mostly small integers and floats, with high inter-row correlation), ClickHouse typically achieves >10× compression — putting on-disk size at 30–80 GB. This fits on a single node well past the 1B threshold, deferring sharding complexity to a later phase. Snowflake/BigQuery would handle this easily too; the question is cost, not feasibility (see §12).
Most analytical queries are scoped to a single app (a Thryve customer). Tenant-scoping must be physically efficient in the storage layout — accomplished by including app_id as the first key in the fact table's ORDER BY, allowing ClickHouse to skip any data not belonging to the queried tenant.
The case data shows a German user (Maria Schmidt, Europe/Berlin). Even without explicit mention, wearable health data in a European context is governed by GDPR. This affects the design in three concrete places: (a) deletes must propagate end-to-end and be physically removable on demand (§7); (b) PII columns like email and access_token warrant encryption at rest and access logging; (c) data residency may pin component deployment to EU regions, affecting cloud-provider choice. HIPAA scope (for any US customers) would tighten these further; this is one of the questions in §15.
Appendix D of the case enumerates representative queries. They cluster into three patterns:
"What is the 90-day trend of resting heart rate for user X?" These queries read a contiguous slice of epoch_data filtered by end_user_id, data_type_id, and a date range. The fact-table ORDER BY (app_id, end_user_id, data_type_id, start_timestamp) reduces these to localized, mostly-sequential reads — the cheapest pattern on ClickHouse.
"How does average HRV compare across users with condition X?" These touch many users but typically a single data_type_id over a recent time window. Solved cleanly by materialized aggregates partitioned by month: mv_daily_hrv_per_user rolls epoch-level data into one row per user per day, after which cohort summaries are a few thousand rows of aggregation rather than millions of raw events.
"Which users have HR gaps > 24h in the last week?" / "What % of rows arrive > 48h after measurement?" These are global scans over recent partitions. They are not user-scoped, so they don't benefit from the user-level ordering — but they only need recent partitions (small disk footprint) and the columnar layout means scanning only the columns they care about (start_timestamp, created_at, data_type_id) is cheap.
Pattern A and B together account for the vast majority of expected queries. Pattern C is operational and tolerates higher latency. The architecture optimizes for A and B, and accepts that C will use more compute per query.
It is useful to be explicit about workloads the design does not need to support, to avoid carrying their costs:
app_id into the fact and using dictionaries for the small reference tables, avoiding JOINs at query time for the common cases.The pipeline is six components arranged in a single linear flow, with a side branch to object-storage archive and a cross-cutting observability plane. Each component has a single job, a clear contract with its neighbors, and a well-understood failure mode.
The architecture is consciously linear. There is exactly one production path from each MySQL change to its representation in ClickHouse, and the path passes through systems that are individually simple, individually resilient, and well-understood by the operations community. The cost of this conservatism is that we forgo certain "exotic" optimizations — e.g., direct MySQL-to-ClickHouse replication (skipping Kafka) is possible via tools like Sinkhole or ClickHouse's MaterializedMySQL engine, but it sacrifices the replay-from-buffer capability that has saved us in nearly every incident on similar pipelines.
cdc.thryve.. Each message keyed by the source primary key to preserve per-row ordering within a partition.max(now() - max(created_at)) per table and is the SLA dashboard's headline metric.The architecture is intentionally staged so that the same blueprint scales from single-node Phase 1 to a 3-shard Phase 2 cluster without rework. The phasing is driven by row count and concurrent-user count, not by calendar time.
| Phase | Trigger | What changes | What stays the same |
|---|---|---|---|
| Phase 1 0 – 500M rows |
Day-one shipping volume | Single-node ClickHouse · single-broker Kafka (Redpanda viable) · Debezium standalone | Schemas · topic layout · MV definitions · monitoring queries |
| Phase 2 0.5B – 5B rows |
CH disk > 60%, query latency drifting; or write-rate > 10K events/sec sustained | 3-node ClickHouse with sharding by app_id · 3-broker Kafka · 2-node Debezium HA · add schema registry |
Logical schemas · query interface · per-table strategy |
| Phase 3 5B+ rows |
AI workloads need raw event log; or consumer concurrency exceeds CH capacity | Add Iceberg-on-S3 tier fed by Kafka Connect · introduce Trino for ad-hoc cross-domain JOINs · evaluate dbt Cloud | The CDC layer (Debezium + Kafka) feeds both tiers — no source-side rework |
Debezium is the de-facto open-source CDC framework. The MySQL connector presents itself to MySQL as a replica (using a dedicated replication user), reads the binary log (binlog) in ROW format, and emits a structured event per change. The events include both before and after states for updates, and emit explicit delete events.
# Debezium MySQL connector — production settings name: thryve-mysql-cdc connector.class: io.debezium.connector.mysql.MySqlConnector tasks.max: 1 # single-threaded by design # Source connection database.hostname: mysql-replica.internal database.port: 3306 database.user: debezium_cdc database.password: ${file:/run/secrets/dbz-pw:value} database.server.id: 184054 # unique among MySQL replicas database.include.list: thryve table.include.list: thryve.app, thryve.end_user, thryve.epoch_data, thryve.epoch_data_type, thryve.data_source, thryve.generation_type, thryve.generation_method, thryve.trustworthiness # Snapshot mode — initial snapshot, then tail binlog forever snapshot.mode: initial snapshot.locking.mode: none # MVCC, no table locks snapshot.fetch.size: 10000 # Throughput tuning for the bursty epoch_data table max.batch.size: 8192 max.queue.size: 32768 poll.interval.ms: 500 # Topic naming & partitioning topic.prefix: cdc.thryve message.key.columns: thryve.epoch_data:id;thryve.end_user:id # Flatten the Debezium envelope and surface delete tombstones. # Without unwrap, every event is wrapped in {before, after, source, op}. transforms: unwrap transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState transforms.unwrap.drop.tombstones: false transforms.unwrap.delete.handling.mode: rewrite transforms.unwrap.add.fields: op,source.ts_ms,source.pos,source.file # Schema history (where DDL events are persisted) schema.history.internal.kafka.bootstrap.servers: kafka:9092 schema.history.internal.kafka.topic: cdc.thryve.schema-history
Debezium requires the source MySQL instance to satisfy a few conditions. These should be verified before the connector is deployed:
log_bin = ON and binlog_format = ROW (statement-based binlog cannot reconstruct individual row changes reliably).binlog_row_image = FULL — emit complete before/after row images, not just changed columns. Required for ReplacingMergeTree to dedupe correctly.gtid_mode = ON recommended — simplifies failover and resumption.binlog_expire_logs_seconds = 604800). This is the recovery window for a Debezium outage.REPLICATION CLIENT, REPLICATION SLAVE, and SELECT on the source schemas. (No write permissions.)After the unwrap SMT is applied, an event for an epoch_data insert looks roughly like this:
{
"id": 991200,
"end_user_id": 42817,
"data_type_id": 3000,
"data_source_id": 5,
"start_timestamp": 1741003200000,
"end_timestamp": 1741003260000,
"long_value": 72,
"created_at": 1741254720000,
"timezone_offset": 60,
"__op": "c", // c = create, u = update, d = delete
"__source_ts_ms": 1741254720123,
"__source_pos": 8742019, // binlog position — our _version
"__source_file": "binlog.000142",
"__deleted": "false"
}
The crucial fields for downstream processing are __source_pos (used as the _version in ClickHouse's ReplacingMergeTree — monotonically increasing within a binlog file) and __deleted (set to "true" by the delete.handling.mode = rewrite SMT for tombstone events).
| Failure | Detection | Recovery |
|---|---|---|
| Connector crash | Kafka Connect REST /connectors/{name}/status = FAILED | systemd auto-restart; connector resumes from last committed binlog offset (no data loss). |
| MySQL binlog purged before Debezium reads it | BinlogNotFoundException in connector logs | Manual: trigger incremental snapshot for affected tables via Kafka topic signal. |
| Schema change to non-nullable column | Connector pauses with schema error | Manual: apply matching ALTER on ClickHouse, then resume. |
| MySQL replica failover | Connection reset; auto-reconnect | With GTID mode, automatic. Without GTID, requires manual offset reset. |
| Source DB password rotation | Auth errors in logs | Update connector secret; restart task. |
Kafka sits between Debezium and ClickHouse as a durable, replayable buffer. It is the component that absorbs burst syncs without backpressuring MySQL, that lets ClickHouse be restarted for maintenance without data loss, and that gives us replay if we ever need to rebuild a ClickHouse table from history.
Debezium can write directly to many sinks via Debezium Server (Kinesis, Pulsar, HTTP). For a startup, the temptation to skip Kafka and go straight to ClickHouse is real — one less system to operate. The reasons we keep Kafka in the path anyway:
Redpanda is a Kafka-compatible broker written in C++ — single binary, no Zookeeper, lower memory footprint, often better tail latency. For Phase 1 (single-broker), Redpanda is meaningfully easier to operate and produces nearly identical guarantees. I'd recommend it as the starting point and migrate to Apache Kafka only if specific Kafka-ecosystem tools (KSQL, certain connectors) become needed.
| Topic | Partitions | Retention | Reason |
|---|---|---|---|
cdc.thryve.epoch_data | 24 | 7 days | Highest-volume table. 24 partitions allow parallel ClickHouse consume up to 24 consumer threads. |
cdc.thryve.end_user | 3 | 14 days | Low volume but PII-relevant; longer retention to permit re-verification of GDPR deletes. |
cdc.thryve.app | 1 | 30 days | Tenant config; rarely changes, but want long replay window. |
cdc.thryve.data_sourcecdc.thryve.epoch_data_typecdc.thryve.generation_*cdc.thryve.trustworthiness | 1 each | 30 days | Reference data; one partition suffices, long retention is cheap given low volume. |
Partitioning is by the source primary key. For epoch_data, this means events for the same row always land on the same partition, preserving in-row ordering. We do not partition by end_user_id because that would create skew (some users sync orders-of-magnitude more data than others), and because cross-row ordering is not required for correctness — ReplacingMergeTree resolves the final state from _version.
| Setting | Value | Reason |
|---|---|---|
Producer acks | all | Wait for all in-sync replicas — durability over latency. |
Producer compression.type | zstd | Best compression ratio for repeated JSON keys. |
Producer linger.ms | 50 | Allow micro-batching at the producer for better throughput. |
Topic min.insync.replicas | 2 | Tolerate 1 broker failure without losing durability. |
Consumer isolation.level | read_committed | Only read committed records — avoids reading transactional aborts. |
ClickHouse is the most consequential choice in the architecture. It owns query latency, storage cost, and most of the operational complexity. Several alternatives could work — see §4.3.4 for the comparison — but ClickHouse aligns better than any of them with the specific shape of this workload: time-ordered, append-heavy, columnar-friendly, with a strong preference for predictable cost over elastic scale.
epoch_data rows have high column-wise redundancy (same end_user_id appearing for hundreds of consecutive rows; same data_type_id for thousands; integer-typed values with narrow ranges). ClickHouse's column codecs (LZ4 for hot data, ZSTD for cold) plus its delta-of-delta encoding for sorted columns typically achieve 10–20× compression on this kind of data. At 1B rows the uncompressed footprint would be ~250 GB; compressed, we're looking at 15–30 GB.Kafka engine table acts as a consumer; a materialized view on top of it moves rows into the persistent MergeTree table. We don't operate a separate ETL process — the consume + insert loop runs inside ClickHouse.ReplacingMergeTree dedupes by primary key keeping the highest-versioned row; AggregatingMergeTree maintains running aggregates incrementally. Both are designed for the append-heavy + late-update pattern we have.SELECT ... FROM epoch_data GROUP BY ... recomputes only for the rows in each new insert — not the whole table. This is what makes "compute the daily aggregate on every insert" performant rather than catastrophic.epoch_data partitions older than 2 years" or "move them to S3 after 90 days" with a one-line TTL clause, no separate batch job needed.ReplacingMergeTree merges in the background; until the merge runs, both old and new versions of a row coexist in the table. Reads have to either tolerate this or pay the cost of FINAL / argMax dedup (see §7).A handful of settings are particularly important for our workload:
| Setting | Value | Reason |
|---|---|---|
max_partitions_per_insert_block | 100 | Allow burst inserts to span multiple monthly partitions when a user syncs week-old data. |
Kafka engine kafka_max_block_size | 65536 | Cap per-batch insert size to avoid "too many parts" errors during sustained load. |
Kafka engine kafka_flush_interval_ms | 2000 | Tighter than default 7500ms — keeps freshness under 5s at normal load. |
| Merge thread count | tune to ~50% of available cores | Background merges must keep up with insert rate to prevent part accumulation. |
| Codecs on time columns | Codec(Delta, ZSTD(3)) | Delta encoding shrinks monotonically-increasing timestamps to a few bits per row. |
| Codecs on ID columns | Codec(T64, ZSTD(3)) | T64 codec is highly effective for unsigned integers with narrow ranges. |
| Candidate | Verdict |
|---|---|
| Snowflake | The best developer experience in the analytical-DB space. Its separation of storage from compute and excellent JOIN support are genuinely superior. The case against it for this workload is cost: Snowpipe Streaming inserts at our event rate, combined with the warehouse credits consumed by even modest concurrent queries, easily reaches $2–3K/month at 1B rows. That's defensible at a Series B startup but not a Seed/Series A one. Reconsider when the data team has >5 concurrent users. |
| BigQuery | Similar shape to Snowflake — excellent for ad-hoc analytics, expensive for streaming ingest. Strongest if the team is already on GCP and would benefit from BigQuery's tight integration with the rest of the Google data stack. Same cost objection as Snowflake. |
| TimescaleDB / Postgres | An attractive option up to roughly 500M rows of time-series data. Compression (via TimescaleDB's compressed chunks) is good but not at ClickHouse levels. The selling point is JOIN flexibility and the comfort of Postgres ergonomics. The breaking concern is query performance on the heaviest analytical patterns at 1B+ rows: per-user scans hold up, but cohort aggregations slow down measurably. Best choice if the team strongly prefers Postgres familiarity and is confident growth will be slow. |
| Apache Druid | Purpose-built for real-time OLAP. Excellent ingestion throughput and sub-second queries at scale. Operationally significantly heavier — coordinator, brokers, historicals, middle managers, deep storage — and its SQL surface, while improved, is still less expressive than ClickHouse's. The case for Druid is strongest when sub-second query latency at high concurrency is the dominant requirement. Our 5-min freshness + few-second query SLA doesn't push us into that territory. |
| Apache Pinot | Similar value proposition to Druid; some teams find it easier to operate. Same caveat applies — operational complexity exceeds what a startup data team typically wants to absorb day-one. |
| Iceberg + Trino / DuckDB | The "modern lakehouse" answer. Strong for AI training workloads (Spark / Polars / DuckDB can all read Iceberg directly). Doesn't solve the 5-min freshness requirement on its own — needs a streaming layer feeding it. A natural Phase 3 addition, not a Phase 1 base. |
| MySQL read replica | Trivial to set up but doesn't solve the analytical query problem. A scan over 1B rows of epoch_data is hours on row-oriented storage. Mentioned only to be ruled out explicitly — sometimes proposed by stakeholders unfamiliar with the OLTP / OLAP distinction. |
ClickHouse consumes Kafka via a specific three-table pattern:
-- 1. The Kafka engine table acts as a consumer. -- Reading from it consumes from Kafka, so it's not directly queried. CREATE TABLE kafka_epoch_data ( id UInt64, end_user_id UInt32, data_type_id UInt32, start_timestamp Int64, long_value Nullable(Int64), double_value Nullable(Float64), -- ... other value columns __source_pos UInt64, __deleted String ) ENGINE = Kafka SETTINGS kafka_broker_list = 'kafka:9092', kafka_topic_list = 'cdc.thryve.epoch_data', kafka_group_name = 'ch_epoch_data_consumer', kafka_format = 'JSONEachRow', kafka_num_consumers = 8, kafka_max_block_size = 65536, kafka_flush_interval_ms = 2000; -- 2. The persistent storage table (the one queries actually hit). CREATE TABLE epoch_data ( id UInt64, end_user_id UInt32, app_id UInt16, -- ... full schema (see §6) _version UInt64, _deleted UInt8 ) ENGINE = ReplacingMergeTree(_version) PARTITION BY toYYYYMM(start_ts) ORDER BY (app_id, end_user_id, data_type_id, start_ts, id); -- 3. The materialized view that moves rows from (1) to (2). -- Each time the Kafka engine table receives a batch, this view's SELECT -- runs against the batch and inserts the result into epoch_data. CREATE MATERIALIZED VIEW mv_ingest_epoch_data TO epoch_data AS SELECT id, end_user_id, dictGetUInt16('dim_end_user', 'app_id', end_user_id) AS app_id, data_type_id, data_source_id, toDateTime64(start_timestamp / 1000, 3, 'UTC') AS start_ts, toDateTime64(end_timestamp / 1000, 3, 'UTC') AS end_ts, long_value, double_value, boolean_value, string_value, __source_pos AS _version, if(__deleted = 'true', 1, 0) AS _deleted FROM kafka_epoch_data;
The dictGetUInt16 call denormalizes app_id from the end_user dimension at write time, so queries never need to JOIN to retrieve it. This is the single most important denormalization in the schema — it converts an expensive JOIN into a free column lookup, at the cost of recomputing app_id on every insert (cheap, since the dictionary is in memory).
The archive tier is not on the critical path — no production query reads from it — but it earns its place for three reasons:
S3 is the obvious default. R2 is meaningfully cheaper for this use case because egress is free — and the archive will be read from many directions (ClickHouse for backfills, training jobs, ad-hoc analysts). At ~$0.015/GB/month storage and zero egress, R2 is roughly 1/3 the cost of S3 for a workload that reads as well as writes. The compatibility with the S3 API is essentially complete; the Kafka Connect S3 Sink works against R2 with no code change.
The argument against R2 is data-residency concerns (R2's regions are more limited than S3's) and the absence of certain advanced features (lifecycle policies, Glacier-style tiering). For Phase 1, neither applies meaningfully.
name: s3-sink-epoch-data connector.class: io.confluent.connect.s3.S3SinkConnector topics: cdc.thryve.epoch_data, cdc.thryve.end_user, ... s3.bucket.name: thryve-cdc-archive s3.region: auto # R2 uses 'auto' store.url: https://<account>.r2.cloudflarestorage.com # Hourly partitioning, by event time not arrival time flush.size: 100000 rotate.interval.ms: 3600000 # 1 hour timestamp.extractor: RecordField timestamp.field: source.ts_ms path.format: "'topic='${topic}/'y='YYYY/'m='MM/'d='dd/'h='HH" format.class: io.confluent.connect.s3.format.parquet.ParquetFormat parquet.codec: snappy
Files land at paths like:
s3://thryve-cdc-archive/topic=cdc.thryve.epoch_data/
year=2026/month=05/day=23/hour=14/
cdc.thryve.epoch_data+12+0000000000412198192.parquet
The filename includes the topic, partition, and starting Kafka offset, making it trivial to reason about completeness. Parquet with Snappy compression typically gives ~5–8× compression on this event schema.
At 2M events/day with ~150 bytes per event compressed, the archive grows at ~10 GB/month. At Phase 2 burst rates (~10M events/day), that becomes ~50 GB/month. Even after 5 years, the archive is well under 5 TB — under $80/month at R2 list pricing. Lifecycle policies can move very-old data to colder tiers, but the economics rarely justify the operational complexity at our scale.
The observability stack is intentionally boring. Prometheus scrapes metrics from each component's built-in exporter; Grafana visualizes; Alertmanager routes pages. None of this is novel, all of it is well-supported by every component in the pipeline.
The single most important metric is the end-to-end lag between when a row is written to MySQL and when it becomes queryable in ClickHouse. Implementation:
-- Run every 30 seconds via a Grafana → ClickHouse query SELECT 'epoch_data' AS table, now() - max(created_at) AS freshness_seconds FROM epoch_data WHERE created_at >= now() - INTERVAL 10 MINUTE;
This metric collapses every possible pipeline failure into one number. If MySQL is healthy, Debezium is healthy, Kafka is healthy, and ClickHouse is healthy, freshness will be 5–30 seconds. Anything else and freshness will trend upward, hit the 5-minute threshold, and page.
Freshness tells us that something is wrong; the layered metrics tell us what:
| Layer | Metric | What it means when red |
|---|---|---|
| MySQL | mysql_global_status_innodb_log_lsn_current | Write rate. Rising fast = sustained heavy load on source. |
| Debezium | debezium_metrics_milliseconds_behind_source | How far behind binlog Debezium is. Above ~1s = read bottleneck. |
| Debezium | debezium_metrics_total_number_of_events_seen | Event throughput. Flat-lining = connector stuck. |
| Kafka | kafka_consumer_lag (per consumer group) | How many messages the consumer hasn't read yet. Rising = consumer can't keep up. |
| Kafka | kafka_topic_partitions_under_replicated | Replication health. >0 = broker issue. |
| ClickHouse | system.parts count per partition | Background merge health. >300 sustained = merges falling behind. |
| ClickHouse | system.replication_queue size (when replicated) | Replica sync health. |
| ClickHouse | InsertedRows per minute | Insert throughput. Flat-lining = Kafka consumer paused. |
Alerts that are not pages: anything that doesn't require action within 30 minutes. The 3 a.m. page should only fire for things that genuinely need a human in the next half-hour.
Stream-versus-batch is the wrong question to ask per table; the right question is "which consumption pattern, knowing that the same Debezium connector reads all tables anyway." The infrastructure is shared, so the per-table decision is really about (a) how the data lands on the ClickHouse side, and (b) what failsafe we wrap around the streaming path.
| Table | Primary path | Failsafe | CH engine | Reasoning |
|---|---|---|---|---|
epoch_data |
CDC stream | Daily incremental snapshot → Parquet on S3/R2 | ReplacingMergeTree |
High volume; 5-min freshness is the whole point. Daily snapshot is the rebuild source if Kafka retention expires during a sustained outage. |
end_user |
CDC stream | Hourly full snapshot to R2 | ReplacingMergeTree |
PII-bearing. Deletes must propagate quickly (GDPR). Small enough that an hourly full snapshot is cheap insurance. |
app |
CDC stream | Hourly full snapshot | ReplacingMergeTree |
Tenant config — small, slow-changing, but joins from every query. Snapshot makes recovery trivial. |
epoch_data_typedata_source |
CDC stream | Hourly full snapshot | ReplacingMergeTree + ClickHouse Dictionary (in-memory) |
Reference data, used for in-query lookups via dictGetX. Dictionary auto-refreshed from the underlying MergeTree. |
generation_typegeneration_methodtrustworthiness |
Snapshot only | — | Dictionary |
Static, <100 rows. Loaded as a ClickHouse dictionary, refreshed daily by a one-line cron'd query. No streaming overhead. |
The temptation, on reading the per-table sizes, is to optimize: stream the heavy table, batch the small ones. We resist that because:
end_user would mean an hour of staleness on a user-deletion request — unacceptable for GDPR.The generation_* and trustworthiness tables are small enough (~100 rows total) that loading them as in-memory dictionaries is dramatically faster than even an indexed JOIN. Dictionary lookups are O(1) hash lookups, run in the same memory space as the query, and don't touch disk. The CDC pipeline still streams these tables (for $0 incremental cost), but queries use the dictionary form via dictGet('trustworthiness', 'name', trustworthiness_id).
The schema decisions are dominated by one question: what does the ORDER BY on epoch_data look like? Everything else in the model follows from getting that right.
CREATE TABLE epoch_data ( id UInt64, end_user_id UInt32, app_id UInt16, -- denormalized at ingest time data_type_id UInt32, data_source_id UInt16, third_party_ds_id UInt16 DEFAULT 0, start_ts DateTime64(3, 'UTC') CODEC(Delta, ZSTD(3)), end_ts Nullable(DateTime64(3, 'UTC')) CODEC(Delta, ZSTD(3)), created_at DateTime64(3, 'UTC'), -- sync time double_value Nullable(Float64) CODEC(ZSTD(3)), long_value Nullable(Int64) CODEC(T64, ZSTD(3)), boolean_value Nullable(UInt8), date_value Nullable(DateTime64(3, 'UTC')), string_value Nullable(String), generation_type_id UInt8 DEFAULT 0, generation_method_id UInt8 DEFAULT 0, trustworthiness_id UInt8 DEFAULT 0, medical_grade UInt8 DEFAULT 0, chronological_exactness Int16, timezone_offset Int16, measurement_id UInt64, _version UInt64, -- binlog position _deleted UInt8 DEFAULT 0, _ingested_at DateTime DEFAULT now() -- for ops queries ) ENGINE = ReplacingMergeTree(_version) PARTITION BY toYYYYMM(start_ts) ORDER BY (app_id, end_user_id, data_type_id, start_ts, id) SETTINGS index_granularity = 8192 TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;
The ORDER BY tuple (app_id, end_user_id, data_type_id, start_ts, id) determines the physical layout of every part. ClickHouse stores rows in this order within each partition. The implications:
app_id can skip every part that doesn't contain that tenant's data. With 230 tenants, that's a 200× reduction in data scanned for most tenant-scoped queries.(app_id=12, end_user_id=42817) are contiguous on disk within each partition. The 90-day-trend query reads a tiny block, not the whole table.HeartRate rows are contiguous, then all Steps rows, etc. Filtering by data_type_id further reduces the scan.id at the end guarantees uniqueness for the dedup engine. Without it, two truly distinct rows with identical (app, user, type, ts) would be incorrectly collapsed.The case for not putting start_ts earlier in the key: a pure time-ordered key would optimize for "what happened across all users in the last hour" queries, at the cost of "what's this user's history" queries. The case's analytical question set leans heavily on the latter, so user-locality wins.
Partitioning by toYYYYMM(start_ts) gives one partition per calendar month per shard. At 1B rows over 3–4 years that's ~30M rows per partition — a comfortable size for ClickHouse merge operations. The benefits:
The source has five nullable value columns (double_value, long_value, boolean_value, date_value, string_value), only one of which is populated per row based on the row's data_type_id. There are two reasonable approaches:
data_type_id. Null columns compress essentially for free in a columnar store. Single table = simpler schema, fewer joins.We choose option 1 because (a) the null columns cost essentially nothing in storage, (b) ClickHouse's column-pruning means a query that only reads long_value never touches the other columns on disk anyway, and (c) the per-table-split approach makes cross-type queries (recovery score = f(HR, HRV, sleep)) painful.
CREATE TABLE dim_end_user ( id UInt32, app_id UInt16, pseudonym String, first_name String, last_name String, email String, date_of_birth Nullable(Date), timezone LowCardinality(String), created_at DateTime64(3, 'UTC'), _version UInt64, _deleted UInt8 DEFAULT 0 ) ENGINE = ReplacingMergeTree(_version) ORDER BY (app_id, id); -- And the dictionary that the fact-table ingest MV uses to look up app_id CREATE DICTIONARY dim_end_user_dict ( id UInt32, app_id UInt16 ) PRIMARY KEY id SOURCE(CLICKHOUSE(TABLE 'dim_end_user')) LAYOUT(HASHED()) LIFETIME(MIN 300 MAX 600); -- refresh every 5–10 minutes
LowCardinality(String) is used for the timezone column because there are at most a few hundred distinct values across all users; LowCardinality stores them as small integer codes with a dictionary, saving substantial space and speeding up filters and grouping.
Materialized views in ClickHouse are insert-triggered — when rows are inserted into the underlying table, the view's SELECT runs against just the new batch and inserts the result into the view's storage. This is the mechanism we use to keep daily/weekly aggregations always-current with minimal overhead.
-- Daily resting HR per user (powers the "90-day trend" query) CREATE MATERIALIZED VIEW mv_daily_resting_hr ENGINE = AggregatingMergeTree PARTITION BY toYYYYMM(date) ORDER BY (app_id, end_user_id, date) AS SELECT app_id, end_user_id, toDate(start_ts) AS date, avgState(long_value) AS avg_hr, minState(long_value) AS min_hr, maxState(long_value) AS max_hr, countState() AS samples FROM epoch_data WHERE data_type_id = 3001 -- HeartRateResting AND _deleted = 0 GROUP BY app_id, end_user_id, date; -- The 90-day-trend query, now sub-second on billions of rows SELECT date, avgMerge(avg_hr) AS resting_hr, minMerge(min_hr) AS min_hr, maxMerge(max_hr) AS max_hr FROM mv_daily_resting_hr WHERE end_user_id = 42817 AND date >= today() - 90 GROUP BY date ORDER BY date;
The materialized view stores intermediate aggregation state (e.g., avgState stores both running sum and count), not the final value. This is necessary because the view receives data in batches, and a true running average can't be incrementally combined without preserving the underlying counts. At query time, avgMerge combines those partial states into the final answer.
This is also what makes burst-sync correctness work: when a user syncs week-old data, the rows arrive at the view, which computes partial aggregates for the old dates and merges them with the existing partial aggregates already stored. The query result is always consistent with the underlying fact table, regardless of arrival order.
| View | Powers |
|---|---|
mv_daily_user_metrics | Daily summary per user: total steps, distance, calories, resting HR, sleep duration. The dashboard backbone. |
mv_user_hrv_30d | 30-day rolling RMSSD per user. Powers HRV cohort comparisons and anomaly baselines. |
mv_user_sleep_stages_daily | Daily total time in each sleep stage. Powers the "sleep quality vs activity" correlation queries. |
mv_user_data_quality_daily | Per user per day: row count, time coverage, gap count. Powers operational data-quality dashboards. |
mv_data_arrival_latency | Histogram of created_at - start_timestamp buckets. Powers the "% of rows arriving >48h late" operational query. |
The source uses BIGINT milliseconds since epoch for all timestamps. We convert to DateTime64(3, 'UTC') at the ingest materialized view layer because:
Delta codec, which compresses sorted timestamps to a few bits per row.start_ts >= today() - 90 vs. start_ts >= toUnixTimestamp64Milli(today() - 90)).toDateInTimezone(start_ts, dim_end_user.timezone)).The original epoch value is recoverable via toUnixTimestamp64Milli(start_ts) if any downstream consumer needs it in that form.
Three queries from Appendix D, executed against the proposed model, with comments on access patterns:
-- Q1. "90-day trend of resting heart rate for user 42817" -- Uses MV. Reads ~90 rows. Sub-second. SELECT date, avgMerge(avg_hr) AS resting_hr FROM mv_daily_resting_hr WHERE end_user_id = 42817 AND date >= today() - 90 GROUP BY date ORDER BY date; -- Q2. "Users with HR gaps > 24h in the last 7 days" -- Uses mv_user_data_quality_daily. Reads ~3.5M rows (500K users × 7 days), -- but each row is small. Few seconds. SELECT end_user_id, max_gap_minutes FROM mv_user_data_quality_daily WHERE data_type_id = 3000 AND date >= today() - 7 AND max_gap_minutes > 1440; -- Q3. "Cohort HRV: condition-X users vs general population" -- The "cohort" filter is external (likely a SQL IN list of user IDs). -- Reads from mv_user_hrv_30d. Sub-second per cohort. SELECT if(end_user_id IN cohort_users, 'condition_x', 'general') AS cohort, avgMerge(avg_rmssd) AS avg_hrv, count() AS users FROM mv_user_hrv_30d WHERE date >= today() - 30 GROUP BY cohort;
This section addresses the case study's explicit prompt: "How do you handle updates/deletes in an append-oriented system?" The short answer: ReplacingMergeTree with a monotonic version column from the binlog position, combined with a read-side wrapper view that materializes the latest version on demand. The long answer follows.
ReplacingMergeTree(version_column) is a MergeTree variant that, during its background merge cycle, dedupes rows sharing the same ORDER BY key and keeps only the row with the highest version_column value. This is exactly the semantics we need: every UPDATE in MySQL becomes a new row in ClickHouse with a higher _version; the merge eventually collapses the old row away.
The choice of version column matters: it must be monotonically increasing for every change to a given row. Debezium provides several candidates; the binlog position (source.pos) is the natural fit because it's strictly increasing within a binlog file and Debezium guarantees in-order delivery per source row. For changes that cross binlog file rotations, we use a combined value: (file_number * 2^32) + pos (the binlog file numbers themselves are monotonic).
When MySQL deletes a row, Debezium emits a delete event. With the delete.handling.mode = rewrite SMT, that event becomes a regular insert into the topic but with __deleted = "true". The ingest materialized view sets _deleted = 1 on the resulting ClickHouse row, which has the same ORDER BY key as the original.
Because the delete event always has a higher _version than the row it tombstones, ReplacingMergeTree will keep the _deleted = 1 version after merge. Queries then need to filter WHERE _deleted = 0 to exclude logically-deleted rows.
The catch with ReplacingMergeTree is that merges are asynchronous. After an UPDATE, both the old and new versions of the row exist in the table until the next background merge runs (which can be seconds or minutes later, depending on data size and merge schedule). A naive SELECT will see both rows.
There are three ways to handle this, in increasing order of cost and decreasing order of convenience:
CREATE VIEW epoch_data_current AS SELECT id, end_user_id, app_id, data_type_id, data_source_id, start_ts, end_ts, created_at, argMax(double_value, _version) AS double_value, argMax(long_value, _version) AS long_value, argMax(boolean_value, _version) AS boolean_value, argMax(string_value, _version) AS string_value, argMax(generation_type_id, _version) AS generation_type_id, argMax(trustworthiness_id, _version) AS trustworthiness_id, argMax(_deleted, _version) AS is_deleted FROM epoch_data GROUP BY id, end_user_id, app_id, data_type_id, data_source_id, start_ts, end_ts, created_at HAVING is_deleted = 0;
This is the recommended default. It hides the dedup logic from data-team queries entirely. The cost is the GROUP BY and argMax overhead, which is bounded — typically <2× the cost of a naive SELECT against the same range.
The FINAL modifier tells ClickHouse to merge-on-the-fly at query time. Functionally equivalent to Option A in terms of result, but the implementation forces a sort+merge of every matching part, which can be considerably more expensive for queries that touch many parts.
I'd reserve FINAL for one-off queries where the convenience matters more than the cost; for production query paths, the wrapper view is more predictable.
Manually trigger a full merge. The most expensive option (rewrites all affected parts) but produces the cleanest physical state — useful as a scheduled operation off-peak, especially after large backfills or batch deletes.
ReplacingMergeTree logically deletes rows but doesn't physically remove them from disk until partitions are dropped (via TTL or manual operation). For GDPR compliance, we need a way to physically delete a user's data on demand.
-- Triggered when a user's erasure request is approved. -- ALTER ... DELETE in ClickHouse is a lightweight mutation — -- it rewrites affected parts in background. ALTER TABLE epoch_data DELETE WHERE end_user_id = 42817; ALTER TABLE dim_end_user DELETE WHERE id = 42817; -- And from each materialized view's storage: ALTER TABLE mv_daily_resting_hr DELETE WHERE end_user_id = 42817; -- (Repeat for each MV)
This pattern is wrapped in a single stored procedure (a SQL function or Python script) invoked from an internal "user erasure" admin tool. The procedure must also:
The mutation completes within minutes for a typical user (most of whose data is concentrated in recent partitions). Backfilling old archive files takes longer (~hours, since it requires scanning S3/R2 hourly Parquet files). I'd commit to 24-hour SLA for full physical erasure, with the SQL-layer deletion happening within 1 hour. Stakeholders may want a stricter SLA — see §15.
ClickHouse offers a sibling engine, CollapsingMergeTree, that handles updates by storing +1 ("new state") and -1 ("old state") rows that cancel on merge. It's arguably more theoretically clean but requires the writer to produce paired rows for every update. Debezium doesn't natively emit -1 events; doing this would require an SMT or a downstream processor.
ReplacingMergeTree is simpler: Debezium emits each row state independently, and ClickHouse handles the resolution. We lose the ability to do "delta math" (computing the change between successive versions of a row) but that's not a common analytical pattern here.
Debezium has at-least-once delivery semantics. If the connector crashes after emitting an event but before committing the offset, that event will be re-emitted on restart. Our pipeline is naturally idempotent under this:
_version as the original.(ORDER BY tuple, _version) combination.The pre-merge window allows brief over-counting in MV aggregates, but the steady-state is correct.
The burst-sync pattern produces a different kind of "update" — not a modification of an existing row but the arrival of new rows for old time periods. This is handled trivially by partitioning by start_ts (the measurement time, not the sync time): late rows land in historical partitions automatically, and the materialized views over those partitions incrementally update.
The only operational consideration is part count: a burst sync of a week of data lands in multiple historical partitions. The CH settings allow inserts to span up to 100 partitions per block (max_partitions_per_insert_block) to accommodate this.
The constraint says: "A user offline for days syncs a week of data at once." With ~15,000 epoch_data rows in one burst, and potentially hundreds of users syncing simultaneously after, say, a regional connectivity event, the system needs to absorb spikes of 1M+ events in seconds without inducing pipeline lag for other tenants.
linger.ms = 50; the 15K messages spread across 24 partitions of the epoch_data topic. Total Kafka latency: <1 second.kafka_max_block_size = 65,536. The 15K-row burst fits in one batch, inserted as 1–2 new parts. Insert latency: ~1–3 seconds (bounded by kafka_flush_interval_ms = 2000).End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.
If a connectivity event causes 100 users to sync at once, the total is 1.5M events. The pipeline behaves as follows:
The genuinely concerning scenario is not a one-time burst but sustained high-rate writes — e.g., a viral marketing event that 50× our user count overnight. The bottlenecks we'd encounter in order:
epoch_data, another for everything else).| Setting | Default | Our value | Why |
|---|---|---|---|
Debezium max.batch.size | 2048 | 8192 | Bigger batches, less per-event CPU. |
Kafka topic partitions (epoch_data) | 1 | 24 | Allows ClickHouse parallel consume up to 24 threads. |
Kafka linger.ms | 0 | 50 | Better batching at the producer. |
CH kafka_max_block_size | 1M | 65,536 | Predictable batch size avoids huge parts during bursts. |
CH kafka_flush_interval_ms | 7500 | 2000 | Tighter latency at normal load. |
CH parts_to_throw_insert | 3000 | (unchanged, monitor) | Safety net; if hit, inserts fail loud. |
CH max_partitions_per_insert_block | 100 | 100 | Bursts of week-old data span multiple historical partitions. |
Schema changes are an unavoidable reality. The pipeline must handle them gracefully, even though we don't have a separate schema registry on day one.
Debezium captures every DDL event in MySQL and emits it to a dedicated schema-history Kafka topic. The downstream behavior depends on the kind of change:
| Change | Default behavior | Action required |
|---|---|---|
| ADD COLUMN (nullable) | Debezium emits events with the new field; ClickHouse Kafka engine ignores unknown fields by default | None on day one. To make the new column queryable, ALTER it onto the ClickHouse table and update the ingest MV. |
| ADD COLUMN (NOT NULL, with default) | Same as above. The default value flows through Debezium events. | Same as above. |
| DROP COLUMN | Debezium stops including the field; existing CH column stays as-is | Optional cleanup: ALTER DROP COLUMN on ClickHouse to reclaim storage. |
| RENAME COLUMN | Treated as DROP + ADD by Debezium | Manual: ALTER RENAME on ClickHouse, then redeploy ingest MV with new field name. |
| CHANGE TYPE (widening, e.g. INT → BIGINT) | Debezium handles transparently if compatible; otherwise emits schema change event | ALTER MODIFY COLUMN on CH if the type widens (cheap); if narrowing, requires migration plan. |
| ADD TABLE | Debezium picks it up if it matches table.include.list pattern; otherwise ignored |
Add table to include list, restart connector, optionally trigger incremental snapshot. |
| DROP TABLE | Existing data in CH stays; new events stop flowing | Decide whether to retain history in CH or drop the mirror table. |
For day one (small team, controlled source changes), schema evolution is handled manually:
This works because the team is small enough that announcements happen in shared channels and changes are infrequent (weekly at most). Once the team grows or change frequency increases, this becomes a bottleneck — at which point we add:
The most dangerous case is a MySQL schema change that looks safe but breaks downstream — e.g., changing a column from BIGINT to INT when some existing values exceed the new range. Debezium will emit events as the new type; ClickHouse insert will fail; the consumer will fall behind.
Mitigation: a DLQ (dead-letter queue) topic for failed inserts, with an alert on non-zero message count. The DLQ lets the pipeline keep moving for the other 99.99% of events while the offending change is investigated.
This section catalogues the failure modes we expect, in rough order of likelihood, with detection and recovery procedures for each. Most failures share a common detection signal — end-to-end freshness rising — and differ in what they require for recovery.
Likelihood: High (happens occasionally during normal operation).
Cause: Debezium has at-least-once delivery. If it crashes after emitting an event but before committing the Kafka offset, the event is re-emitted on restart.
Detection: No detection needed — system handles automatically.
Recovery: ReplacingMergeTree dedupes on next merge. No action required.
Likelihood: Medium.
Cause: Restart, hardware failure, disk fill, OOM.
Detection: Kafka consumer lag rises; freshness alert fires after ~5 min.
Recovery: Kafka holds events (7-day retention on epoch_data, 30 days on dims). Restart ClickHouse; the Kafka engine resumes consuming from the saved offset. Bounded blast radius: 7 days of writes before the binlog itself is consumed.
Likelihood: Medium.
Cause: OOM, MySQL connection loss, schema parsing error.
Detection: Producer rate from Debezium drops to zero; freshness alert fires.
Recovery: Auto-restart via systemd or Kubernetes. Connector resumes from last committed binlog offset (no data loss). If restart loop, check logs for schema errors (often the cause).
Likelihood: Medium (we control the source, but accidents happen).
Cause: Backend deploys a schema change without coordinating; ClickHouse Kafka engine fails to parse the new event shape.
Detection: DLQ topic accumulates messages; freshness alert fires.
Recovery: Apply matching ALTER on ClickHouse, then re-process the DLQ. For non-trivial changes, this is a 15–60 minute incident.
Likelihood: Medium.
Cause: Source data with values outside the expected range (e.g., NULL in a NOT NULL column, malformed UTF-8 in a string).
Detection: CH parse error → row routed to DLQ; alert on DLQ rate.
Recovery: Quarantine the affected row in epoch_data_dlq; investigate the source; manually correct or re-insert.
Likelihood: Medium under heavy load, low at normal load.
Cause: Insert rate exceeds merge capacity; system.parts count per partition climbs past ~300; eventually insert fails with TOO_MANY_PARTS.
Detection: Grafana panel on system.parts; alert at 250 per partition.
Recovery: Reduce kafka_flush_interval_ms (already at 2000), increase kafka_max_block_size (already at 65K), throttle non-critical consumers, or scale ClickHouse (add a shard). If unrecoverable in the short term, pause Kafka consumer to give merges time to catch up.
Likelihood: Low if managed; medium if self-hosted single-broker.
Cause: Hardware failure, OOM, disk fill.
Detection: Debezium producer errors; freshness alert fires.
Recovery: Debezium pauses gracefully (binlog buffers naturally in MySQL — 7-day retention). Restart Kafka broker; everything resumes.
Prevention: Move to 3-broker cluster in Phase 2.
Likelihood: Low (assuming 7-day binlog retention).
Cause: Debezium offline for >7 days, or binlog retention misconfigured.
Detection: BinlogNotFoundException in connector logs.
Recovery: Worst-case scenario. Trigger Debezium incremental snapshot for the affected tables — sends a signal via a Kafka topic that tells Debezium to scan the source table and emit synthetic insert events for all rows, while still processing live binlog changes in parallel. Takes hours for epoch_data, minutes for everything else.
Prevention: Monitor Debezium lag actively; alert at 12h sustained lag.
Likelihood: Low (depends on infrastructure).
Cause: MySQL replica we're reading from is promoted or replaced.
Detection: Connection reset in Debezium logs.
Recovery: With GTID mode on MySQL, Debezium reconnects to the new replica and resumes from the GTID set. Without GTID, it requires manual reset of the binlog offset to a known-good position on the new replica.
Prevention: Use GTID mode (already in MySQL prerequisites, §4.1).
Likelihood: Low.
Cause: Kafka Connect S3 Sink fails or slows.
Detection: Separate consumer lag metric for the archive sink.
Recovery: Restart sink; it resumes from offset. The archive is not on the critical path, so this is a "warn, fix in business hours" not a page.
The worst-case rebuild — needing to recreate the ClickHouse state from scratch — proceeds in this order:
earliest. The Kafka engine table will start re-consuming from the beginning of the topic's retention window (7–30 days, depending on table).INSERT INTO ... SELECT FROM s3(...)), then enable Kafka consume.POPULATE on a small data sample, then incrementally fill from the live stream).For a typical table this rebuild is a 30-min to 2-hour operation. For epoch_data at 1B+ rows, several hours to a day, dominated by S3 read + parsing time. This is the operation we run drills against — practising it once a quarter ensures we know how long it really takes.
Capacity decisions follow from a few simple numbers. The estimates here are conservative — real-world ClickHouse performance often exceeds them — but they're useful for sizing decisions.
| Metric | Phase 1 (500M rows) | Phase 2 (5B rows) |
|---|---|---|
| Uncompressed size (avg ~250 bytes/row in ClickHouse representation) | ~125 GB | ~1.25 TB |
| Compressed (10–15× with LZ4 + Delta codecs) | ~10–15 GB | ~80–125 GB |
| + Materialized views (~5% of fact) | ~1 GB | ~10 GB |
| + Replicated tables (×2 for HA in Phase 2) | — | ~180 GB |
| + ZooKeeper/Keeper state | ~100 MB | ~1 GB |
| + Headroom for merges (1.5×) | ~25 GB | ~270 GB |
| Total disk per node | ~30 GB | ~300 GB per shard |
| Metric | Baseline | Peak burst (100 simultaneous syncs) |
|---|---|---|
| Events/sec to MySQL | ~25 (2M/day average) | ~50,000 for 30 seconds |
| Debezium throughput required | ~100/sec | ~50,000/sec (sustainable) |
| Kafka write rate | ~25 msg/sec/partition | ~2,000 msg/sec/partition |
| ClickHouse insert rate | ~1 batch/2 sec | ~1 large batch / 2 sec |
| End-to-end freshness | 5–10 sec | 20–60 sec |
ClickHouse single-node on the Phase 1 spec (4 vCPU, 16 GB RAM) comfortably handles ~10 concurrent analytical queries, with the heaviest individual query latencies in the 100ms–2s range against MVs. Ad-hoc scans of the raw fact table can hit several seconds.
Phase 2 (3-node cluster, 8 vCPU each) handles ~30 concurrent users at similar latency. Past that, we'd either scale horizontally or add a read-only replica tier for dashboard queries.
Every component is sized for ~3× current peak. This gives ~12–18 months of organic growth before any component needs scaling, which is the natural cadence for revisiting infra decisions. If growth exceeds 3×, the alerts on end-to-end freshness will signal degradation well before any hard limit is hit.
All prices in USD; rounded to nearest dollar; based on Hetzner / Cloudflare R2 / Backblaze list pricing as of mid-2026. Substituting US-based equivalents (e.g., DigitalOcean for VMs, AWS S3) increases the totals by ~30–50% but doesn't change the relative comparisons.
| Item | Spec | Monthly |
|---|---|---|
| ClickHouse VM (Hetzner CCX23) | 4 vCPU, 16GB RAM, 240GB NVMe | $60 |
| Kafka / Redpanda VM | 4 vCPU, 16GB RAM, 160GB SSD | $60 |
| Debezium VM | 2 vCPU, 4GB RAM, 40GB SSD | $20 |
| S3 / R2 archive | ~200 GB Parquet, ~10 GB/mo growth | $5 |
| Backups (snapshots to R2, 30-day retention) | ~50 GB | $20 |
| Prometheus + Grafana (co-hosted) | — | $0 |
| Bandwidth (egress mostly internal) | — | $0 |
| Total | ~$165 / mo |
| Item | Spec | Monthly |
|---|---|---|
| ClickHouse cluster (3 × CCX33) | 8 vCPU, 32GB RAM, 480GB NVMe each | $330 |
| Kafka cluster (3 brokers) | 4 vCPU, 16GB RAM each | $180 |
| ClickHouse Keeper (2 nodes) | 2 vCPU, 4GB RAM each | $40 |
| Debezium HA (2 nodes) | 2 vCPU, 4GB RAM each | $40 |
| S3 / R2 archive | ~1.5 TB Parquet | $25 |
| Schema registry (Apicurio) | 2 vCPU, 4GB RAM | $20 |
| Backups (longer retention, more data) | ~200 GB | $40 |
| Monitoring (still self-hosted) | 2 vCPU, 8GB RAM | $40 |
| Total | ~$715 / mo |
For Phase 1 volume (60M events/month, ~500M rows), a representative managed stack:
| Item | Notes | Monthly |
|---|---|---|
| Snowflake compute (XS, ~6h/day) | For ETL + queries | ~$500 |
| Snowpipe Streaming ingest | 60M events/mo at ~$5/M | ~$300 |
| Snowflake storage | ~500 GB | $15 |
| Confluent Cloud Kafka | Small cluster | ~$400 |
| Fivetran CDC (priced per MAR) | Or Airbyte Cloud equivalent | ~$1,500+ |
| Total | ~$2,700+ / mo |
Same workload, ~16× the bill. The savings on the self-hosted stack are real but they come at the cost of ~5–10 hours/week of operational toil. The right tradeoff at Phase 1; less obvious at Phase 2; usually wrong at Phase 3 (when team time becomes the binding constraint).
Track these two metrics over time, and consider the managed-services tier when both cross a threshold:
Bringing this pipeline online from a greenfield state. Assumes MySQL is running, the data team is ready to consume from a new analytical store, and we have ~4 engineering-weeks of capacity to ship Phase 1.
app, data_source, epoch_data_type, etc.). These are small and let us validate the end-to-end flow without taking on burst-sync complexity.epoch_data and end_user to the Debezium connector. Trigger initial snapshot; this will take several hours for the 150M-row fact table.| Risk | Mitigation |
|---|---|
Initial snapshot of epoch_data impacts MySQL performance | Run snapshot from a replica, not the primary. Use snapshot.fetch.size to chunk reads. Schedule for off-peak. |
| Hidden schema constraints not captured in Appendix A | Pre-flight: dump and review actual MySQL schema. Account for any orphan FKs or non-standard collations. |
| Data team expects features not in scope | Publish the scope-and-deferral list (§14, §15) explicitly. Get sign-off before shipping. |
| Single VM failure during Phase 1 | Accept 7-day Kafka buffer; snapshot disk daily; restore from backup if needed. Phase 2 adds redundancy. |
The case study mentions: "Outline how to link clinical report data (e.g., PDF lab reports, insurance claim documents) to a user's wearable data so both can be queried together." This is a Phase 2/3 concern but worth sketching now.
Clinical reports are large, unstructured (PDFs), and arrive at a different cadence than wearable data. Wearable data is small-and-frequent; clinical data is large-and-rare. They need to be linkable by user ID and queryable together — e.g., "for users in our diabetes cohort, show their A1c history alongside their resting HR trend."
s3://thryve-clinical/end_user_id={id}/year={Y}/month={M}/{document_id}.pdf.dim_end_user on end_user_id; queries can compute correlations between extracted clinical fields and wearable aggregates.Very little. The CDC pipeline doesn't change. The clinical data ingestion is a separate pipeline (likely a periodic batch from the clinical-records provider) that lands metadata in a new ClickHouse table. The join surface is shared (end_user_id), so the data team's mental model stays consistent.
The AI team eventually needs reproducible point-in-time snapshots of the event data for training. Three options:
Several Appendix D queries ("detect HR spikes within 5 minutes of arrival") imply real-time alerting. ClickHouse can serve this today with a polling pattern:
-- Run every 30 seconds; check for HR spikes in the last 10 minutes SELECT end_user_id, max(long_value) AS peak_hr, ... FROM epoch_data WHERE data_type_id = 3000 AND created_at >= now() - INTERVAL 10 MINUTE AND long_value > ( SELECT avgMerge(avg_hr) + 2 * stddevSampMerge(stddev_hr) FROM mv_user_hr_baseline_30d WHERE end_user_id = epoch_data.end_user_id ) GROUP BY end_user_id;
For sub-second alerting (e.g., as a feature of the consumer app), introduce Flink between Kafka and ClickHouse: Flink does the windowed comparison in memory and pushes alerts to a notification topic. This is the natural Phase 2/3 addition for real-time product features.
The architecture above is robust to the constraints in the case study, but real-world refinement depends on details we'd want to confirm with the team. These are the questions I'd ask in the kickoff conversation.
binlog_row_image); Aurora has its own CDC offering worth considering.access_token and email considered PII that must be tokenized before they leave MySQL? Affects whether we add an SMT at Debezium or accept them in ClickHouse with column-level encryption.