Section 1

Executive Summary

Thryve collects sensor data from consumer wearable devices and stores it in a MySQL database. As the data and AI teams grow, they need an analytical store that is updated in near real-time, supports the burst-write pattern characteristic of wearable syncs, and stays affordable while the company scales from 150M to 1B+ rows.

This document specifies a streaming Change Data Capture (CDC) pipeline built around three open-source components — Debezium, Apache Kafka, and ClickHouse — that delivers end-to-end freshness in 5–30 seconds (versus the 5-minute SLA), handles updates and deletes correctly via ReplacingMergeTree semantics, and costs roughly $165/month in Phase 1 growing to ~$720/month at 5B rows. The same workload on a managed warehouse stack (Snowflake + Confluent + Fivetran) would cost approximately 15× more at this stage.

Headline decisions

Layer	Choice	Rationale (one line)
Change capture	Debezium MySQL connector	Binlog-based CDC — only path that propagates updates and deletes correctly without a source schema change.
Transport	Apache Kafka (Redpanda viable)	Absorbs burst syncs, decouples source from sink, gives replay for free.
Analytical DB	ClickHouse	Columnar, >10× compression on sensor data, native Kafka engine, $0 license.
Aggregation	ClickHouse Materialized Views	Incremental on insert — removes the need for a separate stream processor today.
Archive	S3 / Cloudflare R2	Cheap long-term backup of raw events; future AI training corpus.
Observability	Prometheus + Grafana	Free, with first-class exporters for every component in the stack.

What we explicitly defer

Several common pieces are not in the day-one design — not because they're wrong, but because they don't earn their complexity yet. Each is easier to add later than it would be to remove if added prematurely:

Stream processor (Flink / Spark Streaming). Add when we need sub-second anomaly detection — ClickHouse handles 5-min windowed alerts comfortably.
Lakehouse layer (Iceberg + Trino). Add when AI training requires reproducible point-in-time snapshots over the raw event log.
Schema registry. Add when more than one consumer depends on schema stability.
Managed services (Confluent Cloud, ClickHouse Cloud). Add when ops time exceeds the rent saved by self-hosting.

What I'd want to know before shipping

The architecture is robust to the constraints in the case study, but several stakeholder decisions could refine it materially — chiefly the GDPR erasure SLA, the actual peak write QPS on MySQL, and whether the data team needs sub-second or few-second query latency. These are catalogued in §15.

Section 2

Problem Analysis

2.1 Source-of-truth shape

The MySQL backend stores six logical entities, with one fact table (epoch_data) holding ~99.99% of the row count. The dimensions are tiny and slow-changing, dominated by a 500K-row end_user table. This shape — one heavy fact, several thin dimensions — is the textbook profile for a columnar analytical store, and it gets favorable treatment in every layer of the design.

Table	Rows	Profile	Implications
`epoch_data`	150M → 1B+	High-frequency, burst writes (~15K rows per user sync), 2M/day baseline	Drives all storage, ingest, and partitioning decisions.
`end_user`	500K	Low-frequency writes, PII-bearing	Needs GDPR-compliant deletion. Joins to fact via `end_user_id`.
`app`	230	Tenant configuration (B2B multi-tenant)	Denormalized into fact for tenant pruning at query time.
`epoch_data_type`	500	Reference; rarely changes	Loaded as ClickHouse dictionary for fast in-query lookup.
`data_source`	50	Reference; rarely changes	Same as above.
`generation_*`, `trustworthiness`	<100 each	Static	Dictionary-backed.

2.2 The constraints that drive the design

Burst syncs

A wearable that's been offline for a week may sync ~15,000 epoch_data rows in a single request. These rows have old start_timestamp values (the measurement times) and a new created_at value (the sync time). The pipeline must:

absorb the spike without inducing lag on other tenants (Kafka buffering);
route the rows to historical partitions on the analytical side (partition by start_timestamp, not created_at);
trigger pre-aggregation updates for those old partitions correctly (incremental materialized views, not snapshot-based aggregates).

Freshness within 5 minutes

This requirement rules out hourly micro-batch ETL but is comfortably satisfied by a streaming pipeline. In practice, the architecture below achieves end-to-end freshness of 5–30 seconds at normal load and 1–3 minutes under burst conditions. The 5-minute SLA gives operational headroom for transient ClickHouse maintenance or single-broker Kafka restarts.

Updates and deletes must propagate

This single requirement eliminates two otherwise-attractive options:

Incremental polling (querying MySQL for rows where created_at > last_seen): does not capture updates to existing rows and never sees deletes. There is no updated_at column in the source schema, so this would be doubly broken.
Full snapshots (periodic SELECT * dumps): can capture state but cannot meet 5-minute freshness on a 150M-row fact table without saturating MySQL.

Binlog-based CDC is the only mechanism that satisfies all three correctness requirements (inserts, updates, deletes) at the required latency.

Scale: 150M → 1B+ rows

At 1B rows with the sensor-data profile (mostly small integers and floats, with high inter-row correlation), ClickHouse typically achieves >10× compression — putting on-disk size at 30–80 GB. This fits on a single node well past the 1B threshold, deferring sharding complexity to a later phase. Snowflake/BigQuery would handle this easily too; the question is cost, not feasibility (see §12).

Multi-tenancy (230 apps)

Most analytical queries are scoped to a single app (a Thryve customer). Tenant-scoping must be physically efficient in the storage layout — accomplished by including app_id as the first key in the fact table's ORDER BY, allowing ClickHouse to skip any data not belonging to the queried tenant.

Implicit constraint: regulated data

The case data shows a German user (Maria Schmidt, Europe/Berlin). Even without explicit mention, wearable health data in a European context is governed by GDPR. This affects the design in three concrete places: (a) deletes must propagate end-to-end and be physically removable on demand (§7); (b) PII columns like email and access_token warrant encryption at rest and access logging; (c) data residency may pin component deployment to EU regions, affecting cloud-provider choice. HIPAA scope (for any US customers) would tighten these further; this is one of the questions in §15.

2.3 Workload patterns the analytical DB must serve

Appendix D of the case enumerates representative queries. They cluster into three patterns:

Pattern A — Per-user time-series scans

"What is the 90-day trend of resting heart rate for user X?" These queries read a contiguous slice of epoch_data filtered by end_user_id, data_type_id, and a date range. The fact-table ORDER BY (app_id, end_user_id, data_type_id, start_timestamp) reduces these to localized, mostly-sequential reads — the cheapest pattern on ClickHouse.

Pattern B — Cohort aggregations

"How does average HRV compare across users with condition X?" These touch many users but typically a single data_type_id over a recent time window. Solved cleanly by materialized aggregates partitioned by month: mv_daily_hrv_per_user rolls epoch-level data into one row per user per day, after which cohort summaries are a few thousand rows of aggregation rather than millions of raw events.

Pattern C — Data-quality and operational queries

"Which users have HR gaps > 24h in the last week?" / "What % of rows arrive > 48h after measurement?" These are global scans over recent partitions. They are not user-scoped, so they don't benefit from the user-level ordering — but they only need recent partitions (small disk footprint) and the columnar layout means scanning only the columns they care about (start_timestamp, created_at, data_type_id) is cheap.

Pattern A and B together account for the vast majority of expected queries. Pattern C is operational and tolerates higher latency. The architecture optimizes for A and B, and accepts that C will use more compute per query.

2.4 What the analytical DB will not be doing

It is useful to be explicit about workloads the design does not need to support, to avoid carrying their costs:

Transactional consistency. The analytical store is eventually-consistent with MySQL. Reads may see a row that was deleted seconds ago in MySQL, until the CDC event propagates and the next merge runs. For analytical purposes this is acceptable; for product features that need strict consistency, those reads stay on MySQL.
Sub-millisecond point lookups. ClickHouse is excellent at scans and aggregations, less so at OLTP-style single-row lookups. Product features requiring this stay on MySQL or a key-value cache layer.
Heavy multi-table JOINs. Star-schema JOINs across many large dimensions are an area where Snowflake / BigQuery shine more than ClickHouse. We mitigate by denormalizing app_id into the fact and using dictionaries for the small reference tables, avoiding JOINs at query time for the common cases.

The design principle behind these tradeoffs: ClickHouse is being chosen for what it does best — fast scans and aggregations over time-ordered columnar data — and the schema is shaped to keep queries in that sweet spot. The day a query workload emerges that fights ClickHouse's grain, the right answer is to add a complementary tier (a key-value cache, a lakehouse for ad-hoc JOINs), not to rebuild the foundation.

Section 3

Architecture Overview

The pipeline is six components arranged in a single linear flow, with a side branch to object-storage archive and a cross-cutting observability plane. Each component has a single job, a clear contract with its neighbors, and a well-understood failure mode.

Why this shape

The architecture is consciously linear. There is exactly one production path from each MySQL change to its representation in ClickHouse, and the path passes through systems that are individually simple, individually resilient, and well-understood by the operations community. The cost of this conservatism is that we forgo certain "exotic" optimizations — e.g., direct MySQL-to-ClickHouse replication (skipping Kafka) is possible via tools like Sinkhole or ClickHouse's MaterializedMySQL engine, but it sacrifices the replay-from-buffer capability that has saved us in nearly every incident on similar pipelines.

What flows where

MySQL → Debezium: Binary log (binlog) in ROW format. Debezium reads it via the MySQL replication protocol, presenting itself as a replica.
Debezium → Kafka: One topic per source table, prefixed cdc.thryve.. Each message keyed by the source primary key to preserve per-row ordering within a partition.
Kafka → ClickHouse: Pulled by ClickHouse's Kafka engine table. A materialized view inside ClickHouse moves rows from the Kafka engine into the persistent MergeTree.
Kafka → S3/R2: Kafka Connect S3 Sink writes the raw event stream as hourly Parquet files. Pure failsafe + future AI corpus, not in the critical path.
Observability: Every component exposes Prometheus metrics. End-to-end freshness is computed as max(now() - max(created_at)) per table and is the SLA dashboard's headline metric.

3.1 Pipeline phases over time

The architecture is intentionally staged so that the same blueprint scales from single-node Phase 1 to a 3-shard Phase 2 cluster without rework. The phasing is driven by row count and concurrent-user count, not by calendar time.

Phase	Trigger	What changes	What stays the same
Phase 1 0 – 500M rows	Day-one shipping volume	Single-node ClickHouse · single-broker Kafka (Redpanda viable) · Debezium standalone	Schemas · topic layout · MV definitions · monitoring queries
Phase 2 0.5B – 5B rows	CH disk > 60%, query latency drifting; or write-rate > 10K events/sec sustained	3-node ClickHouse with sharding by `app_id` · 3-broker Kafka · 2-node Debezium HA · add schema registry	Logical schemas · query interface · per-table strategy
Phase 3 5B+ rows	AI workloads need raw event log; or consumer concurrency exceeds CH capacity	Add Iceberg-on-S3 tier fed by Kafka Connect · introduce Trino for ad-hoc cross-domain JOINs · evaluate dbt Cloud	The CDC layer (Debezium + Kafka) feeds both tiers — no source-side rework

The intent: No phase change requires rebuilding earlier work. Each phase adds capacity without invalidating the schemas, query patterns, or operational runbooks established in the prior phase.

3.2 What we explicitly do not include in the day-one architecture

No separate stream processor. Apache Flink or Spark Streaming would let us do complex windowed joins across topics, sub-second anomaly detection, and stateful transformations. None of the case's requirements need this today. ClickHouse materialized views cover the aggregation cases; cross-topic logic can be added later when the product needs it.
No managed control plane. Confluent Cloud, AWS MSK, ClickHouse Cloud — all are excellent and remove ~10 hours/week of operational toil at scale. They also cost 2–4× more. Worth revisiting when the team's operational capacity is the binding constraint.
No data lake tier. S3/R2 archive is a flat dump of raw events, not a queryable Iceberg/Delta Lake. Adding the table format is a clear upgrade path for Phase 3, but introducing it early would add governance, schema-registry, and partitioning complexity for no current benefit.
No reverse ETL. Pushing aggregated data back into MySQL or operational systems is sometimes useful but creates a feedback loop that complicates CDC. Out of scope.

Section 4

Component Deep Dive

4.1 Debezium · Change Data Capture

Debezium is the de-facto open-source CDC framework. The MySQL connector presents itself to MySQL as a replica (using a dedicated replication user), reads the binary log (binlog) in ROW format, and emits a structured event per change. The events include both before and after states for updates, and emit explicit delete events.

Why Debezium specifically

Mature MySQL support. Production-tested at companies running databases far larger than ours. Handles MySQL quirks (schema changes, GTID gaps, long-running transactions) more reliably than rolling our own binlog reader.
Resumable from offset. Persists its last-read binlog position to Kafka. On restart, it resumes from exactly where it left off — no duplicate snapshot, no gap.
Incremental snapshots. For backfilling individual tables without taking the connector offline. Critical for the "binlog purged before we read it" recovery scenario.
Single-Message Transforms (SMTs). Allow us to flatten the Debezium envelope, route by table, mask PII, etc. — without writing a separate stream processor.

Critical configuration

# Debezium MySQL connector — production settings
name: thryve-mysql-cdc
connector.class: io.debezium.connector.mysql.MySqlConnector
tasks.max: 1             # single-threaded by design

# Source connection
database.hostname: mysql-replica.internal
database.port: 3306
database.user: debezium_cdc
database.password: ${file:/run/secrets/dbz-pw:value}
database.server.id: 184054   # unique among MySQL replicas
database.include.list: thryve
table.include.list: thryve.app, thryve.end_user, thryve.epoch_data,
                     thryve.epoch_data_type, thryve.data_source,
                     thryve.generation_type, thryve.generation_method,
                     thryve.trustworthiness

# Snapshot mode — initial snapshot, then tail binlog forever
snapshot.mode: initial
snapshot.locking.mode: none     # MVCC, no table locks
snapshot.fetch.size: 10000

# Throughput tuning for the bursty epoch_data table
max.batch.size: 8192
max.queue.size: 32768
poll.interval.ms: 500

# Topic naming & partitioning
topic.prefix: cdc.thryve
message.key.columns: thryve.epoch_data:id;thryve.end_user:id

# Flatten the Debezium envelope and surface delete tombstones.
# Without unwrap, every event is wrapped in {before, after, source, op}.
transforms: unwrap
transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones: false
transforms.unwrap.delete.handling.mode: rewrite
transforms.unwrap.add.fields: op,source.ts_ms,source.pos,source.file

# Schema history (where DDL events are persisted)
schema.history.internal.kafka.bootstrap.servers: kafka:9092
schema.history.internal.kafka.topic: cdc.thryve.schema-history

MySQL prerequisites

Debezium requires the source MySQL instance to satisfy a few conditions. These should be verified before the connector is deployed:

log_bin = ON and binlog_format = ROW (statement-based binlog cannot reconstruct individual row changes reliably).
binlog_row_image = FULL — emit complete before/after row images, not just changed columns. Required for ReplacingMergeTree to dedupe correctly.
gtid_mode = ON recommended — simplifies failover and resumption.
Binlog retention ≥ 7 days (binlog_expire_logs_seconds = 604800). This is the recovery window for a Debezium outage.
A dedicated MySQL user with REPLICATION CLIENT, REPLICATION SLAVE, and SELECT on the source schemas. (No write permissions.)
Connector reads from a replica, not the primary, to avoid loading the OLTP server with binlog read traffic.

What the events look like

After the unwrap SMT is applied, an event for an epoch_data insert looks roughly like this:

{
  "id": 991200,
  "end_user_id": 42817,
  "data_type_id": 3000,
  "data_source_id": 5,
  "start_timestamp": 1741003200000,
  "end_timestamp": 1741003260000,
  "long_value": 72,
  "created_at": 1741254720000,
  "timezone_offset": 60,
  "__op": "c",                         // c = create, u = update, d = delete
  "__source_ts_ms": 1741254720123,
  "__source_pos": 8742019,            // binlog position — our _version
  "__source_file": "binlog.000142",
  "__deleted": "false"
}

The crucial fields for downstream processing are __source_pos (used as the _version in ClickHouse's ReplacingMergeTree — monotonically increasing within a binlog file) and __deleted (set to "true" by the delete.handling.mode = rewrite SMT for tombstone events).

Failure modes specific to Debezium

Failure	Detection	Recovery
Connector crash	Kafka Connect REST `/connectors/{name}/status` = FAILED	systemd auto-restart; connector resumes from last committed binlog offset (no data loss).
MySQL binlog purged before Debezium reads it	`BinlogNotFoundException` in connector logs	Manual: trigger incremental snapshot for affected tables via Kafka topic signal.
Schema change to non-nullable column	Connector pauses with schema error	Manual: apply matching ALTER on ClickHouse, then resume.
MySQL replica failover	Connection reset; auto-reconnect	With GTID mode, automatic. Without GTID, requires manual offset reset.
Source DB password rotation	Auth errors in logs	Update connector secret; restart task.

4.2 Apache Kafka · Transport and Buffer

Kafka sits between Debezium and ClickHouse as a durable, replayable buffer. It is the component that absorbs burst syncs without backpressuring MySQL, that lets ClickHouse be restarted for maintenance without data loss, and that gives us replay if we ever need to rebuild a ClickHouse table from history.

Why a broker, not direct push

Debezium can write directly to many sinks via Debezium Server (Kinesis, Pulsar, HTTP). For a startup, the temptation to skip Kafka and go straight to ClickHouse is real — one less system to operate. The reasons we keep Kafka in the path anyway:

Replay capability. If we discover a bug in our materialized view definition, or want to migrate to a new ClickHouse schema, we just reset the consumer offset and let the data re-flow. Without Kafka, the only replay source is the MySQL binlog itself — which has 7-day retention, not 6 months.
Decoupling. When ClickHouse needs an offline maintenance window, Kafka keeps buffering writes. Debezium continues reading the MySQL binlog (which is cleared more aggressively). The system is robust to multi-hour ClickHouse outages without data loss.
Fan-out. The same event stream feeds the ClickHouse pipeline, the S3 archive sink, and any future consumers (e.g., a Flink anomaly detector) without re-reading MySQL.

Redpanda as a drop-in

Redpanda is a Kafka-compatible broker written in C++ — single binary, no Zookeeper, lower memory footprint, often better tail latency. For Phase 1 (single-broker), Redpanda is meaningfully easier to operate and produces nearly identical guarantees. I'd recommend it as the starting point and migrate to Apache Kafka only if specific Kafka-ecosystem tools (KSQL, certain connectors) become needed.

Topic layout

Topic	Partitions	Retention	Reason
`cdc.thryve.epoch_data`	24	7 days	Highest-volume table. 24 partitions allow parallel ClickHouse consume up to 24 consumer threads.
`cdc.thryve.end_user`	3	14 days	Low volume but PII-relevant; longer retention to permit re-verification of GDPR deletes.
`cdc.thryve.app`	1	30 days	Tenant config; rarely changes, but want long replay window.
`cdc.thryve.data_source` `cdc.thryve.epoch_data_type` `cdc.thryve.generation_*` `cdc.thryve.trustworthiness`	1 each	30 days	Reference data; one partition suffices, long retention is cheap given low volume.

Partitioning is by the source primary key. For epoch_data, this means events for the same row always land on the same partition, preserving in-row ordering. We do not partition by end_user_id because that would create skew (some users sync orders-of-magnitude more data than others), and because cross-row ordering is not required for correctness — ReplacingMergeTree resolves the final state from _version.

Producer / consumer settings

Setting	Value	Reason
Producer `acks`	`all`	Wait for all in-sync replicas — durability over latency.
Producer `compression.type`	`zstd`	Best compression ratio for repeated JSON keys.
Producer `linger.ms`	`50`	Allow micro-batching at the producer for better throughput.
Topic `min.insync.replicas`	`2`	Tolerate 1 broker failure without losing durability.
Consumer `isolation.level`	`read_committed`	Only read committed records — avoids reading transactional aborts.

4.3 ClickHouse · Analytical Database

ClickHouse is the most consequential choice in the architecture. It owns query latency, storage cost, and most of the operational complexity. Several alternatives could work — see §4.3.4 for the comparison — but ClickHouse aligns better than any of them with the specific shape of this workload: time-ordered, append-heavy, columnar-friendly, with a strong preference for predictable cost over elastic scale.

4.3.1 Why ClickHouse for this workload

Columnar storage + compression. The epoch_data rows have high column-wise redundancy (same end_user_id appearing for hundreds of consecutive rows; same data_type_id for thousands; integer-typed values with narrow ranges). ClickHouse's column codecs (LZ4 for hot data, ZSTD for cold) plus its delta-of-delta encoding for sorted columns typically achieve 10–20× compression on this kind of data. At 1B rows the uncompressed footprint would be ~250 GB; compressed, we're looking at 15–30 GB.
Native Kafka integration. The Kafka engine table acts as a consumer; a materialized view on top of it moves rows into the persistent MergeTree table. We don't operate a separate ETL process — the consume + insert loop runs inside ClickHouse.
MergeTree family. The data engine itself supports CDC semantics: ReplacingMergeTree dedupes by primary key keeping the highest-versioned row; AggregatingMergeTree maintains running aggregates incrementally. Both are designed for the append-heavy + late-update pattern we have.
Materialized views are incremental. A view defined as SELECT ... FROM epoch_data GROUP BY ... recomputes only for the rows in each new insert — not the whole table. This is what makes "compute the daily aggregate on every insert" performant rather than catastrophic.
TTL. Declarative data lifecycle. We can say "delete epoch_data partitions older than 2 years" or "move them to S3 after 90 days" with a one-line TTL clause, no separate batch job needed.
Cost. ClickHouse is open-source and runs comfortably on a single commodity VM through ~500M rows of this data. The equivalent workload on Snowflake or BigQuery would cost an order of magnitude more.

4.3.2 What ClickHouse gives up

Eventual-consistency reads. ReplacingMergeTree merges in the background; until the merge runs, both old and new versions of a row coexist in the table. Reads have to either tolerate this or pay the cost of FINAL / argMax dedup (see §7).
Heavy multi-table JOINs. ClickHouse's JOIN implementation is improving rapidly but is still less mature than Snowflake's or BigQuery's. For analytical patterns dominated by large-table-to-large-table JOINs, those warehouses can be a better fit. Our workload is dominated by per-user time-range scans on a single denormalized fact, which plays to ClickHouse's strengths.
Schema evolution. ALTERs are mostly supported but require care — renaming a column in a partitioned table can be expensive. We work around this by treating the ClickHouse schema as a derived artifact: schema changes happen at the source first, then are applied to ClickHouse with awareness of cost.
Operational complexity at scale. A 3-node sharded ClickHouse cluster with replication requires careful operation (ZooKeeper or ClickHouse Keeper, distributed DDL, replica reconciliation). Manageable, but a non-trivial step up from a single-node deployment. The Phase 2 transition is the moment to evaluate ClickHouse Cloud or Altinity.

4.3.3 ClickHouse-specific configuration

A handful of settings are particularly important for our workload:

Setting	Value	Reason
`max_partitions_per_insert_block`	100	Allow burst inserts to span multiple monthly partitions when a user syncs week-old data.
Kafka engine `kafka_max_block_size`	65536	Cap per-batch insert size to avoid "too many parts" errors during sustained load.
Kafka engine `kafka_flush_interval_ms`	2000	Tighter than default 7500ms — keeps freshness under 5s at normal load.
Merge thread count	tune to ~50% of available cores	Background merges must keep up with insert rate to prevent part accumulation.
Codecs on time columns	`Codec(Delta, ZSTD(3))`	Delta encoding shrinks monotonically-increasing timestamps to a few bits per row.
Codecs on ID columns	`Codec(T64, ZSTD(3))`	T64 codec is highly effective for unsigned integers with narrow ranges.

4.3.4 Alternatives considered and why I didn't choose them

Candidate	Verdict
Snowflake	The best developer experience in the analytical-DB space. Its separation of storage from compute and excellent JOIN support are genuinely superior. The case against it for this workload is cost: Snowpipe Streaming inserts at our event rate, combined with the warehouse credits consumed by even modest concurrent queries, easily reaches $2–3K/month at 1B rows. That's defensible at a Series B startup but not a Seed/Series A one. Reconsider when the data team has >5 concurrent users.
BigQuery	Similar shape to Snowflake — excellent for ad-hoc analytics, expensive for streaming ingest. Strongest if the team is already on GCP and would benefit from BigQuery's tight integration with the rest of the Google data stack. Same cost objection as Snowflake.
TimescaleDB / Postgres	An attractive option up to roughly 500M rows of time-series data. Compression (via TimescaleDB's compressed chunks) is good but not at ClickHouse levels. The selling point is JOIN flexibility and the comfort of Postgres ergonomics. The breaking concern is query performance on the heaviest analytical patterns at 1B+ rows: per-user scans hold up, but cohort aggregations slow down measurably. Best choice if the team strongly prefers Postgres familiarity and is confident growth will be slow.
Apache Druid	Purpose-built for real-time OLAP. Excellent ingestion throughput and sub-second queries at scale. Operationally significantly heavier — coordinator, brokers, historicals, middle managers, deep storage — and its SQL surface, while improved, is still less expressive than ClickHouse's. The case for Druid is strongest when sub-second query latency at high concurrency is the dominant requirement. Our 5-min freshness + few-second query SLA doesn't push us into that territory.
Apache Pinot	Similar value proposition to Druid; some teams find it easier to operate. Same caveat applies — operational complexity exceeds what a startup data team typically wants to absorb day-one.
Iceberg + Trino / DuckDB	The "modern lakehouse" answer. Strong for AI training workloads (Spark / Polars / DuckDB can all read Iceberg directly). Doesn't solve the 5-min freshness requirement on its own — needs a streaming layer feeding it. A natural Phase 3 addition, not a Phase 1 base.
MySQL read replica	Trivial to set up but doesn't solve the analytical query problem. A scan over 1B rows of `epoch_data` is hours on row-oriented storage. Mentioned only to be ruled out explicitly — sometimes proposed by stakeholders unfamiliar with the OLTP / OLAP distinction.

4.3.5 The Kafka → ClickHouse pattern

ClickHouse consumes Kafka via a specific three-table pattern:

-- 1. The Kafka engine table acts as a consumer.
--    Reading from it consumes from Kafka, so it's not directly queried.
CREATE TABLE kafka_epoch_data (
  id UInt64,
  end_user_id UInt32,
  data_type_id UInt32,
  start_timestamp Int64,
  long_value Nullable(Int64),
  double_value Nullable(Float64),
  -- ... other value columns
  __source_pos UInt64,
  __deleted String
) ENGINE = Kafka
SETTINGS
  kafka_broker_list = 'kafka:9092',
  kafka_topic_list = 'cdc.thryve.epoch_data',
  kafka_group_name = 'ch_epoch_data_consumer',
  kafka_format = 'JSONEachRow',
  kafka_num_consumers = 8,
  kafka_max_block_size = 65536,
  kafka_flush_interval_ms = 2000;

-- 2. The persistent storage table (the one queries actually hit).
CREATE TABLE epoch_data (
  id UInt64,
  end_user_id UInt32,
  app_id UInt16,
  -- ... full schema (see §6)
  _version UInt64,
  _deleted UInt8
) ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id);

-- 3. The materialized view that moves rows from (1) to (2).
--    Each time the Kafka engine table receives a batch, this view's SELECT
--    runs against the batch and inserts the result into epoch_data.
CREATE MATERIALIZED VIEW mv_ingest_epoch_data TO epoch_data AS
SELECT
  id, end_user_id,
  dictGetUInt16('dim_end_user', 'app_id', end_user_id) AS app_id,
  data_type_id, data_source_id,
  toDateTime64(start_timestamp / 1000, 3, 'UTC') AS start_ts,
  toDateTime64(end_timestamp / 1000, 3, 'UTC') AS end_ts,
  long_value, double_value, boolean_value, string_value,
  __source_pos AS _version,
  if(__deleted = 'true', 1, 0) AS _deleted
FROM kafka_epoch_data;

The dictGetUInt16 call denormalizes app_id from the end_user dimension at write time, so queries never need to JOIN to retrieve it. This is the single most important denormalization in the schema — it converts an expensive JOIN into a free column lookup, at the cost of recomputing app_id on every insert (cheap, since the dictionary is in memory).

4.4 S3 / R2 · Raw Event Archive

The archive tier is not on the critical path — no production query reads from it — but it earns its place for three reasons:

Disaster recovery. If both ClickHouse and Kafka are lost (worst-case multi-region outage, or a destructive operational error), we can rebuild the ClickHouse state from the archive without touching MySQL.
Replay window extension. Kafka's retention is 7–30 days; the archive extends this to "forever, effectively." Useful for replaying months of history when introducing a new materialized view that needs to be backfilled.
AI training corpus. Future ML models need stable, reproducible snapshots of event data — not the live mutating analytical store. A Parquet archive partitioned by date is the standard format for this.

Cloudflare R2 vs AWS S3

S3 is the obvious default. R2 is meaningfully cheaper for this use case because egress is free — and the archive will be read from many directions (ClickHouse for backfills, training jobs, ad-hoc analysts). At ~$0.015/GB/month storage and zero egress, R2 is roughly 1/3 the cost of S3 for a workload that reads as well as writes. The compatibility with the S3 API is essentially complete; the Kafka Connect S3 Sink works against R2 with no code change.

The argument against R2 is data-residency concerns (R2's regions are more limited than S3's) and the absence of certain advanced features (lifecycle policies, Glacier-style tiering). For Phase 1, neither applies meaningfully.

Sink configuration

name: s3-sink-epoch-data
connector.class: io.confluent.connect.s3.S3SinkConnector
topics: cdc.thryve.epoch_data, cdc.thryve.end_user, ...
s3.bucket.name: thryve-cdc-archive
s3.region: auto                          # R2 uses 'auto'
store.url: https://<account>.r2.cloudflarestorage.com

# Hourly partitioning, by event time not arrival time
flush.size: 100000
rotate.interval.ms: 3600000             # 1 hour
timestamp.extractor: RecordField
timestamp.field: source.ts_ms
path.format: "'topic='${topic}/'y='YYYY/'m='MM/'d='dd/'h='HH"
format.class: io.confluent.connect.s3.format.parquet.ParquetFormat
parquet.codec: snappy

Storage layout

Files land at paths like:

s3://thryve-cdc-archive/topic=cdc.thryve.epoch_data/
  year=2026/month=05/day=23/hour=14/
    cdc.thryve.epoch_data+12+0000000000412198192.parquet

The filename includes the topic, partition, and starting Kafka offset, making it trivial to reason about completeness. Parquet with Snappy compression typically gives ~5–8× compression on this event schema.

Lifecycle and cost

At 2M events/day with ~150 bytes per event compressed, the archive grows at ~10 GB/month. At Phase 2 burst rates (~10M events/day), that becomes ~50 GB/month. Even after 5 years, the archive is well under 5 TB — under $80/month at R2 list pricing. Lifecycle policies can move very-old data to colder tiers, but the economics rarely justify the operational complexity at our scale.

4.5 Observability · Prometheus, Grafana, Alertmanager

The observability stack is intentionally boring. Prometheus scrapes metrics from each component's built-in exporter; Grafana visualizes; Alertmanager routes pages. None of this is novel, all of it is well-supported by every component in the pipeline.

The headline metric: end-to-end freshness

The single most important metric is the end-to-end lag between when a row is written to MySQL and when it becomes queryable in ClickHouse. Implementation:

-- Run every 30 seconds via a Grafana → ClickHouse query
SELECT
  'epoch_data' AS table,
  now() - max(created_at) AS freshness_seconds
FROM epoch_data
WHERE created_at >= now() - INTERVAL 10 MINUTE;

This metric collapses every possible pipeline failure into one number. If MySQL is healthy, Debezium is healthy, Kafka is healthy, and ClickHouse is healthy, freshness will be 5–30 seconds. Anything else and freshness will trend upward, hit the 5-minute threshold, and page.

Layered diagnostic metrics

Freshness tells us that something is wrong; the layered metrics tell us what:

Layer	Metric	What it means when red
MySQL	`mysql_global_status_innodb_log_lsn_current`	Write rate. Rising fast = sustained heavy load on source.
Debezium	`debezium_metrics_milliseconds_behind_source`	How far behind binlog Debezium is. Above ~1s = read bottleneck.
Debezium	`debezium_metrics_total_number_of_events_seen`	Event throughput. Flat-lining = connector stuck.
Kafka	`kafka_consumer_lag` (per consumer group)	How many messages the consumer hasn't read yet. Rising = consumer can't keep up.
Kafka	`kafka_topic_partitions_under_replicated`	Replication health. >0 = broker issue.
ClickHouse	`system.parts` count per partition	Background merge health. >300 sustained = merges falling behind.
ClickHouse	`system.replication_queue` size (when replicated)	Replica sync health.
ClickHouse	`InsertedRows` per minute	Insert throughput. Flat-lining = Kafka consumer paused.

Alerting policy

Page when end-to-end freshness > 5 min for >3 minutes (sustained, not transient).
Page when DLQ topic message count > 0 (a single malformed row is OK, but it shouldn't accumulate).
Warn when ClickHouse disk usage > 80% (gives ~weeks of runway, not minutes).
Warn when Kafka consumer lag > 100K messages for >5 min (early indicator of CH slowdown).
Warn when Debezium milliseconds_behind_source > 30s sustained.

Alerts that are not pages: anything that doesn't require action within 30 minutes. The 3 a.m. page should only fire for things that genuinely need a human in the next half-hour.

Section 5

Per-Table Strategy

Stream-versus-batch is the wrong question to ask per table; the right question is "which consumption pattern, knowing that the same Debezium connector reads all tables anyway." The infrastructure is shared, so the per-table decision is really about (a) how the data lands on the ClickHouse side, and (b) what failsafe we wrap around the streaming path.

Table	Primary path	Failsafe	CH engine	Reasoning
`epoch_data`	CDC stream	Daily incremental snapshot → Parquet on S3/R2	`ReplacingMergeTree (_version)`	High volume; 5-min freshness is the whole point. Daily snapshot is the rebuild source if Kafka retention expires during a sustained outage.
`end_user`	CDC stream	Hourly full snapshot to R2	`ReplacingMergeTree (_version)`	PII-bearing. Deletes must propagate quickly (GDPR). Small enough that an hourly full snapshot is cheap insurance.
`app`	CDC stream	Hourly full snapshot	`ReplacingMergeTree (_version)`	Tenant config — small, slow-changing, but joins from every query. Snapshot makes recovery trivial.
`epoch_data_type` `data_source`	CDC stream	Hourly full snapshot	`ReplacingMergeTree` + ClickHouse `Dictionary` (in-memory)	Reference data, used for in-query lookups via `dictGetX`. Dictionary auto-refreshed from the underlying MergeTree.
`generation_type` `generation_method` `trustworthiness`	Snapshot only	—	`Dictionary`	Static, <100 rows. Loaded as a ClickHouse dictionary, refreshed daily by a one-line cron'd query. No streaming overhead.

Why mostly uniform

The temptation, on reading the per-table sizes, is to optimize: stream the heavy table, batch the small ones. We resist that because:

The same Debezium connector covers all eight tables. Including or excluding small tables doesn't materially change its load.
Streaming the small tables gives us free correctness on updates and deletes. An hourly batch on end_user would mean an hour of staleness on a user-deletion request — unacceptable for GDPR.
Operating one pipeline pattern is cheaper than operating two. Every variation we add is a runbook entry that has to be remembered six months from now at 2 a.m.

Why ClickHouse Dictionaries for the truly static tables

The generation_* and trustworthiness tables are small enough (~100 rows total) that loading them as in-memory dictionaries is dramatically faster than even an indexed JOIN. Dictionary lookups are O(1) hash lookups, run in the same memory space as the query, and don't touch disk. The CDC pipeline still streams these tables (for $0 incremental cost), but queries use the dictionary form via dictGet('trustworthiness', 'name', trustworthiness_id).

Section 6

Data Modeling

The schema decisions are dominated by one question: what does the ORDER BY on epoch_data look like? Everything else in the model follows from getting that right.

6.1 The fact table

CREATE TABLE epoch_data (
  id                  UInt64,
  end_user_id         UInt32,
  app_id              UInt16,                       -- denormalized at ingest time
  data_type_id        UInt32,
  data_source_id      UInt16,
  third_party_ds_id   UInt16 DEFAULT 0,
  start_ts            DateTime64(3, 'UTC') CODEC(Delta, ZSTD(3)),
  end_ts              Nullable(DateTime64(3, 'UTC')) CODEC(Delta, ZSTD(3)),
  created_at          DateTime64(3, 'UTC'),    -- sync time

  double_value        Nullable(Float64)         CODEC(ZSTD(3)),
  long_value          Nullable(Int64)           CODEC(T64, ZSTD(3)),
  boolean_value       Nullable(UInt8),
  date_value          Nullable(DateTime64(3, 'UTC')),
  string_value        Nullable(String),

  generation_type_id  UInt8 DEFAULT 0,
  generation_method_id UInt8 DEFAULT 0,
  trustworthiness_id  UInt8 DEFAULT 0,
  medical_grade       UInt8 DEFAULT 0,
  chronological_exactness Int16,
  timezone_offset     Int16,
  measurement_id      UInt64,

  _version            UInt64,                       -- binlog position
  _deleted            UInt8 DEFAULT 0,
  _ingested_at        DateTime DEFAULT now()      -- for ops queries
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(start_ts)
ORDER BY (app_id, end_user_id, data_type_id, start_ts, id)
SETTINGS index_granularity = 8192
TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;

6.2 Why this ORDER BY

The ORDER BY tuple (app_id, end_user_id, data_type_id, start_ts, id) determines the physical layout of every part. ClickHouse stores rows in this order within each partition. The implications:

Tenant pruning at the storage level. A query filtering by app_id can skip every part that doesn't contain that tenant's data. With 230 tenants, that's a 200× reduction in data scanned for most tenant-scoped queries.
User scans are sequential. All rows for (app_id=12, end_user_id=42817) are contiguous on disk within each partition. The 90-day-trend query reads a tiny block, not the whole table.
Per-data-type sub-scans. Within a user, all HeartRate rows are contiguous, then all Steps rows, etc. Filtering by data_type_id further reduces the scan.
Time range within data type is sequential. Combined with the monthly partition, a 90-day query touches at most 4 partitions and reads a fully sequential block within each.
Including id at the end guarantees uniqueness for the dedup engine. Without it, two truly distinct rows with identical (app, user, type, ts) would be incorrectly collapsed.

The case for not putting start_ts earlier in the key: a pure time-ordered key would optimize for "what happened across all users in the last hour" queries, at the cost of "what's this user's history" queries. The case's analytical question set leans heavily on the latter, so user-locality wins.

6.3 Partition strategy

Partitioning by toYYYYMM(start_ts) gives one partition per calendar month per shard. At 1B rows over 3–4 years that's ~30M rows per partition — a comfortable size for ClickHouse merge operations. The benefits:

Time-range pruning. Queries with a date filter scan only the partitions they need.
TTL is per-partition. The 2-year retention rule drops whole partitions at a time, which is cheap.
Backfills are partition-scoped. When a user syncs week-old data, only the relevant historical partitions are touched.
Maintenance windows align with months. If we ever need to re-optimize partitions, we can do it one month at a time, off-peak.

6.4 The value-column question

The source has five nullable value columns (double_value, long_value, boolean_value, date_value, string_value), only one of which is populated per row based on the row's data_type_id. There are two reasonable approaches:

Keep them as-is in one wide table. Analytical queries always know which column to read because they filter by data_type_id. Null columns compress essentially for free in a columnar store. Single table = simpler schema, fewer joins.
Split into per-type tables (one for HR, one for Steps, etc.). Each table has just the relevant value column. Cleaner per-table, but queries that want multiple data types need UNIONs, and the schema has 15+ tables to operate.

We choose option 1 because (a) the null columns cost essentially nothing in storage, (b) ClickHouse's column-pruning means a query that only reads long_value never touches the other columns on disk anyway, and (c) the per-table-split approach makes cross-type queries (recovery score = f(HR, HRV, sleep)) painful.

6.5 Dimension tables

CREATE TABLE dim_end_user (
  id UInt32,
  app_id UInt16,
  pseudonym String,
  first_name String,
  last_name String,
  email String,
  date_of_birth Nullable(Date),
  timezone LowCardinality(String),
  created_at DateTime64(3, 'UTC'),
  _version UInt64,
  _deleted UInt8 DEFAULT 0
) ENGINE = ReplacingMergeTree(_version)
ORDER BY (app_id, id);

-- And the dictionary that the fact-table ingest MV uses to look up app_id
CREATE DICTIONARY dim_end_user_dict (
  id UInt32,
  app_id UInt16
)
PRIMARY KEY id
SOURCE(CLICKHOUSE(TABLE 'dim_end_user'))
LAYOUT(HASHED())
LIFETIME(MIN 300 MAX 600);                    -- refresh every 5–10 minutes

LowCardinality(String) is used for the timezone column because there are at most a few hundred distinct values across all users; LowCardinality stores them as small integer codes with a dictionary, saving substantial space and speeding up filters and grouping.

6.6 Materialized views (the pre-aggregation layer)

Materialized views in ClickHouse are insert-triggered — when rows are inserted into the underlying table, the view's SELECT runs against just the new batch and inserts the result into the view's storage. This is the mechanism we use to keep daily/weekly aggregations always-current with minimal overhead.

-- Daily resting HR per user (powers the "90-day trend" query)
CREATE MATERIALIZED VIEW mv_daily_resting_hr
ENGINE = AggregatingMergeTree
PARTITION BY toYYYYMM(date)
ORDER BY (app_id, end_user_id, date)
AS SELECT
  app_id,
  end_user_id,
  toDate(start_ts) AS date,
  avgState(long_value) AS avg_hr,
  minState(long_value) AS min_hr,
  maxState(long_value) AS max_hr,
  countState()         AS samples
FROM epoch_data
WHERE data_type_id = 3001          -- HeartRateResting
  AND _deleted = 0
GROUP BY app_id, end_user_id, date;

-- The 90-day-trend query, now sub-second on billions of rows
SELECT
  date,
  avgMerge(avg_hr) AS resting_hr,
  minMerge(min_hr) AS min_hr,
  maxMerge(max_hr) AS max_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817
  AND date >= today() - 90
GROUP BY date
ORDER BY date;

The State / Merge pattern

The materialized view stores intermediate aggregation state (e.g., avgState stores both running sum and count), not the final value. This is necessary because the view receives data in batches, and a true running average can't be incrementally combined without preserving the underlying counts. At query time, avgMerge combines those partial states into the final answer.

This is also what makes burst-sync correctness work: when a user syncs week-old data, the rows arrive at the view, which computes partial aggregates for the old dates and merges them with the existing partial aggregates already stored. The query result is always consistent with the underlying fact table, regardless of arrival order.

Suggested materialized views to ship day-one

View	Powers
`mv_daily_user_metrics`	Daily summary per user: total steps, distance, calories, resting HR, sleep duration. The dashboard backbone.
`mv_user_hrv_30d`	30-day rolling RMSSD per user. Powers HRV cohort comparisons and anomaly baselines.
`mv_user_sleep_stages_daily`	Daily total time in each sleep stage. Powers the "sleep quality vs activity" correlation queries.
`mv_user_data_quality_daily`	Per user per day: row count, time coverage, gap count. Powers operational data-quality dashboards.
`mv_data_arrival_latency`	Histogram of `created_at - start_timestamp` buckets. Powers the "% of rows arriving >48h late" operational query.

6.7 Time columns: epoch ms vs DateTime64

The source uses BIGINT milliseconds since epoch for all timestamps. We convert to DateTime64(3, 'UTC') at the ingest materialized view layer because:

Native datetime columns get the Delta codec, which compresses sorted timestamps to a few bits per row.
Query syntax is much more natural (start_ts >= today() - 90 vs. start_ts >= toUnixTimestamp64Milli(today() - 90)).
Time-zone-aware functions work directly (toDateInTimezone(start_ts, dim_end_user.timezone)).

The original epoch value is recoverable via toUnixTimestamp64Milli(start_ts) if any downstream consumer needs it in that form.

6.8 Query patterns to validate the model

Three queries from Appendix D, executed against the proposed model, with comments on access patterns:

-- Q1. "90-day trend of resting heart rate for user 42817"
--     Uses MV. Reads ~90 rows. Sub-second.
SELECT date, avgMerge(avg_hr) AS resting_hr
FROM mv_daily_resting_hr
WHERE end_user_id = 42817 AND date >= today() - 90
GROUP BY date ORDER BY date;

-- Q2. "Users with HR gaps > 24h in the last 7 days"
--     Uses mv_user_data_quality_daily. Reads ~3.5M rows (500K users × 7 days),
--     but each row is small. Few seconds.
SELECT end_user_id, max_gap_minutes
FROM mv_user_data_quality_daily
WHERE data_type_id = 3000
  AND date >= today() - 7
  AND max_gap_minutes > 1440;

-- Q3. "Cohort HRV: condition-X users vs general population"
--     The "cohort" filter is external (likely a SQL IN list of user IDs).
--     Reads from mv_user_hrv_30d. Sub-second per cohort.
SELECT
  if(end_user_id IN cohort_users, 'condition_x', 'general') AS cohort,
  avgMerge(avg_rmssd) AS avg_hrv,
  count() AS users
FROM mv_user_hrv_30d
WHERE date >= today() - 30
GROUP BY cohort;

Section 7

Handling Updates & Deletes

This section addresses the case study's explicit prompt: "How do you handle updates/deletes in an append-oriented system?" The short answer: ReplacingMergeTree with a monotonic version column from the binlog position, combined with a read-side wrapper view that materializes the latest version on demand. The long answer follows.

7.1 The mechanism: ReplacingMergeTree

ReplacingMergeTree(version_column) is a MergeTree variant that, during its background merge cycle, dedupes rows sharing the same ORDER BY key and keeps only the row with the highest version_column value. This is exactly the semantics we need: every UPDATE in MySQL becomes a new row in ClickHouse with a higher _version; the merge eventually collapses the old row away.

The choice of version column matters: it must be monotonically increasing for every change to a given row. Debezium provides several candidates; the binlog position (source.pos) is the natural fit because it's strictly increasing within a binlog file and Debezium guarantees in-order delivery per source row. For changes that cross binlog file rotations, we use a combined value: (file_number * 2^32) + pos (the binlog file numbers themselves are monotonic).

7.2 The mechanism: deletes via tombstones

When MySQL deletes a row, Debezium emits a delete event. With the delete.handling.mode = rewrite SMT, that event becomes a regular insert into the topic but with __deleted = "true". The ingest materialized view sets _deleted = 1 on the resulting ClickHouse row, which has the same ORDER BY key as the original.

Because the delete event always has a higher _version than the row it tombstones, ReplacingMergeTree will keep the _deleted = 1 version after merge. Queries then need to filter WHERE _deleted = 0 to exclude logically-deleted rows.

7.3 The read-side problem

The catch with ReplacingMergeTree is that merges are asynchronous. After an UPDATE, both the old and new versions of the row exist in the table until the next background merge runs (which can be seconds or minutes later, depending on data size and merge schedule). A naive SELECT will see both rows.

There are three ways to handle this, in increasing order of cost and decreasing order of convenience:

Option A: A wrapper view that does the dedup at query time

CREATE VIEW epoch_data_current AS
SELECT
  id, end_user_id, app_id, data_type_id, data_source_id,
  start_ts, end_ts, created_at,
  argMax(double_value, _version)   AS double_value,
  argMax(long_value, _version)     AS long_value,
  argMax(boolean_value, _version)  AS boolean_value,
  argMax(string_value, _version)   AS string_value,
  argMax(generation_type_id, _version) AS generation_type_id,
  argMax(trustworthiness_id, _version) AS trustworthiness_id,
  argMax(_deleted, _version)       AS is_deleted
FROM epoch_data
GROUP BY id, end_user_id, app_id, data_type_id, data_source_id,
         start_ts, end_ts, created_at
HAVING is_deleted = 0;

This is the recommended default. It hides the dedup logic from data-team queries entirely. The cost is the GROUP BY and argMax overhead, which is bounded — typically <2× the cost of a naive SELECT against the same range.

Option B: SELECT ... FINAL

The FINAL modifier tells ClickHouse to merge-on-the-fly at query time. Functionally equivalent to Option A in terms of result, but the implementation forces a sort+merge of every matching part, which can be considerably more expensive for queries that touch many parts.

I'd reserve FINAL for one-off queries where the convenience matters more than the cost; for production query paths, the wrapper view is more predictable.

Option C: OPTIMIZE TABLE ... FINAL

Manually trigger a full merge. The most expensive option (rewrites all affected parts) but produces the cleanest physical state — useful as a scheduled operation off-peak, especially after large backfills or batch deletes.

7.4 GDPR right-to-erasure: physical deletion

ReplacingMergeTree logically deletes rows but doesn't physically remove them from disk until partitions are dropped (via TTL or manual operation). For GDPR compliance, we need a way to physically delete a user's data on demand.

The implementation

-- Triggered when a user's erasure request is approved.
-- ALTER ... DELETE in ClickHouse is a lightweight mutation —
-- it rewrites affected parts in background.
ALTER TABLE epoch_data
DELETE WHERE end_user_id = 42817;

ALTER TABLE dim_end_user
DELETE WHERE id = 42817;

-- And from each materialized view's storage:
ALTER TABLE mv_daily_resting_hr
DELETE WHERE end_user_id = 42817;
-- (Repeat for each MV)

This pattern is wrapped in a single stored procedure (a SQL function or Python script) invoked from an internal "user erasure" admin tool. The procedure must also:

Delete the user's S3/R2 archive files (Parquet does not support per-row deletion, so we rewrite affected files with the user's rows omitted).
Confirm the delete in MySQL has propagated through Kafka — if the CDC pipeline is lagging, deletes could re-create the row after our deletion. The procedure waits for end-to-end freshness < 1 min before deleting.
Log the erasure to an immutable audit table for compliance evidence.

SLA

The mutation completes within minutes for a typical user (most of whose data is concentrated in recent partitions). Backfilling old archive files takes longer (~hours, since it requires scanning S3/R2 hourly Parquet files). I'd commit to 24-hour SLA for full physical erasure, with the SQL-layer deletion happening within 1 hour. Stakeholders may want a stricter SLA — see §15.

7.5 Why ReplacingMergeTree over CollapsingMergeTree

ClickHouse offers a sibling engine, CollapsingMergeTree, that handles updates by storing +1 ("new state") and -1 ("old state") rows that cancel on merge. It's arguably more theoretically clean but requires the writer to produce paired rows for every update. Debezium doesn't natively emit -1 events; doing this would require an SMT or a downstream processor.

ReplacingMergeTree is simpler: Debezium emits each row state independently, and ClickHouse handles the resolution. We lose the ability to do "delta math" (computing the change between successive versions of a row) but that's not a common analytical pattern here.

7.6 The duplicate-event corner case

Debezium has at-least-once delivery semantics. If the connector crashes after emitting an event but before committing the offset, that event will be re-emitted on restart. Our pipeline is naturally idempotent under this:

The duplicate event has the same _version as the original.
ReplacingMergeTree keeps only one row per (ORDER BY tuple, _version) combination.
The materialized views, being defined over the deduplicated state, are correct after merge.

The pre-merge window allows brief over-counting in MV aggregates, but the steady-state is correct.

7.7 Late-arriving data

The burst-sync pattern produces a different kind of "update" — not a modification of an existing row but the arrival of new rows for old time periods. This is handled trivially by partitioning by start_ts (the measurement time, not the sync time): late rows land in historical partitions automatically, and the materialized views over those partitions incrementally update.

The only operational consideration is part count: a burst sync of a week of data lands in multiple historical partitions. The CH settings allow inserts to span up to 100 partitions per block (max_partitions_per_insert_block) to accommodate this.

Section 8

Burst Sync Handling

The constraint says: "A user offline for days syncs a week of data at once." With ~15,000 epoch_data rows in one burst, and potentially hundreds of users syncing simultaneously after, say, a regional connectivity event, the system needs to absorb spikes of 1M+ events in seconds without inducing pipeline lag for other tenants.

8.1 Walking through a single 15K-row burst

MySQL. The backend writes the batch in one or many INSERT statements over ~2 seconds. The binlog grows by ~3 MB (assuming ~200 bytes per row in binlog format).
Debezium. Reads the binlog at >100K events/sec for narrow rows. Adds <1 second of latency. Emits 15K Kafka messages.
Kafka. Producer batches via linger.ms = 50; the 15K messages spread across 24 partitions of the epoch_data topic. Total Kafka latency: <1 second.
ClickHouse Kafka engine. Consumes in batches up to kafka_max_block_size = 65,536. The 15K-row burst fits in one batch, inserted as 1–2 new parts. Insert latency: ~1–3 seconds (bounded by kafka_flush_interval_ms = 2000).
Materialized views. Fire on each insert. Computing per-user daily aggregates over 15K rows is <100 ms.

End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.

8.2 The 100-simultaneous-bursts scenario

If a connectivity event causes 100 users to sync at once, the total is 1.5M events. The pipeline behaves as follows:

MySQL. Heaviest impact is here — 1.5M inserts in a short window. If the backend uses connection pooling and prepared statements this is sustainable on a moderate instance; if not, MySQL becomes the bottleneck and we'd see write latency degrade for other tenants too.
Debezium. 1.5M events / 100K events/sec = 15 seconds of catchup. Within tolerance.
Kafka. Trivially handles 1.5M small messages. Disk usage temporarily spikes.
ClickHouse. 1.5M events in ~24 batches = 24 inserts. ClickHouse merges these in background; if part count stays under ~300 per partition, no degradation. We'd likely see a brief surge to ~50–80 parts then settle.
End-to-end: ~20–30 seconds at the worst point.

8.3 Sustained burst (the failure mode to actually worry about)

The genuinely concerning scenario is not a one-time burst but sustained high-rate writes — e.g., a viral marketing event that 50× our user count overnight. The bottlenecks we'd encounter in order:

MySQL write capacity. Not our problem to solve at the pipeline level, but it's the first thing to break.
Debezium throughput. Single-threaded per source. At ~50–100K events/sec sustained, we'd need to split connectors by table (e.g., a dedicated connector for epoch_data, another for everything else).
ClickHouse merge backlog. If inserts outpace merges, part count grows. Recovery: throttle insert rate temporarily, scale CH vertically, or add a shard.

8.4 Key knobs

Setting	Default	Our value	Why
Debezium `max.batch.size`	2048	8192	Bigger batches, less per-event CPU.
Kafka topic partitions (`epoch_data`)	1	24	Allows ClickHouse parallel consume up to 24 threads.
Kafka `linger.ms`	0	50	Better batching at the producer.
CH `kafka_max_block_size`	1M	65,536	Predictable batch size avoids huge parts during bursts.
CH `kafka_flush_interval_ms`	7500	2000	Tighter latency at normal load.
CH `parts_to_throw_insert`	3000	(unchanged, monitor)	Safety net; if hit, inserts fail loud.
CH `max_partitions_per_insert_block`	100	100	Bursts of week-old data span multiple historical partitions.

Section 9

Schema Evolution

Schema changes are an unavoidable reality. The pipeline must handle them gracefully, even though we don't have a separate schema registry on day one.

9.1 The default contract

Debezium captures every DDL event in MySQL and emits it to a dedicated schema-history Kafka topic. The downstream behavior depends on the kind of change:

Change	Default behavior	Action required
ADD COLUMN (nullable)	Debezium emits events with the new field; ClickHouse Kafka engine ignores unknown fields by default	None on day one. To make the new column queryable, ALTER it onto the ClickHouse table and update the ingest MV.
ADD COLUMN (NOT NULL, with default)	Same as above. The default value flows through Debezium events.	Same as above.
DROP COLUMN	Debezium stops including the field; existing CH column stays as-is	Optional cleanup: ALTER DROP COLUMN on ClickHouse to reclaim storage.
RENAME COLUMN	Treated as DROP + ADD by Debezium	Manual: ALTER RENAME on ClickHouse, then redeploy ingest MV with new field name.
CHANGE TYPE (widening, e.g. INT → BIGINT)	Debezium handles transparently if compatible; otherwise emits schema change event	ALTER MODIFY COLUMN on CH if the type widens (cheap); if narrowing, requires migration plan.
ADD TABLE	Debezium picks it up if it matches `table.include.list` pattern; otherwise ignored	Add table to include list, restart connector, optionally trigger incremental snapshot.
DROP TABLE	Existing data in CH stays; new events stop flowing	Decide whether to retain history in CH or drop the mirror table.

9.2 Operational workflow

For day one (small team, controlled source changes), schema evolution is handled manually:

Backend team announces the planned MySQL schema change in advance.
Data engineer reviews — assesses whether it's additive (safe) or breaking (needs care).
Backend deploys the schema change.
Data engineer applies the corresponding ClickHouse ALTER and updates ingest MVs.
Validate end-to-end freshness remains within SLA after deploy.

This works because the team is small enough that announcements happen in shared channels and changes are infrequent (weekly at most). Once the team grows or change frequency increases, this becomes a bottleneck — at which point we add:

Schema registry (Apicurio or Confluent) — formal contract enforcement at the Kafka layer.
dbt project on top of ClickHouse — version-controlled MV definitions, with PR review for changes.
Automated CH ALTERs for safe changes — a controller process that reads schema-history Kafka topic and applies non-breaking ALTERs automatically.

9.3 The failure mode to watch

The most dangerous case is a MySQL schema change that looks safe but breaks downstream — e.g., changing a column from BIGINT to INT when some existing values exceed the new range. Debezium will emit events as the new type; ClickHouse insert will fail; the consumer will fall behind.

Mitigation: a DLQ (dead-letter queue) topic for failed inserts, with an alert on non-zero message count. The DLQ lets the pipeline keep moving for the other 99.99% of events while the offending change is investigated.

Section 10

Failure Modes & Operational Runbook

This section catalogues the failure modes we expect, in rough order of likelihood, with detection and recovery procedures for each. Most failures share a common detection signal — end-to-end freshness rising — and differ in what they require for recovery.

10.1 Most-likely failures

F1 · Duplicate events from Debezium retry

Likelihood: High (happens occasionally during normal operation).
Cause: Debezium has at-least-once delivery. If it crashes after emitting an event but before committing the Kafka offset, the event is re-emitted on restart.
Detection: No detection needed — system handles automatically.
Recovery: ReplacingMergeTree dedupes on next merge. No action required.

F2 · ClickHouse goes down (planned or unplanned)

Likelihood: Medium.
Cause: Restart, hardware failure, disk fill, OOM.
Detection: Kafka consumer lag rises; freshness alert fires after ~5 min.
Recovery: Kafka holds events (7-day retention on epoch_data, 30 days on dims). Restart ClickHouse; the Kafka engine resumes consuming from the saved offset. Bounded blast radius: 7 days of writes before the binlog itself is consumed.

F3 · Debezium connector crash

Likelihood: Medium.
Cause: OOM, MySQL connection loss, schema parsing error.
Detection: Producer rate from Debezium drops to zero; freshness alert fires.
Recovery: Auto-restart via systemd or Kubernetes. Connector resumes from last committed binlog offset (no data loss). If restart loop, check logs for schema errors (often the cause).

F4 · Schema change in MySQL not anticipated

Likelihood: Medium (we control the source, but accidents happen).
Cause: Backend deploys a schema change without coordinating; ClickHouse Kafka engine fails to parse the new event shape.
Detection: DLQ topic accumulates messages; freshness alert fires.
Recovery: Apply matching ALTER on ClickHouse, then re-process the DLQ. For non-trivial changes, this is a 15–60 minute incident.

F5 · Bad/malformed value

Likelihood: Medium.
Cause: Source data with values outside the expected range (e.g., NULL in a NOT NULL column, malformed UTF-8 in a string).
Detection: CH parse error → row routed to DLQ; alert on DLQ rate.
Recovery: Quarantine the affected row in epoch_data_dlq; investigate the source; manually correct or re-insert.

F6 · "Too many parts" in ClickHouse during sustained burst

Likelihood: Medium under heavy load, low at normal load.
Cause: Insert rate exceeds merge capacity; system.parts count per partition climbs past ~300; eventually insert fails with TOO_MANY_PARTS.
Detection: Grafana panel on system.parts; alert at 250 per partition.
Recovery: Reduce kafka_flush_interval_ms (already at 2000), increase kafka_max_block_size (already at 65K), throttle non-critical consumers, or scale ClickHouse (add a shard). If unrecoverable in the short term, pause Kafka consumer to give merges time to catch up.

10.2 Lower-likelihood failures

F7 · Kafka broker down (single broker setup)

Likelihood: Low if managed; medium if self-hosted single-broker.
Cause: Hardware failure, OOM, disk fill.
Detection: Debezium producer errors; freshness alert fires.
Recovery: Debezium pauses gracefully (binlog buffers naturally in MySQL — 7-day retention). Restart Kafka broker; everything resumes.
Prevention: Move to 3-broker cluster in Phase 2.

F8 · MySQL binlog purged before Debezium reads it

Likelihood: Low (assuming 7-day binlog retention).
Cause: Debezium offline for >7 days, or binlog retention misconfigured.
Detection: BinlogNotFoundException in connector logs.
Recovery: Worst-case scenario. Trigger Debezium incremental snapshot for the affected tables — sends a signal via a Kafka topic that tells Debezium to scan the source table and emit synthetic insert events for all rows, while still processing live binlog changes in parallel. Takes hours for epoch_data, minutes for everything else.
Prevention: Monitor Debezium lag actively; alert at 12h sustained lag.

F9 · MySQL replica failover

Likelihood: Low (depends on infrastructure).
Cause: MySQL replica we're reading from is promoted or replaced.
Detection: Connection reset in Debezium logs.
Recovery: With GTID mode on MySQL, Debezium reconnects to the new replica and resumes from the GTID set. Without GTID, it requires manual reset of the binlog offset to a known-good position on the new replica.
Prevention: Use GTID mode (already in MySQL prerequisites, §4.1).

F10 · S3/R2 archive sink falling behind

Likelihood: Low.
Cause: Kafka Connect S3 Sink fails or slows.
Detection: Separate consumer lag metric for the archive sink.
Recovery: Restart sink; it resumes from offset. The archive is not on the critical path, so this is a "warn, fix in business hours" not a page.

10.3 Recovery procedures: the rebuild path

The worst-case rebuild — needing to recreate the ClickHouse state from scratch — proceeds in this order:

Stop the ingest MV. Prevents partially-rebuilt data from being picked up by aggregates.
Drop the affected table.
Recreate the table from the canonical DDL.
Reset Kafka consumer group offset to earliest. The Kafka engine table will start re-consuming from the beginning of the topic's retention window (7–30 days, depending on table).
If older history is needed, load from S3/R2 archive first (via INSERT INTO ... SELECT FROM s3(...)), then enable Kafka consume.
Restart the ingest MV.
Restart materialized views (or recreate them via POPULATE on a small data sample, then incrementally fill from the live stream).
Verify counts against MySQL: a periodic check query that compares row counts and recent-data totals between the source and ClickHouse.

For a typical table this rebuild is a 30-min to 2-hour operation. For epoch_data at 1B+ rows, several hours to a day, dominated by S3 read + parsing time. This is the operation we run drills against — practising it once a quarter ensures we know how long it really takes.

Section 11

Capacity Planning

Capacity decisions follow from a few simple numbers. The estimates here are conservative — real-world ClickHouse performance often exceeds them — but they're useful for sizing decisions.

11.1 Storage math

Metric	Phase 1 (500M rows)	Phase 2 (5B rows)
Uncompressed size (avg ~250 bytes/row in ClickHouse representation)	~125 GB	~1.25 TB
Compressed (10–15× with LZ4 + Delta codecs)	~10–15 GB	~80–125 GB
+ Materialized views (~5% of fact)	~1 GB	~10 GB
+ Replicated tables (×2 for HA in Phase 2)	—	~180 GB
+ ZooKeeper/Keeper state	~100 MB	~1 GB
+ Headroom for merges (1.5×)	~25 GB	~270 GB
Total disk per node	~30 GB	~300 GB per shard

11.2 Throughput math

Metric	Baseline	Peak burst (100 simultaneous syncs)
Events/sec to MySQL	~25 (2M/day average)	~50,000 for 30 seconds
Debezium throughput required	~100/sec	~50,000/sec (sustainable)
Kafka write rate	~25 msg/sec/partition	~2,000 msg/sec/partition
ClickHouse insert rate	~1 batch/2 sec	~1 large batch / 2 sec
End-to-end freshness	5–10 sec	20–60 sec

11.3 Concurrent query capacity

ClickHouse single-node on the Phase 1 spec (4 vCPU, 16 GB RAM) comfortably handles ~10 concurrent analytical queries, with the heaviest individual query latencies in the 100ms–2s range against MVs. Ad-hoc scans of the raw fact table can hit several seconds.

Phase 2 (3-node cluster, 8 vCPU each) handles ~30 concurrent users at similar latency. Past that, we'd either scale horizontally or add a read-only replica tier for dashboard queries.

11.4 What we're not bottlenecked by

Network. Even at peak burst, the total network throughput is well under 100 Mbps.
Disk IOPS. ClickHouse's sequential-read pattern is well-served by even modest SSDs.
Kafka storage. 7-day retention on epoch_data at peak rates is <10 GB.
Debezium memory. A few hundred MB even at peak.

11.5 Headroom thinking

Every component is sized for ~3× current peak. This gives ~12–18 months of organic growth before any component needs scaling, which is the natural cadence for revisiting infra decisions. If growth exceeds 3×, the alerts on end-to-end freshness will signal degradation well before any hard limit is hit.

Section 12

Cost Analysis

All prices in USD; rounded to nearest dollar; based on Hetzner / Cloudflare R2 / Backblaze list pricing as of mid-2026. Substituting US-based equivalents (e.g., DigitalOcean for VMs, AWS S3) increases the totals by ~30–50% but doesn't change the relative comparisons.

12.1 Phase 1 (0–500M rows, 0–12 months)

Item	Spec	Monthly
ClickHouse VM (Hetzner CCX23)	4 vCPU, 16GB RAM, 240GB NVMe	$60
Kafka / Redpanda VM	4 vCPU, 16GB RAM, 160GB SSD	$60
Debezium VM	2 vCPU, 4GB RAM, 40GB SSD	$20
S3 / R2 archive	~200 GB Parquet, ~10 GB/mo growth	$5
Backups (snapshots to R2, 30-day retention)	~50 GB	$20
Prometheus + Grafana (co-hosted)	—	$0
Bandwidth (egress mostly internal)	—	$0
Total		~$165 / mo

12.2 Phase 2 (0.5B–5B rows, 12–36 months)

Item	Spec	Monthly
ClickHouse cluster (3 × CCX33)	8 vCPU, 32GB RAM, 480GB NVMe each	$330
Kafka cluster (3 brokers)	4 vCPU, 16GB RAM each	$180
ClickHouse Keeper (2 nodes)	2 vCPU, 4GB RAM each	$40
Debezium HA (2 nodes)	2 vCPU, 4GB RAM each	$40
S3 / R2 archive	~1.5 TB Parquet	$25
Schema registry (Apicurio)	2 vCPU, 4GB RAM	$20
Backups (longer retention, more data)	~200 GB	$40
Monitoring (still self-hosted)	2 vCPU, 8GB RAM	$40
Total		~$715 / mo

12.3 Comparison: managed warehouse stack

For Phase 1 volume (60M events/month, ~500M rows), a representative managed stack:

Item	Notes	Monthly
Snowflake compute (XS, ~6h/day)	For ETL + queries	~$500
Snowpipe Streaming ingest	60M events/mo at ~$5/M	~$300
Snowflake storage	~500 GB	$15
Confluent Cloud Kafka	Small cluster	~$400
Fivetran CDC (priced per MAR)	Or Airbyte Cloud equivalent	~$1,500+
Total		~$2,700+ / mo

Same workload, ~16× the bill. The savings on the self-hosted stack are real but they come at the cost of ~5–10 hours/week of operational toil. The right tradeoff at Phase 1; less obvious at Phase 2; usually wrong at Phase 3 (when team time becomes the binding constraint).

12.4 The deciding question for when to switch

Track these two metrics over time, and consider the managed-services tier when both cross a threshold:

Engineer-hours/week spent on pipeline ops. When >15/week sustained, the salary cost exceeds the rent saved.
Data team query concurrency. When >15 concurrent analytical users, the managed warehouses' separation-of-storage-from-compute architecture starts to pay off.

Section 13

Migration & Rollout Plan

Bringing this pipeline online from a greenfield state. Assumes MySQL is running, the data team is ready to consume from a new analytical store, and we have ~4 engineering-weeks of capacity to ship Phase 1.

Week 1 — Foundations

Provision Phase 1 infrastructure (3 VMs: ClickHouse, Kafka, Debezium).
Verify MySQL prerequisites (§4.1): binlog ROW format, GTID, retention, replication user.
Stand up Prometheus + Grafana with basic dashboards.
Stand up the S3/R2 archive bucket and the Kafka Connect S3 Sink (so we have replay from day one).

Week 2 — CDC plumbing

Deploy Debezium MySQL connector targeting the reference and dimension tables first (app, data_source, epoch_data_type, etc.). These are small and let us validate the end-to-end flow without taking on burst-sync complexity.
Create the ClickHouse dimension tables, Kafka engine tables, and ingest MVs.
Validate that updates and deletes propagate correctly using a test row in MySQL.
Build the freshness dashboard and SLO alert.

Week 3 — The fact table

Add epoch_data and end_user to the Debezium connector. Trigger initial snapshot; this will take several hours for the 150M-row fact table.
While snapshot runs, build the fact-table schema, Kafka engine table, and ingest MV.
Snapshot completes; verify row counts match MySQL.
Switch to binlog tail mode; verify freshness SLA holds.

Week 4 — Materialized views & consumer enablement

Define and populate the day-one materialized views (§6.6).
Validate query latencies against the case study's Appendix D queries — aim for <1s on the MV-backed queries, <10s on ad-hoc scans.
Document query patterns and access for the data team. Provision read-only ClickHouse users.
Run a quarterly DR drill: simulate ClickHouse loss, verify rebuild procedure works.
Ship.

Post-launch: weekly cadence

Weekly review of freshness SLO compliance and DLQ message rate.
Weekly review of Kafka and ClickHouse disk utilization.
Monthly cost review.
Quarterly DR drill.
Annual revisit of architecture choices, especially the Phase 1 → Phase 2 transition triggers.

Risk register for the rollout

Risk	Mitigation
Initial snapshot of `epoch_data` impacts MySQL performance	Run snapshot from a replica, not the primary. Use `snapshot.fetch.size` to chunk reads. Schedule for off-peak.
Hidden schema constraints not captured in Appendix A	Pre-flight: dump and review actual MySQL schema. Account for any orphan FKs or non-standard collations.
Data team expects features not in scope	Publish the scope-and-deferral list (§14, §15) explicitly. Get sign-off before shipping.
Single VM failure during Phase 1	Accept 7-day Kafka buffer; snapshot disk daily; restore from backup if needed. Phase 2 adds redundancy.

Section 14

Future Considerations

14.1 Linking clinical reports and insurance documents

The case study mentions: "Outline how to link clinical report data (e.g., PDF lab reports, insurance claim documents) to a user's wearable data so both can be queried together." This is a Phase 2/3 concern but worth sketching now.

The shape of the problem

Clinical reports are large, unstructured (PDFs), and arrive at a different cadence than wearable data. Wearable data is small-and-frequent; clinical data is large-and-rare. They need to be linkable by user ID and queryable together — e.g., "for users in our diabetes cohort, show their A1c history alongside their resting HR trend."

Recommended approach

Object storage for the raw artifacts. PDFs and claim documents live in S3/R2 under a path like s3://thryve-clinical/end_user_id={id}/year={Y}/month={M}/{document_id}.pdf.
A clinical metadata table in ClickHouse. One row per document, with extracted fields (document type, lab measurements, date of test, source provider), plus a pointer to the S3 path. Built either by manual data entry or by an OCR + NLP pipeline (e.g., Tesseract + a fine-tuned LLM extractor).
Joined queries via standard SQL. The clinical metadata table joins to dim_end_user on end_user_id; queries can compute correlations between extracted clinical fields and wearable aggregates.
For unstructured retrieval (e.g., "find users with X mentioned in their last lab report"), index the extracted text in a vector database (Qdrant or pgvector). Use vector search for retrieval, ClickHouse for analytics.

What changes in the pipeline

Very little. The CDC pipeline doesn't change. The clinical data ingestion is a separate pipeline (likely a periodic batch from the clinical-records provider) that lands metadata in a new ClickHouse table. The join surface is shared (end_user_id), so the data team's mental model stays consistent.

14.2 AI/ML training corpus

The AI team eventually needs reproducible point-in-time snapshots of the event data for training. Three options:

Query ClickHouse with a fixed timestamp filter. Works for small snapshots, but provides no guarantee against mutation (the data could change after training).
Read from the S3/R2 archive directly. Parquet partitions are immutable once written. The archive becomes the canonical training corpus. Works today.
Promote the archive to an Iceberg table. Adds explicit schema, time-travel queries, and ACID semantics. Worth doing when more than one ML model depends on the corpus.

14.3 Real-time anomaly detection

Several Appendix D queries ("detect HR spikes within 5 minutes of arrival") imply real-time alerting. ClickHouse can serve this today with a polling pattern:

-- Run every 30 seconds; check for HR spikes in the last 10 minutes
SELECT end_user_id, max(long_value) AS peak_hr, ...
FROM epoch_data
WHERE data_type_id = 3000
  AND created_at >= now() - INTERVAL 10 MINUTE
  AND long_value > (
    SELECT avgMerge(avg_hr) + 2 * stddevSampMerge(stddev_hr)
    FROM mv_user_hr_baseline_30d
    WHERE end_user_id = epoch_data.end_user_id
  )
GROUP BY end_user_id;

For sub-second alerting (e.g., as a feature of the consumer app), introduce Flink between Kafka and ClickHouse: Flink does the windowed comparison in memory and pushes alerts to a notification topic. This is the natural Phase 2/3 addition for real-time product features.

Section 15

Open Questions for Stakeholders

The architecture above is robust to the constraints in the case study, but real-world refinement depends on details we'd want to confirm with the team. These are the questions I'd ask in the kickoff conversation.

Operational reality

What's the actual peak write QPS on MySQL? Avg vs P99 vs P99.9?
How is MySQL hosted — AWS RDS, Aurora, self-managed? RDS limits some Debezium configuration options (notably binlog_row_image); Aurora has its own CDC offering worth considering.
What's the current binlog retention? Anything less than 7 days needs to change before we ship — otherwise our recovery window is dangerously narrow.
Do we have on-call engineers comfortable operating Kafka? If the answer is "no, and we don't want to be," that flips us toward Redpanda Cloud or AWS MSK Serverless even at the cost premium.
Is there an existing observability stack we should integrate with (Datadog, Honeycomb, New Relic)? Adopting it saves us standing up Prometheus.

Compliance & data governance

What's the SLA for GDPR right-to-erasure? Hours or days? Affects whether the §7.4 procedure needs to be more aggressive.
Any HIPAA scope (US customers)? Tightens encryption, audit logging, and BAA requirements.
Are access_token and email considered PII that must be tokenized before they leave MySQL? Affects whether we add an SMT at Debezium or accept them in ClickHouse with column-level encryption.
What's the data-residency requirement — EU-only? US? Affects choice of cloud provider and region.
Are there audit-logging requirements (who queried what, when)? ClickHouse has query logs but a formal audit pipeline may need additional plumbing.

Consumer expectations

Who are the day-one consumers of the analytical store? Internal data analysts? Product features? AI training jobs?
What query latency SLA does the data team want? <1 second matters for dashboards; "a few seconds" is comfortable for ad-hoc analysis.
What existing tools should we integrate with — Looker, Metabase, Superset, dbt, Mode?
Is real-time anomaly alerting a near-term product feature or a "nice to have"? Affects whether we need Flink in Phase 1 or can defer.
Do consumers need a defined data contract / SLA per table? Or is "best effort within 5 minutes" enough?

Scale & budget

What's the user-growth trajectory — 1B rows in 12 months, 24 months, or 60 months? Affects when we'd transition to Phase 2.
What's the data-infra budget ceiling for this fiscal year? Constrains how aggressively we can adopt managed services.
How fast does the data team expect new data sources to be added (CRM, billing, support tickets, etc.)? Each adds CDC complexity.
What does the AI team need from the data — raw event stream, daily snapshots, or feature-store-style derived data?
Are there partner integrations or external data sharing that need to be supported? Affects whether we need a separate read-replica or data-sharing layer.

Team & process

How does the backend team announce schema changes today? The CDC pipeline depends on this — schema changes that aren't coordinated will cause incidents.
Who owns the analytical SQL layer long-term — a data engineer, the analysts themselves, or shared? Affects whether we invest in dbt early.
What's the team's appetite for self-hosting vs. managed? My recommendation is biased toward self-hosting for cost, but team preference matters more than my preference.

These questions don't block shipping the design above — the design is correct under reasonable defaults for each. They would refine implementation details and inform the Phase 2 timeline. The answer to any one of them changes 5–10% of the architecture; the answer to several in combination could shift the cost calculus materially.

Real-Time Pipeline for Wearable Health Data

Table of Contents

Executive Summary

Headline decisions

What we explicitly defer

What I'd want to know before shipping

Problem Analysis

2.1 Source-of-truth shape

2.2 The constraints that drive the design

Burst syncs

Freshness within 5 minutes

Updates and deletes must propagate

Scale: 150M → 1B+ rows

Multi-tenancy (230 apps)

Implicit constraint: regulated data

2.3 Workload patterns the analytical DB must serve

Pattern A — Per-user time-series scans

Pattern B — Cohort aggregations

Pattern C — Data-quality and operational queries

2.4 What the analytical DB will not be doing

Architecture Overview

Why this shape

What flows where

3.1 Pipeline phases over time

3.2 What we explicitly do not include in the day-one architecture

Component Deep Dive

4.1 Debezium · Change Data Capture

Why Debezium specifically

Critical configuration

MySQL prerequisites

What the events look like

Failure modes specific to Debezium

4.2 Apache Kafka · Transport and Buffer

Why a broker, not direct push

Redpanda as a drop-in

Topic layout

Producer / consumer settings

4.3 ClickHouse · Analytical Database

4.3.1 Why ClickHouse for this workload

4.3.2 What ClickHouse gives up

4.3.3 ClickHouse-specific configuration

4.3.4 Alternatives considered and why I didn't choose them

4.3.5 The Kafka → ClickHouse pattern

4.4 S3 / R2 · Raw Event Archive

Cloudflare R2 vs AWS S3

Sink configuration

Storage layout

Lifecycle and cost

4.5 Observability · Prometheus, Grafana, Alertmanager

The headline metric: end-to-end freshness

Layered diagnostic metrics

Alerting policy

Per-Table Strategy

Why mostly uniform

Why ClickHouse Dictionaries for the truly static tables

Data Modeling

6.1 The fact table

6.2 Why this ORDER BY

6.3 Partition strategy

6.4 The value-column question

6.5 Dimension tables

6.6 Materialized views (the pre-aggregation layer)

The *State / *Merge pattern

Suggested materialized views to ship day-one

6.7 Time columns: epoch ms vs DateTime64

6.8 Query patterns to validate the model

Handling Updates & Deletes

7.1 The mechanism: ReplacingMergeTree

7.2 The mechanism: deletes via tombstones

7.3 The read-side problem

Option A: A wrapper view that does the dedup at query time

Option B: SELECT ... FINAL

Option C: OPTIMIZE TABLE ... FINAL

7.4 GDPR right-to-erasure: physical deletion

The implementation

SLA

7.5 Why ReplacingMergeTree over CollapsingMergeTree

7.6 The duplicate-event corner case

7.7 Late-arriving data

Burst Sync Handling

The State / Merge pattern