From MySQL to a queryable analytical store in <5 minutes — built lean for a startup, designed to scale to 1B+ rows.
| Table | Volume | Profile |
|---|---|---|
epoch_data | 150M, +2M/day → 1B+ | High freq · bursty |
end_user | 500K | Low freq · PII |
app | 230 | Tenant config |
epoch_data_type | 500 | Reference |
data_source | 50 | Reference |
| 3 reference tables | <100 each | Static |
updated_at (column doesn't exist; also misses deletes). Binlog CDC is mandatory.| Layer | Choice | Why (one sentence) |
|---|---|---|
| Change capture | Debezium MySQL connector | De-facto OSS standard; reads binlog; emits inserts/updates/deletes uniformly; resumes from offset on crash. |
| Transport / buffer | Apache Kafka (or Redpanda) | Absorbs burst syncs; decouples source from sink; gives us replay if ClickHouse needs a rebuild. |
| Analytical DB | ClickHouse | Columnar, 10× compression on sensor data, native Kafka engine, ReplacingMergeTree handles CDC merges, $0 license. |
| Aggregations | ClickHouse Materialized Views | Incremental on insert — no separate stream processor (no Flink) needed at this stage. |
| Raw archive | S3 or Cloudflare R2 | Kafka tiered storage / sink connector → cheap long-term backup + AI training source later. |
| Observability | Prometheus + Grafana | Free, well-supported by Debezium/Kafka/ClickHouse exporters. Lag-as-SLO is the headline metric. |
| Schema registry | Apicurio (OSS) — deferred | Add when schema evolution becomes painful; not blocking on day one. |
FINAL or a wrapper view.| Option | Verdict |
|---|---|
| Snowflake | Best DX. But streaming ingest at our write rate gets expensive fast (Snowpipe Streaming + warehouse credits). Not justified for a startup until query workload demands it. |
| BigQuery | Same shape as Snowflake. Storage is cheap; streaming inserts and on-demand query costs add up. Strong if already on GCP. |
| TimescaleDB / Postgres | Lovely for time-series <500M rows. Compression and query speed fall behind ClickHouse at 1B+. JOINs are easier — relevant if many small dim joins. |
| Apache Druid / Pinot | Excellent real-time OLAP, but heavier to operate (coordinators, brokers, historicals) and weaker ad-hoc SQL ergonomics. |
| Iceberg + Trino lakehouse | Strong for AI training corpus. Doesn't solve sub-5-min freshness alone (needs streaming on top). Good future tier, not the day-one tool. |
updated_at column in the schema → polling has nothing reliable to filter on.# Debezium MySQL connector — the settings that matter connector.class: io.debezium.connector.mysql.MySqlConnector database.include.list: thryve table.include.list: thryve.app, thryve.end_user, thryve.epoch_data, ... # Snapshot once, then tail binlog (no table locks, uses InnoDB MVCC) snapshot.mode: initial snapshot.locking.mode: none # Performance for the bursty epoch_data table max.batch.size: 8192 max.queue.size: 32768 poll.interval.ms: 500 # Topic per table, keyed by PK → in-row ordering preserved topic.prefix: cdc.thryve message.key.columns: thryve.epoch_data:id # Flatten Debezium envelope; mark deletes via __deleted=true transforms: unwrap transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState transforms.unwrap.delete.handling.mode: rewrite
| Table | Primary path | Failsafe | CH engine | Reasoning |
|---|---|---|---|---|
epoch_data |
CDC stream | Daily incremental snapshot to S3 (Parquet) | ReplacingMergeTree(_version) |
High volume, 5-min freshness is the whole point. Snapshot covers the "Kafka retention expired during outage" scenario. |
end_user |
CDC stream | Hourly full snapshot | ReplacingMergeTree(_version) |
PII — needs deletes to propagate fast (GDPR). Small enough that hourly full snapshot is fine as backup. |
app |
CDC stream | Hourly full snapshot | ReplacingMergeTree(_version) |
Tenant config — small, slow-changing, but critical for joins. Snapshot makes recovery trivial. |
epoch_data_typedata_source |
CDC stream | Hourly full snapshot | ReplacingMergeTree(_version) |
Reference data. Could be loaded once and forgotten, but streaming keeps it self-healing. |
| Reference tables ( generation_*, trustworthiness) |
Snapshot only | Manual reload | Dictionary (in-memory) |
Static, <100 rows. Loaded as ClickHouse dictionaries for fast lookup — no need for streaming pipeline at all. |
ORDER BY key is the most important decision in the schema.-- Fact mirror (denormalized app_id for tenant pruning) CREATE TABLE epoch_data ( id UInt64, end_user_id UInt32, app_id UInt16, -- denormalized data_type_id UInt32, data_source_id UInt16, start_ts DateTime64(3, 'UTC'), end_ts Nullable(DateTime64(3, 'UTC')), created_at DateTime64(3, 'UTC'), double_value Nullable(Float64), long_value Nullable(Int64), boolean_value Nullable(UInt8), string_value Nullable(String), trustworthiness_id UInt8, _version UInt64, -- binlog pos _deleted UInt8 ) ENGINE = ReplacingMergeTree(_version) PARTITION BY toYYYYMM(start_ts) ORDER BY (app_id, end_user_id, data_type_id, start_ts, id) TTL toDate(start_ts) + INTERVAL 2 YEAR DELETE;
(app_id, end_user_id, data_type_id, start_ts) means typical queries read 1 contiguous block instead of scanning. Including id guarantees uniqueness for dedup.
-- Incremental MV: daily resting HR per user CREATE MATERIALIZED VIEW mv_daily_resting_hr ENGINE = AggregatingMergeTree PARTITION BY toYYYYMM(date) ORDER BY (app_id, end_user_id, date) AS SELECT app_id, end_user_id, toDate(start_ts) AS date, avgState(long_value) AS avg_hr, countState() AS samples FROM epoch_data WHERE data_type_id = 3001 -- HeartRateResting AND _deleted = 0 GROUP BY app_id, end_user_id, date; -- 90-day trend query — sub-second on billions of rows SELECT date, avgMerge(avg_hr) AS resting_hr FROM mv_daily_resting_hr WHERE end_user_id = 42817 AND date >= today()-90 GROUP BY date ORDER BY date;
5 nullable value cols, only one populated per row. We keep them as-is — analytical queries always filter by data_type_id, which fixes which column to read. Splitting into per-type tables would add JOINs without a real benefit.
ReplacingMergeTree(_version) keeps only the latest version per ORDER BY key during background merges._version = MySQL binlog position (monotonic, unique per change). Comes for free from Debezium's source.ts_ms or GTID._deleted=1, also superseding any prior version.SELECT can see both the old and new versions until the next merge.argMax dedup — predictable cost, recommended for most queries.SELECT ... FINAL — convenient, slower (forces merge at read time).OPTIMIZE TABLE ... FINAL periodically — heavy, schedule off-peak.AggregatingMergeTree + *State functions. The aggregate is itself idempotent under replay.-- Recommended pattern: a view that hides the dedup logic CREATE VIEW epoch_data_current AS SELECT id, end_user_id, app_id, data_type_id, data_source_id, start_ts, end_ts, created_at, argMax(double_value, _version) AS double_value, argMax(long_value, _version) AS long_value, argMax(boolean_value, _version) AS boolean_value, argMax(_deleted, _version) AS is_deleted FROM epoch_data GROUP BY id, end_user_id, app_id, data_type_id, data_source_id, start_ts, end_ts, created_at HAVING is_deleted = 0;
ALTER TABLE epoch_data DELETE WHERE end_user_id = ? — this rewrites affected parts. Schedule monthly batch (or on-demand) for outstanding erasure requests. Same for end_user table.
Bursts of week-old measurements are not a problem because we partition by start_ts. The new rows just land in older partitions. Materialized views over those partitions will incrementally update on the new insertions automatically.
id (round-robin across 24 partitions). Throughput: ~150K msg/s on a single broker. Latency: <1 sec.kafka_max_block_size = 65K). Inserts as 1–2 parts. Latency: 1–3 sec for batch close.End-to-end: ~5 seconds. Well inside the 5-min SLA, with 60× headroom.
= 1.5M events. Still ~10 seconds end-to-end on a single-broker / single-CH-node setup. We saturate Debezium's binlog read first (~50–100K events/sec on commodity hardware). At sustained 2M+ msg/sec we'd add Debezium replicas or split connectors by table.
| Knob | Default | Tune to |
|---|---|---|
Debezium max.batch.size | 2048 | 8192 — bigger batches, lower CPU |
| Kafka topic partitions (epoch_data) | 1 | 24 — parallel consume |
Kafka linger.ms | 0 | 50 — better batching |
CH kafka_max_block_size | 1M | 65K — predictable latency |
CH kafka_flush_interval_ms | 7.5s | 2000 — tighter freshness |
CH MergeTree parts_to_throw_insert | 3000 | monitor — bursts can create many small parts |
system.parts. If we're consistently above ~300 parts per partition, switch to a buffered insert pattern or scale CH.
| Failure | Detection | Recovery | Likelihood |
|---|---|---|---|
| ClickHouse goes down | Kafka consumer lag > threshold | Kafka holds messages (7-day retention). Restart CH → consumer resumes from offset. | Medium |
| Debezium connector crash | Kafka Connect REST + producer rate = 0 | Auto-restart via systemd. Resumes from last committed binlog offset. No data loss. | Medium |
| MySQL binlog purged before Debezium reads | Debezium "binlog not found" error | Worst case. Use Debezium incremental snapshot to backfill affected table without downtime. | Low (if retention ≥ 7d) |
| Schema change in MySQL | Debezium DDL event; CH insert fails on unknown column | Additive: ALTER TABLE … ADD COLUMN. Automate nullable adds; review breaking changes. |
Medium |
| Bad/malformed value | CH parse error → dead-letter topic | Quarantine in epoch_data_dlq. Alert. Manual triage + reprocess. |
Medium |
| Duplicate events (Debezium retry) | Identical _version |
ReplacingMergeTree dedupes on next merge. No action needed. | High (at-least-once) |
| Kafka broker down | Debezium producer errors | Debezium pauses; binlog buffers naturally. Restart Kafka, resume. | Low |
| Too many parts in CH (sustained burst) | system.parts > threshold |
Reduce flush interval, increase batch; scale CH (add shard). | Medium |
max(now() - max(created_at)) per table. Page when > 5 min sustained. It's the SLA. Everything else is diagnostics.
Months 0–12. Single-node everything.
| ClickHouse (Hetzner CCX23, 4 vCPU / 16GB / 240GB NVMe) | ~$60 |
| Kafka or Redpanda (single broker, same spec) | ~$60 |
| Debezium (small VM, 2 vCPU / 4GB) | ~$20 |
| S3 / R2 archive (~200 GB compressed) | ~$5 |
| Prometheus + Grafana (co-hosted) | $0 |
| Backups (snapshots to R2, retention 30d) | ~$20 |
| Total | ~$165 / mo |
Months 12–36. Cluster CH, scale Kafka.
| ClickHouse 3-node cluster (CCX33 each) | ~$330 |
| Kafka 3-broker (CCX23 each) + Keeper | ~$240 |
| Debezium 2-node HA | ~$50 |
| S3 / R2 archive (~1.5 TB) | ~$25 |
| Schema registry (Apicurio) | ~$15 |
| Monitoring + backups | ~$60 |
| Total | ~$720 / mo |
Same workload, Snowflake-ish pricing.
| Snowpipe Streaming ingest (~2M/day → 60M/mo events) | ~$400 |
| Warehouse compute (XS, ~6h/day for ETL + queries) | ~$500 |
| Storage (~500 GB) | ~$15 |
| Confluent Cloud (Kafka, small cluster) | ~$500 |
| Debezium / Fivetran CDC (priced per row) | ~$1500+ |
| Total | ~$2,900+ / mo |
Same correctness, ~15× the bill. Easy call at this stage.
| I chose | Over | What I gave up | What I got |
|---|---|---|---|
| ClickHouse | Snowflake / BigQuery | Best-in-class JOINs · zero-ops scaling · richest ecosystem | 15× cost reduction at this scale · faster ingest · self-host control |
| Debezium + Kafka | Fivetran / Airbyte / DMS | UI-driven config · vendor support | $0 license · full event control · replay · reusable for other sinks |
| CH Kafka engine | Flink / Spark Streaming | Complex stream joins · windowed aggregations across topics | One less system to operate · MVs cover 90% of needs |
| Wide denormalized fact | Strict star schema | Storage savings · dimensional purity | No JOIN at query time · pruning by app_id in fact table directly |
| ReplacingMergeTree | CollapsingMergeTree · external dedup | Read-side simplicity (need FINAL or view) |
Simplest mapping from Debezium events · idempotent under replay |
| Self-hosted on VMs | Managed (CH Cloud, Confluent Cloud) | ~10 hours/week of ops at scale | ~70% cost reduction · no vendor lock-in |
| 5-min SLA | Sub-second SLA | True real-time (Flink + push to product) | Meets stated requirement · sub-second is a feature we add when needed |
| No stream processor day-one | Flink for anomaly detection | Real-time alerting in CH itself is harder | One less moving part · revisit when product needs it |
access_token and email considered PII that must be tokenized at ingest?Happy to go deeper on any layer — data modeling, Kafka topology, ClickHouse internals, or the things I deferred.