Change Data Capture (CDC)
Change Data Capture (CDC) replicates individual inserts, updates, and deletes from a source system into your warehouse in real time -- instead of re-extracting entire tables on every run.
Skippr reads native change logs (PostgreSQL WAL, MySQL binlog, MongoDB change streams, DynamoDB Streams, Kafka Debezium envelopes) and applies each mutation to the destination with exactly-once semantics.
How it works
Source Database
│
│ log reader ── reads native change log (WAL / binlog / stream)
▼
Skippr WAL (local)
│
│ segment commit ── Arrow IPC + CDC metadata (mutation kind, order token)
▼
Committed Segment
│
│ idempotent apply ── MERGE with order-token guard + tombstone check
▼
Destination Warehouse
├── _skippr_order_token column (stale-write rejection)
└── _skippr_tombstones_{table} (anti-resurrect protection)Each change carries a mutation kind (insert, update, or delete) and a lexicographically sortable order token derived from the source's native log position (e.g. PostgreSQL LSN, MySQL binlog file + position, MongoDB resume token).
At the destination, Skippr uses these to guarantee correctness:
- Upsert-if-newer -- a row is only written if its order token is greater than the existing token. Stale or replayed writes are silently discarded.
- Tombstone anti-resurrect -- deletes are recorded in a per-table tombstone table. A later insert for a deleted key is blocked unless its order token proves it occurred after the delete.
Enabling CDC
CDC requires two pieces in your skippr.yaml:
- Set
cdc_enabled: trueon the source - Add a
cdc:pipeline block with your business key columns
Skippr automatically infers and enforces the strongest CDC guarantee your source/sink pair supports -- you never need to specify a guarantee level.
project: my_cdc_pipeline
source:
kind: postgres
host: localhost
port: 5432
user: replicator
password: ${POSTGRES_PASSWORD}
database: mydb
cdc_enabled: true
warehouse:
kind: snowflake
database: ANALYTICS
schema: RAW
warehouse: COMPUTE_WH
cdc:
business_key_columns:
- idSee CDC Configuration for the full reference.
Supported sources
| Source | CDC Mechanism | Details |
|---|---|---|
| PostgreSQL | WAL logical replication (pgoutput) | CDC Sources -- PostgreSQL |
| MySQL | Binlog replication | CDC Sources -- MySQL |
| MongoDB | Change streams | CDC Sources -- MongoDB |
| DynamoDB | DynamoDB Streams | CDC Sources -- DynamoDB |
| Kafka | Debezium envelope parsing | CDC Sources -- Kafka |
Supported destinations
All warehouse destinations support CDC with exactly-once MERGE semantics:
| Destination | MERGE Strategy | Details |
|---|---|---|
| Snowflake | MERGE DML | CDC Destinations -- Snowflake |
| BigQuery | MERGE DML | CDC Destinations -- BigQuery |
| PostgreSQL | Staging table + INSERT ... ON CONFLICT | CDC Destinations -- PostgreSQL |
| Redshift | Staging table + MERGE | CDC Destinations -- Redshift |
| ClickHouse | ReplacingMergeTree | CDC Destinations -- ClickHouse |
| Databricks | Unity Catalog MERGE | CDC Destinations -- Databricks |
| Synapse | MERGE via Tiberius | CDC Destinations -- Synapse |
| MotherDuck | DuckDB MERGE | CDC Destinations -- MotherDuck |
Further reading
- CDC Sources -- prerequisites, configuration, and resume behavior for each source
- CDC Destinations -- how changes are applied at each warehouse
- CDC Configuration -- full YAML reference for the
cdc:pipeline block - Blog: Change Data Capture with Exactly-Once Guarantees -- deep dive into Skippr's CDC architecture
