What Is the Impact of Running Log-Based CDC on My Database Server?
Fivetran and HVR 6 both support log-based Change Data Capture (CDC) from transaction processing databases.
Question
What is the impact of running CDC on the database server in terms of CPU, memory, and IO resources (including disk and network IO)?
Environment
This article applies to most transaction processing databases, such as Oracle, SQL Server, Db2, MySQL, PostgreSQL, SAP HANA, etc.
Answer
Concepts and features are not identical across transaction processing databases, so there are subtle differences that can result in slightly higher or lower impact. Also, for many database technologies, we provide multiple methods to perform log-based CDC that may change the impact.
Log-based CDC pulls table changes from the transaction log, during and after transactions are completed. The transaction log is the foundation for database recovery. Replication technologies usually require data changes to be written to the transaction log, as opposed to statements that were performed. Log-based CDC runs independent of the transaction that is making the change. Therefore, on a well-configured system with sufficient resources, the transactions themselves may not experience any noticeable impact. However, there is impact on database processing and the database server, even if it's not noticeable.
Why does CDC impact my database server?
To allow for the highest possible transaction rates, database technologies usually minimize the amount of change data logged without compromising recoverability. Therefore, updates only record the column values that were changed. Also, identifiers to existing rows are often internal identifiers rather than application-level identifiers (primary or unique key column values). Log-based CDC, which is commonly used to push changes into a different technology, requires application-level identifiers. The following contribute to CDC's impact on the database server:
- Supplemental logging: The database is instructed to write additional information to the log, requiring more IO to the log and more storage space to store backups. The degree of impact from supplemental logging depends on the following:
- The relative percentage of the database activity you want to capture
- The mix of transactions (inserts, updates and deletes)
- The number of columns per table
- Database technology differences
If you already replicate your database, then you may already absorb this overhead. It is possible to see significant additional transaction log generation (20+%) as a result of supplemental logging, though it rarely slows down the database server or database processing.
NOTE: Since SAP HANA does not allow the generation of supplemental logging, the introduction of log-based CDC has no impact on HANA's transaction log generation.
- Log parsing: Several database technologies provide built-in routines or APIs to retrieve the relevant log changes. For example, SQL Server can populate CDC tables that you can select from, Db2 provides the db2ReadLog API, and PostgreSQL has decoders to retrieve data from replication slots. Some CDC providers, including Fivetran, provide binary log readers. A binary log reader can parse a transaction log file or log fragments directly. This approach bypasses database processing altogether, and it can often be run on a different server, as long as there is access to the transaction logs. However, to achieve the lowest possible latency, the binary log reader may also run on the database server directly. Binary log readers almost always take up fewer resources (CPU, memory, IO) than database routines or APIs, because the in-database processing introduces additional overhead.
- More processing and storage: This is not applicable to all use cases because it relates to the presentation of the change data. For example, SQL Server either publishes changes in a distribution database (in the case of replication) or through CDC tables (when CDC is used). In either case, additional database processing and storage are required to make changes visible.
- Change retrieval: When pulling data remotely, a large data transfer may benefit from optimizations in the database connectivity driver. Generally, passing less data and using optimized network connectivity results in the lowest possible impact. Fivetran's agent-based technology, including HVA connectors and HVR 6, always pass compressed data on the wire. With 5-10x compression on just the change data, the data volume crossing the network is drastically limited relative to uncompressed CDC data transfer or to a binary log parser running elsewhere.
How high is CDC's impact?
Many factors influence the impact of log-based CDC:
- What percentage of database activity must be captured?
- What are the table definitions involved including what data types they use, and what is the row width?
- What is the transaction mix?
- What approach is used to capture the changes?
It is rare to see more than 10% overhead on database server CPU due to the introduction of log-based CDC. On a well-configured system, you should be able to drive database processing up to 100% CPU utilization. Introducing CDC on your database server should have very limited (if any) noticeable impact, especially if your database server CPU utilization is no higher than 70-80%.
The less activity there is in your database, the higher the overhead of CDC. When relative base is much lower on a mostly idle system, the overhead to validate that there is no change data becomes noticeable, though this overhead shouldn't negatively impact an idle system.