Compare
After you run the Refresh job and the data replication process starts, you might want to verify that the data in source and target locations is in sync. For this, Fivetran HVR provides the Compare functionality, which allows you to compare data in two or more locations in a channel.
HVR supports comparing both database and file locations. For more information about comparing file locations, see the Direct File Compare section. You can compare a single source location with multiple target locations. Additionally, you can compare an entire database (all tables) or only specific tables. The Compare functionality compares both the table structures and the data they contain.
HVR also supports comparing data in heterogeneous DBMSes where source and target data types may not match and have different character encoding.
Data Type Handling and Transformations
When comparing data across different DBMSes or formats, HVR automatically handles differences in data types and character encoding between source and target location using coercion and transformation techniques.
HVR standardizes data types internally during comparison by coercing values to fit the internal representation. You can configure how HVR handles coercion errors in the channel definition using the CoerceErrorPolicy and CoerceErrorType parameters in the TableProperties action. By default, a coercion error results in a fatal error.
HVR also supports transformations using actions that control compare behavior. For example, you can use the CaptureExpression parameter in the ColumnProperties action to convert data to lowercase before comparison. You can apply transformations per table, location, channel, or installation.
When comparing between different DBMS types, ambiguity can occur due to data type coercions. Different systems interpret data types and values differently, which can lead to inconsistencies during comparison. HVR resolves this by applying the rules of the target location (also known as the "write" location), not the source (also known as "read") location. This ensures that comparison results match the behavior of the system receiving the data.
For example, Oracle treats an empty string and NULL
as equivalent, while Ingres treats them as distinct. During replication, HVR may map an empty string in the Ingres (ing) location to a NULL
in a varchar
column in the Oracle (ora) location. During comparison, whether these values are considered identical or different depends on the comparison direction:
- When comparing from ing to ora, HVR uses Oracle’s rules and reports the tables as identical because Oracle treats empty strings and
NULL
as the same. - When comparing from ora to ing, HVR uses Ingres’s rules and reports the tables as different because Ingres treats empty strings and
NULL
as distinct values.
You can run Compare using the following methods:
- via UI – see Comparing Data
- via CLI – see hvrcompare
- via REST API – see /api/latest/hubs/{hub}/channels/{channel}/compare in Activate, Refresh, and Compare Interface.
Compare Types
If your source table has been pre-populated with data, there are two types of Compare you can choose from:
Bulk Compare: HVR calculates the checksum for each table in the channel and compares these checksum to report whether the replicated tables are identical.
Row-by-Row Compare: HVR extracts the data from a source (read) location, compresses it, and transfers the data to a target (write) location(s) to perform a row level Compare. Each individual row is compared to produce a 'diff' result. For each detected difference, an SQL statement is written: an
INSERT
,UPDATE
, orDELETE
. This compare type is also referred to as row-wise compare in this documentation.
Bulk and online row-wise compare are not supported for packed SAP tables; only offline row-wise compare is available.
You can set a Compare type in the CLI using -g option of hvrcompare or in the UI enable Bulk Compare by selecting the Table Checksums Only option.
Online Compare
You can choose to perform online compare, which is a live Compare between locations with rapidly changing data. While performing a compare, if the online compare option is defined, HVR processes the changes that occur during the compare and does not miscount them as differences. You can define the online compare in CLI using -o option of hvrcompare or in the UI by selecting the Online Compare option.
Direct File Compare
HVR also allows you to perform the Compare on the file locations using the Direct File Compare method, which is performed against a file location. This Compare method is a faster alternative for file compare via Hive External Tables and also helps to avoid compare mismatches caused by data type coercion through Hive deserializer.
During direct file Compare, HVR reads and parses (deserialize) files directly from the file location instead of using the HIVE external tables (even if it is configured for that location). In direct file compare, the files of each table are sliced and distributed to prereader subtasks. Each prereader subtasks reads, sorts and parses (deserialize) the files to generate compressed(encrypted) intermediate files. These intermediate files are then compared with the database on the other side.
The number of prereader subtasks used during direct file Compare can be configured using the compare option File Prereaders per Table (CLI option -w).
The location to store the intermediate files generated during Compare can be configured using the location property Intermediate_Directory.
To perform a direct file compare:
against a source file location, action Capture with parameter Pattern should be defined.
against a target file location, action Integrate with parameter ComparePattern should be defined.
Limitations
Following are the direct file compare limitations:
Direct file Compare does not support Avro, Parquet or JSON file formats.
Direct file Compare is not supported if action Restrict with parameter RefreshCondition is defined on a file location involved in the compare.
Direct file Compare is not supported when the channel is a 'blob' file channel. A blob file channel has no table information and simply treats each file as a sequence of bytes without understanding their file format.
Direct file Compare for XML files requires each XML file to contain a single table.
Slicing
Sometimes, the amount of data that the Compare job needs to process is too big. In this case, you can choose to divide the table into a few batches and process them in parallel. In HVR, this is achieved via the Slicing functionality. By configuring Slicing, you can divide your database table into a few pieces that will be processed in parallel saving you a lot of time.
HVR suggests a few types of slicing, each fitting best for a specific business case. For more information about slicing types and when it is best to use them, refer to the Slicing article.