How to Remove Duplicate Data in My Databricks Destination?
Question
Why is there duplicate data in my Databricks destination?
Environment
Destination: Databricks
Answer
A common cause of duplicate data in your Databricks destination is the presence of row or column filters applied to tables created by Fivetran. For details on filters, see Databricks' documentation.
When filters are applied to a table, Fivetran cannot access certain rows or columns. As a result, instead of updating existing records, we insert new ones, which results in duplicates.
To check and resolve filter-related issues, perform the following steps:
To verify if any filters are applied to a table, run the following SQL query:
DESCRIBE EXTENDED <table_name>;
If filters exist, do one of the following to resolve the issue:
Remove the row or column filters applied by running the following SQL command:
ALTER TABLE <table_name> DROP ROW FILTER;
Drop the table and perform a full resync.
If you do not have any filters applied and still seeing duplicates, please reach out to support.