Staging for Databricks
Fivetran HVR can store/stage data in a temporary directory (staging file) before loading it into the target Databricks location. To ensure optimal performance, HVR uses staging files when performing Burst Integrate and Bulk Refresh into Databricks.
HVR supports the following platforms for staging files:
- AWS S3
- Azure Data Lake Storage (ADLS)
- Google Cloud Storage (GCS)
Since v6.2.5/0
The following outlines how HVR implements staging for Databricks:
- HVR first writes data into the staging directory on the configured platform. Since 6.2.5/1, the staging files can be stored in CSV or Parquet format.
- HVR then uses the Databricks SQL command
COPY INTO
to ingest data into the target tables from the staging directory.
Configuring Staging
HVR and the staging platform must be configured to stage data before loading it into Databricks.
If the managed secrets feature is enabled, the option USE TOKEN INSTEAD is displayed in the fields designated for entering secrets.
Staging on AWS S3
Since v6.1.0/13
Setting up staging on AWS S3 requires configuring both the AWS S3 environment and HVR.
Prerequisites
An AWS S3 location (bucket) to store staging files. For more information about creating and configuring an S3 bucket, refer to the AWS documentation.
An AWS user with AmazonS3FullAccess permission policy to access the S3 bucket. Alternatively, an AWS user with minimal set of permission can also be used:
Click here for minimal set of permissions.
- s3:AbortMultipartUpload
- s3:GetBucketLocation
- s3:ListBucket
- s3:ListBucketMultipartUploads
- s3:GetObject
- s3:PutObject
- s3:DeleteObject
Sample JSON with a user role permission policy for S3 location
{ "Statement": [ { "Sid": "<identifier>", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": [ "s3:GetObject", "s3:GetObjectVersion", "s3:PutObject", "s3:DeleteObject", "s3:DeleteObjectVersion", "s3:AbortMultipartUpload" ], "Resource": "arn:aws:s3:::<bucket_name>/*" }, { "Sid": "<identifier>", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": [ "s3:ListBucket", "s3:GetBucketLocation", "s3:ListBucketMultipartUploads" ], "Resource": "arn:aws:s3:::<bucket_name>" } ] }
For minimal permission, HVR also supports the AWS temporary security credentials in IAM. There are two ways to request for the AWS Security Token Service (STS) temporary credentials:
Method 1: Using a combination of AWS STS Role ARN, AWS Access Key Id, and AWS Secret Access Key
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_id>:<user>/<username>" }, "Action": "sts:AssumeRole" } ] }
Method 2: Using a combination of AWS STS Role ARN and AWS IAM Role (a role that has access to an EC2 machine)
Sample JSON
{ "Version": "2012-10-17", "Statement": [ { "Sid": "", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::<account_id>:<user>/<username>", "arn:aws:iam::<account_id>:<role>/<username>" ] }, "Action": "sts:AssumeRole" } ] }
For more information on the AWS S3 permissions policy, refer to the AWS S3 documentation.
For more information, refer to the following AWS documentation:
HVR Location Configuration
To connect HVR to AWS S3, define the following location properties when creating a Databricks location.
For an existing Databricks location, you can define these properties by editing the location's source and target properties.
The Location Property equivalent to the HVR UI field is shown in parentheses below.
In the STORAGE field (File_Scheme), select Amazon S3 (HTTPS).
This specifies the protocol for connecting HVR to AWS S3.In the BUCKET field (S3_Bucket), enter the name of the AWS S3 bucket.
In the Credentials or Instance Profile option, select the method to authenticate HVR’s connection to AWS S3. For more information about obtaining your AWS credentials or Instance Profile Role, refer to the AWS Documentation.
If you select Credentials, define the following location properties:
- In the KEY ID field (AWS_Access_Key_Id), enter the access key ID of the IAM user.
- In the SECRET KEY field (AWS_Secret_Access_Key), enter the secret access key of the IAM user.
If you select Instance Profile, define the following location property:
- In the IAM ROLE field (AWS_IAM_Role), enter the AWS IAM role name.
In the STAGING DIRECTORY field (Staging_Directory), specify the path to the directory where HVR will create temporary staging files within the S3 bucket (e.g., /my_staging_dir).
Since v6.2.5/1
In the STAGING FILE FORMAT drop-down menu (Staging_File_Format), select the format for storing the staging files:- CSV
- Parquet
(Optional) Select the Create Delta Tables in an External Location option to create external tables.
If this option is selected, define the following:In the EXTERNAL LOCATION field (Databricks_Location), enter the path to the external tables in Databricks.
This can be a mount path (e.g., /mnt/...) optionally prefixed with dbfs:/, or an s3:// URL. If the path is defined without a prefix, it is assumed to be a mount path beginning with dbfs:/.In the EXTERNAL LOCATION S3 URL field (Databricks_Location_S3S), enter the s3s:// URL for the external tables in Databricks.
This field is required only if the EXTERNAL LOCATION is set to a mount path (e.g., /mnt/... or dbfs:/).s3s is same as accessing AWS S3 using HTTPS.
By default, HVR connects to us-east-1 once for determining your S3 bucket region. If a firewall restriction or a service such as Amazon Private Link is preventing the determination of your S3 bucket region, you can change this region (us-east-1) to the region where your S3 bucket is located by defining the following action:
Group Table Action Parameters Databricks * Environment Name=HVR_S3_BOOTSTRAP_REGION
Value=s3_bucket_region
Staging on Azure Data Lake Storage
Setting up staging on Azure Data Lake Storage (ADLS) requires configuring both the Azure environment and HVR.
Due to technical limitations, staging on ADLS is not supported in the HVR releases since 6.1.5/3 to 6.1.5/9.
Prerequisites
An ADLS location (container) to store staging files. For more information about creating and configuring an ADLS container, refer to the ADLS documentation.
An ADLS user (storage account) to access the container.
HVR Location Configuration
To connect HVR to ADLS, define the following location properties when creating a Databricks location.
For an existing Databricks location, you can define these properties by editing the location's source and target properties.
The Location Property equivalent to the HVR UI field is shown in parentheses below.
In the STORAGE field (File_Scheme), select Azure Data Lake Storage Gen2 (HTTPS).
This specifies the protocol for connecting HVR to ADLS.In the ACCOUNT field (ABFS_Account), enter the name of the ADLS account.
In the CONTAINER field (ABFS_Container), enter the name of the container available within the ADLS account.
In the AUTHENTICATION METHOD field (ABFS_Authentication_Method), select Client Credentials.
This specifies the authentication method for connecting HVR to ADLS.In the OAUTH2 ENDPOINT field (Azure_OAuth2_Endpoint), enter the URL used to obtain the bearer token with credential token.
In the CLIENT ID field (Azure_OAuth2_Client_Id), enter the Client ID (or application ID) used for obtaining the Microsoft Entra ID (formerly Azure Active Directory) access token.
In the CLIENT SECRET KEY field (Azure_OAuth2_Client_Secret), enter the secret key associated with the CLIENT ID.
In the STAGING DIRECTORY field (Staging_Directory), specify the path to the directory where HVR will create temporary staging files within the Azure container (e.g., /my_staging_dir).
Since v6.2.5/1
In the STAGING FILE FORMAT drop-down menu (Staging_File_Format), select the format for storing the staging files:- CSV
- Parquet
(Optional) Select the Create Delta Tables in an External Location option to create external tables.
If this option is selected, define the following:
In the EXTERNAL LOCATION field (Databricks_Location), enter the path to the external tables in Databricks.
This can be a mount path (e.g., /mnt/...) in Databricks, optionally prefixed with dbfs:/, or an abfss:// URL. If the path is defined without a prefix, it is assumed to be a mount path beginning with dbfs:/.In the EXTERNAL LOCATION ABFSS URL field (Databricks_Location_ABFSS), enter the abfss:// URL for external tables in Databricks.
This field is required only if EXTERNAL LOCATION is set to a mount path (e.g., /mnt/... or dbfs:/...).
Staging on Google Cloud Storage
Since v6.2.5/0
Setting up staging on Google Cloud Storage (GCS) requires configuring both the GCS environment and HVR.
Prerequisites
A GCS location (bucket) to store staging files. For more information about creating and configuring a GCS bucket, refer to the GCS documentation.
A Google Cloud user (storage account) to access the GCS bucket.
Configure the storage integrations to allow Databricks to read and write data into the GCS bucket. For more information, refer to the Databricks documentation.
HVR Location Configuration
To connect HVR to GCS, define the following location properties when creating a Databricks location.
For an existing Databricks location, you can define these properties by editing the location's source and target properties.
The Location Property equivalent to the UI field is shown in parentheses below.
In the STORAGE field (File_Scheme), select Google Cloud Storage (HTTPS).
This specifies the protocol for connecting HVR to GCS.In the BUCKET field (GS_Bucket), enter the name of the GCS bucket.
In the AUTHENTICATION METHOD field (GCloud_Authentication_Method), select the authentication method for connecting HVR to GCS.
Environment Variable GOOGLE_APPLICATION_CREDENTIALS:
Authenticate using the credentials fetched from the GOOGLE_APPLICATION_CREDENTIALS environment variable.OAuth 2.0 Credentials File:
- In the CREDENTIALS FILE field (GCloud_OAuth_File), specify the directory path to the service account key file.
Hash-Based Message Authentication Code (HMAC):
- In the ACCESS KEY field (GS_HMAC_Access_Key_Id), enter the HMAC access key ID of the service account.
- In the SECRET KEY field (GS_HMAC_Secret_Access_Key), enter the HMAC secret key of the service account.
In the STAGING DIRECTORY field (Staging_Directory), specify the path to the directory where HVR will create temporary staging files within the GCS bucket (e.g., /my_staging_dir).
Since v6.2.5/1
In the STAGING FILE FORMAT drop-down menu (Staging_File_Format), select the format for storing the staging files:- CSV
- Parquet
(Optional) Select the Create Delta Tables in an External Location option to create external tables.
If this option is selected, define the following:In the EXTERNAL LOCATION field (Databricks_Location), enter the path to the external tables in Databricks.
This can be a mount path (e.g., /mnt/...) in Databricks, optionally prefixed with dbfs:/, or a gs:// URL. If the path is defined without a prefix, it is assumed to be a mount path beginning with dbfs:/.In the EXTERNAL LOCATION GS URL field (Databricks_Location_GSS), enter the gs:// URL for external tables in Databricks.
This field is required only if the EXTERNAL LOCATION is set to a mount path (e.g., /mnt/... or dbfs:/...).