Connections for INFACore > Connections to source and target endpoints > Databricks Delta
  

Databricks Delta

Create a Databricks Delta connection to connect to the Databricks SQL endpoint or Databricks analytics cluster to read or write data.

Feature snapshot

Operation
Support
Read
Yes
Write
Yes

Before you begin

Before you configure the connection properties, you'll need to get information from your Databricks account.
The following video shows you how to get information from your Databricks account:
https://infa.media/3pNMrGj

Connection properties

The following table describes the Databricks Delta connection properties:
Property
Description
Connection Name
Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
Databricks Host
The host name of the endpoint the Databricks account belongs to.
Use the following syntax:
jdbc:spark://<Databricks Host>:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<Org Id>/<Cluster ID>;AuthMech=3;UID=token;PWD=<personal-access-token>
Note: You can get the URL from the Advanced Options of JDBC or ODBC in the Databricks Delta analytics cluster or all purpose cluster.
The value of PWD in Databricks Host, Org Id, and Cluster ID is always <personal-access-token>.
Cluster ID
The ID of the Databricks analytics cluster.
You can get the cluster ID from the JDBC URL.
Use the following syntax:
jdbc:spark://<Databricks Host>:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<Org Id>/<Cluster ID>;AuthMech=3;UID=token;PWD=<personal-access-token>
Organization Id
The unique organization ID for the workspace in Databricks.
Use the following syntax:
jdbc:spark://<Databricks Host>:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/<Org Id>/<Cluster ID>;AuthMech=3;UID=token;PWD=<personal-access-token>
Databricks Token
Personal access token to access Databricks.
Ensure that you have permissions to attach to the cluster identified in the Cluster ID property.
SQL Endpoint JDBC URL
Databricks SQL endpoint JDBC connection URL.
Use the following syntax:
jdbc:spark://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/<SQL endpoint cluster ID>;
This field is required to connect to the Databricks SQL endpoint. Ensure that you set the required environment variables in the Secure Agent.
Note: The Databricks Host, Organization ID, and Cluster ID properties are not considered if you configure the SQL Endpoint JDBC URL property.
For more information on Databricks Delta SQL endpoint, contact Informatica Global Customer Support.
Database
The database in Databricks Delta that you want to connect to.
By default, all databases available in the workspace are listed.
JDBC Driver Class Name
The name of the JDBC driver class.
Specify the driver class name as com.simba.spark.jdbc.Driver.
Cluster Environment
The cloud provider where the Databricks cluster is deployed.
Choose from the following options:
  • - AWS
  • - Azure
Default is AWS.
The connection attributes depend on the cluster environment you select. For more information, see the AWS cluster properties and Azure cluster properties sections.
Min Workers
The minimum number of worker nodes to be used for the Spark job.
Minimum value is 1.
Max Workers
The maximum number of worker nodes to be used for the Spark job.
If you don't want to autoscale, set Max Workers = Min Workers or don't set Max Workers.
DB Runtime Version
The Databricks runtime version.
Select 7.3 LTS from the list.
Worker Node Type
The worker node instance type that is used to run the Spark job.
For example, the worker node type for AWS can be i3.2xlarge. The worker node type for Azure can be Standard_DS3_v2.
Driver Node Type
The driver node instance type that is used to collect data from the Spark workers.
For example, the driver node type for AWS can be i3.2xlarge. The driver node type for Azure can be Standard_DS3_v2.
If you don't specify the driver node type, Databricks uses the value you specify in the worker node type field.
Instance Pool ID
The instance pool ID used for the Spark cluster. If you specify the Instance Pool ID, the following connection properties are ignored:
  • - Driver Node Type
  • - EBS Volume Count
  • - EBS Volume Type
  • - EBS Volume Size
  • - Enable Elastic Disk
  • - Worker Node Type
  • - Zone ID
Enable Elastic Disk
Enables the cluster to get additional disk space.
Enable this option if the Spark workers are running low on disk space.
Spark Configuration
The Spark configuration to use in the Databricks cluster.
The configuration must be in the following format:
"key1"="value1";"key2"="value2";....
For example:
"spark.executor.userClassPathFirst"="False"
Spark Environment Variables
The environment variables to export before launching the Spark driver and workers.
The variables must be in the following format:
"key1"="value1";"key2"="value2";....
For example:
"MY_ENVIRONMENT_VARIABLE"="true"

Read properties

The following table describes the advanced source properties that you can configure in the Python code to read from Databricks Delta:
Property
Description
Database Name
Overrides the database name provided in connection and the database name provided during metadata import.
Table Name
Overrides the table name used in the metadata import with the table name that you specify.
Pre SQL
The pre-SQL command to run on the Databricks Delta source table before the agent reads the data.
For example, if you want to update records in the database before you read the records from the table, specify a pre-SQL statement.
The query must include a fully qualified table name. You can specify multiple pre-SQL commands, each separated with a semicolon.
Post SQL
The post-SQL command to run on the Databricks Delta table after the agent completes the read operation.
For example, if you want to delete some records after the latest records are loaded, specify a post-SQL statement.
The query must include a fully qualified table name. You can specify multiple post-SQL commands, each separated with a semicolon.
Staging Location
Relative directory path to store the staging files.
  • - If the Databricks cluster is deployed on AWS, use the path relative to the Amazon S3 staging bucket.
  • - If the Databricks cluster is deployed on Azure, use the path relative to the Azure Data Lake Store Gen2 staging filesystem name.
Note: If staging location is not specified for Unity Catalog, the mapping fails.
Job Timeout
Maximum time in seconds that is taken by the Spark job to complete processing. If the job is not completed within the time specified, the Databricks cluster terminates the job and the mapping fails.
If the job timeout is not specified, the mapping shows success or failure based on the job completion.
Job Status Poll Interval
Poll interval in seconds at which the Secure Agent checks the status of the job completion.
Default is 30 seconds.
DB REST API Timeout
The Maximum time in seconds for which the Secure Agent retries the REST API calls to Databricks when there is an error due to network connection or if the REST endpoint returns 5xx HTTP error code.
Default is 10 minutes.
DB REST API Retry Interval
The time Interval in seconds at which the Secure Agent must retry the REST API call, when there is an error due to network connection or when the REST endpoint returns 5xx HTTP error code.
This value does not apply to the Job status REST API. Use job status poll interval value for the Job status REST API.
Default is 30 seconds.

Write properties

The following table describes the advanced target properties that you can configure in the Python code to write to Databricks Delta:
Advanced Property
Description
Target Database Name
Overrides the database name provided in the connection and the database selected in the metadata browser for existing targets.
Target Table Name
Overrides the table name at runtime for existing targets.
Pre SQL
The pre-SQL command to run before the agent writes to Databricks Delta.
For example, if you want to assign sequence object to a primary key field of the target table before you write data to the table, specify a pre-SQL statement.
You can specify multiple pre-SQL commands, each separated with a semicolon.
Post SQL
The post-SQL command to run after the agent completes the write operation.
For example, if you want to alter the table created by using create target option and assign constraints to the table before you write data to the table, specify a post-SQL statement.
You can specify multiple post-SQL commands, each separated with a semicolon.
Write Disposition
Overwrites or adds data to the existing data in a table. You can select from the following options:
  • - Append. Appends data to the existing data in the table even if the table is empty.
  • - Truncate. Overwrites the existing data in the table.
Staging Location
Relative directory path to store the staging files.
  • - If the Databricks cluster is deployed on AWS, use the path relative to the Amazon S3 staging bucket.
  • - If the Databricks cluster is deployed on Azure, use the path relative to the Azure Data Lake Store Gen2 staging filesystem name.
Note: Mandatory if you want to create a new target at runtime when the source query is set for Unity Catalog.
Job Timeout
Maximum time in seconds that is taken by the Spark job to complete processing. If the job is not completed within the time specified, the Databricks cluster terminates the job and the mapping fails.
If the job timeout is not specified, the mapping shows success or failure based on the job completion.
Job Status Poll Interval
Poll interval in seconds at which the Secure Agent checks the status of the job completion.
Default is 30 seconds.
DB REST API Timeout
The Maximum time in seconds for which the Secure Agent retries the REST API calls to Databricks when there is an error due to network connection or if the REST endpoint returns 5xx HTTP error code.
Default is 10 minutes.
DB REST API Retry Interval
The time Interval in seconds at which the Secure Agent must retry the REST API call, when there is an error due to network connection or when the REST endpoint returns 5xx HTTP error code.
This value does not apply to the Job status REST API. Use job status poll interval value for the Job status REST API.
Default is 30 seconds.
Update Mode
Defines how rows are updated in the target tables. Select from the following options:
  • - Update As Update: Rows matching the selected update columns are updated in the target.
  • - Update Else Insert: Rows matching the selected update columns are updated in the target. Rows that don't match are appended to the target.