Connectors and Connections > Data Ingestion and Replication connection properties > Databricks connection properties
  

Databricks connection properties

Create a Databricks connection to securely read data from or write data to Databricks.

Staging prerequisites

Before you create a connection, you must perform certain prerequisite tasks to configure the staging environment to connect to SQL warehouse or Databricks cluster.

SQL warehouse

Configure either the AWS or Azure staging environment for the SQL warehouse based on the deployed environment. You also need to configure the Spark parameters for the SQL warehouse to use Azure and AWS staging.

Configure AWS staging

Configure IAM AssumeRole authentication to use AWS staging for the SQL warehouse.

Configure Spark parameters for AWS staging

On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for AWS under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse:
For example, the S3 staging bucket warehouse value is s3.ap-south-1.amazonaws.com.
Ensure that the configured access key and secret key have access to the S3 buckets where you store the data for Databricks tables.

Configure Azure staging

Before you use Microsoft Azure Data Lake Storage Gen2 to stage files, perform the following tasks:

Configure Spark parameters for Azure staging

On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for Azure under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse:
Ensure that the configured client ID and client secret have access to the file systems where you store the data for Databricks tables.

Databricks cluster

Configure the Spark parameters for the Databricks cluster to use Azure and AWS staging based on where the cluster is deployed.
You also need to enable the Secure Agent properties for runtime and design-time processing on the Databricks cluster.

Spark configuration

Before you connect to the Databricks cluster, you must configure the Spark parameters on AWS and Azure.

Configuration on AWS

Add the following Spark configuration parameters for the Databricks cluster and restart the cluster:
Ensure that the access and secret key configured has access to the buckets where you store the data for Databricks tables.

Configuration on Azure

Add the following Spark configuration parameters for the Databricks cluster and restart the cluster:
Ensure that the client ID and client secret configured has access to the file systems where you store the data for Databricks tables.

Connect to Databricks

Let's configure the Databricks connection properties to connect to Databricks.

Before you begin

Before you get started, you'll need to get information from your Databricks account.
The following video shows you how to get the information you need:
https://onlinehelp.informatica.com/IICS/videos/prod/idlf/data-loader-connect-databricks-delta-video.mp4
You also need to configure the AWS or Azure staging environment to use the SQL endpoint or the Databricks cluster in the connection.
To learn about the staging prerequisites for the Azure or AWS environment, check out SQL warehouse or Databricks cluster.

Connection details

The following table describes the basic connection properties:
Property
Description
Connection Name
Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
Description
Description of the connection. Maximum length is 4000 characters.
Type
Databricks
Runtime Environment
The name of the runtime environment where you want to run tasks.
You cannot run an application ingestion and replication, database ingestion and replication, or streaming ingestion and replication task on a Hosted Agent or serverless runtime environment.
SQL Warehouse JDBC URL
Databricks SQL Warehouse JDBC connection URL.
Required to connect to a Databricks SQL warehouse. Doesn't apply to Databricks clusters.
Note: Databricks SQL Serverless is the recommended Databricks cluster type.
To get the SQL Warehouse JDBC URL, go to the Databricks console and select the JDBC driver version from the JDBC URL menu.
Application ingestion and replication and database ingestion and replication tasks can use JDBC URL version 2.6.25 or later or 2.6.22 or earlier. The URLs must begin with the prefix jdbc:databricks://, as follows:
jdbc:databricks://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/<SQL endpoint cluster ID>;
Ensure that you set the required environment variables in the Secure Agent. Also specify the correct JDBC Driver Class Name under advanced connection settings.
Note: Specify the database name in the Database Name connection property. If you specify the database name in the JDBC URL, it is not considered. The Databricks Host, Organization ID, and Cluster ID properties are not considered if you configure the SQL warehouse JDBC URL property.
Databricks Token
Personal access token to access Databricks.
Required for SQL warehouse and Databricks cluster.
Catalog Name
If you use Unity Catalog, the name of an existing catalog in the metastore.
Optional for SQL warehouse. Doesn't apply to Databricks cluster.
You can also specify the catalog name in the end of the SQL warehouse JDBC URL.
Note: The catalog name cannot contain special characters.
For more information about Unity Catalog, see the Databricks documentation.

Advanced settings

The following table describes the advanced connection properties:
Property
Description
Database
The name of the schema in the unity catalog in Databricks.
Optional for SQL warehouse and Databricks cluster.
If you do not specify a value, all databases available in the workspace are listed. The value you specify overrides the schema specified in the SQL Warehouse JDBC URL connection property.
JDBC Driver Class Name
The name of the JDBC driver class.
Optional for SQL warehouse and Databricks cluster.
For JDBC URL versions 2.6.22 or earlier, specify the driver class name as com.simba.spark.jdbc.Driver.
For JDBC URL versions 2.6.25 or later, specify the driver class name as com.databricks.client.jdbc.Driver.
Staging Environment
The cloud provider where the Databricks cluster is deployed.
Required for SQL warehouse and Databricks cluster.
Select one of the following options:
  • - AWS
  • - Azure
  • - Personal Staging Location
Default is Personal Staging Location.
You can select the Personal Staging Location as the staging environment instead of Azure or AWS staging environments to stage data locally for mappings and tasks.
If you select Personal Staging Location for a connection that Data Ingestion and Replication uses, the Parquet data files for application ingestion and replication jobs or database ingestion and replication jobs can be staged to a local personal storage location, which has a data retention period of 7 days. You must also specify a Database Host value. If you use Unity Catalog, note that a personal storage location is automatically provisioned.
Personal staging location doesn't apply to Databricks cluster.
You cannot use personal staging location with Databricks unmanaged tables.
Note: You cannot switch between clusters once you establish a connection.
Databricks Host
The host name of the endpoint the Databricks account belongs to.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Databicks Host from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster.
The following example shows the Databicks Host in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3; UID=token; PWD=<personal-access-token>
The value of PWD in Databricks Host, Organization Id, and Cluster ID is always <personal-access-token>.
Cluster ID
The ID of the cluster.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the cluster ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster
The following example shows the Cluster ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3;UID=token; PWD=<personal-access-token>
Organization ID
The unique organization ID for the workspace in Databricks.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
You can get the Organization ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster
The following example shows the Organization ID in JDBC URL:
jdbc:spark://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Organization ID>/ <Cluster ID>;AuthMech=3;UID=token; PWD=<personal-access-token>
Min Workers
The minimum number of worker nodes to be used for the Spark job. Minimum value is 1.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Max Workers
The maximum number of worker nodes to be used for the Spark job. If you don't want to autoscale, set Max Workers = Min Workers or don't set Max Workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
DB Runtime Version
The version of Databricks cluster to spawn when you connect to Databricks cluster to process mappings.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
Select the Databricks runtime version 9.1 LTS or 13.3 LTS.
Worker Node Type
The worker node instance type that is used to run the Spark job.
Required for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the worker node type for AWS can be i3.2xlarge. The worker node type for Azure can be Standard_DS3_v2.
Driver Node Type
The driver node instance type that is used to collect data from the Spark workers.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
For example, the driver node type for AWS can be i3.2xlarge. The driver node type for Azure can be Standard_DS3_v2.
If you don't specify the driver node type, Databricks uses the value you specify in the worker node type field.
Instance Pool ID
The instance pool ID used for the Spark cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
If you specify the Instance Pool ID to run mappings, the following connection properties are ignored:
  • - Driver Node Type
  • - EBS Volume Count
  • - EBS Volume Type
  • - EBS Volume Size
  • - Enable Elastic Disk
  • - Worker Node Type
  • - Zone ID
Elastic Disk
Enables the cluster to get additional disk space.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Enable this option if the Spark workers are running low on disk space.
Spark Configuration
Doesn't apply to a data loader task or to Data Ingestion and Replication tasks.
Spark Environment Variables
Doesn't apply to a data loader task or to Data Ingestion and Replication tasks.

AWS staging environment

The following table describes the properties for the AWS staging environment:
Property
Description
S3 Access Key
The key to access the Amazon S3 bucket.
S3 Secret Key
The secret key to access the Amazon S3 bucket.
S3 Data Bucket
The existing bucket to store the Databricks data.
S3 Staging Bucket
The existing bucket to store the staging files.
S3 Authentication Mode
The authentication mode to access Amazon S3.
Select one of the following authentication modes:
  • - Permanent IAM credentials. Uses the S3 access key and S3 secret key to connect to Databricks.
  • - IAM Assume Role. Uses the AssumeRole for IAM authentication to connect to Databricks. Doesn't apply to Databricks cluster.
IAM Role ARN
The Amazon Resource Number (ARN) of the IAM role assumed by the user to use the dynamically generated temporary security credentials.
Set the value of this property if you want to use the temporary security credentials to access the Amazon S3 staging bucket.
For more information about how to get the ARN of the IAM role, see the AWS documentation.
Use EC2 Role to Assume Role
Optional. Select the check box to enable the EC2 role to assume another IAM role specified in the IAM Role ARN option.
The EC2 role must have a policy attached with a permission to assume an IAM role from the same or different AWS account.
S3 Region Name
The AWS cluster region in which the bucket you want to access resides.
Select a cluster region if you choose to provide a custom JDBC URL that does not contain a cluster region name in the JDBC URL connection property.
S3 Service Regional Endpoint
The S3 regional endpoint when the S3 data bucket and the S3 staging bucket need to be accessed through a region-specific S3 regional endpoint.
Doesn't apply to Databricks cluster.
Default is s3.amazonaws.com.
Zone ID
The zone ID for the Databricks job cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Applies only if you want to create a Databricks job cluster in a particular zone at runtime.
For example, us-west-2a.
Note: The zone must be in the same region where your Databricks account resides.
EBS Volume Type
The type of EBS volumes launched with the cluster.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
EBS Volume Count
The number of EBS volumes launched for each instance. You can choose up to 10 volumes.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.
Note: In a Databricks connection, specify at least one EBS volume for node types with no instance store. Otherwise, cluster creation fails.
EBS Volume Size
The size of a single EBS volume in GiB launched for an instance.
Optional for Databricks cluster. Doesn't apply to SQL warehouse.

Azure staging environment

The following table describes the properties for the Azure staging environment:
Property
Description
ADLS Storage Account Name
The name of the Microsoft Azure Data Lake Storage account.
ADLS Client ID
The ID of your application to complete the OAuth Authentication in the Active Directory.
ADLS Client Secret
The client secret key to complete the OAuth Authentication in the Active Directory.
ADLS Tenant ID
The ID of the Microsoft Azure Data Lake Storage directory that you use to write data.
ADLS Endpoint
The OAuth 2.0 token endpoint from where authentication based on the client ID and client secret is completed.
ADLS Filesystem Name
The name of an existing file system to store the Databricks data.
ADLS Staging Filesystem Name
The name of an existing file system to store the staging data.

JDBC URL parameters

You can utilize the additional JDBC URL parameters field in the Databricks connection to customize and set any additional parameters required to connect to Databricks.
You can configure the following properties as additional JDBC URL parameters in the Databricks connection:

Rules and guidelines for personal staging location

When you select the personal staging location as a staging environment, the data is first staged in a java temporary location and then copied to a personal staging location of the unity catalog. Both the staged files will be deleted after the task runs successfully.
However, to stage the data in a different directory, configure the DTM property -Djava.io.tmpdir=/my/dir/path in the JVM options in the system configuration settings of the Administrator service.
To enable data staging in a different directory, you should have read and write permission and enough disk space to stage the data in the directory.
When you specify a personal staging location in the Databricks connection properties for staging, consider the following rules and guidelines: