Connectors and Connections > Data Ingestion and Replication connection properties > Databricks connection properties

Databricks connection properties

Create a Databricks connection to securely read data from or write data to Databricks.

Staging prerequisites

Before you create a connection, you must perform certain prerequisite tasks to configure the staging environment to connect to SQL warehouse, all-purpose cluster, or job cluster.

SQL warehouse

Configure AWS, Azure, or Databricks Volume as the staging environment for the SQL warehouse based on your deployment environment. Additionally, configure the Spark parameters for the SQL warehouse to use the Azure or AWS staging environment.

You can use a SQL warehouse on the Windows and Linux operating systems.

For more information on the types of SQL warehouses that you can connect to, see the Databricks SQL warehouses Knowledge Base article.

Configure AWS staging

Configure IAM AssumeRole authentication to use AWS staging for the SQL warehouse.

Create a minimal Amazon IAM policy

To stage the data in Amazon S3, use the following minimum required permissions: :

•PutObject
•GetObject
•DeleteObject
•ListBucket

You can use the following sample Amazon IAM policy:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket",

],
"Resource": [
"arn:aws:s3:::<bucket_name>/*",
"arn:aws:s3:::<bucket_name>"
]
}
]
}

Note: The Test Connection does not validate the IAM policy assigned to users. You can specify the Amazon S3 bucket name in the source advanced properties.

Configure Spark parameters for AWS staging

On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for AWS under Data access configuration.

Add the following Spark configuration parameters and restart the SQL warehouse:

• spark.hadoop.fs.s3a.access.key <S3 Access Key value>
•spark.hadoop.fs.s3a.secret.key <S3 Secret Key value>
•spark.hadoop.fs.s3a.endpoint <S3 Staging Bucket endpoint value>

For example, the S3 staging bucket warehouse value is s3.ap-south-1.amazonaws.com.

Ensure that the configured access key and secret key have access to the S3 buckets where you store the data for Databricks tables.

Configure Azure staging

Before you use Microsoft Azure Data Lake Storage Gen2 to stage files, perform the following tasks:

• Create a storage account to use with Microsoft Azure Data Lake Storage Gen2 and enable Hierarchical namespace in the Azure portal.

You can use role-based access control to authorize the users to access the resources in the storage account. Assign the Contributor role or Reader role to the users. The contributor role grants you full access to manage all resources in the storage account, but does not allow you to assign roles. The reader role allows you to view all resources in the storage account, but does not allow you to make any changes.

Note: To add or remove role assignments, you must have write and delete permissions, such as an Owner role.

•Register an application in Azure Active Directory to authenticate users to access the Microsoft Azure Data Lake Storage Gen2 account.

You can use role-based access control to authorize the application. Assign the Storage Blob Data Contributor or Storage Blob Data Reader role to the application. The Storage Blob Data Contributor role lets you read, write, and delete Azure Storage containers and blobs in the storage account. The Storage Blob Data Reader role lets you only read and list Azure Storage containers and blobs in the storage account.

• Create an Azure Active Directory web application for service-to-service authentication with Microsoft Azure Data Lake Storage Gen2.

Note: Ensure that you have superuser privileges to access the folders or files created in the application using the connector.

•To read complex files, set the JVM options for type DTM to increase the -Xms and -Xmx values in the system configuration details of the Secure Agent to avoid java heap space error. The recommended -Xms and -Xmx values are 512 MB and 1024 MB respectively.

Configure Spark parameters for Azure staging

On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for Azure under Data access configuration.

Add the following Spark configuration parameters and restart the SQL warehouse:

•spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net <value>
•spark.hadoop.fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net OAuth
•spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <Value>
•spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
•spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net https://login.microsoftonline.com/<Tenant ID>/oauth2/token

Ensure that the configured client ID and client secret have access to the file systems where you store the data for Databricks tables.

Stage data in Databricks Volumes

You can stage data in a Volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories.

You can use a Volume on a Linux or Windows machine and with the Databricks JDBC driver.

To use a Volume, complete the following prerequisites:

•Ensure that your Databricks workspace is enabled for Unity Catalog.
•Get the path where you want to stage the data in a Volume. If you do not specify a volume path in the Databricks connection properties, the Secure Agent creates a managed volume in Databricks, which is automatically deleted once the job is completed. Ensure you have the create and delete permissions for the managed volume if you do not want to specify a path.

All-purpose cluster

Enable the Secure Agent properties for design-time processing on the all-purpose cluster.

You can use all-purpose cluster only on the Linux operating system.

Configure Secure Agent properties

To connect to all-purpose cluster, enable the Secure Agent properties for design time.

1In Administrator, select the Secure Agent listed on the Runtime Environments tab.
2Click Edit.
3In the System Configuration Details section, select Data Integration Server as the Service and Tomcat JRE as the Type.
4Edit the JRE_OPTS field and set the value to -DUseDatabricksSql=false.

This image displays the JRE_OPTS property for the Tomcat JRE type.

Job cluster

Configure the Spark parameters for job cluster to use Azure and AWS staging based on where the cluster is deployed.

You also need to enable the Secure Agent properties for runtime processing on the job cluster.

You can use job cluster only on the Linux operating system.

Spark configuration

Before you connect to the job cluster, you must configure the Spark parameters on AWS and Azure.

Configuration on AWS

Add the following Spark configuration parameters for the job cluster and restart the cluster:

•spark.hadoop.fs.s3a.access.key <value>
•spark.hadoop.fs.s3a.secret.key <value>
•spark.hadoop.fs.s3a.endpoint <value>

Ensure that the access and secret key configured has access to the buckets where you store the data for Databricks tables.

Configuration on Azure

Add the following Spark configuration parameters for the job cluster and restart the cluster:

•fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net <value>
•fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net <value>
•fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net <Value>
•fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
•fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net https://login.microsoftonline.com/<Tenant ID>/oauth2/token

Ensure that the client ID and client secret configured has access to the file systems where you store the data for Databricks tables.

Configure Secure Agent properties

To connect to job cluster, enable the Secure Agent properties for runtime.

Note: This topic does not pertain to Data Ingestion and Replication.

1In Administrator, select the Secure Agent listed on the Runtime Environments tab.
2Click Edit.
3In the System Configuration Details section, select Data Integration Server as the Service and DTM as the Type.
4Edit the JVMOption field and set the value to -DUseDatabricksSql=false.

This image shows the JVMOption property for mappings.

Connect to Databricks

Let's configure the Databricks connection properties to connect to Databricks.

Before you begin

You can use a Databricks connection to read from and write to Databricks tables.

You can configure the following compute resources to connect to Databricks:

•SQL warehouse (Recommended)

The Secure Agent connects to the SQL warehouse at design time and runtime.

•All-purpose cluster and job cluster

The Secure Agent connects to the all-purpose cluster to import the metadata at design time and to the job cluster to run the mappings.

Note: If you're using an all-purpose or job cluster, Informatica recommends transitioning to the SQL warehouse. The all-purpose and job clusters will no longer receive new feature updates or enhancements, although they will still receive critical security updates to maintain their stability and safety. By switching to the SQL warehouse, you will benefit from the latest features and enhancements.

Before you get started, you'll need to configure the AWS or Azure staging environment to use Databricks connection.

To learn about prerequisites for the Azure or AWS environment, check out Staging prerequisites.

Connection details

The following table describes the basic connection properties:

Property	Description
Connection Name	Name of the connection. Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -, Maximum length is 255 characters.
Description	Description of the connection. Maximum length is 4000 characters.
Type	Databricks
Runtime Environment	The name of the runtime environment where you want to run tasks. For application ingestion and replication initial load tasks and database ingestion and replication tasks that have a supported source type, you can use the serverless runtime environment. You cannot run an application ingestion and replication, a database ingestion and replication, or a streaming ingestion and replication task on a Hosted Agent or elastic runtime environment.
SQL Warehouse JDBC URL	Databricks SQL Warehouse JDBC connection URL. This property is required only for Databricks SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. To get the SQL Warehouse JDBC URL, go to the Databricks console and select the JDBC driver version from the JDBC URL menu. Application ingestion and replication and database ingestion and replication tasks can use JDBC URL version 2.6.25 or later or 2.6.22 or earlier. The URLs must begin with the prefix jdbc:databricks://, as follows: jdbc:databricks://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/<SQL endpoint cluster ID>; Ensure that you set the required environment variables in the Secure Agent. Also specify the correct JDBC Driver Class Name under advanced connection settings. Note: Specify the database name in the Database Name connection property. If you specify the database name in the JDBC URL, it is not considered.

Authentication type

You can configure personal access token and OAuth machine-to-machine authentication types to access Databricks.

Select the required authentication method and then configure the authentication-specific parameters.

Personal access token authentication requires the personal access token and OAuth machine-to-machine authentication requires the client ID and client secret of your Databricks account.

For more information on how to get the personal access token, client ID, and client secret, see the Databricks documentation.

Personal access token authentication

Personal access token authentication requires the personal access token of your Databricks account.

The following table describes the connection properties for personal access token authentication:

Property	Description
Databricks Token	Personal access token to access Databricks. This property is required for SQL warehouse, all-purpose cluster, and job cluster.
Catalog Name	The name of an existing catalog in the metastore when you use Unity Catalog. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. The catalog name cannot contain special characters. For more information about Unity Catalog, see the Databricks documentation.

OAuth machine-to-machine authentication

OAuth machine-to-machine authentication requires the client ID and client secret of your Databricks account.

OAuth machine-to-machine authentication doesn't apply to all-purpose cluster, job cluster, and mappings in advanced mode. You can use OAuth machine-to-machine authentication only with JDBC driver versions 2.6.25 or later.

The following table describes the connection properties for OAuth machine-to-machine authentication:

Property	Description
Client ID	The client ID of the service principal.
Client Secret	The client secret associated with the Client ID of the service principal.
Catalog Name	The name of an existing catalog in the metastore when you use Unity Catalog. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. The catalog name cannot contain special characters. For more information about Unity Catalog, see the Databricks documentation.

Advanced settings

The following table describes the advanced connection properties:

Property	Description
Database	The name of the schema in Databricks. The name can contain only alphanumeric characters and hyphen (-). This property is optional for SQL warehouse, all-purpose cluster, and job cluster. If you do not specify a value, all databases available in the workspace are listed. The value you specify overrides the schema specified in the SQL Warehouse JDBC URL connection property.
JDBC Driver Class Name	The name of the JDBC driver class. This property is optional for SQL warehouse, all-purpose cluster, and job cluster. Default is com.databricks.client.jdbc.Driver
Staging Environment	The staging environment where your data is temporarily stored before processing. This property is required for SQL warehouse, all-purpose cluster, and job cluster. Select one of the following options as the staging environment: - AWS. If Databricks is hosted on the AWS platform. - Azure. If Databricks is hosted on the Azure platform. - Personal Staging Location. To stage data in a local personal storage location. If you select Personal Staging Location for a connection that Data Ingestion and Replication uses, the Parquet data files for application ingestion and replication jobs or database ingestion and replication jobs can be staged to a local personal storage location, with a data retention period of 7 days. You must also specify a Database Host value. If you use Unity Catalog, a personal storage location is automatically provisioned. However, you cannot use a personal staging location with Databricks unmanaged tables. - Volume. To stage data in a volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories. Note: When you configure a Databricks connection to connect to Databricks endpoints hosted on Google Cloud Platform, you must use a volume to stage data. If you use Data Ingestion and Replication, you can't use unmanaged tables for volume staging. Both volume and personal staging location do not apply to all-purpose clusters and job clusters. Default is Volume. Note: You cannot switch between clusters after you establish a connection. Important: Effective in the October 2024 release, personal staging location is deprecated. Deprecated functionality is supported, but Informatica intends to drop support in a future release. Informatica requests that you use a volume to stage the data.
Volume Path	The absolute path in Volume where you want to stage the data temporarily. Specify the path in the following format: /Volumes/<catalog_identifier>/<schema_identifier>/<volume_identifier>/ If you do not specify a volume path, the Secure Agent creates a managed volume in Databricks.
Databricks Host	The host name of the endpoint the Databricks account belongs to. This property is required only for all-purpose cluster and job cluster. Doesn't apply to SQL warehouse. You can get the Databricks Host from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster. The following example shows the Databricks Host in JDBC URL: jdbc:databricks://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3; UID=token; PWD=<personal-access-token> The value of PWD in Databricks Host, Organization Id, and Cluster ID is always <personal-access-token>.
Cluster ID	The ID of the cluster. This property is required only for all-purpose cluster and job cluster. Doesn't apply to SQL warehouse. You can get the cluster ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster The following example shows the Cluster ID in JDBC URL: jdbc:databricks://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Org Id>/<Cluster ID>; AuthMech=3;UID=token; PWD=<personal-access-token>
Organization ID	The unique organization ID for the workspace in Databricks. This property is required only for all-purpose cluster and job cluster. Doesn't apply to SQL warehouse. You can get the Organization ID from the JDBC URL. The URL is available in the Advanced Options of JDBC or ODBC in the Databricks all-purpose cluster The following example shows the Organization ID in JDBC URL: jdbc:databricks://<Databricks Host>:443/ default;transportMode=http; ssl=1;httpPath=sql/ protocolv1/o/<Organization ID>/ <Cluster ID>;AuthMech=3;UID=token; PWD=<personal-access-token>
Min Workers	The minimum number of worker nodes to be used for the Spark job. Minimum value is 1. This property is required only for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
Max Workers	The maximum number of worker nodes to be used for the Spark job. If you don't want to autoscale, set Max Workers = Min Workers or don't set Max Workers. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
DB Runtime Version	The version of job cluster to spawn when you connect to job cluster to process mappings. This property is required only for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Select the Databricks runtime version 9.1 LTS, 13.3 LTS, or 15.4 LTS. To use version 15.4 LTS, ensure that you set the spark.databricks.driver.dbfsLibraryInstallationAllowed parameter to true in the Spark Configuration connection property.
Worker Node Type	The worker node instance type that is used to run the Spark job. This property is required only for all-purpose cluster and job cluster. Doesn't apply to SQL warehouse. For example, the worker node type for AWS can be i3.2xlarge. The worker node type for Azure can be Standard_DS3_v2.
Driver Node Type	The driver node instance type that is used to collect data from the Spark workers. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. For example, the driver node type for AWS can be i3.2xlarge. The driver node type for Azure can be Standard_DS3_v2. If you don't specify the driver node type, Databricks uses the value you specify in the worker node type field.
Instance Pool ID	The instance pool ID used for the Spark cluster. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. If you specify the Instance Pool ID to run mappings, the following connection properties are ignored: - Driver Node Type - EBS Volume Count - EBS Volume Type - EBS Volume Size - Enable Elastic Disk - Worker Node Type - Zone ID
Elastic Disk	Enables the cluster to get additional disk space. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Enable this option if the Spark workers are running low on disk space.
Spark Configuration	Doesn't apply to Data Ingestion and Replication tasks.
Spark Environment Variables	Doesn't apply to Data Ingestion and Replication tasks.

AWS staging environment

The following table describes the properties for the AWS staging environment:

Property	Description
S3 Authentication Mode	The authentication mode to connect to Amazon S3. Select one of the following authentication modes: - Permanent IAM credentials. Uses the S3 access key and S3 secret key to connect to Amazon S3. - IAM Assume Role. Uses the AssumeRole for IAM authentication to connect to Amazon S3. This authentication mode applies only to SQL warehouse.
S3 Access Key	The key to access the Amazon S3 bucket.
S3 Secret Key	The secret key to access the Amazon S3 bucket.
S3 Data Bucket	The existing S3 bucket to store the Databricks data.
S3 Staging Bucket	The existing bucket to store the staging files.
S3 VPC Endpoint Type	The type of Amazon Virtual Private Cloud endpoint for Amazon S3. You can use a VPC endpoint to enable private communication with Amazon S3. Select one of the following options: - None. Select if you do not want to use a VPC endpoint. - Interface Endpoint. Select to establish private communication with Amazon S3 through an interface endpoint which uses a private IP address from the IP address range of your subnet. It serves as an entry point for traffic destined to an AWS service.
Endpoint DNS Name for S3	The DNS name for the Amazon S3 interface endpoint. Replace the asterisk symbol with the bucket keyword in the DNS name. Enter the DNS name in the following format: bucket.<DNS name of the interface endpoint> For example, bucket.vpce-s3.us-west-2.vpce.amazonaws.com
IAM Role ARN	The Amazon Resource Number (ARN) of the IAM role assumed by the user to use the dynamically generated temporary security credentials. Set the value of this property if you want to use the temporary security credentials to access the Amazon S3 staging bucket. For more information about how to get the ARN of the IAM role, see the AWS documentation.
Use EC2 Role to Assume Role	Optional. Select the check box to enable the EC2 role to assume another IAM role specified in the IAM Role ARN option. The EC2 role must have a policy attached with a permission to assume an IAM role from the same or different AWS account.
STS VPC Endpoint Type	The type of Amazon Virtual Private Cloud endpoint for AWS Security Token Service. You can use a VPC endpoint to enable private communication with Amazon Security Token Service. Select one of the following options: - None. Select if you do not want to use a VPC endpoint. - Interface Endpoint. Select to establish private communication with Amazon Security Token Service through an interface endpoint which uses a private IP address from the IP address range of your subnet.
Endpoint DNS Name for AWS STS	The DNS name for the AWS STS interface endpoint. For example, vpce-01f22cc14558c241f-s8039x4c.sts.us-west-2.vpce.amazonaws.com
S3 Service Regional Endpoint	The S3 regional endpoint when the S3 data bucket and the S3 staging bucket need to be accessed through a region-specific S3 regional endpoint. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. Default is s3.amazonaws.com.
S3 Region Name	The AWS cluster region in which the bucket you want to access resides. Select a cluster region if you choose to provide a custom JDBC URL that does not contain a cluster region name in the JDBC URL connection property.
Zone ID	The zone ID for the Databricks job cluster. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Specify the Zone ID only if you want to create a Databricks job cluster in a particular zone at runtime. For example, us-west-2a. Note: The zone must be in the same region where your Databricks account resides.
EBS Volume Type	The type of EBS volumes launched with the cluster. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
EBS Volume Count	The number of EBS volumes launched for each instance. You can choose up to 10 volumes. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Note: In a Databricks connection, specify at least one EBS volume for node types with no instance store. Otherwise, cluster creation fails.
EBS Volume Size	The size of a single EBS volume in GiB launched for an instance. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.

Azure staging environment

The following table describes the properties for the Azure staging environment:

Property	Description
ADLS Storage Account Name	The name of the Microsoft Azure Data Lake Storage account.
ADLS Client ID	The ID of your application to complete the OAuth Authentication in the Active Directory.
ADLS Client Secret	The client secret key to complete the OAuth Authentication in the Active Directory.
ADLS Tenant ID	The ID of the Microsoft Azure Data Lake Storage directory that you use to write data.
ADLS Endpoint	The OAuth 2.0 token endpoint from where authentication based on the client ID and client secret is completed.
ADLS Filesystem Name	The name of an existing file system to store the Databricks data.
ADLS Staging Filesystem Name	The name of an existing file system to store the staging data.

JDBC URL parameters

You can utilize the additional JDBC URL parameters field in the Databricks connection to customize and set any additional parameters required to connect to Databricks.

You can configure the following properties as additional JDBC URL parameters in the Databricks connection:

• To connect to Databricks using the proxy server, enter the following parameters:

jdbc:databricks://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/219fe3013963cdce;UseProxy=<Proxy=true>;ProxyHost=<proxy host IPaddress>;ProxyPort=<proxy server port number>;ProxyAuth=<Auth_true>;

Note: Data Ingestion and Replication does not support use of a proxy server to connect to Databricks.

•To connect to SSL-enabled Databricks, specify the value in the JDBC URL in the following format:

jdbc:databricks://<Databricks Host>:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/<SQL endpoint cluster ID>;

Rules and guidelines for personal staging location

When you select the personal staging location as a staging environment, the data is first staged in a Java temporary location and then copied to a personal staging location of the unity catalog. Both the staged files will be deleted after the task runs successfully.

However, to stage the data in a different directory, configure the DTM property -Djava.io.tmpdir=/my/dir/path in the JVM options in the system configuration settings of the Administrator service.

To enable data staging in a different directory, you should have read and write permissions and enough disk space to stage the data in the directory.

When you specify a personal staging location in the Databricks connection properties for staging, consider the following rules and guidelines:

•You can only specify unity enabled catalog in the SQL warehouse JDBC URL.
•You cannot use personal staging location as the staging environment with OAuth machine-to-machine authentication.
•Mappings run without SQL ELT optimization.
•The data is staged in the folder stage://tmp/<user_name> where the <user_name> is picked from the Databricks token provided in the connection and requires read and write access to the personal staging location in the root location of AWS and Azure.

Important: Effective in the October 2024 release, personal staging location entered deprecation. While you can use the functionality in the current release, Informatica intends to drop support for the functionality in a future release. Informatica recommends that you use a Volume to stage the data.