Connections > Databricks connection > Connect to Databricks

Connect to Databricks

Let's configure the Databricks connection properties to connect to Databricks.

Before you begin

You can use a Databricks connection to read from and write to Databricks tables.

You can configure the following compute resources to connect to Databricks:

•SQL warehouse (Recommended)

The Secure Agent connects to the SQL warehouse at design time and runtime.

•All-purpose cluster and job cluster

The Secure Agent connects to the all-purpose cluster to import the metadata at design time and to the job cluster to run the mappings.

Note: If you're using an all-purpose or job cluster, Informatica recommends transitioning to the SQL warehouse. The all-purpose and job clusters will no longer receive new feature updates or enhancements, although they will still receive critical security updates to maintain their stability and safety. By switching to the SQL warehouse, you will benefit from the latest features and enhancements.

Before you get started, you'll need to configure the AWS or Azure staging environment to use Databricks connection.

To learn about the prerequisites for the Azure or AWS environment, check out Staging prerequisites.

Connection details

The following table describes the connection properties:

Property	Description
Connection Name	Name of the connection. Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -, Maximum length is 255 characters.
Runtime Environment	Informatica Cloud Hosted Agent, the runtime environment where you want to run tasks.
SQL Warehouse JDBC URL	Databricks SQL Warehouse JDBC connection URL. This property is required only for Databricks SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. This field is required to connect to the Databricks SQL warehouse.

Authentication type

You can configure personal access token and OAuth machine-to-machine authentication types to access Databricks.

Select the required authentication method and then configure the authentication-specific parameters.

Personal access token authentication requires the personal access token and OAuth machine-to-machine authentication requires the client ID and client secret of your Databricks account.

For more information on how to get the personal access token, client ID, and client secret, see the Databricks documentation.

Personal access token authentication

Personal access token authentication requires the personal access token of your Databricks account.

The following table describes the connection properties for personal access token authentication:

Property	Description
Databricks Token	Personal access token to access Databricks. This property is required for SQL warehouse, all-purpose cluster, and job cluster.
Catalog Name	The name of an existing catalog in the metastore when you use Unity Catalog. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. The catalog name cannot contain special characters. For more information about Unity Catalog, see the Databricks documentation.

OAuth machine-to-machine authentication

OAuth machine-to-machine authentication requires the client ID and client secret of your Databricks account.

OAuth machine-to-machine authentication doesn't apply to all-purpose cluster and job cluster. You can use OAuth machine-to-machine authentication only with JDBC driver versions 2.6.25 or later.

The following table describes the connection properties for OAuth machine-to-machine authentication:

Property	Description
Client ID	The client ID of the service principal.
Client Secret	The client secret associated with the Client ID of the service principal.
Catalog Name	The name of an existing catalog in the metastore when you use Unity Catalog. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. The catalog name cannot contain special characters. For more information about Unity Catalog, see the Databricks documentation.

Advanced settings

The following table describes the advanced connection properties:

Property	Description
Database	The name of the schema in Databricks. The name can contain only alphanumeric characters and hyphen (-). This property is optional for SQL warehouse, all-purpose cluster, and job cluster. By default, all databases available in the workspace are listed.
JDBC Driver Class Name	The name of the JDBC driver class. This property is optional for SQL warehouse, all-purpose cluster, and job cluster. Specify the driver class name as com.simba.spark.jdbc.Driver for the data loader task.
Staging Environment	The staging environment where your data is temporarily stored before processing. This property is required for SQL warehouse, all-purpose cluster, and job cluster. Select one of the following options as the staging environment: - AWS. If Databricks is hosted on the AWS platform. - Azure. If Databricks is hosted on the Azure platform. - Personal Staging Location. To stage data in a local personal storage location. - Volume. To stage data in a volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories. Note: When you configure a Databricks connection to connect to Databricks endpoints hosted on Google Cloud Platform, you must use a volume to stage data. Both volume and personal staging location do not apply to all-purpose clusters and job clusters. Default is Volume. Note: You cannot switch between clusters after you establish a connection. Important: Effective in the October 2024 release, personal staging location is deprecated. Deprecated functionality is supported, but Informatica intends to drop support in a future release. Informatica requests that you use a volume to stage the data.
Volume Path	The absolute path in Volume where you want to stage the data temporarily. Specify the path in the following format: /Volumes/<catalog_identifier>/<schema_identifier>/<volume_identifier>/ If you do not specify a volume path, the Secure Agent creates a managed volume in Databricks.
Databricks Host	Doesn't apply to a data loader task.
Cluster ID	Doesn't apply to a data loader task.
Organization ID	Doesn't apply to a data loader task.
Min Workers	Doesn't apply to a data loader task.
Max Workers	Doesn't apply to a data loader task.
DB Runtime Version	Doesn't apply to a data loader task.
Worker Node Type	Doesn't apply to a data loader task.
Driver Node Type	Doesn't apply to a data loader task.
Instance Pool ID	Doesn't apply to a data loader task.
Elastic Disk	Doesn't apply to a data loader task.
Spark Configuration	Doesn't apply to a data loader task.
Spark Environment Variables	Doesn't apply to a data loader task.

AWS staging environment

The following table describes the properties for the AWS staging environment:

Property	Description
S3 Authentication Mode	The authentication mode to connect to Amazon S3. Select one of the following authentication modes: - Permanent IAM credentials. Uses the S3 access key and S3 secret key to connect to Amazon S3. - IAM Assume Role. Uses the AssumeRole for IAM authentication to connect to Amazon S3. This authentication mode applies only to SQL warehouse.
S3 Access Key	The key to access the Amazon S3 bucket.
S3 Secret Key	The secret key to access the Amazon S3 bucket.
S3 Data Bucket	The existing S3 bucket to store the Databricks data.
S3 Staging Bucket	The existing bucket to store the staging files.
S3 VPC Endpoint Type	The type of Amazon Virtual Private Cloud endpoint for Amazon S3. You can use a VPC endpoint to enable private communication with Amazon S3. Select one of the following options: - None. Select if you do not want to use a VPC endpoint. - Interface Endpoint. Select to establish private communication with Amazon S3 through an interface endpoint which uses a private IP address from the IP address range of your subnet. It serves as an entry point for traffic destined to an AWS service.
Endpoint DNS Name for S3	The DNS name for the Amazon S3 interface endpoint. Replace the asterisk symbol with the bucket keyword in the DNS name. Enter the DNS name in the following format: bucket.<DNS name of the interface endpoint> For example, bucket.vpce-s3.us-west-2.vpce.amazonaws.com
IAM Role ARN	The Amazon Resource Number (ARN) of the IAM role assumed by the user to use the dynamically generated temporary security credentials. Set the value of this property if you want to use the temporary security credentials to access the Amazon S3 staging bucket. For more information about how to get the ARN of the IAM role, see the AWS documentation.
Use EC2 Role to Assume Role	Optional. Select the check box to enable the EC2 role to assume another IAM role specified in the IAM Role ARN option. The EC2 role must have a policy attached with a permission to assume an IAM role from the same or different AWS account.
STS VPC Endpoint Type	The type of Amazon Virtual Private Cloud endpoint for AWS Security Token Service. You can use a VPC endpoint to enable private communication with Amazon Security Token Service. Select one of the following options: - None. Select if you do not want to use a VPC endpoint. - Interface Endpoint. Select to establish private communication with Amazon Security Token Service through an interface endpoint which uses a private IP address from the IP address range of your subnet.
Endpoint DNS Name for AWS STS	The DNS name for the AWS STS interface endpoint. For example, vpce-01f22cc14558c241f-s8039x4c.sts.us-west-2.vpce.amazonaws.com
S3 Service Regional Endpoint	The S3 regional endpoint when the S3 data bucket and the S3 staging bucket need to be accessed through a region-specific S3 regional endpoint. This property is optional for SQL warehouse. Doesn't apply to all-purpose cluster and job cluster. Default is s3.amazonaws.com.
S3 Region Name	The AWS cluster region in which the bucket you want to access resides. Select a cluster region if you choose to provide a custom JDBC URL that does not contain a cluster region name in the JDBC URL connection property.
Zone ID	The zone ID for the Databricks job cluster. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Specify the Zone ID only if you want to create a Databricks job cluster in a particular zone at runtime. For example, us-west-2a. Note: The zone must be in the same region where your Databricks account resides.
EBS Volume Type	The type of EBS volumes launched with the cluster. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.
EBS Volume Count	The number of EBS volumes launched for each instance. You can choose up to 10 volumes. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster. Note: In a Databricks connection, specify at least one EBS volume for node types with no instance store. Otherwise, cluster creation fails.
EBS Volume Size	The size of a single EBS volume in GiB launched for an instance. This property is optional for job cluster. Doesn't apply to SQL warehouse and all-purpose cluster.

Azure staging environment

The following table describes the properties for the Azure staging environment:

Property	Description
ADLS Storage Account Name	The name of the Microsoft Azure Data Lake Storage account.
ADLS Client ID	The ID of your application to complete the OAuth Authentication in the Active Directory.
ADLS Client Secret	The client secret key to complete the OAuth Authentication in the Active Directory.
ADLS Tenant ID	The ID of the Microsoft Azure Data Lake Storage directory that you use to write data.
ADLS Endpoint	The OAuth 2.0 token endpoint from where authentication based on the client ID and client secret is completed.
ADLS Filesystem Name	The name of an existing file system to store the Databricks data.
ADLS Staging Filesystem Name	The name of an existing file system to store the staging data.