Catalog Administrator Guide > Managing Data Domains > Data Domain Discovery on the Databricks Cluster

Data Domain Discovery on the Databricks Cluster

Use the Databricks cluster to perform data discovery on the Spark engine. The Databricks cluster is a environment to run the spark jobs. You can run a profile to perform data discovery for the Azure sources using the Databricks cluster.

You need to perform the following steps to connect to the Azure sources in the Databricks cluster:

Prerequisite

Add the following advanced Spark configuration parameters for the Databricks cluster and restart the cluster:

•fs.azure.account.auth.type OAuth
•fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
•fs.azure.account.oauth2.client.id <your-service-client-id>
• fs.azure.account.oauth2.client.secret <your-service-client-secret-key>
•fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-ID-of-Azure-AD>/oauth2/token
•spark.hadoop.fs.azure.account.key.<<ACCOUNT_NAME>>. dfs.core.windows.net <<VALUE>>

Download and Copy the JAR files for the Profiling Warehouse

1. Get the Oracle DataDirect JDBC driver JAR files for the profiling warehouse. You can copy the files from the following location: <INFA_HOME>/services/shared/jars/thirdparty/com.informatica.datadirect-dworacle-6.0.0_F.jar.
2. Place the Oracle DataDirect JDBC driver JAR files in the following locations:

- <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/spark
- <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/common
- <INFA_HOME>/services/shared/hadoop/<DataBricksversion>/runtimeLib

Download and Copy the JAR files for the JBDC Delta Objects

1. Get the JDBC .jar files for JDBC delta objects. You can download the files from the database vendor website.
2. Update the genericJDBC.zip with the JDBC delta JAR files in the following location: INFA_HOME/services/CatalogService/ScannerBinaries .
3. Recycle the Catalog Service.

Configure Custom Properties on the Data Integration Service

1. Launch Informatica Administrator, and then select the Data Integration Service in the Domain Navigator.
2. Click the Custom Properties option on the Properties tab.
3. Set the following custom property to perform automatic installation of the Informatica libraries into the Databricks cluster:

ExecutionContextOptions.databricks.enable.infa.libs.autoinstall: true

4. Recycle the Data Integration Service.

Supported sources for data domain discovery on the Databricks Cluster

•JDBC Delta
•Azure Data Lake Store Gen2