Data Domain Discovery on the Databricks Cluster
Use the Databricks cluster to perform data discovery on the Spark engine. The Databricks cluster is a environment to run the spark jobs. You can run a profile to perform data discovery for the Azure sources using the Databricks cluster.
You need to perform the following steps to connect to the Azure sources in the Databricks cluster:
Prerequisite
Add the following advanced Spark configuration parameters for the Databricks cluster and restart the cluster:
- •fs.azure.account.auth.type OAuth
- •fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
- •fs.azure.account.oauth2.client.id <your-service-client-id>
- • fs.azure.account.oauth2.client.secret <your-service-client-secret-key>
- •fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com/<directory-ID-of-Azure-AD>/oauth2/token
- •spark.hadoop.fs.azure.account.key.<<ACCOUNT_NAME>>. dfs.core.windows.net <<VALUE>>
Download and Copy the JAR files for the Profiling Warehouse
- 1. Get the Oracle DataDirect JDBC driver JAR files for the profiling warehouse. You can copy the files from the following location: <INFA_HOME>/services/shared/jars/thirdparty/com.informatica.datadirect-dworacle-6.0.0_F.jar.
- 2. Place the Oracle DataDirect JDBC driver JAR files in the following locations:
- - <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/spark
- - <INFA_HOME>/connectors/thirdparty/informatica.jdbc_v2/common
- - <INFA_HOME>/services/shared/hadoop/<DataBricksversion>/runtimeLib
Download and Copy the JAR files for the JBDC Delta Objects
- 1. Get the JDBC .jar files for JDBC delta objects. You can download the files from the database vendor website.
- 2. Update the genericJDBC.zip with the JDBC delta JAR files in the following location: INFA_HOME/services/CatalogService/ScannerBinaries .
- 3. Recycle the Catalog Service.
Configure Custom Properties on the Data Integration Service
- 1. Launch Informatica Administrator, and then select the Data Integration Service in the Domain Navigator.
- 2. Click the Custom Properties option on the Properties tab.
- 3. Set the following custom property to perform automatic installation of the Informatica libraries into the Databricks cluster:
ExecutionContextOptions.databricks.enable.infa.libs.autoinstall: true
- 4. Recycle the Data Integration Service.
Supported sources for data domain discovery on the Databricks Cluster
- •JDBC Delta
- •Azure Data Lake Store Gen2