Catalog Administrator Guide > Managing Data Domains > Data Domain Discovery on the Spark Engine
  

Data Domain Discovery on the Spark Engine

When you run a profile to perform data discovery on the Spark engine, reference tables are staged on the Hadoop cluster. To make sure that reference tables for all the data domains are staged on the cluster, you can perform the following steps:

Prerequisite:

You must have a permission to impersonate HDFS user when you perform a data domain discovery.

Download the JDBC .JAR Files

  1. 1. Obtain the JDBC .jar files of the reference database that you use. You can download the files from the database vendor web site.
  2. 2. Copy the files that you download to the following location:<INFA_HOME>/externaljdbcjars

Configure Custom Properties on the Data Integration Service

  1. 1. Launch Informatica Administrator, and then select the Data Integration Service in the Domain Navigator.
  2. 2. Click the Custom Properties option on the Properties tab.
  3. 3. Set the following custom properties to stage reference tables for the data domains:
  4. Property Name
    Property Value
    AdvancedProfilingServiceOptions.ProfilingSparkReferenceDataHDFSDir
    /tmp/cms
    ExecutionContextOptions.SparkRefTableHadoopConnectorArgs
    --connect <JDBC thin driver connection URL>
  5. 4. Make sure /tmp/cms directory exists on the cluster. If the directory is not present, create the /tmp/cms directory or a custom directory where you want to stage the data. The reference data is staged at /tmp/cms directory by default.
  6. 5. Recycle the Data Integration Service.
  7. 6. Open the Catalog Administrator and make sure you run a first profile with all the data domains to stage the reference data.
Note: If you do not select all the data domains in the first profile run and then select additional data domains in the next profile run, the profile run may fail.