Reference Data Requirements
If you have a Data Quality product license, you can push a mapping that contains data quality transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data values are accurate and correctly formatted.
When you apply a pushdown operation to a mapping that contains data quality transformations, the operation can copy the reference data that the mapping uses. The pushdown operation copies reference table data, content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster deletes the reference data that the pushdown operation copied with the mapping.
Note: The pushdown operation does not copy address validation reference data. If you push a mapping that performs address validation, you must install the address validation reference data files on each DataNode that runs the mapping. The cluster does not delete the address validation reference data files after the address validation mapping runs.
Address validation mappings validate and enhance the accuracy of postal address records. You can buy address reference data files from Informatica on a subscription basis. You can download the current address reference data files from Informatica at any time during the subscription period.
Reference Data for Address Validation
When you run an address validation mapping in a Hadoop environment, the address reference data files must reside on each DataNode on which the mapping runs. Informatica Big Data Management installs with a shell script that you can use to install the files on the DataNodes.
Use the shell script to install the address reference data files on the DataNodes in a single operation. The script reads a file that contains the names or IP addresses of the nodes. The script copies the address reference data files to each node that the file identifies.
The script name is copyRefDataToComputeNodes.sh.
Find the script in the following directory in the Informatica Big Data Management installation:
<Informatica installation directory>/tools/dq/av
The following table describes the options that the script uses:
Option | Description |
---|
-n | The file that contains the list of names or IP addresses of the DataNodes in the Hadoop cluster. Enter each node name or IP address on a separate line in the file. By default, the script reads the file from the $BASEDIR/HadoopDataNodes directory, where $BASEDIR is the location of the shell script. |
-p | A prompt to confirm that you want to install the address reference data files. By default, the script displays a prompt to confirm that you want to copy the files from the source directory to the target directories on the DataNodes. if you run the shell script on a schedule, you can disable the prompt. The default option value is Y. To disable the prompt, set the value to N. |
-s | The source directory for the address reference data files that the script copies to the nodes. By default, the script reads the files from the /reference_data directory on the local machine. Note: Address reference data files use the file name extension .MD. The source directory must contain the address reference data files and no other files. |
-t | The directory on each node to which the script copies the address reference data files. By default, the script copies the files to the /reference_data directory on each node. |
-u | The user name of the user who runs the script. The user must have passwordless secure shell access to the nodes. |
Installing the Address Reference Data Files
To install address reference data files on the DataNodes in a Hadoop cluster, run the copyRefDataToComputeNodes.sh shell script. Or, define a job to run the shell script in a job scheduler application at time intervals that you specify.
Before you run the script or define the job, review the option values that you specify for the script. You can accept the default values or update the values.
Installing the Address Reference Data Files at the Command Prompt
To install the files at the command prompt, perform the following steps:
- 1. At the command prompt, open the following directory:
<Informatica installation directory>/tools/dq/av
- 2. Run copyRefDataToComputeNodes.sh.
Optionally, enter one or more values for the script options. If you do not enter a value for an option, the script runs with the default value for the option.
By default, the script prompts you to confirm the installation of the files. To install the files, enter Y.
Installing the Address Reference Data Files with a Scheduled Job
You can define a job to run the shell script at time intervals that you specify. Add the job to a job scheduler application. If you define a job to install the files, you must disable the prompt to confirm installation.
To disable the prompt, set the following option on the shell script:
-p n