Field Match Output Options
Configure the Match Output options to define the output format for field match analysis.
You configure the options in the Match Output Type area and the Properties area.
Match Output Types
The Match Output view includes options that specify the output data format. You can configure the transformation to write records in clusters or in matched pairs.
Select one of the following match output types:
- Best Match
- Writes each record in the master data set with the record that represents the best match in the second data set. The match operation selects the record in the second data set that has the highest match score for the master record. If two or more records return the highest score, the match operation selects the first record in the second data set. Best Match writes each pair of records to a single row.
- You can select Best Match when you configure the transformation for dual-source analysis.
- Clusters
- Writes clusters that contain sets of records that link to each other with match scores that meet the match threshold. Each record must match at least one other record in the cluster with a score that meets the threshold.
- You can select Clusters when you configure the transformation for single-source analysis and dual-source analysis.
- Matched Pairs
- Writes all pairs of records that match each other with a score that meets the match threshold. The transformation writes each pair to a single row and adds the match score for each pair to each row. If a record matches more than one other record, the transformation writes a row for each record pair.
- You can select Matched Pairs when you configure the transformation for single-source and dual-source analysis.
Match Output Properties
The Match Output view includes properties that specify the cache memory behavior, the match score threshold, and the match scores that appear in the transformation output.
You can also use the match output properties to specify how the transformation adds match score values to the output records.
After you select a match output type, configure the following properties:
- Cache Directory
- Specifies the directory to which the Data Integration Service writes temporary data during field match analysis. The Data Integration Service writes temporary files to the directory when the volume of data that the match analysis generates is greater than the available system memory. The Data Integration Service deletes the temporary files after the mapping runs.
- You can enter a directory path on the property, or you can use a parameter to identify the directory. Specify a local path on the Data Integration Service machine. The Data Integration Service must be able to write to the directory. The default value is the CacheDir system parameter.
- Cache Size
- Determines the amount of system memory that the Data Integration Service assigns to field match analysis. The default value is 400,000 bytes.
- Before it sorts the data, the Data Integration Service allocates the amount of memory that you specify. If the match analysis generates a greater amount of data, the Data Integration Service writes the excess data to the cache directory. If the match analysis requires more memory than the system memory and the file storage can provide, the mapping fails.
Note: If you enter a value of 65536 or higher, the transformation reads the value in bytes. If you enter a lower value, the transformation reads the value in megabytes.
- Threshold
- Sets the minimum match score that identifies two records as potential duplicates of each other.
- You can assign a parameter to the threshold value. Set a decimal value in the range 0 through 1.
- Scoring Method
- Determines the match score values that appear in the transformation output. Select a scoring method for cluster outputs.
- The following table describes the scoring method options:
Scoring Method Option | Description |
---|
Both | Adds the link score and the driver score to each record in the cluster. |
Link Score | Adds the link score to each record in the cluster. Default option. |
Driver Score | Adds the driver score to each record in the cluster. |
None | Does not add a match score to any record in the cluster. |
Note: If you add the driver score to the records, you increase the mapping run time. The mapping waits until all clusters are complete before it adds the driver score values to the records.