Identity Match Output Options
The Match Output view includes options that specify the output data format. You can configure the transformation to write records in clusters or in matched pairs. You can also configure the transformation to include or exclude different categories of identity when you perform identity analysis against a persistent index data store.
You configure the options in the Match output type area and the Properties area.
Match Output Types
The Match Output view includes options that specify the output data format. You can configure the transformation to write records in clusters or in matched pairs.
Select one of the following match output types:
- Best Match
- Writes each record in the master data set with the record that represents the best match in the second data set. The match operation selects the record in the second data set that has the highest match score for the master record. If two or more records return the highest score, the match operation selects the first record in the second data set. Best Match writes each pair of records to a single row.
- You can select Best Match when you configure the transformation for dual-source analysis.
- Clusters - Best Match
- Writes clusters that represent the best match between one record and another record in the same data set or between two data sets. The match score between the two records must meet the match threshold. Best match clusters can contain more than two records if a record represents the best match with more than one other record.
- You can select Clusters - Best Match in any type of identity analysis.
Note: The index data storage method that you select can affect the contents of the cluster output in Clusters - Best Match mode. A transformation that connects to index tables can create different clusters than a transformation that stores index data for the same records in temporary files. The index data storage method does not affect the match scores that the transformation generates for pairs of records.
- Clusters - Match All
- Writes clusters of records that match with a score that meets the match threshold. Each record must match at least one other record in the cluster.
- You can select Clusters - Match All in any type of identity analysis.
- Matched Pairs
- Writes all pairs of records that match each other with a score that meets the match threshold. The transformation writes each pair to a single row and adds the match score for each pair to each row. If a record matches more than one other record, the transformation writes a row for each record pair.
- You can select Matched Pairs in any type of identity analysis.
Match Output Properties
The Match Output view contains properties that specify the cache memory behavior and the match score threshold. You can also use the properties to determine how the transformation selects data store records for analysis and writes data store records as output.
After you select a match output type, configure the following properties:
- Cache Directory
- Specifies the directory to which the Data Integration Service writes temporary data during identity match analysis. The Data Integration Service writes temporary files to the directory when the volume of data that the match analysis generates is greater than the available system memory. The Data Integration Service deletes the temporary files after the mapping runs.
- You can enter a directory path on the property, or you can use a system parameter to identify the directory. Specify a local path on the Data Integration Service machine. The Data Integration Service must be able to write to the directory. The default value is the CacheDir system parameter.
- Cache Size
- Determines the amount of system memory that the Data Integration Service assigns to identity match analysis. The default value is 400,000 bytes.
- If the match analysis generates a greater amount of data, the Data Integration Service writes the excess data to the cache directory. If the match analysis requires more memory than the system memory and the file storage can provide, the mapping fails.
Note: If you enter a value of 65536 or higher, the transformation reads the value in bytes. If you enter a lower value, the transformation reads the value in megabytes.
- Match
- Identifies the records to analyze when the transformation reads index data from database tables. Use the options on the Match Type view to identify the index tables.
- By default, the transformation analyzes all the records in the data source and the index database tables. Configure the Match property to specify a subset of the records for duplicate analysis.
- Output
- Filters the records that the transformation writes as output when you configure the transformation to read index database tables. Use the options on the Match Type view to identify the index tables.
- By default, the Match transformation writes all records from the data source and the index database tables as output. Configure the Output property when you do not need to review all the records in the input data.
- Threshold
- Sets the minimum match score that identifies two records as potential duplicates of each other.
- You can assign a parameter to the threshold value. Set a decimal value in the range 0 through 1.
Match Property Configuration
Use the Match property on the Match Output view to specify how the transformation selects input data for analysis. Configure the property when you configure the Match transformation to read a persistent store of index data. The Match property refines the options that you set on the Match Type view.
You can configure the Match property to perform the following types of analysis:
- Compare the data source records with the index data records
- To look for duplicate records between the data source and the index data tables, select Exclusive.
- When you select the Exclusive option, the Match transformation compares the data source records with the index data store. The transformation does not analyze the records within the data source or within the data store.
- Select Exclusive when you know that the index data store does not contain duplicate records, and you know that the data source does not contain duplicate records.
- Compare the data source records with the index data records, and compare the data source records with each other
- To look for duplicates in the data source and to look for duplicates between the data source and the index tables, select Partial.
- The transformation compares the data source records with the index data store. The transformation also compares the records within the data source with each other.
- Select Partial when you know that the index data store does not contain duplicate records, but you have not performed any duplicate analysis on the data source.
- Compare all records in the data source and the index tables as a single data set
- To look for duplicates between the data source and the index tables, and to look for duplicates within the data source and within the index tables, select Full. The default option is Full.
- The transformation analyzes the data source and the data store as a single data set and compares all record data within the data set.
- Select Full when you cannot verify that either data set is free of duplicate records.
Output Property Configuration
Use the Output property on the Match Output view to filter the records that the transformation writes as output. Configure the property when you specify index data tables and you select a clustered output format. Filter the records to limit the output to clusters that contain one or more records from the data source.
You can filter the output data in the following ways:
- Write every cluster that includes a record from the data source or the index tables
- Select All Rows. The transformation writes every cluster that contains at least one record from either the data source or the index data store. The default is All Rows.
- Because a cluster can contain a single record, the output contains all records.
- Write every cluster that includes a record from the data source
- Select New and Associated Rows. The transformation writes every cluster that contains at least one record from the data source.
- Because a cluster can contain a single record, the output contains all records in the data source. The clusters can also include records from the index tables.
- Write every cluster from the data source
- Select New Rows Only. The transformation writes the clusters that contains records from the data source. The output does not contain any record from the index tables.