Duplicate analysis operations

You define duplicate analysis operations in a deduplicate asset in Data Quality and in the Deduplicate transformation that reads the asset in Data Integration.

At a high level, you complete the following steps:

Deduplicate asset steps

Select the type of identity information that the transformation will analyze.

Configure the search criteria that the transformation will apply to the input data.

Deduplicate transformation steps

Add the deduplicate asset to the transformation.

Select the fields that contain the relevant identity data.

Select a field on which the transformation can sort the input records at run time.

The deduplicate asset provides a list of identity types that you must choose from. Each identity is optimized for different types of information. When you configure the Deduplicate transformation, you map the identity fields that the asset specifies to the input fields on the transformation.

Additionally, you configure the Deduplicate transformation to sort the input records into groups based on the values in a field that you select. In duplicate analysis, a group is a set of records that contain identical values in a given field. At run time, the Deduplicate transformation analyzes records exclusively within each group and combines the results from each group into a single output data set. When you create groups on an appropriate field, you reduce the overall number of comparisons that the Deduplicate transformation must perform without any meaningful loss of accuracy in the duplicate analysis.

The GroupKey field in the Deduplicate transformation identifies the field on which the transformation sorts the records. For more information on groups in duplicate analysis, see the Deduplicate transformation chapter in the Transformations module of the Data Integration documentation.

Note: To analyze all of your input records in a single group, create the group from a field that contains the same value for all records.

Rules and guidelines for duplicate analysis operations

When you configure the deduplicate analysis operations, consider the following rules and guidelines:

•The Deduplicate transformation compares field data from every input record with data from every other record in its group. The transformation begins with the first record in the group, or the record with the lowest sequence ID value.

The transformation creates discrete sets of similar records called clusters. If a record matches any other record in the group, the transformations adds the records to the same cluster. If a record matches another record that is already part of a cluster, the transformation adds the record to the cluster. If a record does not match any other record in its group, the record becomes the sole member of its cluster.

When the group analysis is complete, the Deduplicate transformation collates the records from each group into a single output data set.

•To determine the similarity between the identities in a set of records, the Deduplicate transformation creates an index from the values on the input fields that are relevant to the identity. The index contains raw values from the input fields and a range of alternative versions of the raw values.

You select the identity type as the Objective in the deduplicate asset. You specify the input field that contains the principal data for the identity as the Index Key in the asset.

•The Deduplicate transformation calculates a numerical score for each pair of records in a group that represents the degree of similarity between the records. A pair of identical records will return a score of 1.00.

The transformation uses a threshold value to identify records that match each other. The default value is 0.9, or 90 percent. If two records match each other with a score that meets or exceeds the threshold, the transformation identifies the records as duplicates.

You can optionally update additional asset properties that modify the speed, performance, and depth of the duplicate analysis.

•The final output from the Deduplicate transformation depends on the output properties that you specify in the deduplicate asset. By default, the asset instructs the transformation to retain every pair of matching records in the clusters that it creates. Alternatively, the asset can instruct the transformation to retain only the records that are the closest matches with each other. You can set the output type as a property on the asset.