Transformations > Deduplicate transformation > Deduplication and consolidation operations
  

Deduplication and consolidation operations

When you run a mapping, the Deduplicate transformation generates a temporary index from the input records that it reads. The transformation analyzes the index to find pairs of similar records.
The transformation calculates a series of percentage scores that represent the degrees of similarity between the pairs of records that it finds. If two records match each other with a score that exceeds a given threshold, the transformation considers the records to be duplicates.
The deduplicate asset that you add to the transformation specifies the comparison criteria for the deduplication operation, including the threshold score that duplicate records must satisfy.
Consolidation is an optional process that the deduplicate asset can specify for the transformation. During consolidation, the transformation evaluates the sets of matching records that the deduplication process identifies. The transformation selects or constructs a preferred version of the records in each matching set.
A Data Quality user configures the deduplication and consolidation processes in the deduplicate asset. For more information about the criteria that the asset defines, contact the Data Quality user.

Rules and guidelines for deduplication and consolidation

When you add a Deduplicate transformation to a mapping, consider the following rules and guidelines:
Mapping fields for identity analysis
The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the objective of the deduplication operation. The type of identity defines the types of information that the transformation expects to find in the index.
You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map additional input fields to other fields on the transformation.
Groups and sequence ID values
In duplicate analysis, a group is a set of records that contain identical values in a given field. At run time, the Deduplicate transformation analyzes the index data for records exclusively within each group and subsequently combines the results from each group into a single data set. Use the GroupKey field on the Field Mapping tab to define your groups. When you create groups on an appropriate field, you reduce the overall number of comparisons that the transformation must perform without any meaningful loss of accuracy in duplicate analysis.
The GroupKey field is mandatory. If you prefer not to sort your input data into groups, add a column to your data set that has the same value on every row and map the column to the GroupKey field.
Sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a field that can provide data to the SequenceId field, the transformation reads the records in the order in which they appear in the input data set. A SequenceId field is mandatory if you run a mapping in advanced mode.
Clusters and scores
When two or more records match each other, the transformation assigns them to the same matching set and adds an ID value to each record that identifies them as members of the set.
A set of matching records within a group is also known as a cluster, and the ID value that associates matching records together is the cluster ID.
The transformation represents the relationships between matching records with link score and driver score values in the output data set. The link score is the score between two records that identifies them as members of the same cluster. The driver score is the score between the first record added to a cluster and another record in the cluster.
Bear in mind that the transformation generates a single score for each pair of matching records that it finds. The link and driver scores define the types of relationship between different records and do not represent different calculations.
Metadata fields
On the Output Fields tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the cluster to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each cluster. The transformation identifies the preferred record as the survivor record.
Use the fields to understand the mapping results.
For more information about the metadata fields, see Metadata fields on the Deduplicate transformation and Link scores and driver scores.