Deduplication and consolidation operations

Mapping fields for identity analysis

The deduplicate asset that you add to the transformation specifies a type of identity, such as a person name or an organization name. The asset identifies the identity type as the objective of the deduplication operation. The type of identity defines the types of information that the transformation expects to find in the index.

You must map the appropriate input fields on the transformation to the target fields that the transformation indicates. You can optionally map additional input fields to other fields on the transformation.

Groups and sequence ID values

In duplicate analysis, a group is a set of records that contain identical values in a given field. At run time, the Deduplicate transformation analyzes the index data for records exclusively within each group and subsequently combines the results from each group into a single data set. Use the GroupKey field on the Field Mapping tab to define your groups. When you create groups on an appropriate field, you reduce the overall number of comparisons that the transformation must perform without any meaningful loss of accuracy in duplicate analysis.

The GroupKey field is mandatory. If you prefer not to sort your input data into groups, add a column to your data set that has the same value on every row and map the column to the GroupKey field.

Sequence ID values determine the order in which the transformation reads the input records. If your input records do not contain a field that can provide data to the SequenceId field, the transformation reads the records in the order in which they appear in the input data set. A SequenceId field is mandatory if you run a mapping in advanced mode.

Clusters and scores

When two or more records match each other, the transformation assigns them to the same matching set and adds an ID value to each record that identifies them as members of the set.

A set of matching records within a group is also known as a cluster, and the ID value that associates matching records together is the cluster ID.

The transformation represents the relationships between matching records with link score and driver score values in the output data set. The link score is the score between two records that identifies them as members of the same cluster. The driver score is the score between the first record added to a cluster and another record in the cluster.

Bear in mind that the transformation generates a single score for each pair of matching records that it finds. The link and driver scores define the types of relationship between different records and do not represent different calculations.

Metadata fields

On the Output Fields tab, the transformation adds fields that display the score values for pairs of matching records. The fields also identify the cluster to which each record belongs. If the deduplicate asset specifies a consolidation process, the metadata fields specify a preferred record for each cluster. The transformation identifies the preferred record as the survivor record.

Use the fields to understand the mapping results.

For more information about the metadata fields, see Metadata fields on the Deduplicate transformation and Link scores and driver scores.

Deduplication and consolidation operations

Rules and guidelines for deduplication and consolidation