Metadata fields on the Deduplicate transformation

The Deduplicate transformation includes a set of predefined fields that contain metadata for the deduplication and consolidation processes. The transformation creates the fields by default and populates the fields when the mapping runs.

Metadata fields on the Field Mapping tab

The Target Fields list in the Field Mappings tab includes the following metadata fields:

GroupKey

Contains the data values that the transformation uses to sort input records into groups for duplicate analysis.

SequenceId

Contains a unique identifier for each record that enters the transformation.

The transformation uses the sequence ID values to identify records in the Out_DriverId and Out_LinkId data. If you do not map the SequenceId field, the transformation uses the values on the OutRowId field as unique identifiers for the records.

Metadata fields on the Output Fields tab

The Output Fields tab includes the following metadata fields:

Out_ClusterId

Contains the identifiers of the cluster to which each record belongs.

Note: In the deduplication process, a cluster is a set of records whose data values match each other to a degree that exceeds the duplicate threshold. Records in the same set are likely to identify the same identity. A set may contain a single record, as every unique record is a perfect match with itself.

Out_ClusterSize

Contains the number of records in the set to which the current record belongs. When a set contains a unique record, the cluster size is 1.

Out_DriverId

Contains the identifier of the driver record in each matching record set. The driver record is the record in the set with the lowest value on the SequenceId input field. If the transformation does not use the SequenceId field, the driver record is the record in the matching set with the lowest Out_RowId value.

Out_DriverScore

Contains the score that represents the degree of similarity between the current record and the driver record in the matching record set.

Out_IsSurvivor

Contains an identifier for the preferred record that a consolidation process specifies.

Out_LinkId

Contains the identifier of the record that matched with the current record and linked it to the matching record set.

Out_LinkScore

Contains the score between two records that results in the addition of a record to a matching record set. The Out_LinkId field identifies the record with which the current record shares the link score.

Out_RowId

Contains a unique identifier for each record in the mapping source data set.

The transformation uses the Out_RowId values to identify records if you do not map a field of unique identifiers to the SequenceId field.

Selecting metadata fields

The metadata fields can provide important information about the relationship between duplicate records. For example, the metadata includes the Out_LinkScore field, which represents the degree of similarity between two records as a numerical value. If you select the Out_LinkScore field, select the Out_LinkId field also. The Out_LinkId field identifies the other record in the pair of records that the Out_LinkScore value describes.

The Out_DriverId value provides a benchmark for all records in a matching record set. The Out_DriverId value is the score between the current record and the record in the set with the lowest sequence ID or row ID value. The record with the lowest ID is also the first record that the deduplication process added to the set.