The Deduplicate transformation includes a set of predefined fields that contain metadata for the deduplication and consolidation processes. The transformation creates the fields by default and populates the fields when the mapping runs.
Metadata fields on the Field Mapping tab
The Target Fields list in the Field Mappings tab includes the following metadata fields:
GroupKey
Contains the data values that the transformation uses to sort input records into groups for duplicate analysis.
SequenceId
Contains a unique identifier for each record that enters the transformation.
The transformation uses the sequence ID values to identify records in the Out_DriverId and Out_LinkId data. If you do not map the SequenceId field, the transformation uses the values on the OutRowId field as unique identifiers for the records.
Metadata fields on the Output Fields tab
The Output Fields tab includes the following metadata fields:
Out_ClusterId
Contains the identifiers of the cluster to which each record belongs.
Note: In the deduplication process, a cluster is a set of records whose data values match each other to a degree that exceeds the duplicate threshold. Records in the same set are likely to identify the same identity. A set may contain a single record, as every unique record is a perfect match with itself.
Out_ClusterSize
Contains the number of records in the set to which the current record belongs. When a set contains a unique record, the cluster size is 1.
Out_DriverId
Contains the identifier of the driver record in each matching record set. The driver record is the record in the set with the lowest value on the SequenceId input field. If the transformation does not use the SequenceId field, the driver record is the record in the matching set with the lowest Out_RowId value.
Out_DriverScore
Contains the score that represents the degree of similarity between the current record and the driver record in the matching record set.
Out_IsSurvivor
Contains an identifier for the preferred record that a consolidation process specifies.
Out_LinkId
Contains the identifier of the record that matched with the current record and linked it to the matching record set.
Out_LinkScore
Contains the score between two records that results in the addition of a record to a matching record set. The Out_LinkId field identifies the record with which the current record shares the link score.
Out_RowId
Contains a unique identifier for each record in the mapping source data set.
The transformation uses the Out_RowId values to identify records if you do not map a field of unique identifiers to the SequenceId field.
Selecting metadata fields
The metadata fields can provide important information about the relationship between duplicate records. For example, the metadata includes the Out_LinkScore field, which represents the degree of similarity between two records as a numerical value. If you select the Out_LinkScore field, select the Out_LinkId field also. The Out_LinkId field identifies the other record in the pair of records that the Out_LinkScore value describes.
The Out_DriverId value provides a benchmark for all records in a matching record set. The Out_DriverId value is the score between the current record and the record in the set with the lowest sequence ID or row ID value. The record with the lowest ID is also the first record that the deduplication process added to the set.