Data Quality Assets > Part III: Deduplicate assets > Introduction to deduplicate assets

Introduction to deduplicate assets

A deduplicate asset measures the level of similarity between records in a data set and optionally consolidates similar records into a single, preferred record.

Deduplicate assets focus on identity information within the records that they analyze. In a data quality context, an identity is a set of data values in a record that identify a person, an organization, or another uniquely identifiable entity.

In the deduplication process, a Deduplicate transformation generates a set of numerical values that represent the degrees of similarity between the input records. When two or more records match each other with a high score, the transformation adds the records to a set of potential duplicates. You can review and update the score threshold that the transformation uses in the deduplicate asset.

In the consolidation process, the mapping further evaluates the sets of matching records that the duplicate analysis identified. The mapping builds a preferred or optimal version of the records in each set according to the criteria that you specify in the deduplicate asset.

You create and test a deduplicate asset in Data Quality or Data Integration, and then add the asset to a Deduplicate transformation in a mapping in Data Integration. In Data Quality, you create a deduplicate asset from the New dialog box. In Data Integration, you create a deduplicate asset either from the New Asset dialog box or from a mapping.

When the mapping runs, Data Integration performs the deduplication and consolidation operations that the asset defines.

Types of deduplication project

You might use a deduplicate asset in the following types of data project:

•Customer Relationship Management. For example, a store designs a mail campaign and must check the customer database for duplicate customer records.
•Regulatory compliance initiatives. For example, a business operates under government or industry regulations that insist all data systems are free of duplicate records.
•Financial risk management. For example, a bank may want to search for relationships between account holders.
•Any project that must identify or eliminate records that store duplicate identity information.