Deduplicate assets > Introduction to deduplicate assets
Introduction to deduplicate assets
A deduplicate asset defines a set of duplicate analysis operations. Duplicate analysis measures the levels of record duplication in a data set and optionally consolidates each set of duplicate records into a single, preferred record.
You create a deduplicate asset in Data Quality. You add the asset to a Deduplicate transformation in a mapping in Data Integration. When the mapping runs, Data Integration performs the deduplication and consolidation operations that the asset defines.
Deduplicate assets focus on identity information within the records that they analyze. In a data quality context, an identity is a set of data values in a record that identify a person, an organization, or another uniquely identifiable entity.
In the deduplication process, a Deduplicate transformation generates a set of numerical values that represent the degrees of similarity between the input records. When two or more records match each other with a high score, the transformation adds the records to a set of potential duplicates. You can review and update the score threshold that the transformation uses in the deduplicate asset.
In the consolidation process, the mapping further evaluates the sets of matching records that the duplicate analysis identified. The mapping builds a preferred or optimal version of the records in each set according to the criteria that you specify in the deduplicate asset.
Types of deduplication project
You might use a deduplicate asset in the following types of data project:
•Customer Relationship Management. For example, a store designs a mail campaign and must check the customer database for duplicate customer records.
•Regulatory compliance initiatives. For example, a business operates under government or industry regulations that insist all data systems are free of duplicate records.
•Financial risk management. For example, a bank may want to search for relationships between account holders.
•Any project that must identify or eliminate records that store duplicate identity information.