Data Quality Assets > Part III: Deduplicate assets > Introduction to deduplicate assets > Deduplicate asset structure
  

Deduplicate asset structure

A Deduplicate asset contains options on a Definition tab, a Deduplication tab, and a Consolidation tab.
Use the Definition tab options to enter a name for the asset, optionally enter a description for the asset, and select the folder in which to store the asset. Use the Deduplication tab options to configure the type of identity analysis that a mapping will perform. Optionally, use the Consolidation options to configure the strategy that the mapping will apply to the discrete sets of duplicate records that arise from the identity analysis.

Deduplication tab options

Use the options on the Deduplication tab to configure the type of duplicate analysis that a mapping will perform. The Deduplication tab includes the Configuration, Advanced Options and the Test Data panels.

Configuration panel

Use the Configuration panel to configure the type of duplicate analysis that a mapping will perform.
The following image shows the Configuration panel on the Deduplication tab:
The image depicts the Confiuration panel options.
The Configuration panel includes the following options:
  1. 1Objective. Identifies the type of information that the asset will prioritize when it compares the input records to each other. For more information about the objectives that you can configure, see Deduplication objectives.
  2. 2Index Key. Identifies the type of information that the asset will use to create an index of the input records.
  3. 3Data Locale. Identifies the county or region for which the Deduplicate transformation loads identity population reference data.
  4. 4Optional Fields. Enables the deduplicate analysis to consider additional fields at run time. This field is disabled by default.
  5. 5Filter Exact Duplicates. Determines whether the transformation passes identical records directly to the configuration stage without analysis. This field is disabled by default.
  6. 6Performance. Specifies the speed and granularity of duplicate analysis. Fine-grained analysis takes longer to perform. To enable advanced options, select Custom.

Advanced Options panel

Use the Advanced Options panel to review or update the performance criteria that the Deduplicate transformation defines for duplicate analysis at run time.
The following image shows the Advanced Options panel on the Deduplication tab:
The Advanced Options panel on the Deduplication tab. The panel lists various options to customize the performance. The Field Thresholds option isn't enabled.
  1. 1Level of Accuracy. Determines the degree of similarity that must exist between two identities before the deduplication process considers them to be good matches with each other.
  2. You can select one of the following accuracy levels:
  3. 2Level of Confidence. Determines the extent of the differences between data values that the deduplication process will tolerate. The level of confidence represents the level of error tolerance that the deduplication process supports.
  4. You can select one of the following confidence levels:
  5. 3Key Level. Determines the complexity of the index keys that the deduplication process creates.
  6. You can select one of the following key levels:
  7. 4Deduplication Outputs. Determines the composition of the clusters of matching records that the transformation creates at run time.
  8. You can select one of the following deduplication outputs:
  9. 5Deduplication Threshold. Specifies the minimum score between two records that identify the records as plausible duplicates of each other. You can enter a threshold value from 0 through 100.
  10. 6Field Thresholds. Enables custom deduplication thresholds for input fields.
  11. You can use the field-level threshold options to change the emphasis on individual fields within a deduplication process and deliver results that are fine-tuned to your current deduplication project.
    To set the thresholds, click Manage Field Thresholds. In the Manage Field Thresholdes dialog box, you can configure a unique threshold value for each input field. You can enter a threshold value from 0 through 100.

Test Data panel

Use the Test Data panel to see the fields for which the asset expects data, based on the objective that you select.
The following image shows the Test Data panel:
The image depicts the Test Data options.
Each field appears as a column name. To test the similarity between data records, enter two or more rows of data. Populate at a minimum each mandatory field and one or more of any required fields that the asset specifies for the current objective.
The panel also includes a Runtime Environment option and options to search, sort, and filter the test data. Use the Runtime Environment option to specify a Secure Agent.

Consolidation tab

Use the options on the Consolidation tab to configure the type of consolidation that a mapping will perform.
The following image shows the Consolidation tab options:
Two overlapping views of the Consolidation tab. One view shows the options for row-based consolidation The other view shows the options for field-based consolidation.
The Consolidation tab includes the following options:
  1. 1Consolidation mode.
  2. Identifies the type of consolidation that the Deduplicate transformation will perform when the mapping runs. The type that you select determines how the transformation selects the preferred record in each set of duplicate records.
    Choose the row-based option to select a preferred record based on the quantity of data in the identity fields. Choose the field-based to build a preferred record from the data values across one or more records. You can also choose not to consolidate the duplicate record sets.
  3. 2Row strategy.
  4. Determines how the transformation will select the preferred record when you choose the row-based consolidation mode.
    Choose Most Data to specify the record with the greatest number of characters as the preferred record. Choose Most Filled to specify the row with the highest number of populated fields. Choose Modal Exact to select the record with the highest number of fields that contain the most common values in their respective columns.
  5. 3Field name column.
  6. Lists the fields in the input records that the Deduplicate transformation will read. The field name column is visible when you select the field-based consolidation mode. You can specify a consolidation strategy for each field when you select field-based consolidation.
  7. 4Strategy.
  8. Determines how the transformation selects the value in each field for the preferred record when you choose the field-based consolidation mode.
    You can select one of the following strategies:
  9. 5Type
  10. Indicates whether the asset created the field during the deduplication operation or whether you added the field to the asset in the Consolidation pane.
  11. 6Data Type
  12. Identifies the data type of the field. The default data type on all fields is String. You can modify the data type in field-based consolidation to suit your data requirements.
    You can select one of the following data types for a field:
    Note: If you modify the data type of a field, do not change the mode to No Consolidation or Row-based consolidation without first saving the asset. The asset discards any update that you make to a data type in Field-based consolidation mode if you change to another mode.
  13. 7Add field button.
  14. Adds one or more inputs to the consolidation operation when you select the field-based mode.
    Add fields in the following cases:
    The field-based strategies will apply to every input field that you map to the deduplicate asset in the transformation. If you do not specify a strategy for a field in field-based mode, the transformation applies the default strategy.