Correct Duplicates
To correct duplicate records, examine all the records in a cluster. Add the most accurate data from each record in the cluster to the preferred record. Remove any record that is not a duplicate record.
Perform the following actions to correct duplicates:
- Edit the cluster.
- The Data Editing panel displays the records in a cluster and selects a record as the preferred record. Examine the field values in each record in the cluster. Verify that the records in the cluster represent different versions of the same record. If you find values that contain more accurate information than the preferred record values, replace the preferred record values with the values from the other records.
- Create clusters.
- A cluster might contain a record that is not a duplicate of any other record in the cluster. In this case, create a cluster and move the record to it. Create a cluster to keep the unique record in the database table. The cluster contains a single record.
- Find records in other clusters.
- If you expect that the data contains duplicate records across more than one cluster, search for records in the other clusters. You can open multiple clusters at the same time in the Analyst tool. If duplicate records occur between clusters, move records from one cluster to another.
- Set review status.
- When you complete the preferred record in a cluster, set the cluster review status to REVIEWED. The Analyst tool marks the preferred record for storage in the database table and it marks the remaining records as redundant. A Mapping task can delete the redundant records in a later stage of the workflow. You can clear the cluster status if you want to change the preferred record again. The review status becomes NULL.
- Comment.
- Opens comments for a cluster. You can view and add comments for a specific cluster. When you add comments, the comments appear in the audit trail for the cluster.
- Complete the task.
- After you review all the clusters in your task, you can indicate that you completed work on the task. The workflow defines the options that you can select to complete the task.
Editing a Cluster
Examine the records in the cluster and select any data value that you want to add to a field in the preferred record. You can select fields from multiple records to add to the preferred record.
The clusters in the task appear in a series of tabs across the Data Editing panel. The first cluster is open by default.
Use the following steps to edit a cluster:
1. Open a correct duplicates task.
The clusters in the task appear in the Exceptions workspace.
2. Click a tab to open the cluster.
The duplicate records appear beneath the preferred record.
3. Compare the preferred record with the other records in the cluster.
Identify the most accurate values in each column in the cluster.
4. Click Edit.
5. Click a value from a duplicate record that you want to move to the preferred record.
Repeat the steps for all values that you want to move to the preferred record. When you complete work in a cluster, confirm that you reviewed the cluster.
Creating a Cluster
Create a cluster when the current cluster contains a unique record. Move the unique record to the new cluster.
1. Open a cluster in the Data Editing panel.
2. Click Cluster Actions > Create Cluster.
An empty cluster appears in the Data Editing panel below the current cluster.
3. Select the record to add to the cluster.
4. Click Move Records.
The record becomes the preferred record of the cluster you created.
5. Move any other record that matches the preferred record to the cluster you created.
If the cluster you created contains one record, the preferred record is a unique record.
Finding Duplicate Records in Multiple Clusters
You can find records in other clusters that may match records in the current cluster. Enter a data value to search for and enter the record column that contains the data value.
1. In the Data Editing panel, click Cluster Actions > Find Clusters.
The Find dialog box appears.
2. Enter the data value you want to find. You can enter the full data value as it appears in the record column, or you can use an asterisk (*) or percent sign (%) wildcard character in a text pattern.
3. Select the column that contains the data value to search for.
4. Click Find.
The search operation returns all records that contain the value you searched for.
5. Select any record in the search results that matches a record in the open cluster. You can use the CTRL key to select multiple records.
The Data Editing panel displays the clusters that contain the records you select. Select a record and click Move Records to move a record from one cluster to the other.
Setting the Status of a Cluster
When you complete work on a cluster, confirm that you reviewed the cluster. When you confirm the review, you set the record review status to REVIEWED.
1. Open a cluster in the Data Editing panel.
2. Click Edit.
3. Click Cluster Actions > Confirm Cluster Review.
4. To clear the REVIEWED status, click Cluster Actions > Clear Cluster Status.
Adding Comments to a Cluster
You can add comments to a cluster in order to save notes or information about the data. Comments appear in the data audit trail.
1. Open a cluster in the Data Editing panel
2. Click Edit.
3. Click Cluster Actions > Comment.
4. Enter text in the Comment dialog box.