A duplicate analysis mapping can take time to run because of the number of data comparisons that the Deduplicate transformation must perform. The number of comparisons relates to the number of data values on the fields that you select.
The following table shows the number of calculations that a mapping performs on a single field:
Number of data values
Number of comparisons
10,000
50 million
100,000
5,000 million
1 million
500,000 million
To reduce the time that the mapping takes to run, you configure the Deduplicate transformation to assign the input records to groups.
A group is a set of records that contain identical values on a field that you specify. When you perform duplicate analysis on grouped data, the Deduplicate transformation analyzes the record data exclusively within each group and combines the results from each group into a single output data set. The field on which you group the data is the GroupKey field. When you choose an appropriate group key, you reduce the overall number of comparisons that the Deduplicate transformation must perform without any meaningful loss of accuracy in the mapping analysis. Select the GroupKey field in the Deduplicate transformation.
The following table shows the number of calculations that a mapping performs on a single field that you sort into ten groups:
Number of data values
Number of groups
Group size
Total number of comparisons (all groups)
10,000
10
1,000
5 million
100,000
10
10,000
500 million
1 million
10
100,000
50,000 million
Consider the following rules and guidelines when you organize data into groups:
•The GroupKey field must contain a range of identical values, such as a city name or a state name in an address data set.
•Do not select a group key that contains information that is relevant to the duplicate analysis. For example, do not select the index key field as the GroupKey field. The goal in group creation is to organize the data according to values whose duplicate nature is not relevant to the objectives of the analysis.
•When you select a group key, consider whether the transformation can create groups that are a valid size relative to your input data. If the groups are too small, the match analysis might not find all of the duplicate records in the data set. If the groups are too large, the match analysis might return false duplicates.
•If your data does not contain a suitable field for group keys, create a data column that the transformation can use to can sort the records into the group sizes that you require. For example, for a data set that contains 1 million records, you might decide to create a column that repeats a series of values from 1 through 50. The records in each group will be distributed evenly in the data set and will allow the duplicate analysis to proceed on grouped data.
•If you do not want to sort the records into groups, specify a GroupKey field that contains the same value in every record. If a suitable field does not exist, create the field. For example, create a column of data in which every value is Group1 and select the column as the GroupKey field. When the mapping runs, the Deduplicate transformation sorts the records by the GroupKey field values and therefore assigns every record to the same group.
•Groups do not reorder the position of the records in the mapping data set.
Example: Selecting a group key column
Let's say that a bank wants to search for duplicate bank account holders. The bank's customer data set includes columns for customer names and addresses, and the bank chooses Contact as the objective in the deduplicate asset. The bank decides to sort the input records into groups and to perform duplicate analysis on each group. The bank must select a column in the Deduplicate transformation on which to create the groups.
The following table shows a fragment of the data set:
Customer ID
Lastname
Firstname
Address1
City
State
Zip
Country
90999990
Armstrong
Al
6121 SUNSET BLVD.
LOS ANGELES
CA
90028
USA
90999907
Baldwin
Lynn
1600 EL CAMINO REAL, SUITE 1500
MENLO PARK
CA
94025
USA
90999917
Baldwyn
Linn
1600 EL CAMINO REAL, #1500
MENLO PK
CA
94025
USA
90999859
Belleperche
Carmen
9255 SUNSET BLVD.
LOS ANGELES
CA
90069
USA
90999876
Clark
Wick
777 S. FIGUEROA
LOS ANGELES
CA
90071
USA
90999859
Bachtin
Guy
30 S. WACKER
CHICAGO
IL
60606
USA
90999868
Dicintio
David
181 WEST MADISON ST
CHICAGO
IL
60602
USA
90999869
Ash
Pascal
335 WEST 16TH STREET
NEW YORK
NY
10011
USA
90999996
Bachtin
David
1633 BROADWAY
NEW YORK
NY
10022
USA
90999994
Carpenter
Brad
30 BROAD ST
NEW YORK
NY
42304
USA
90999820
Dedmond
David
ONE FINANCIAL SQUARE
NEW YORK
NY
10008
USA
90999902
Backwell
Chris
901 SE OAK, WILLAMETTE PLZ
PORTLAND
OR
97214
USA
90999897
Askerup
Nancy
400 MARKET STREET
HOUSTON
TX
77027
USA
90999904
Choy
Shelley
1177 WEST LOOP SOUTH
HOUSTON
TX
77027
USA
90999886
Cote
Lian
530 E. SWEDESFORD RD.
HOUSTON
TX
77027
USA
90999999
Croteau
Paul
3829-55 GASKINS ROAD
HOUSTON
TX
77027
USA
In this scenario, you might identify the State column as the most suitable column on which to sort the records. You select the State column as the GroupKey field in the transformation.
When you select the State column as the GroupKey field, the deduplication operation enables the creation of five groups, one for each state. The likelihood that the bank has customers with the same contact information in different states is very low. Additionally, the data includes a Customer ID column that can add to the confidence of the deduplication process.
The Customer ID column is a poor candidate for group creation, as it is a primary key field. If you select the column as the GroupKey field, the deduplication operation creates a group for every unique ID and thus for every record.
The Country column is also a poor candidate for group creation, as the column contains the same value in every row. If you select the Country column as the GroupKey field, the deduplication operation adds all of the records to the same group. Your bank might have two or more genuine customers with the same name living across the country, and you do not want to deduplicate their entries.