Groups in duplicate analysis

A duplicate analysis mapping can take time to run because of the number of data comparisons that the Deduplicate transformation must perform. The number of comparisons relates to the number of data values on the fields that you select.

Number of data values	Number of comparisons
10,000	50 million
100,000	5,000 million
1 million	500,000 million

A group is a set of records that contain identical values on a field that you specify. When you perform duplicate analysis on grouped data, the Deduplicate transformation analyzes the record data exclusively within each group and combines the results from each group into a single output data set. The field on which you group the data is the GroupKey field. When you choose an appropriate group key, you reduce the overall number of comparisons that the Deduplicate transformation must perform without any meaningful loss of accuracy in the mapping analysis. Select the GroupKey field in the Deduplicate transformation.

Number of data values	Number of groups	Group size	Total number of comparisons (all groups)
10,000	10	1,000	5 million
100,000	10	10,000	500 million
1 million	10	100,000	50,000 million

Example: Selecting a group key column

Let's say that a bank wants to search for duplicate bank account holders. The bank's customer data set includes columns for customer names and addresses, and the bank chooses Contact as the objective in the deduplicate asset. The bank decides to sort the input records into groups and to perform duplicate analysis on each group. The bank must select a column in the Deduplicate transformation on which to create the groups.

Customer ID	Lastname	Firstname	Address1	City	State	Zip	Country
90999990	Armstrong	Al	6121 SUNSET BLVD.	LOS ANGELES	CA	90028	USA
90999907	Baldwin	Lynn	1600 EL CAMINO REAL, SUITE 1500	MENLO PARK	CA	94025	USA
90999917	Baldwyn	Linn	1600 EL CAMINO REAL, #1500	MENLO PK	CA	94025	USA
90999859	Belleperche	Carmen	9255 SUNSET BLVD.	LOS ANGELES	CA	90069	USA
90999876	Clark	Wick	777 S. FIGUEROA	LOS ANGELES	CA	90071	USA
90999859	Bachtin	Guy	30 S. WACKER	CHICAGO	IL	60606	USA
90999868	Dicintio	David	181 WEST MADISON ST	CHICAGO	IL	60602	USA
90999869	Ash	Pascal	335 WEST 16TH STREET	NEW YORK	NY	10011	USA
90999996	Bachtin	David	1633 BROADWAY	NEW YORK	NY	10022	USA
90999994	Carpenter	Brad	30 BROAD ST	NEW YORK	NY	42304	USA
90999820	Dedmond	David	ONE FINANCIAL SQUARE	NEW YORK	NY	10008	USA
90999902	Backwell	Chris	901 SE OAK, WILLAMETTE PLZ	PORTLAND	OR	97214	USA
90999897	Askerup	Nancy	400 MARKET STREET	HOUSTON	TX	77027	USA
90999904	Choy	Shelley	1177 WEST LOOP SOUTH	HOUSTON	TX	77027	USA
90999886	Cote	Lian	530 E. SWEDESFORD RD.	HOUSTON	TX	77027	USA
90999999	Croteau	Paul	3829-55 GASKINS ROAD	HOUSTON	TX	77027	USA

When you select the State column as the GroupKey field, the deduplication operation enables the creation of five groups, one for each state. The likelihood that the bank has customers with the same contact information in different states is very low. Additionally, the data includes a Customer ID column that can add to the confidence of the deduplication process.

The Customer ID column is a poor candidate for group creation, as it is a primary key field. If you select the column as the GroupKey field, the deduplication operation creates a group for every unique ID and thus for every record.

The Country column is also a poor candidate for group creation, as the column contains the same value in every row. If you select the Country column as the GroupKey field, the deduplication operation adds all of the records to the same group. Your bank might have two or more genuine customers with the same name living across the country, and you do not want to deduplicate their entries.