Developer Transformation Guide > Match Transformation > Match Analysis

Match Analysis

You can define different types of duplicate analysis in the Match transformation. The duplicate analysis operations that you define depend on the number of data sources in the mapping and the type of information that the sources contain.

Consider the following factors when you configure a Match transformation:

•You can select a single column from a data set or you can select multiple columns.
•You can analyze columns from a single data source or you can analyze two data sources.
•You can configure the Match transformation to analyze the raw data in the input port fields, or you can configure the transformation to analyze the identity information in the data.
•You can configure the Match transformation to write different types of output. The type of output that you select determines the number of records that the transformation writes and the order of the records.
•To increase performance, sort the input records into groups before you perform the match analysis.

Column Analysis

When you configure a Match transformation, you select one or more columns for analysis.

The Match transformation analyzes columns in pairs. When you select a single column for analysis, the transformation creates a temporary copy of the column and compares the source column with the temporary column. When you select two columns for analysis, the transformation compares the values across the two columns that you select. The transformation compares each value in one column with all of the values in the other column. The transformation returns a match score for each pair of values that it analyzes.

You select the columns to analyze when you configure a strategy in the Match transformation. The strategy specifies the columns to analyze and the algorithm to apply to the columns. The algorithm calculates the levels of similarity between each pair of values. The different algorithms in the transformation use different criteria to measure the levels of similarity between the values. You can define multiple strategies in a transformation, and you can and assign different columns to each strategy.

Column Analysis Example

You want to compare the values in a column of surname data. You create a mapping that includes a data source and a Match transformation. You connect the Surname port to the Match transformation. The transformation creates a temporary copy of the data on the Surname port when the mapping runs.

The following image shows a fragment of the surname data:

The spreadsheet contains two columns of surname data. Column A represents the data on a transformation input port. Column B represents the temporary copy of the data that the transformation generates for match analysis.

The mapping generates a set of match scores that indicate that the following values might be duplicates:

•Baker, Barker
•Barker, Parker
•Smith, Smith

When you review the data, you decide that Baker, Barker, and Parker are not duplicate values. You decide that Smith and Smith are duplicate values.

Single-Source Analysis and Dual-Source Analysis

You can configure the Match transformation to analyze data from one or two data sources. You select the ports from each data source when you define a strategy in the transformation.

When you configure the transformation to perform single-source analysis, you select one or more ports from a single data set. When you configure the transformation to perform dual-source analysis, you select one or more ports from each data set. You select the ports in pairs. For each pair of ports that you select, the transformation compares each value in one port with every value in the other port. If you perform single-source analysis on data from a single column, the transformation creates a temporary copy of the port that you select.

Note: When you perform identity match analysis, you can compare a data source to a persistent index of identity data that you created in an earlier mapping. Use the Match Type options to specify identity analysis with a persistent index.

Field Match Analysis and Identity Match Analysis

You can configure a Match transformation to perform field match analysis or identity match analysis.

In field match analysis, the Match transformation analyzes the source data that enters the transformation. You can perform field match analysis on any type of data. In identity match analysis, the Match transformation generates an index of alternative data values from the input data and analyzes the index data. Configure the Match transformation for identity match analysis when the input ports contain identity data. An identity is a group of data values that identifies a person or an organization.

A data set can represent a single identity in different ways. For example, the following data values represent the name John Smith:

•John Smith
•Smith, John
•jsmith@email.com
•SMITHJMR

The Match transformation reads the identity data in a record and calculates the possible alternative versions of the identity. The transformation creates an index that includes the current versions and alternative versions of the identities. The Match transformation analyzes the index values and not the values in the input records.

Identity Population Files

Identity match operations read reference data files called populations. The population files define the potential variations in the identity data. The files do not install with Informatica applications. You buy and download the population data files from Informatica.

Install the files to a location that the Content Management Service can access. Use Informatica Administrator to set the location on the Content Management Service.

Groups in Match Analysis

A match analysis mapping can take a long time to run because of the number of data comparisons that the transformation must perform. The number of comparisons relates to the number of data values on the ports that you select.

The following table shows the number of calculations that a mapping performs for different numbers of data values on a single port:

Number of data values	Number of comparisons
10,000	50 million
100,000	5,000 million
1 million	500,000 million

To reduce the time that the mapping takes to run, assign the input data records to groups. A group is a set of records that contain identical values on a port that you specify. When you perform match analysis on grouped data, the Match transformation analyzes the records within each group. The transformation does not compare the records in one group with the records in another group. The groups reduce the overall number of comparisons that the transformation must perform without any loss of accuracy in the mapping analysis.

Consider the following rules and guidelines when you organize data into groups:

•The port on which you group the data is the group key port. A group key port must contain a range of duplicate values, such as a city name or a state name in an address data set. If the mapping data does not contain a usable group key port, use the Key Generator to create the port from the current mapping data. Connect the group key output port from the Key Generator transformation to the Match transformation.

You can also use the Key Generator transformation to add sequence identifiers to the mapping data.

•Field match operations must specify a group key port. If you configure the Match transformation for identity analysis, do not select a group key port. The identity analysis generates group keys for the identity index data.
•Do not specify a group key port that you plan to use in the match analysis.
•When you create groups, you must verify that the groups are a valid size. If the groups are too small, the match analysis might not find all the duplicate data in the data set. If the groups are too large, the match analysis might return false duplicates. Select group keys that create an average group size of 10,000 records.
•Groups do not reorder the position of the records in the mapping data set.

Match Pairs and Clusters

The Match transformation can read and write different numbers of input rows and output rows, and it can change the sequence of the output rows. You determine the output format for the results of the match analysis.

The transformation can write rows in the following formats:

Matched pairs: The transformation writes a row for every pair of records that match with a score that meets the match threshold. The transformation writes each pair of records to a single row.; Because a record might match more than one other record, a record might appear on more than one output row.

Best match: The transformation writes a row for each record in a data set and adds the most similar record from another data set to the same row.
Clusters: The transformation assigns the output records to clusters based on the levels of similarity between the records. A cluster is a set of records in which each record matches at least one other record with a score that meets the match threshold. The transformation writes each record to a single row.; Each record in a cluster must match at least one other record in the cluster. Therefore, a cluster can contain pairs of records that do not match each other. A cluster can contain a single record if the record does not match any other record.

Configure the output options on the Match Output view of the transformation.