Match Analysis
You can define different types of duplicate analysis in the Match transformation. The duplicate analysis operations that you define depend on the number of data sources in the mapping and the type of information that the sources contain.
Consider the following factors when you configure a Match transformation:
- •You can select a single column from a data set or you can select multiple columns.
- •You can analyze columns from a single data source or you can analyze two data sources.
- •You can configure the Match transformation to analyze the raw data in the input port fields, or you can configure the transformation to analyze the identity information in the data.
- •You can configure the Match transformation to write different types of output. The type of output that you select determines the number of records that the transformation writes and the order of the records.
- •To increase performance, sort the input records into groups before you perform the match analysis.
Column Analysis
When you configure a Match transformation, you select one or more columns for analysis.
The Match transformation analyzes columns in pairs. When you select a single column for analysis, the transformation creates a temporary copy of the column and compares the source column with the temporary column. When you select two columns for analysis, the transformation compares the values across the two columns that you select. The transformation compares each value in one column with all of the values in the other column. The transformation returns a match score for each pair of values that it analyzes.
You select the columns to analyze when you configure a strategy in the Match transformation. The strategy specifies the columns to analyze and the algorithm to apply to the columns. The algorithm calculates the levels of similarity between each pair of values. The different algorithms in the transformation use different criteria to measure the levels of similarity between the values. You can define multiple strategies in a transformation, and you can and assign different columns to each strategy.
Column Analysis Example
You want to compare the values in a column of surname data. You create a mapping that includes a data source and a Match transformation. You connect the Surname port to the Match transformation. The transformation creates a temporary copy of the data on the Surname port when the mapping runs.
The following image shows a fragment of the surname data:
The mapping generates a set of match scores that indicate that the following values might be duplicates:
- •Baker, Barker
- •Barker, Parker
- •Smith, Smith
When you review the data, you decide that Baker, Barker, and Parker are not duplicate values. You decide that Smith and Smith are duplicate values.
Single-Source Analysis and Dual-Source Analysis
You can configure the Match transformation to analyze data from one or two data sources. You select the ports from each data source when you define a strategy in the transformation.
When you configure the transformation to perform single-source analysis, you select one or more ports from a single data set. When you configure the transformation to perform dual-source analysis, you select one or more ports from each data set. You select the ports in pairs. For each pair of ports that you select, the transformation compares each value in one port with every value in the other port. If you perform single-source analysis on data from a single column, the transformation creates a temporary copy of the port that you select.
Note: When you perform identity match analysis, you can compare a data source to a persistent index of identity data that you created in an earlier mapping. Use the Match Type options to specify identity analysis with a persistent index.
Field Match Analysis and Identity Match Analysis
You can configure a Match transformation to perform field match analysis or identity match analysis.
In field match analysis, the Match transformation analyzes the source data that enters the transformation. You can perform field match analysis on any type of data. In identity match analysis, the Match transformation generates an index of alternative data values from the input data and analyzes the index data. Configure the Match transformation for identity match analysis when the input ports contain identity data. An identity is a group of data values that identifies a person or an organization.
A data set can represent a single identity in different ways. For example, the following data values represent the name John Smith:
- •John Smith
- •Smith, John
- •jsmith@email.com
- •SMITHJMR
The Match transformation reads the identity data in a record and calculates the possible alternative versions of the identity. The transformation creates an index that includes the current versions and alternative versions of the identities. The Match transformation analyzes the index values and not the values in the input records.
Identity Population Files
Identity match operations read reference data files called populations. The population files define the potential variations in the identity data. The files do not install with Informatica applications. You buy and download the population data files from Informatica.
Install the files to a location that the Content Management Service can access. Use Informatica Administrator to set the location on the Content Management Service.
Groups in Match Analysis
A match analysis mapping can take a long time to run because of the number of data comparisons that the transformation must perform. The number of comparisons relates to the number of data values on the ports that you select.
The following table shows the number of calculations that a mapping performs for different numbers of data values on a single port:
Number of data values | Number of comparisons |
---|
10,000 | 50 million |
100,000 | 5,000 million |
1 million | 500,000 million |
To reduce the time that the mapping takes to run, assign the input data records to groups. A group is a set of records that contain identical values on a port that you specify. When you perform match analysis on grouped data, the Match transformation analyzes the records within each group. The transformation does not compare the records in one group with the records in another group. The groups reduce the overall number of comparisons that the transformation must perform without any loss of accuracy in the mapping analysis.
Consider the following rules and guidelines when you organize data into groups:
Match Pairs and Clusters
The Match transformation can read and write different numbers of input rows and output rows, and it can change the sequence of the output rows. You determine the output format for the results of the match analysis.
The transformation can write rows in the following formats:
- Matched pairs
- The transformation writes a row for every pair of records that match with a score that meets the match threshold. The transformation writes each pair of records to a single row.
- Because a record might match more than one other record, a record might appear on more than one output row.
- Best match
- The transformation writes a row for each record in a data set and adds the most similar record from another data set to the same row.
- Clusters
- The transformation assigns the output records to clusters based on the levels of similarity between the records. A cluster is a set of records in which each record matches at least one other record with a score that meets the match threshold. The transformation writes each record to a single row.
- Each record in a cluster must match at least one other record in the cluster. Therefore, a cluster can contain pairs of records that do not match each other. A cluster can contain a single record if the record does not match any other record.
Note: The Clusters option in field analysis corresponds to the Clusters - Match All option in identity analysis. The Clusters - Best Match option in identity analysis combines cluster calculations and matched pair calculations.
Configure the output options on the Match Output view of the transformation.