Match Score Calculations
The match score is a numerical value that represents the degree of similarity between two column values. An algorithm calculates a match score as a decimal value in the range 0 through 1. An algorithm assigns a score of 1 when two column values are identical.
When you select multiple column pairs for analysis, the transformation calculates an average score based on the scores in the selected columns. By default, the transformation assigns equal weight to scores from each pair of columns. The transformation does not infer the relative importance of the column data in the data set.
You can edit the weight values that the transformation uses to calculate the match score. Edit the weight values when you want to assign higher or lower priority to columns in the data set.
You can also set the scores that the transformation applies when it finds a null value in a column. By default, the transformation treats null values as data errors and assigns a low match score to any pair of values that contains a null.
Note: The algorithm you select determines the match score between two values. The algorithm generates a single score for the two values. The match scores do not depend on the type of match output or the type of scoring method you select.
Weighted Scores
When you select multiple columns for match analysis, the transformation calculates an average score for each record based on the scores in the columns. The average score includes any weight values that you apply to the comparison algorithms for each column.
By default, all algorithms use a weight value of 0.5. You can increase this value if the selected columns are more likely to contain duplicated information. You can decrease the weight value if duplicate values in the selected columns are less likely to indicate genuine duplicate information between records. The Match transformation uses the average score as the single match score for each pair of records.
Null Match Scores
A match algorithm applies a predefined match score to a pair of values when one or both values are null. You can edit the match score that a field match algorithm applies to null values.
Null Match Scores and Field Match Algorithms
When you configure a field match algorithm, verify the match score values that the algorithm applies to null data. A field match algorithm applies a default score of 0.5 when it compares two values and one or both values are null. A score of 0.5 indicates a low level of similarity between data values.
Consider the following rules and guidelines when you verify the null match scores:
- •When the algorithm analyzes columns that contain primary keys or other critical data, do not edit the default scores. In this case, a null value represents a data error, and the default scores are appropriate for the data.
- •When the algorithm analyzes columns that can optionally contain data, update the null match score values to the same value as the match threshold. You cancel the effect of the null values on the match analysis when you set the null match scores to the match threshold value.
Null Match Scores and Identity Match Algorithms
An identity match algorithm applies a match score of 0 when it compares two values and one or both values are null. Identity match analysis assigns a record with a null match score to a unique record cluster and records a cluster size value of 1. You cannot edit the score that an identity match algorithm applies to null data.
Cluster Output Options
Select a cluster output option when you want to organize similar or identical records in the output data.
When you select a cluster output option, the transformation adds a cluster ID value to each output record. You can sort the records by the cluster ID values. The transformation output includes a row for every record. If a record does not match another record with a score that meets the match threshold, the transformation assigns a unique cluster ID to the record. Use the Match Output view to select or update the cluster output options.
You can select the following cluster output options:
- Clusters
- Select the option to assign cluster ID values to the output records.
- Clusters - Best Match
- Select the option to add the record pair with the highest match score to a cluster. Because a record might represent the best match with more than one other record, more than one record pair can share a cluster ID value.
- Clusters - Match All
- The Clusters - Match All option works in the same way as the Clusters option.
- The transformation uses Clusters - Match All and Clusters - Best Match as option names in identity match analysis.
Note: If a Data Integration Service runs multiple Match transformations concurrently, the Data Integration Service generates unique cluster ID values for the output from each transformation. Therefore, the cluster ID values for the records that each transformation generates can be non-consecutive.
The Clusters Option and the Clusters - Match All Option
Select the Clusters option in field match analysis. Select the Clusters - Match All option in identity match analysis.
The Match transformation uses the following rules to create the clusters:
- •When two records have a match score that meets the match threshold, the Match transformation adds the records to a cluster.
- •When a record in the data set matches any record in the cluster, the transformation adds the record to the cluster.
- •If a record in one cluster matches a record in another cluster, the process merges the clusters.
- •The transformation performs a continual sweep of the match results until all the records belong to a cluster.
- •If a record does not match any other record in the data set, the transformation assigns a unique cluster ID value to the record.
The Clusters - Best Match Option
Select the Clusters - Best Match option in identity match analysis.
The transformation uses the following rules to create the clusters:
- •The transformation identifies the record that has the highest match score with the current record. If the match score meets the threshold, the transformation adds the pair of records to a cluster.
- •If one of the matching records is in a cluster, the transformation adds the other record to the current cluster.
- •The transformation performs a continual sweep of the match score results until all the records belong to a cluster.
- •A cluster can contain a single record if the record does not match any other record in the data.
Note: You can use the Match property on the Match Output view to specify how the transformation compares a single data source to a persistent data store. The Match property determines whether the transformation looks for duplicates within the source data or the persistent data store.
Driver Scores and Link Scores in Cluster Analysis
When you select a cluster output option in the Match transformation, you can add link score and driver score data to the output.
The link score is the score between two records that identifies the records as members of the same cluster. The links between records determine the composition of the cluster. Any record can link to any other record in the same cluster.
The driver score is the score between the record with the highest sequence ID value in a cluster and another record in the same cluster. Driver scores provide a means to assess all records in a cluster against a single record. When you add driver scores to the match output, the mapping runs more slowly, as the Match transformation cannot calculate the driver scores until all the clusters are complete.
Note: Match analysis generates a single set of scores for each strategy that you define. The driver score and the link score indicate the match scores for different pairs of records in each cluster. The driver scores and link scores can depend on the order in which the records enter the transformation. The driver score might be lower than the match threshold.
Cluster Analysis Example
You configure a field match strategy to analyze a column of surname data. You set a match threshold of 0.825 in the strategy. You select a clustered output format, and you run the Data Viewer on the transformation.
The following table shows the data that the Data Viewer displays:
Surname | Sequence ID | Cluster ID | Cluster Size | Driver ID | Driver Score | Link ID | Link Score |
---|
SMITH | 1 | 1 | 2 | 1 - 6 | 1 | 1 - 1 | 1 |
SMYTH | 2 | 2 | 2 | 1 - 3 | 0.83333 | 1 - 2 | 1 |
SMYTHE | 3 | 2 | 2 | 1 - 3 | 1 | 1 - 2 | 0.83333 |
SMITT | 4 | 3 | 1 | 1 - 4 | 1 | 1 - 4 | 1 |
SMITS | 5 | 4 | 1 | 1 - 5 | 1 | 1 - 5 | 1 |
SMITH | 6 | 1 | 2 | 1 - 6 | 1 | 1 - 1 | 1 |
The Data Viewer contains the following information about the surname data:
- •SMITT and SMITS do not match any record with a score that meets the match threshold. The Match transformation determines that the records are unique in the data set.
SMITT and SMITS have a cluster size of 1. To find unique records in cluster output, search for clusters that contain a single record.
- •SMITH and SMITH have a link score of 1. The Match transformation determines that the records are identical. The transformation adds the records to a single cluster.
- •SMYTH and SMYTHE have a link score of 0.83333. The score exceeds the match threshold. Therefore, the transformation adds the records to a single cluster.