Developer Transformation Guide > Match Transformation > Match Score Calculations
  

Match Score Calculations

The match score is a numerical value that represents the degree of similarity between two column values. An algorithm calculates a match score as a decimal value in the range 0 through 1. An algorithm assigns a score of 1 when two column values are identical.
When you select multiple column pairs for analysis, the transformation calculates an average score based on the scores in the selected columns. By default, the transformation assigns equal weight to scores from each pair of columns. The transformation does not infer the relative importance of the column data in the data set.
You can edit the weight values that the transformation uses to calculate the match score. Edit the weight values when you want to assign higher or lower priority to columns in the data set.
You can also set the scores that the transformation applies when it finds a null value in a column. By default, the transformation treats null values as data errors and assigns a low match score to any pair of values that contains a null.
Note: The algorithm you select determines the match score between two values. The algorithm generates a single score for the two values. The match scores do not depend on the type of match output or the type of scoring method you select.

Weighted Scores

When you select multiple columns for match analysis, the transformation calculates an average score for each record based on the scores in the columns. The average score includes any weight values that you apply to the comparison algorithms for each column.
By default, all algorithms use a weight value of 0.5. You can increase this value if the selected columns are more likely to contain duplicated information. You can decrease the weight value if duplicate values in the selected columns are less likely to indicate genuine duplicate information between records. The Match transformation uses the average score as the single match score for each pair of records.

Null Match Scores

A match algorithm applies a predefined match score to a pair of values when one or both values are null. You can edit the match score that a field match algorithm applies to null values.

Null Match Scores and Field Match Algorithms

When you configure a field match algorithm, verify the match score values that the algorithm applies to null data. A field match algorithm applies a default score of 0.5 when it compares two values and one or both values are null. A score of 0.5 indicates a low level of similarity between data values.
Consider the following rules and guidelines when you verify the null match scores:

Null Match Scores and Identity Match Algorithms

An identity match algorithm applies a match score of 0 when it compares two values and one or both values are null. Identity match analysis assigns a record with a null match score to a unique record cluster and records a cluster size value of 1. You cannot edit the score that an identity match algorithm applies to null data.

Cluster Output Options

Select a cluster output option when you want to organize similar or identical records in the output data.
When you select a cluster output option, the transformation adds a cluster ID value to each output record. You can sort the records by the cluster ID values. The transformation output includes a row for every record. If a record does not match another record with a score that meets the match threshold, the transformation assigns a unique cluster ID to the record. Use the Match Output view to select or update the cluster output options.
You can select the following cluster output options:
Clusters
Select the option to assign cluster ID values to the output records.
Clusters - Best Match
Select the option to add the record pair with the highest match score to a cluster. Because a record might represent the best match with more than one other record, more than one record pair can share a cluster ID value.
Clusters - Match All
The Clusters - Match All option works in the same way as the Clusters option.
The transformation uses Clusters - Match All and Clusters - Best Match as option names in identity match analysis.
Note: If a Data Integration Service runs multiple Match transformations concurrently, the Data Integration Service generates unique cluster ID values for the output from each transformation. Therefore, the cluster ID values for the records that each transformation generates can be non-consecutive.

The Clusters Option and the Clusters - Match All Option

Select the Clusters option in field match analysis. Select the Clusters - Match All option in identity match analysis.
The Match transformation uses the following rules to create the clusters:

The Clusters - Best Match Option

Select the Clusters - Best Match option in identity match analysis.
The transformation uses the following rules to create the clusters:
Note: You can use the Match property on the Match Output view to specify how the transformation compares a single data source to a persistent data store. The Match property determines whether the transformation looks for duplicates within the source data or the persistent data store.

Driver Scores and Link Scores in Cluster Analysis

When you select a cluster output option in the Match transformation, you can add link score and driver score data to the output.
The link score is the score between two records that identifies the records as members of the same cluster. The links between records determine the composition of the cluster. Any record can link to any other record in the same cluster.
The driver score is the score between the record with the highest sequence ID value in a cluster and another record in the same cluster. Driver scores provide a means to assess all records in a cluster against a single record. When you add driver scores to the match output, the mapping runs more slowly, as the Match transformation cannot calculate the driver scores until all the clusters are complete.
Note: Match analysis generates a single set of scores for each strategy that you define. The driver score and the link score indicate the match scores for different pairs of records in each cluster. The driver scores and link scores can depend on the order in which the records enter the transformation. The driver score might be lower than the match threshold.

Cluster Analysis Example

You configure a field match strategy to analyze a column of surname data. You set a match threshold of 0.825 in the strategy. You select a clustered output format, and you run the Data Viewer on the transformation.
The following table shows the data that the Data Viewer displays:
Surname
Sequence ID
Cluster ID
Cluster Size
Driver ID
Driver Score
Link ID
Link Score
SMITH
1
1
2
1 - 6
1
1 - 1
1
SMYTH
2
2
2
1 - 3
0.83333
1 - 2
1
SMYTHE
3
2
2
1 - 3
1
1 - 2
0.83333
SMITT
4
3
1
1 - 4
1
1 - 4
1
SMITS
5
4
1
1 - 5
1
1 - 5
1
SMITH
6
1
2
1 - 6
1
1 - 1
1
The Data Viewer contains the following information about the surname data: