Field Matching Strategies
The Comparison transformation includes predefined field matching strategies that compare pairs of input data fields.
Bigram
Use the Bigram algorithm to compare long text strings, such as postal addresses entered in a single field.
The Bigram algorithm calculates a match score for two data strings based on the occurrence of consecutive characters in both strings. The algorithm looks for pairs of consecutive characters that are common to both strings. It divides the number of pairs that match in both strings by the total number of character pairs.
Bigram Example
Consider the following strings:
These strings yield the following Bigram groups:
l a, a r, r d, d e, e r
l e, e r, r d, d e, e r
Note that the second occurrence of the string "e r" within the string "lerder" is not matched, as there is no corresponding second occurrence of "e r" in the string "larder".
To calculate the Bigram match score, the transformation divides the number of matching pairs (6) by the total number of pairs in both strings (10). In this example, the strings are 60% similar and the match score is 0.60.
Hamming Distance
Use the Hamming Distance algorithm when the position of the data characters is a critical factor, for example in numeric or code fields such as telephone numbers, ZIP Codes, or product codes.
The Hamming Distance algorithm calculates a match score for two data strings by computing the number of positions in which characters differ between the data strings. For strings of different length, each additional character in the longest string is counted as a difference between the strings.
Hamming Distance Example
Consider the following strings:
The highlighted characters indicate the positions that the Hamming algorithm identifies as different.
To calculate the Hamming match score, the transformation divides the number of matching characters (5) by the length of the longest string (8). In this example, the strings are 62.5% similar and the match score is 0.625.
Edit Distance
Use the Edit Distance algorithm to compare words or short text strings, such as names.
The Edit Distance algorithm calculates the minimum “cost” of transforming one string to another by inserting, deleting, or replacing characters.
Edit Distance Example
Consider the following strings:
The highlighted characters indicate the operations required to transform one string into the other.
The Edit Distance algorithm divides the number of unchanged characters (8) by the length of the longest string (11). In this example, the strings are 72.7% similar and the match score is 0.727.
Jaro Distance
Use the Jaro Distance algorithm to compare two strings when the similarity of the initial characters in the strings is a priority.
The Jaro Distance match score reflects the degree of similarity between the first four characters of both strings and the number of identified character transpositions. The transformation weights the importance of the match between the first four characters by using the value that you enter in the Penalty property.
Jaro Distance Properties
When you configure a Jaro Distance algorithm, you can configure the following properties:
- Penalty
- Determines the match score penalty if the first four characters in two compared strings are not identical. The transformation subtracts the full penalty value for a first-character mismatch. The transformation subtracts fractions of the penalty based on the position of the other mismatched characters. The default penalty value is 0.20.
- Case Sensitive
- Determines whether the Jaro Distance algorithm considers character case when it compares characters.
Jaro Distance Example
Consider the following strings:
If you use the default Penalty value of 0.20 to analyze these strings, the Jaro Distance algorithm returns a match score of 0.513. This match score indicates that the strings are 51.3% similar.
Reverse Hamming Distance
Use the Reverse Hamming Distance algorithm to calculate the percentage of character positions that differ between two strings, reading from right to left.
The Hamming Distance algorithm calculates a match score for two data strings by computing the number of positions in which characters differ between the data strings. For strings of different length, the algorithm counts each additional character in the longest string as a difference between the strings.
Reverse Hamming Distance Example
Consider the following strings, which use right-to-left alignment to mimic the Reverse Hamming algorithm:
- •1-999-9999
- •011-01-999-9991
The highlighted characters indicate the positions that the Reverse Hamming Distance algorithm identifies as different.
To calculate the Reverse Hamming match score, the transformation divides the number of matching characters (9) by the length of the longest string (15). In this example, the match score is 0.6, indicating that the strings are 60% similar.