Developer Transformation Guide > Match Transformations in Field Analysis > Field Match Strategies
  

Field Match Strategies

The Strategies view lists the strategies that you define for the input data.
The strategies determine how the transformation measures the similarities and differences between the data source records.

Field Match Algorithms

The Match transformation includes algorithms that compare data values across two columns. Each algorithm calculates the degree of difference between data values in a different way.
Select an algorithm that can measure the types of data difference that you expect to find in the columns that you select.

Bigram

Use the Bigram algorithm to compare long text strings, such as postal addresses entered in a single field.
The Bigram algorithm calculates a match score for two data strings based on the occurrence of consecutive characters in both strings. The algorithm looks for pairs of consecutive characters that are common to both strings. It divides the number of pairs that match in both strings by the total number of character pairs.

Bigram Example

Consider the following strings:
These strings yield the following Bigram groups:
l a, a r, r d, d e, e r
l e, e r, r d, d e, e r
Note that the second occurrence of the string "e r" within the string "lerder" is not matched, as there is no corresponding second occurrence of "e r" in the string "larder".
To calculate the Bigram match score, the transformation divides the number of matching pairs (6) by the total number of pairs in both strings (10). In this example, the strings are 60% similar and the match score is 0.60.

Hamming Distance

Use the Hamming Distance algorithm when the position of the data characters is a critical factor, for example in numeric or code fields such as telephone numbers, ZIP Codes, or product codes.
The Hamming Distance algorithm calculates a match score for two data strings by computing the number of positions in which characters differ between the data strings. For strings of different length, each additional character in the longest string is counted as a difference between the strings.

Hamming Distance Example

Consider the following strings:
The highlighted characters indicate the positions that the Hamming algorithm identifies as different.
To calculate the Hamming match score, the transformation divides the number of matching characters (5) by the length of the longest string (8). In this example, the strings are 62.5% similar and the match score is 0.625.

Edit Distance

Use the Edit Distance algorithm to compare words or short text strings, such as names.
The Edit Distance algorithm calculates the minimum “cost” of transforming one string to another by inserting, deleting, or replacing characters.

Edit Distance Example

Consider the following strings:
The highlighted characters indicate the operations required to transform one string into the other.
The Edit Distance algorithm divides the number of unchanged characters (8) by the length of the longest string (11). In this example, the strings are 72.7% similar and the match score is 0.727.

Jaro Distance

Use the Jaro Distance algorithm to compare two strings when the similarity of the initial characters in the strings is a priority.
The Jaro Distance match score reflects the degree of similarity between the first four characters of both strings and the number of identified character transpositions. The transformation weights the importance of the match between the first four characters by using the value that you enter in the Penalty property.

Jaro Distance Properties

When you configure a Jaro Distance algorithm, you can configure the following properties:
Penalty
Determines the match score penalty if the first four characters in two compared strings are not identical. The transformation subtracts the full penalty value for a first-character mismatch. The transformation subtracts fractions of the penalty based on the position of the other mismatched characters. The default penalty value is 0.20.
Case Sensitive
Determines whether the Jaro Distance algorithm considers character case when it compares characters.

Jaro Distance Example

Consider the following strings:
If you use the default Penalty value of 0.20 to analyze these strings, the Jaro Distance algorithm returns a match score of 0.513. This match score indicates that the strings are 51.3% similar.

Reverse Hamming Distance

Use the Reverse Hamming Distance algorithm to calculate the percentage of character positions that differ between two strings, reading from right to left.
The Hamming Distance algorithm calculates a match score for two data strings by computing the number of positions in which characters differ between the data strings. For strings of different length, the algorithm counts each additional character in the longest string as a difference between the strings.

Reverse Hamming Distance Example

Consider the following strings, which use right-to-left alignment to mimic the Reverse Hamming algorithm:
The highlighted characters indicate the positions that the Reverse Hamming Distance algorithm identifies as different.
To calculate the Reverse Hamming match score, the transformation divides the number of matching characters (9) by the length of the longest string (15). In this example, the match score is 0.6, indicating that the strings are 60% similar.

Field Match Strategy Properties

Open the Strategy wizard on the Strategies view and configure the properties for each field match strategy.
When you configure a field match strategy, you can configure the following properties:
Name
Identifies the strategy by name.
Weight
Determines the relative priority assigned to the match score when the overall score for the record is calculated. Default is 0.5.
Single Field Null
Defines the match score that the algorithm applies to a pair of data values when one value is null. Default is 0.5.
Both Fields Null
Defines the match score that the algorithm applies to a pair of data values when both values are null. Default is 0.5.
Note: A match algorithm does not calculate a match score when one or both of the matched column values are null. The algorithm applies the scores defined in the null match properties. You cannot clear the null match properties.