Reference Data Use in the Labeler Transformation

When you add a reference data object to Labeler transformation strategy, the transformation searches the input data on the strategy for values in the reference data object. The transformation replaces any value it finds with a valid value from the reference data object, or with a value that you specify.

Reference Data Type	Description
Character sets	Identifies different types of characters, such as letters, numbers, and punctuation symbols. Use in character labeling operations.
Probabilistic models	Adds fuzzy match capabilities to token label operations. The transformation can use a probabilistic model to infer the type of information in a string. To enable the fuzzy match capabilities, you compile the probabilistic model in the Developer tool. Use in token labeling operations.
Reference tables	Finds strings that match the entries in a database table. Use in token labeling and character labeling operations.
Regular expressions	Identifies strings that match conditions that you define. You can use a regular expression to find a string within a larger string. Use in token labeling operations.
Token sets	Identifies strings based on the types of information they contain. Use in token labeling operations. Informatica installs with token sets different types of token definitions, such as word, telephone number, post code, and product code definitions.

Character Sets

Character ranges specify a sequential range of character codes. For example, the character range "[A-C]" matches the uppercase characters "A," "B," and "C." This character range does not match the lowercase characters "a," "b," or "c."

Use character sets to identify a specific character or range of characters as part of token parsing or labeling operations. For example, you can label all numerals in a column that contains telephone numbers. After labeling the numbers, you can identify patterns with a Parser transformation and write problematic patterns to separate outputs.

Probabilistic Models

A probabilistic model contains reference data values and label values. The reference data values represent the data on an input port that you connect to the transformation. The label values describe the types of information that the reference data values contain. You assign a label to each reference data value in the model.

To link the reference data values to the labels in a probabilistic model, you compile the model. The compilation process generates a series of logical associations between the data values and the labels. When you run a mapping that reads the model, the Data Integration Service applies the model logic to the transformation input data. The Data Integration Service returns the label that most accurately describes the input data values.

You create a probabilistic model in the Developer tool. The Model repository stores the probabilistic model object. The Developer tool writes the data values, the labels, and the compilation data to a file in the Informatica directory structure.

Reference Tables

A reference table is a database table that contains at least two columns. One column contains the standard or required version of a data value, and other columns contain alternative versions of the value. When you add a reference table to a transformation, the transformation searches the input port data for values that also appear in the table. You can create tables with any data that is useful to the data project you work on.

Regular Expressions

In the context of labeling operations, a regular expression is an expression that you can use to identify a specific string in input data. You can use regular expressions in Labeler transformations that use token labeling mode.

Token Sets

Use token sets to identify specific tokens as part of token labeling operations. For example, you can use a token set to label all email addresses that use that use an "AccountName@DomainName" format. After labeling the tokens, you can use the Parser transformation to write email addresses to output ports that you specify.