Developer Transformation Guide > Labeler Transformation > Reference Data Use in the Labeler Transformation
  

Reference Data Use in the Labeler Transformation

Informatica Developer installs with different types of reference data objects that you can use with the Labeler transformation. You can also create reference data objects.
When you add a reference data object to Labeler transformation strategy, the transformation searches the input data on the strategy for values in the reference data object. The transformation replaces any value it finds with a valid value from the reference data object, or with a value that you specify.
The following table describes the types of reference data you can use:
Reference Data Type
Description
Character sets
Identifies different types of characters, such as letters, numbers, and punctuation symbols.
Use in character labeling operations.
Probabilistic models
Adds fuzzy match capabilities to token label operations. The transformation can use a probabilistic model to infer the type of information in a string. To enable the fuzzy match capabilities, you compile the probabilistic model in the Developer tool.
Use in token labeling operations.
Reference tables
Finds strings that match the entries in a database table.
Use in token labeling and character labeling operations.
Regular expressions
Identifies strings that match conditions that you define. You can use a regular expression to find a string within a larger string.
Use in token labeling operations.
Token sets
Identifies strings based on the types of information they contain.
Use in token labeling operations.
Informatica installs with token sets different types of token definitions, such as word, telephone number, post code, and product code definitions.

Character Sets

A character set contains expressions that identify specific characters and character ranges. You can use character sets in Labeler transformations and in Parser transformations that use token parsing mode.
Character ranges specify a sequential range of character codes. For example, the character range "[A-C]" matches the uppercase characters "A," "B," and "C." This character range does not match the lowercase characters "a," "b," or "c."
Use character sets to identify a specific character or range of characters as part of token parsing or labeling operations. For example, you can label all numerals in a column that contains telephone numbers. After labeling the numbers, you can identify patterns with a Parser transformation and write problematic patterns to separate outputs.

Probabilistic Models

A probabilistic model identifies tokens by the types of information that they contain and by the positions that they occupy in an input string.
A probabilistic model contains reference data values and label values. The reference data values represent the data on an input port that you connect to the transformation. The label values describe the types of information that the reference data values contain. You assign a label to each reference data value in the model.
To link the reference data values to the labels in a probabilistic model, you compile the model. The compilation process generates a series of logical associations between the data values and the labels. When you run a mapping that reads the model, the Data Integration Service applies the model logic to the transformation input data. The Data Integration Service returns the label that most accurately describes the input data values.
You create a probabilistic model in the Developer tool. The Model repository stores the probabilistic model object. The Developer tool writes the data values, the labels, and the compilation data to a file in the Informatica directory structure.

Reference Tables

A reference table is a database table that contains at least two columns. One column contains the standard or required version of a data value, and other columns contain alternative versions of the value. When you add a reference table to a transformation, the transformation searches the input port data for values that also appear in the table. You can create tables with any data that is useful to the data project you work on.

Regular Expressions

In the context of labeling operations, a regular expression is an expression that you can use to identify a specific string in input data. You can use regular expressions in Labeler transformations that use token labeling mode.
Labeler transformations use regular expressions to match an input pattern and create a single label. Regular expressions that have multiple outputs do not generate multiple labels.

Token Sets

A token set contains expressions that identify specific tokens. You can use token sets in Labeler transformations that use token labeling mode.
Use token sets to identify specific tokens as part of token labeling operations. For example, you can use a token set to label all email addresses that use that use an "AccountName@DomainName" format. After labeling the tokens, you can use the Parser transformation to write email addresses to output ports that you specify.
The Developer tool includes system-defined token sets that you can use to identify a wide range of patterns. Examples of system-defined token sets include: