Reference Data Use in the Parser Transformation

Informatica Developer installs with multiple reference data objects that you can use with the Parser transformation. You can also create reference data objects in the Developer tool.

Reference Data Type	Description
Pattern sets	Identifies data values based on the relative position of each value in the string.
Probabilistic models	Adds fuzzy match capabilities to token parsing operations. The transformation can use a probabilistic model to infer the type of information in a string. To enable the fuzzy match capabilities, you compile the probabilistic model in the Developer tool.
Reference tables	Finds strings that match the entries in a database table.
Regular expressions	Identifies strings that match conditions that you define. You can use a regular expression to find a string within a larger string.
Token sets	Identifies strings based on the types of information they contain. Informatica installs with token sets different types of token definitions, such as word, telephone number, post code, and product code definitions.

Pattern Sets

A pattern set contains expressions that identify data patterns in the output of a token labeling operation. You can use pattern sets to analyze the Tokenized Data output port and write matching strings to one or more output ports. Use pattern sets in Parser transformations that use pattern parsing mode.

For example, you can configure a Parser transformation to use pattern sets that identify names and initials. This transformation uses the pattern sets to analyze the output of a Labler transformation in token labeling mode. You can configure the Parser transformation to write names and initials in the output to separate ports.

Probabilistic Models

A probabilistic model contains reference data values and label values. The reference data values represent the data on an input port that you connect to the transformation. The label values describe the types of information that the reference data values contain. You assign a label to each reference data value in the model.

To link the reference data values to the labels in a probabilistic model, you compile the model. The compilation process generates a series of logical associations between the data values and the labels. When you run a mapping that reads the model, the Data Integration Service applies the model logic to the transformation input data. The Data Integration Service returns the label that most accurately describes the input data values.

You create a probabilistic model in the Developer tool. The Model repository stores the probabilistic model object. The Developer tool writes the data values, the labels, and the compilation data to a file in the Informatica directory structure.

Note: If you add a probabilistic model to a token parsing operation and you then edit the label configuration in the probabilistic model, you invalidate the operation. When you update the label configuration in a probabilistic model, recreate any parsing operation that uses the model.

Reference Tables

A reference table is a database table that contains at least two columns. One column contains the standard or required version of a data value, and other columns contain alternative versions of the value. When you add a reference table to a transformation, the transformation searches the input port data for values that also appear in the table. You can create tables with any data that is useful to the data project you work on.

Regular Expressions

In the context of parsing operations, a regular expression is an expression that you can use to identify one or more strings in input data. The Parser transformation writes identified strings to one or more output ports. You can use regular expressions in Parser transformations that use token parsing mode.

Parser transformations use regular expressions to match patterns in input data and parse all matching strings to one or more outputs. For example, you can use a regular expression to identify all email addresses in input data and parse each email address component to a different output.

Token Sets

Use token sets to identify specific tokens as part of parsing operations. For example, you can use a token set to parse all email addresses that use that use an "AccountName@DomainName" format.