Reference Data in the Developer Tool > Content Sets
  

Content Sets

A content set is a Model repository object that stores data or metadata for other reference data objects. A content set can include character sets, pattern sets, token sets, regular expressions, probabilistic models, and classifier models. Use a content set to define and organize reference data objects that relate to a single project, information type, or business purpose.
The Developer tool includes system-defined character sets and token sets that do not appear in the Model repository. To view and use the system-defined objects, configure a strategy in the Labeler transformation, Parser transformation, or Standardizer transformation.

Character Sets

A character set contains expressions that identify specific characters and character ranges. You can use character sets in Labeler transformations that use character labeling mode.
Character ranges specify a sequential range of character codes. For example, the character range "[A-C]" matches the uppercase characters "A," "B," and "C." This character range does not match the lowercase characters "a," "b," or "c."
Use character sets to identify a specific character or range of characters as part of labeling operations. For example, you can label all numerals in a column that contains telephone numbers. After labeling the numbers, you can identify patterns with a Parser transformation and write problematic patterns to separate output ports.

Character Set Properties

Configure properties that determine character labeling operations for a character set.
The following table describes the properties for a user-defined character set:
Property
Description
Label
Defines the label that a Labeler transformation applies to data that matches the character set.
Standard Mode
Enables a simple editing view that includes fields for the start range and end range.
Start Range
Specifies the first character in a character range.
End Range
Specifies the last character in a character range. For a range with a single character, leave this field blank.
Advanced Mode
Enables an advanced editing view where you can manually enter character ranges using range characters and delimiter characters.
Range Character
Temporarily changes the symbol that signifies a character range. The range character reverts to the default character when you close the character set.
Delimiter Character
Temporarily changes the symbol that separates character ranges. The delimiter character reverts to the default character when you close the character set.

Classifier Models

A classifier model analyzes input strings and determines the types of information that the strings are most likely to contain. You use a classifier model in a Classifier transformation. Use a classifier model when the input strings contain significant amounts of data.
A classifier model contains data rows and label values. The set of data rows represents the source data that you might select in the Classifier transformation. The label values describe the types of information that the data rows might contain. You assign a label to each data row in the model.
To link the data rows to the labels in a classifier model, you compile the model. The compilation process generates a series of logical associations between the data rows and the labels. When you run a mapping that reads the model, the Data Integration Service applies the model logic to the transformation input data. The Data Integration Service returns the label that most accurately describes the information in each data row.
You create a classifier model in the Developer tool. The Model repository stores the classifier model object. The Developer tool writes the data rows, the labels, and the compilation data to a file in the Informatica directory structure.

Pattern Sets

A pattern set contains expressions that identify data patterns in the output of a token labeling operation. You can use pattern sets to analyze the Tokenized Data output port and write matching strings to one or more output ports. Use pattern sets in Parser transformations that use pattern parsing mode.
For example, you can configure a Parser transformation to use pattern sets that identify names and initials. This transformation uses the pattern sets to analyze the output of a Labler transformation in token labeling mode. You can configure the Parser transformation to write names and initials in the output to separate ports.

Pattern Set Properties

Configure properties that determine the patterns in a pattern set.
The following table describes the property for a user-defined pattern set:
Property
Description
Pattern
Defines the patterns that the pattern parser searches for. You can enter multiple patterns for one pattern set. You can enter patterns constructed from a combination of wildcards, characters, and strings.

Probabilistic Models

A probabilistic model analyzes input data values and determines the types of information that the values are most likely to contain. Use a probabilistic model in a Labeler transformation and a Parser transformation.
A probabilistic model contains data values and label values. The set of data values represents the source data that you might select in the transformation. The label values describe the types of information that the data values contain. You assign a label to each data value in the model.
To link the data values to the labels in a probabilistic model, you compile the model. The compilation process generates a series of logical associations between the data values and the labels. When you run a mapping that reads the model, the Data Integration Service applies the model logic to the transformation input data. The Data Integration Service returns the label that most accurately describes the input data values.
You create a probabilistic model in the Developer tool. The Model repository stores the probabilistic model object. The Developer tool writes the data values, the labels, and the compilation data to a file in the Informatica directory structure.

Regular Expressions

In the context of content sets, a regular expression is an expression that you can use in parsing and labeling operations. Use regular expressions to identify one or more strings in input data. You can use regular expressions in Parser transformations that use token parsing mode. You can also use regular expressions in Labeler transformations that use token labeling mode.
Parser transformations use regular expressions to match patterns in input data and parse all matching strings to one or more outputs. For example, you can use a regular expression to identify all email addresses in input data and parse each email address component to a different output.
Labeler transformations use regular expressions to match an input pattern and create a single label. Regular expressions that have multiple outputs do not generate multiple labels.

Regular Expression Properties

Configure properties that determine how a regular expression identifies and writes output strings.
The following table describes the properties for a user-defined regular expression:
Property
Description
Number of Outputs
Defines the number of output ports that the regular expression writes.
Regular Expression
Defines a pattern that the Parser transformation uses to match strings.
Test Expression
Contains data that you enter to test the regular expression. As you type data in this field, the field highlights strings that matches the regular expression.
Next Expression
Moves to the next string that matches the regular expression and changes the font of that string to bold.
Previous Expression
Moves to the previous string that matches the regular expression and changes the font of that string to bold.

Token Sets

A token set contains expressions that identify specific tokens. You can use token sets in Labeler transformations that use token labeling mode. You can also use token sets in Parser transformations that use token parsing mode.
Use token sets to identify specific tokens as part of labeling and parsing operations. For example, you can use a token set to label all email addresses that use that use an "AccountName@DomainName" format. After labeling the tokens, you can use the Parser transformation to write email addresses to output ports that you specify.

Token Set Properties

Configure properties that determine the labeling operations for a token set.
The following table describes the properties for a user-defined character set:
Property
Token Set Mode
Description
Name
N/A
Defines the name of the token set.
Description
N/A
Describes the token set.
Token Set Options
N/A
Defines whether the token set uses regular expression mode or character mode.
Label
Regular Expression
Defines the label that a Labeler transformation applies to data that matches the token set.
Regular Expression
Regular Expression
Defines a pattern that the Labeler transformation uses to match strings.
Test Expression
Regular Expression
Contains data that you enter to test the regular expression. As you type data in this field, the field highlights strings that match the regular expression.
Next Expression
Regular Expression
Moves to the next string that matches the regular expression and changes the font of that string to bold.
Previous Expression
Regular Expression
Moves to the previous string that matches the regular expression and changes the font of that string to bold.
Label
Character
Defines the label that a Labeler transformation applies to data that matches the character set.
Standard Mode
Character
Enables a simple editing view that includes fields for the start range and end range.
Start Range
Character
Specifies the first character in a character range.
End Range
Character
Specifies the last character in a character range. For single-character ranges, leave this field blank.
Advanced Mode
Character
Enables an advanced editing view where you can manually enter character ranges using range characters and delimiter characters.
Range Character
Character
Temporarily changes the symbol that signifies a character range. The range character reverts to the default character when you close the character set.
Delimiter Character
Character
Temporarily changes the symbol that separates character ranges. The delimiter character reverts to the default character when you close the character set.

Rules and Guidelines for Probabilistic Models and Classifier Models

Each probabilistic model and classifier model in the Model repository identifies a file in the Informatica directory structure. The files contain the data values and the labels that you add to the model in the Developer tool. The files also contain the compilation logic that defines the associations between the data values and the labels.
Consider the following rules and guidelines when you work with probabilistic models or classifier models:

Creating a Content Set

Create a content set to manage reference data objects that refer to a single project, information type, or business purpose.
    1. In the Object Explorer view, select a project or folder to store the content set.
    2. Click File > New > Content Set.
    3. Enter a name for the content set.
    4. Optionally, select Browse to change the Model repository location for the content set.
    5. Click Finish.

Creating a Reference Data Object in a Content Set

You can create a character set, pattern set, token set, regular expression, probabilistic model, and classifier model in a content set.
    1. Open a content set in the editor and select the Content view.
    2. Select a reference data object type.
    3. Click Add.
    4. Enter a name for the reference data object.
    Optionally, enter a description of the object.
    5. Configure the reference data object properties.
    6. Click Finish.

Generating Reference Data from a Midstream Profile

A midstream profile is a profile that you run on a transformation in a mapping. You can run a midstream profile to generate data for a reference data object.
For example, run a profile on a transformation that you connect to the input ports on a Labeler transformation or Parser transformation. Add the profile data to a probabilistic model, and apply the probabilistic model to the Labeler transformation or Parser transformation.
    1. Open the mapping that contains the transformation to profile.
    2. Select the transformation.
    For example, select a transformation that you connect to the input ports of a Labeler transformation or Parser transformation.
    3. Click Profile Now
    4. Select the Results tab in the profile, and review the profile results.
    5. Under Column Profiling, select a column.
    6. Under Details, select the option to show the profile values.
    The editor displays the data values in the column that you selected. You can select all the values in the column or a subset of the values.
    7. Export the column data to a file.
    8. Save the file that contains the column data. You can save the file on an Informatica services machine or on the Developer tool machine.
You can use the file as a data source for a reference data object. For example, create a probabilistic model from the file.