Reference Data in the Developer Tool > Content Sets

Content Sets

A content set is a Model repository object that stores data or metadata for other reference data objects. A content set can include character sets, pattern sets, token sets, regular expressions, probabilistic models, and classifier models. Use a content set to define and organize reference data objects that relate to a single project, information type, or business purpose.

The Developer tool includes system-defined character sets and token sets that do not appear in the Model repository. To view and use the system-defined objects, configure a strategy in the Labeler transformation, Parser transformation, or Standardizer transformation.

Character Sets

A character set contains expressions that identify specific characters and character ranges. You can use character sets in Labeler transformations that use character labeling mode.

Character ranges specify a sequential range of character codes. For example, the character range "[A-C]" matches the uppercase characters "A," "B," and "C." This character range does not match the lowercase characters "a," "b," or "c."

Use character sets to identify a specific character or range of characters as part of labeling operations. For example, you can label all numerals in a column that contains telephone numbers. After labeling the numbers, you can identify patterns with a Parser transformation and write problematic patterns to separate output ports.

Character Set Properties

Configure properties that determine character labeling operations for a character set.

The following table describes the properties for a user-defined character set:

Property	Description
Label	Defines the label that a Labeler transformation applies to data that matches the character set.
Standard Mode	Enables a simple editing view that includes fields for the start range and end range.
Start Range	Specifies the first character in a character range.
End Range	Specifies the last character in a character range. For a range with a single character, leave this field blank.
Advanced Mode	Enables an advanced editing view where you can manually enter character ranges using range characters and delimiter characters.
Range Character	Temporarily changes the symbol that signifies a character range. The range character reverts to the default character when you close the character set.
Delimiter Character	Temporarily changes the symbol that separates character ranges. The delimiter character reverts to the default character when you close the character set.

Classifier Models

A classifier model analyzes input strings and determines the types of information that they contain. You use a classifier model in a Classifier transformation.

Use a classifier model when input strings contain significant amounts of data. For example, you can use a classifier model to identify the subject matter in a set of documents. You export the text from each document, and you store each document as a separate field in a single data column. The Classifier transformation reads the data and classifies the subject matter in each field according to the labels defined in the classifier model.

The classifier model contains the following columns:

Data column: A column that contains the words and phrases that are likely to exist in the input data. The transformation compares the input data with the data in this column.
Label column: A column that contains descriptive labels that can define the information in the data. The transformation returns a label from this column as output.

The classifier model also contains compilation data that the Classifier transformation uses to calculate the correct information type for the input data.

You create a Classifier model in the Developer tool. The Model repository stores the metadata for the classifier model object. The column data and compilation data are stored in a file in the Informatica directory structure.

Pattern Sets

A pattern set contains expressions that identify data patterns in the output of a token labeling operation. You can use pattern sets to analyze the Tokenized Data output port and write matching strings to one or more output ports. Use pattern sets in Parser transformations that use pattern parsing mode.

For example, you can configure a Parser transformation to use pattern sets that identify names and initials. This transformation uses the pattern sets to analyze the output of a Labler transformation in token labeling mode. You can configure the Parser transformation to write names and initials in the output to separate ports.

Pattern Set Properties

Configure properties that determine the patterns in a pattern set.

The following table describes the property for a user-defined pattern set:

Property	Description
Pattern	Defines the patterns that the pattern parser searches for. You can enter multiple patterns for one pattern set. You can enter patterns constructed from a combination of wildcards, characters, and strings.

Probabilistic Models

A probabilistic model identifies the types of information that data values represent. You use probabilistic models with the Labeler and Parser transformations.

A probabilistic model contains a series of data rows and a related series of label values. The label values describe the types of information that the data values in the data rows contain. You assign a label to each value in each row. When you compile a probabilistic model, the Developer tool creates associations between the data values and the labels that you specify.

The probabilistic model also contains compilation data that the transformations can use to calculate the correct information type for the input data. You update the model logic when you compile the model in the Developer tool.

You create a probabilistic model in the Developer tool. The Model repository stores the metadata for the probabilistic model object. The Developer tool writes the data rows, labels, and compilation data to a file.

Regular Expressions

In the context of content sets, a regular expression is an expression that you can use in parsing and labeling operations. Use regular expressions to identify one or more strings in input data. You can use regular expressions in Parser transformations that use token parsing mode. You can also use regular expressions in Labeler transformations that use token labeling mode.

Parser transformations use regular expressions to match patterns in input data and parse all matching strings to one or more outputs. For example, you can use a regular expression to identify all email addresses in input data and parse each email address component to a different output.

Labeler transformations use regular expressions to match an input pattern and create a single label. Regular expressions that have multiple outputs do not generate multiple labels.

Regular Expression Properties

Configure properties that determine how a regular expression identifies and writes output strings.

The following table describes the properties for a user-defined regular expression:

Property	Description
Number of Outputs	Defines the number of output ports that the regular expression writes.
Regular Expression	Defines a pattern that the Parser transformation uses to match strings.
Test Expression	Contains data that you enter to test the regular expression. As you type data in this field, the field highlights strings that matches the regular expression.
Next Expression	Moves to the next string that matches the regular expression and changes the font of that string to bold.
Previous Expression	Moves to the previous string that matches the regular expression and changes the font of that string to bold.

Token Sets

A token set contains expressions that identify specific tokens. You can use token sets in Labeler transformations that use token labeling mode. You can also use token sets in Parser transformations that use token parsing mode.

Use token sets to identify specific tokens as part of labeling and parsing operations. For example, you can use a token set to label all email addresses that use that use an "AccountName@DomainName" format. After labeling the tokens, you can use the Parser transformation to write email addresses to output ports that you specify.

Token Set Properties

Configure properties that determine the labeling operations for a token set.

The following table describes the properties for a user-defined character set:

Property	Token Set Mode	Description
Name	N/A	Defines the name of the token set.
Description	N/A	Describes the token set.
Token Set Options	N/A	Defines whether the token set uses regular expression mode or character mode.
Label	Regular Expression	Defines the label that a Labeler transformation applies to data that matches the token set.
Regular Expression	Regular Expression	Defines a pattern that the Labeler transformation uses to match strings.
Test Expression	Regular Expression	Contains data that you enter to test the regular expression. As you type data in this field, the field highlights strings that match the regular expression.
Next Expression	Regular Expression	Moves to the next string that matches the regular expression and changes the font of that string to bold.
Previous Expression	Regular Expression	Moves to the previous string that matches the regular expression and changes the font of that string to bold.
Label	Character	Defines the label that a Labeler transformation applies to data that matches the character set.
Standard Mode	Character	Enables a simple editing view that includes fields for the start range and end range.
Start Range	Character	Specifies the first character in a character range.
End Range	Character	Specifies the last character in a character range. For single-character ranges, leave this field blank.
Advanced Mode	Character	Enables an advanced editing view where you can manually enter character ranges using range characters and delimiter characters.
Range Character	Character	Temporarily changes the symbol that signifies a character range. The range character reverts to the default character when you close the character set.
Delimiter Character	Character	Temporarily changes the symbol that separates character ranges. The delimiter character reverts to the default character when you close the character set.

Creating a Content Set

Create a content set to manage reference data objects that refer to a single project, information type, or business purpose.

1. In the Object Explorer view, select a project or folder to store the content set.

2. Click File > New > Content Set.

3. Enter a name for the content set.

4. Optionally, select Browse to change the Model repository location for the content set.

5. Click Finish.

Creating a Reference Data Object in a Content Set

You can create a character set, pattern set, token set, regular expression, probabilistic model, and classifier model in a content set.

1. Open a content set in the editor and select the Content view.

2. Select a reference data object type.

3. Click Add.

4. Enter a name for the reference data object.

Optionally, enter a description of the object.

5. Configure the reference data object properties.

6. Click Finish.

Generating Reference Data from a Midstream Profile

A midstream profile is a profile that you run on a transformation in a mapping. You can run a midstream profile to generate data for a reference data object.

For example, run a profile on a transformation that you connect to the input ports on a Labeler transformation or Parser transformation. Add the profile data to a probabilistic model, and apply the probabilistic model to the Labeler transformation or Parser transformation.

1. Open the mapping that contains the transformation to profile.

2. Select the transformation.

For example, select a transformation that you connect to the input ports of a Labeler transformation or Parser transformation.

3. Click Profile Now

4. Select the Results tab in the profile, and review the profile results.

5. Under Column Profiling, select a column.

6. Under Details, select the option to show the profile values.

The editor displays the data values in the column that you selected. You can select all the values in the column or a subset of the values.

7. Export the column data to a file.

- To export all column values, click the option to Export Value Frequencies to File.
- To export a subset of column values, right-click the values and select Send to > Export Results to File.

8. Save the file that contains the column data. You can save the file on an Informatica services machine or on the Developer tool machine.

You can use the file as a data source for a reference data object. For example, create a probabilistic model from the file.