Catalog Administrator Guide > Enterprise Data Catalog Concepts > Data Domains and Data Domain Groups

Data Domains and Data Domain Groups

A data domain is a predefined or user-defined Model repository object that enables you to discover the functional meaning of column data or column names in a data source. Examples of data domains include Social Security number, account status, IP address, and UPC code. You can add one or more data domains into a data domain group. Data domain groups can contain data domains and not other data domain groups. Examples for data domain groups include PCI, PHI, and PII.

Data domain discovery is the process of discovering the functional meaning of data in the data sources based on the semantics of data. You can use data domains to identify and understand the meaning of critical source data or undiscovered source data so that you can take measures, such as data masking, to work effectively on it. For example, you might have legacy data systems that contain Social Security numbers in a Comments field. You need to find this information so that you can take appropriate measures before you move it to new data systems

Data domain groups help you categorize data domains into specific groups. For example, you can group the data domains first_name, last_name, and account_number into the Personal Health Information (PHI) data domain group. When you create or edit a data domain group, you can add one or more data domains to the group. You can delete a data domain from the data domain group. A data domain can be a part of multiple data domain groups. For example, the Social Security number can belong to both Payment Card Industry (PCI) and PII data domain groups.

Data Domains and Data Domain Groups in Catalog Administrator

In the Catalog Administrator, you can view data domains and data domain groups, create and edit rule-based data domains, edit smart data domains, create and edit data domain groups.

In Catalog Administrator, after you enable data domain discovery for a resource and run the resource, the profiling scanner uses the data domains to infer matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view a data domain as an asset. You can also view the data domains for tabular, column, and field assets. These data domains are inferred for the asset from the profile results or from similar columns. You can curate a inferred data domain based on your requirements.

In Catalog Administrator, you can view all the available data domains and data domain groups in the Library workspace. You can view or edit data domains, and view, edit, or delete data domain groups. You can use filters to view specific data domains or data domain groups.

To create or view data domains and data domain groups, you need appropriate permissions and privileges. Configure the Domain Management: Admin - View Domain and Domaingroup and Domain Management: Admin - Edit Domain and Domaingroup privileges in Informatica Administrator for a user to view, create, edit, or delete data domains or data domain groups in Catalog Administrator. For more information about privileges and permissions, see the Informatica Administrator Reference for Enterprise Data Catalog Guide.

In Enterprise Data Catalog, you can use the following types of data domains:

Predefined data domains: Predefined data domains are system-defined data domains. These data domains are imported automatically when you install Enterprise Data Catalog.
Rule-based data domains: Rule-based data domain are user-defined data domains. You can use one or more rules to create a rule-based data domain in Catalog Administrator.
Smart data domains: Smart data domains are user-defined data domains. You can create a smart data domain with an existing pattern of a column or field in the catalog. This data domain does not contain any rule. You can view smart data domains in Catalog Administrator. To convert a smart data domain to a rule-based data domain, add one or more rules, conformance values, and proximity data domains

Predefined Data Domains

Predefined data domains and predefined data domain groups are system-defined data domains and data domain groups that are available by default in Enterprise Data Catalog. The data domains are imported automatically when you install Enterprise Data Catalog. Predefined data domains use default conformance values. In Catalog Administrator, you can edit the predefined data domains to modify the default conformance values and add proximity data domains.

Predefined data domains use data rule, column name rule, or data and column name rules. Examples for predefined data domains include AccountNumber, India_NationalID, SSN_General, and ZipCode. After you install Enterprise Data Catalog, you can choose to import the predefined data domains either in the Developer tool or run the infacmd.sh oie importobjectscommand

Rule-based Data Domains

You can use one or more rules to define a rule-based data domain. In Catalog Administrator, you can create and edit rule-based data domains. When you enable data domain discovery on resources, Enterprise Data Catalog uses the data domains to discover matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view data domain assets and data domains in tabular, column, and field assets.

When you create or edit a data domain, you can add rules, choose conflict resolution options, configure conformance values, add proximity data domains, and data domain groups. You can curate the data domains in Enterprise Data Catalog.

Rules

You can use the rules that have a single output with Integer data type. If you do not use any rule in a data domain, Enterprise Data Catalog assigns the data domain to similar columns based on the data domain you assign to the column.

Choose one or both of the following rules to define a data pattern for a data domain:

•Data rule. Uses source data that match the metadata. The rule discovers columns with data that match a specific logic defined in the rule.
•Column name rule. Uses column-name patterns that match the metadata. The rule discovers columns that match the column name logic defined in the rule.

When you create or edit a data domain, choose one of the following rules as a data rule or column name rule:

Rule	Description
Reference table	Choose the reference table in the Model repository. When you choose a reference table, the rule uses the column data in the reference table to discover data domains.
Regular expression	Specify a regular expression to match string data. A regular expression is a specialized formula for matching text strings that follow a pattern. The syntax used for regular expression must conform to the grammar rules and tokens described here: https://docs.informatica.com/data-catalog/shared-content-for-data-catalog/10-5/transformation-language-reference/functions/reg_extract.html.
Existing rules	Choose existing rules in Model repository. Before you choose a rule, verify that the rules are available to you with appropriate permissions. You can choose mapplet based rules that were created using Informatica Analyst or Informatica Developer. For more information about using mapplet based rules, see: https://docs.informatica.com/data-engineering/shared-content-for-data-engineering/10-5/developer-mapping-guide/mapplets/mapplet-validation/mapplet-as-a-rule-validation.html.

Conflict Resolution

You can decide to use data rule or column name rule to configure auto-acceptance criteria. Select the Conflict Resolution option to auto-accept data domain on the basis of data or column name rule. You can choose one of the following options:

•If rules conflict, data rule overrides column name rule. Allows Enterprise Data Catalog to auto-accept data domains based on the data rule. If the data rule is not specified, column name rule result is used to auto accept the data domains
•If rules conflict, column name rule overrides data rule. Allows Enterprise Data Catalog to auto-accept data domains based on the column name rule. If the column name rule is not specified, data rule result is used to auto accept the data domains.
•Match data and column name rule. Enterprise Data Catalog auto-accepts data domains when data and column name rule results match. If the data rule is not specified, the data domains are not auto accepted.
•Match data or column name rule. Enterprise Data Catalog auto-accepts data domains when data or column name rule results match. If the data or column name rule are not specified, the data domains are not auto accepted.

If you do not select the Conflict Resolution option, Enterprise Data Catalog applies the data rule result to auto-accept data domains. Data domains are not auto accepted if you configure Data Domain Match Criteria as Row in the Resource section of the Catalog Administrator.

The data domains are auto accepted in the following scenarios:

Data rule overrides column name rule.

Column Name Rule Match	Data Rule Match	Data Domain State
Yes	No	No
Yes	Yes, reaches the configured minimum conformance value.	Inferred
Yes	Yes	Auto accepted
No	No	No
No	Yes, reaches the configured minimum conformance value	Inferred
No	Yes	Auto accepted
Domain discovery on source metadata is not performed.	No	No
Domain discovery on source metadata is not performed.	Yes, reaches the configured minimum conformance value	Inferred
Domain discovery on source metadata is not performed.	Yes	Auto accepted
Yes	Domain discovery on source metadata is not performed.	Auto accepted
No	Domain discovery on source metadata is not performed.	No

Column name rule overrides data rule.

Column Name Rule Match	Data Rule Match	Data Domain State
Yes	No	Auto accepted
Yes	Yes, reaches the configured minimum conformance value	Auto accepted
Yes	Yes	Auto accepted
No	No	No
No	Yes, exceeds the configured auto accept threshold value.	No
No	Yes	No
No	Domain discovery on source data is not performed.	No
Yes	Domain discovery on source data is not performed.	Auto accepted
Domain discovery on source data is not performed.	Yes	Auto accepted
Domain discovery on source data is not performed.	Yes, exceeds the configured auto accept threshold value.	Inferred
Domain discovery on source data is not performed.	No	No

Match data and column name rule.

Column Name Rule Match	Data Rule Match	Data Domain State
Yes	Yes, exceeds the configured auto accept threshold value.	Auto accepted
Yes	Yes, reaches the configured minimum conformance value	Inferred
Yes	No	No
No	No	No
No	Yes, reaches the configured minimum conformance value	No
No	Yes	No
No	Domain discovery on source metadata is not performed.	No
Yes	Domain discovery on source metadata is not performed.	Auto accepted
Domain discovery on source metadata is not performed.	Yes	Auto accepted
Domain discovery on source metadata is not performed.	No	No
Domain discovery on source metadata is not performed.	Yes, reaches the configured minimum conformance value	Inferred

Match data or column name rule.

Column Name Rule Match	Data Rule Match	Data Domain State
Yes	No	Auto accepted
Yes	Yes, reaches the configured minimum conformance value	Auto accepted
Yes	Yes	Auto accepted
No	Yes, reaches the configured minimum conformance value	Inferred
No	Yes	Auto accepted
No	Domain discovery on source metadata is not performed.	No
Yes	Domain discovery on source metadata is not performed.	Auto accepted
Domain discovery on source metadata is not performed.	Yes	Auto accepted
Domain discovery on source metadata is not performed.	No	No
Domain discovery on source metadata is not performed.	Yes, reaches the configured minimum conformance value	Inferred

Proximity Data Domains

Enterprise Data Catalog uses proximity data domains to narrow down the inferred results to identify close-to-identical columns or fields for a data domain. Proximity data domains are one of the factors that help determine the type of data in a column. When one or more data domains are inferred with equal conformance or probability, proximity data domains can be used to identify the close-to-identical data domain. When you create or edit a data domain, add one or more data domains as proximity data domains.

Enterprise Data Catalog displays the results as a match score for the data domain. The match score is the ratio of number proximal data domains discovered in the data source to the number of configured proximal data domains for an inferred data domain.

When you enable data discovery, add data domains, and run the resource, the profiling scanner scans the data source for the data domain and the proximity data domains in the resource and displays a match score in Enterprise Data Catalog.

Example: You run a profile on a CSV file that has the following columns:

Conformance Criteria

When you choose a data rule for a data domain, you can configure the conformance criteria for the data domain. Enterprise Data Catalog identifies the data domains based on the conformance criteria.

You can configure the following conformance criteria for a data domain:

•Minimum conformance. Enter a minimum conformance percentage for the data domain. By default, this option displays 40.
•Auto accept if more than. Enter a percentage. Enterprise Data Catalog accepts the data domain automatically if the data domain match exceeds the configured percentage. By default, this option displays 80.00 percent.
•Row Count. Enter the number of minimum rows for data conformance. By default, this option displays 1.

The default values are called predefined conformance values. You can configure a custom value to override the predefined values. When you choose multiple data domains or data domain groups, Enterprise Data Catalog computes the conformance value based on the predefined values or custom values.

When you choose a column name rule for a data domain, you can configure the Minimum conformance value or Auto accept if more than value as 0 or 100. Column name rule does not use the Row Count value to compute the data domain match.

When you create or edit a resource, you can choose the data domains, its default or configured conformance values, or choose custom conformance values as conformance criteria. You can also choose to exclude the null values during data domain computation.