Data Domains and Data Domain Groups
A data domain is a predefined or user-defined Model repository object that enables you to discover the functional meaning of column data or column names in a data source. Examples of data domains include Social Security number, account status, IP address, and UPC code. You can add one or more data domains into a data domain group. Data domain groups can contain data domains and not other data domain groups. Examples for data domain groups include PCI, PHI, and PII.
Data domain discovery is the process of discovering the functional meaning of data in the data sources based on the semantics of data. You can use data domains to identify and understand the meaning of critical source data or undiscovered source data so that you can take measures, such as data masking, to work effectively on it. For example, you might have legacy data systems that contain Social Security numbers in a Comments field. You need to find this information so that you can take appropriate measures before you move it to new data systems
Data domain groups help you categorize data domains into specific groups. For example, you can group the data domains first_name, last_name, and account_number into the Personal Health Information (PHI) data domain group. When you create or edit a data domain group, you can add one or more data domains to the group. You can delete a data domain from the data domain group. A data domain can be a part of multiple data domain groups. For example, the Social Security number can belong to both Payment Card Industry (PCI) and PII data domain groups.
Data Domains and Data Domain Groups in Catalog Administrator
In the Catalog Administrator, you can view data domains and data domain groups, create and edit rule-based data domains, edit example-based data domains, create and edit data domain groups.
In Catalog Administrator, after you enable data domain discovery for a resource and run the resource, the profiling scanner uses the data domains to infer matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view a data domain as an asset. You can also view the data domains for tabular, column, and field assets. These data domains are inferred for the asset from the profile results or from similar columns. You can curate a inferred data domain based on your requirements.
In Catalog Administrator, you can view all the available data domains and data domain groups in the Library workspace. You can view or edit data domains, and view, edit, or delete data domain groups. You can use filters to view specific data domains or data domain groups.
To create or view data domains and data domain groups, you need appropriate permissions and privileges. Configure the Domain Management: Admin - View Domain and Domaingroup and Domain Management: Admin - Edit Domain and Domaingroup privileges in Informatica Administrator for a user to view, create, edit, or delete data domains or data domain groups in Catalog Administrator. For more information about privileges and permissions, see the Informatica Administrator Reference for Enterprise Data Catalog Guide.
In Enterprise Data Catalog, you can use the following types of data domains:
- Predefined data domains
- Predefined data domains are system-defined data domains. These data domains are imported automatically when you install Enterprise Data Catalog.
- Rule-based data domains
- Rule-based data domain are user-defined data domains. You can use one or more rules to create a rule-based data domain in Catalog Administrator.
You can also create data domains and data domain groups in Informatica Analyst or Informatica Developer and use it in Enterprise Data Catalog. You can create mapplet rules in the Developer tool and use these rules to create data domains in Catalog Administrator. To use these data domains, Enterprise Data Catalog runs the DataDomain scanner which is auto-scheduled to run every 45 minutes. The scanner synchronizes the data domains between the Model repository and catalog.
- Example-based data domains
- Example-based data domain are user-defined data domains. You can create example-based data domain instantly with the existing pattern of a column or field in Enterprise Data Catalog. Example-based data domains are also known as smart data domains. This data domain does not contain any rule. You can view an example-based data domain in Catalog Administrator. To convert an example-based data domain to a rule-based data domain, you can add one or more rules, conformance values, and proximity data domains
Predefined Data Domains
Predefined data domains and predefined data domain groups are system-defined data domains and data domain groups that are available by default in Enterprise Data Catalog. The data domains are imported automatically when you install Enterprise Data Catalog. Predefined data domains use default conformance values. In Catalog Administrator, you can edit the predefined data domains to modify the default conformance values and add proximity data domains.
Predefined data domains use data rule, column name rule, or data and column name rules. Examples for predefined data domains include AccountNumber, India_NationalID, SSN_General, and ZipCode. After you install Enterprise Data Catalog, you can choose to import the predefined data domains either in the Developer tool or run the infacmd.sh oie importobjectscommand
Rule-based Data Domains
You can use one or more rules to define a rule-based data domain. In Catalog Administrator, you can create and edit rule-based data domains. When you enable data domain discovery on resources, Enterprise Data Catalog uses the data domains to discover matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view data domain assets and data domains in tabular, column, and field assets.
When you create or edit a data domain, you can add rules, configure conformance values, add proximity data domains, and data domain groups. You can curate the data domains in Enterprise Data Catalog.
Rules
You can use the rules that have a single output with Integer data type. If you do not use any rule in a data domain, Enterprise Data Catalog assigns the data domain to similar columns based on the data domain you assign to the column.
Choose one or both of the following rules to define a data pattern for a data domain:
- •Data rule. Uses source data that match the metadata. The rule discovers columns with data that match a specific logic defined in the rule.
- •Column name rule. Uses column-name patterns that match the metadata. The rule discovers columns that match the column name logic defined in the rule.
When you create or edit a data domain, choose one of the following rules as a data rule or column name rule:
Rule | Description |
---|
Reference table | Choose the reference table in the Model repository. When you choose a reference table, the rule uses the column data in the reference table to discover data domains. |
Regular expression | Choose a regular expression. A regular expression is a specialized formula for matching text strings that follow a pattern. When you choose a regular expression, the rules use the expression to discover the data domains. |
Existing rules | Choose existing rules in Model repository. Before you choose a rule, verify that the rules are available to you with appropriate permissions. You can choose the rules that you created in Informatica Analyst or Informatica Developer. |
Proximity Data Domains
Enterprise Data Catalog uses proximity data domains to narrow down the inferred results to identify close-to-identical columns or fields for a data domain. Enterprise Data Catalog displays the results as a match score for the data domain. The match score is the ratio of number proximal data domains discovered in the data source to the number of configured proximal data domains for an inferred data domain.
To use proximity data domains in the data domain discovery process, perform the following tasks:
- 1. When you create or edit a data domain, add one or more data domains as proximity data domains.
- 2. When you create or edit a resource and enable data domain discovery, add the proximity data domains to the data domain.
When you enable data discovery and run the resource, the profiling scanner scans the data source for the data domain and the proximity data domains in the resource and displays a match score in Enterprise Data Catalog.
Conformance Criteria
When you choose a data rule for a data domain, you can configure the conformance criteria for the data domain. Enterprise Data Catalog identifies the data domains based on the conformance criteria.
You can configure the following conformance criteria for a data domain:
- •Minimum conformance. Enter a minimum conformance percentage for the data domain. By default, this option displays 40.
- •Auto accept if more than. Enter a percentage. Enterprise Data Catalog accepts the data domain automatically if the data domain match exceeds the configured percentage. By default, this option displays 80.00 percent.
- •Row Count. Enter the number of minimum rows for data conformance. By default, this option displays 1.
The default values are called predefined conformance values. You can configure a custom value to override the predefined values. When you choose multiple data domains or data domain groups, Enterprise Data Catalog computes the conformance value based on the predefined values or custom values.
When you choose a column name rule for a data domain, you can configure the Minimum conformance value or Auto accept if more than value as 0 or 100. Column name rule does not use the Row Count value to compute the data domain match.
When you create or edit a resource, you can choose the data domains, its default or configured conformance values, or choose custom conformance values as conformance criteria. You can also choose to exclude the null values during data domain computation.