Catalog Administrator Guide > Enterprise Data Catalog Concepts > Data Domains and Data Domain Groups
  

Data Domains and Data Domain Groups

A data domain is a predefined or user-defined Model repository object that enables you to discover the functional meaning of column data or column names in a data source. Examples of data domains include Social Security number, account status, IP address, and UPC code. You can add one or more data domains into a data domain group. Data domain groups can contain data domains and not other data domain groups. Examples for data domain groups include PCI, PHI, and PII.
Data domain discovery is the process of discovering the functional meaning of data in the data sources based on the semantics of data. You can use data domains to identify and understand the meaning of critical source data or undiscovered source data so that you can take measures, such as data masking, to work effectively on it. For example, you might have legacy data systems that contain Social Security numbers in a Comments field. You need to find this information so that you can take appropriate measures before you move it to new data systems
Data domain groups help you categorize data domains into specific groups. For example, you can group the data domains first_name, last_name, and account_number into the Personal Health Information (PHI) data domain group. When you create or edit a data domain group, you can add one or more data domains to the group. You can delete a data domain from the data domain group. A data domain can be a part of multiple data domain groups. For example, the Social Security number can belong to both Payment Card Industry (PCI) and PII data domain groups.

Data Domains and Data Domain Groups in Catalog Administrator

In the Catalog Administrator, you can view data domains and data domain groups, create and edit rule-based data domains, edit smart data domains, create and edit data domain groups.
In Catalog Administrator, after you enable data domain discovery for a resource and run the resource, the profiling scanner uses the data domains to infer matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view a data domain as an asset. You can also view the data domains for tabular, column, and field assets. These data domains are inferred for the asset from the profile results or from similar columns. You can curate a inferred data domain based on your requirements.
In Catalog Administrator, you can view all the available data domains and data domain groups in the Library workspace. You can view or edit data domains, and view, edit, or delete data domain groups. You can use filters to view specific data domains or data domain groups.
To create or view data domains and data domain groups, you need appropriate permissions and privileges. Configure the Domain Management: Admin - View Domain and Domaingroup and Domain Management: Admin - Edit Domain and Domaingroup privileges in Informatica Administrator for a user to view, create, edit, or delete data domains or data domain groups in Catalog Administrator. For more information about privileges and permissions, see the Informatica Administrator Reference for Enterprise Data Catalog Guide.
In Enterprise Data Catalog, you can use the following types of data domains:
Predefined data domains
Predefined data domains are system-defined data domains. These data domains are imported automatically when you install Enterprise Data Catalog.
Rule-based data domains
Rule-based data domain are user-defined data domains. You can use one or more rules to create a rule-based data domain in Catalog Administrator.
You can also create data domains and data domain groups in Informatica Analyst or Informatica Developer and use it in Enterprise Data Catalog. You can create mapplet rules in the Developer tool and use these rules to create data domains in Catalog Administrator. To use these data domains, Enterprise Data Catalog runs the DataDomain scanner which is auto-scheduled to run every 45 minutes. The scanner synchronizes the data domains between the Model repository and catalog.
Smart data domains
Smart data domains are user-defined data domains. You can create a smart data domain with an existing pattern of a column or field in the catalog. This data domain does not contain any rule. You can view smart data domains in Catalog Administrator. To convert a smart data domain to a rule-based data domain, add one or more rules, conformance values, and proximity data domains

Predefined Data Domains

Predefined data domains and predefined data domain groups are system-defined data domains and data domain groups that are available by default in Enterprise Data Catalog. The data domains are imported automatically when you install Enterprise Data Catalog. Predefined data domains use default conformance values. In Catalog Administrator, you can edit the predefined data domains to modify the default conformance values and add proximity data domains.
Predefined data domains use data rule, column name rule, or data and column name rules. Examples for predefined data domains include AccountNumber, India_NationalID, SSN_General, and ZipCode. After you install Enterprise Data Catalog, you can choose to import the predefined data domains either in the Developer tool or run the infacmd.sh oie importobjectscommand

Rule-based Data Domains

You can use one or more rules to define a rule-based data domain. In Catalog Administrator, you can create and edit rule-based data domains. When you enable data domain discovery on resources, Enterprise Data Catalog uses the data domains to discover matching column data or column name patterns from the metadata extracted by the resources. In Enterprise Data Catalog, you can view data domain assets and data domains in tabular, column, and field assets.
When you create or edit a data domain, you can add rules, choose conflict resolution options, configure conformance values, add proximity data domains, and data domain groups. You can curate the data domains in Enterprise Data Catalog.

Rules

You can use the rules that have a single output with Integer data type. If you do not use any rule in a data domain, Enterprise Data Catalog assigns the data domain to similar columns based on the data domain you assign to the column.
Choose one or both of the following rules to define a data pattern for a data domain:
When you create or edit a data domain, choose one of the following rules as a data rule or column name rule:
Rule
Description
Reference table
Choose the reference table in the Model repository. When you choose a reference table, the rule uses the column data in the reference table to discover data domains.
Regular expression
Specify a regular expression to match string data. A regular expression is a specialized formula for matching text strings that follow a pattern. The syntax used for regular expression must conform to the grammar rules and tokens described here: https://docs.informatica.com/data-catalog/shared-content-for-data-catalog/10-5/transformation-language-reference/functions/reg_extract.html.
Existing rules
Choose existing rules in Model repository. Before you choose a rule, verify that the rules are available to you with appropriate permissions. You can choose mapplet based rules that were created using Informatica Analyst or Informatica Developer. For more information about using mapplet based rules, see: https://docs.informatica.com/data-engineering/shared-content-for-data-engineering/10-5/developer-mapping-guide/mapplets/mapplet-validation/mapplet-as-a-rule-validation.html.

Conflict Resolution

You can decide to use data rule or column name rule to configure auto-acceptance criteria. Select the Conflict Resolution option to auto-accept data domain on the basis of data or column name rule. You can choose one of the following options:
If you do not select the Conflict Resolution option, Enterprise Data Catalog applies the data rule result to auto-accept data domains. Data domains are not auto accepted if you configure Data Domain Match Criteria as Row in the Resource section of the Catalog Administrator.
The data domains are auto accepted in the following scenarios:
Data rule overrides column name rule.
Column Name Rule Match
Data Rule Match
Data Domain State
Yes
No
No
Yes
Yes, reaches the configured minimum conformance value.
Inferred
Yes
Yes
Auto accepted
No
No
No
No
Yes, reaches the configured minimum conformance value
Inferred
No
Yes
Auto accepted
Domain discovery on source metadata is not performed.
No
No
Domain discovery on source metadata is not performed.
Yes, reaches the configured minimum conformance value
Inferred
Domain discovery on source metadata is not performed.
Yes
Auto accepted
Yes
Domain discovery on source metadata is not performed.
Auto accepted
No
Domain discovery on source metadata is not performed.
No
Column name rule overrides data rule.
Column Name Rule Match
Data Rule Match
Data Domain State
Yes
No
Auto accepted
Yes
Yes, reaches the configured minimum conformance value
Auto accepted
Yes
Yes
Auto accepted
No
No
No
No
Yes, exceeds the configured auto accept threshold value.
No
No
Yes
No
No
Domain discovery on source data is not performed.
No
Yes
Domain discovery on source data is not performed.
Auto accepted
Domain discovery on source data is not performed.
Yes
Auto accepted
Domain discovery on source data is not performed.
Yes, exceeds the configured auto accept threshold value.
Inferred
Domain discovery on source data is not performed.
No
No
Match data and column name rule.
Column Name Rule Match
Data Rule Match
Data Domain State
Yes
Yes, exceeds the configured auto accept threshold value.
Auto accepted
Yes
Yes, reaches the configured minimum conformance value
Inferred
Yes
No
No
No
No
No
No
Yes, reaches the configured minimum conformance value
No
No
Yes
No
No
Domain discovery on source metadata is not performed.
No
Yes
Domain discovery on source metadata is not performed.
Auto accepted
Domain discovery on source metadata is not performed.
Yes
Auto accepted
Domain discovery on source metadata is not performed.
No
No
Domain discovery on source metadata is not performed.
Yes, reaches the configured minimum conformance value
Inferred
Match data or column name rule.
Column Name Rule Match
Data Rule Match
Data Domain State
Yes
No
Auto accepted
Yes
Yes, reaches the configured minimum conformance value
Auto accepted
Yes
Yes
Auto accepted
No
Yes, reaches the configured minimum conformance value
Inferred
No
Yes
Auto accepted
No
Domain discovery on source metadata is not performed.
No
Yes
Domain discovery on source metadata is not performed.
Auto accepted
Domain discovery on source metadata is not performed.
Yes
Auto accepted
Domain discovery on source metadata is not performed.
No
No
Domain discovery on source metadata is not performed.
Yes, reaches the configured minimum conformance value
Inferred

Proximity Data Domains

Enterprise Data Catalog uses proximity data domains to narrow down the inferred results to identify close-to-identical columns or fields for a data domain. Proximity data domains are one of the factors that help determine the type of data in a column. When one or more data domains are inferred with equal conformance or probability, proximity data domains can be used to identify the close-to-identical data domain. When you create or edit a data domain, add one or more data domains as proximity data domains.
Enterprise Data Catalog displays the results as a match score for the data domain. The match score is the ratio of number proximal data domains discovered in the data source to the number of configured proximal data domains for an inferred data domain.
When you enable data discovery, add data domains, and run the resource, the profiling scanner scans the data source for the data domain and the proximity data domains in the resource and displays a match score in Enterprise Data Catalog.
Example
You run a profile on a CSV file that has the following columns:
You want to determine whether the ID column refers to EMPID, NODEID, or CUSTID data domains that also contain seven digits.
When you analyze the file, you determine that the ID column has Employee IDs because the table has columns for first names, last names, and department names. The department names usually appear in employee-related tables. The Dname column is conclusive proof that the ID column is EMPID and not NODEID or CUSTID.
In Enterprise Data Catalog, you can add proximity data domains to determine the type of data in the ID column.
To identify the data in the ID column, perform the following tasks in Catalog Administrator:
  1. 1. Create the DeptName and EMPID data domains.
  2. 2. Add FirstName, LastName, and DeptName data domains as proximity data domains to the EMPID data domain.
  3. 3. Edit the resource with the CSV file, and add the EMPID data domain.
  4. 4. Run the resource.
During resource run, the ID column is identified as EMPID, NODEID, and CUSTID because of conformance scores, but the proximity data domains determine and infer the ID column as EMPID. The catalog displays the ID column as EMPID.

Conformance Criteria

When you choose a data rule for a data domain, you can configure the conformance criteria for the data domain. Enterprise Data Catalog identifies the data domains based on the conformance criteria.
You can configure the following conformance criteria for a data domain:
The default values are called predefined conformance values. You can configure a custom value to override the predefined values. When you choose multiple data domains or data domain groups, Enterprise Data Catalog computes the conformance value based on the predefined values or custom values.
When you choose a column name rule for a data domain, you can configure the Minimum conformance value or Auto accept if more than value as 0 or 100. Column name rule does not use the Row Count value to compute the data domain match.
When you create or edit a resource, you can choose the data domains, its default or configured conformance values, or choose custom conformance values as conformance criteria. You can also choose to exclude the null values during data domain computation.