Catalog Administrator Guide > Enterprise Data Catalog Concepts > Data Discovery
  

Data Discovery

When you enable data discovery for a resource and scan the resource, Enterprise Data Catalog identifies profiling-related metadata, such as null values, distinct values, inferred data types, unique keys, and data domains in the resource.
When you create a resource, choose the Metadata Load Settings > Data Discovery > Enable Data Discovery option to discover profiling-related metadata and unique keys. Based on your requirements, you can configure domain connection settings, basic profile settings, unique key inference settings, and column similarity settings.
Important: Profiling might fail with a licensing error if you run data domain discovery using a Data Integration Service on a grid.

Supported Resource Types for Data Discovery

You can enable data discovery for the following types of resources:
You can also run data discovery on structured file types, unstructured file types, and extended unstructured formats.
Structured file types
Data discovery in Enterprise Data Catalog supports the following structured file types:
Unstructured File Types and Extended Unstructured Formats
When you choose HDFS, Amazon S3, or File System as a resource, you can choose extended unstructured formats or unstructured file types. Extended unstructured formats include mp3, mp4, bmp, and jpg formats. Extended unstructured formats do not fall under structured or unstructured file types.
Data domain discovery in Enterprise Data Catalog supports the following unstructured file types:

Domain Connection Settings

In the Domain Connection Settings section, you can configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity for the resource. You can choose a different Data Integration Service to infer column similarity.
Choose one of the following options to configure Data Integration Service properties:

Custom

Configure the following Data Integration Service parameters:
Domain Name
Name of the Informatica domain.
Data Integration Service
Name of Data Integration Service to run profiles.
Username
User name that the Data Integration Service uses to access the Model Repository Service.
Password
Password for the Model repository user.
Security Domain
Name of the security domain to which the domain user belongs.
Host
Host name of the master gateway node.
Port
Port number of the master gateway node.
Operating System Profile
Choose an operating system profile if you do not have a default operating system profile. You need to assign an operating system profile to Enterprise Data Catalog users if the Data Integration Service uses operating system profiles. If you do not assign an operating system profile to a user and the user tries to run a profile in Catalog Administrator, the profile run fails. The Data Integration Service uses the operating system profile user credentials to run data discovery. Data discovery includes column profiles and data domain discovery profiles.
If you have multiple operating system profiles and you do not have a default operating system profile, choose an operating system profile in the Domain Connection Settings section.
In the Informatica Administrator > Security > Operating System Profiles tab, you can create or assign an operating system profile. To configure a default operating system profile for a user, click the Assign or Change the Default Operating System Profile option in the Security > Users > Permissions > Operating System Profiles section.
Note: You can choose an operating system profile only when you choose the Custom option for the Specify the configuration settings for Data Integration Service property. If you choose the Global option for the Specify the configuration settings for Data Integration Service property, then you must have a default operating system profile to run the profile.
For information about configuring the Data Integration Service to use operating system profiles, see Informatica Application Service Guide. For information about creating and assigning operating system profiles, see Informatica Security Guide.

Global

Choose this option to enable a reusable configuration.
A reusable configuration has Data Integration Service settings that you can use for a resource to extract profile metadata. You can configure one or more reusable configurations. Navigate to Manage > Reusable Configuration to view, create, or delete a reusable configuration.

Basic Profile Settings

In the Basic Profile Settings section, you can configure the following options for a resource:

Profile Run Option

Choose one of the following profile run options for the profiling scanner to run the profile job on the resource:
Column Profile
Identifies the number of null values, distinct values, non-distinct values, and infers data patterns and data types of the columns in the resource.
Data Domain Discovery
Discovers all the data domains associated with a column based on the column value or column name.
Column Profile and Data Domain Discovery
Identifies the number of null values, distinct values, non-distinct values, and infers data patterns, data types, and data domains in the resource.
When you run a scan on a resource multiple times, the last scan results include all the scans. For example, you choose column profile when you scan a resource. Then, before you run the scan again, you choose to perform data domain discovery. The results for the second scan includes both the column profile results and data domain discovery results.
Data domain discovery results display all the inferred data domains from all the runs. For example, if data domain D1 is inferred during the first resource scan and data domain D4 is inferred during the next scan, the scan results for the second time display both D1 and D4.
When you run a scan on a resource for the second time or for subsequent runs, you can optionally run only data discovery on the source. To run only data discovery on the resource, disable Metadata Load Settings > Source Metadata option.

Data Domain Discovery Type

Choose one of the following options for the profiling scanner to infer data domains based on column name, column data, or both:
Run Discovery on Source Data
Runs data domain discovery on source data.
Run Discovery on Source Metadata
Runs data domain discovery on column names.
Run Discovery on both Source Metadata and Data
Runs data domain discovery on source data and source metadata.
Run Discovery on Source Data Where Metadata Matches
Runs data domain discovery on the source metadata to identify the column names that match the data domains. The scanner then runs data domain discovery on the source data of the identified columns.
Note: You can choose only the Run Discovery on Source Data option when you run data domain discovery on unstructured data sources.

Priority

Choose one of the following values for the profiling scanner to prioritize the resource run:
The profiling scanner runs the resources with High priority and then runs the resources with Low priority.
For example, you have three resources R1, R2, and R3. The priority value for R1 and R3 is set to high and the priority value for R2 is set to low. When you run the resources, the scanner first runs R1 and R3 followed by R2.

Sampling Option

Choose one of the following sampling options to determine the number of rows to run the profile job on:
All rows
Runs the profile on all the rows in the data source.
Auto Random rows
Runs the profile on a random sample of rows. Enterprise Data Catalog computes the number of random rows based on the number of source rows.
Random N rows
Runs the profile on the configured number of random rows
In the Random Sampling Rows field, enter the number of rows that you want to run the profile on.
First N rows
Runs the profile on the first N number of rows in the resource.
In the Number of First N Sampling Rows field, enter the number of rows to run the profile on.
Limit N rows
Runs the profile based on the number of rows in the data object.
In the Number of Rows to Limit field, enter the number of rows to run the profile on.
Random Percentage
Runs the profile on a percentage of rows in the data object.
In the Random Percentage field, enter the number of rows to run the profile on.

Exclude Views

Choose the Exclude Views option if you do not want the profiling scanner to scan the views in the relational data sources.

Incremental Profiling

Choose the option to run the profile only for the changes made to the data source. If you do not select this option, the profile runs on the entire data source.
Enterprise Data Catalog supports incremental profiling for the following resources:
When you enable incremental profiling for a resource that has one table and you run the profile multiple times on the resource, the profiling scanner validates and runs the profile on the same table every single time.

Data Profile Filter

You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names.
For more information about the Data Profile Filter field, see the Source Metadata and Data Profile Filter topic.

Cumulative

Enterprise Data Catalog does not retain the previous scan results. It displays only the latest scan results. To retain the profile results from the previous run in the latest scan results, choose the Cumulative option. If you do not choose this option, the column profile and column similarity results from the previous run are deleted and only the latest results appear in Enterprise Data Catalog.
The following use cases explain how the Cumulative option with the Data Profile Filter field and Incremental Profiling option impacts the profiling results:

Source Connection Name

Choose the source connection to run data discovery. You can create the connections in Informatica Administrator.
Note: This parameter is optional for a File System resource.

Run On

Choose one of the following run-time environments to run the profile:
Blaze
Runs the profile in the Hadoop environment on the Blaze engine.
Spark
Runs the profile in the Hadoop environment on the Spark engine.
Native
Runs the profile on the same machine where the Data Integration Service runs.
Databricks
Runs the profile in the Hadoop environment on the Spark engine in the Databricks cluster. The Databricks run-time environment supports JDBC and Azure Data Lake Store resources.
Note: Choose Blaze or Native run-time environment to run the profile job for all the resource except Hive resource. When you choose the Blaze engine or Spark engine, select a Hadoop connection to run the profiles.

Select Data Domain

Choose one of the following data domain options:
All Data Domains
Discovers all the data domains in the resource.
Specific Data Domain Groups
Discovers the data domains in the selected data domain groups.
In the Data Domain Groups field, choose one or more data domain groups.
Specific Data Domains
Discovers the selected data domains.
In the Data Domains field, choose one or more data domains.
In the Library workspace, you can view all the data domains and data domain groups that are available in Enterprise Data Catalog. To create a data domain or data domain group, navigate to New > Data Domain, or New > Data Domain Group page. In the Library workspace, you can view or delete data domains or data domain groups.

Use Conformance from

Choose one of the following conformance values for the data domain:
Data Domain
Uses the predefined conformance values that you configured for the data domains.
When you create a data domain, you configure the minimum percentage of source rows and minimum number of source rows as the conformance criteria for data domain match. These values are predefined conformance values.
Custom
Uses the conformance value that you enter in the Custom Conformance Value field for the data domains. The custom value overrides the predefined conformance values.

Data Domain Match Criteria

Choose one of the following conformance criteria for data domain match:
Percentage
Ratio of the number of matching rows divided by the total number of rows.
Rows
Total number of rows.
Enterprise Data Catalog uses the data conformance properties that you configured for the data domains. Navigate to Library > Assets > Data Domains to view the data domains. Open each data domain to view its configured properties.

Exclude Null Values from Data Domain Discovery

Choose the option to exclude the null values from the data source when you run data domain discovery. When you use this option, the data domain inference is more accurate and reliable. For example, you have a table with 100 rows and 30 rows contain null values . The conformance row count is 40. If you do not choose this option, data domain discovery runs on all the 100 rows to discover data domains which might result in an inaccurate inference. If you choose this option, data domain discovery runs on 70 rows and the results are more accurate.
When you select the minimum percentage of rows with the exclude null values option, the conformance percentage is the ratio of number of matching rows in a column to the number of rows that do not contain null values. For example, if the total number of rows in a column is T, the number of matching rows is M, the number of rows with null values is N, then the conformance percentage is M/(T-N)%.

Similarity Profile Data Preparation and Value Frequency Settings

Configure the column similarity properties to identify similar columns and value frequency in the resource.
Run Similarity Profile
Choose one of the following options:
Save Source Data
Choose one of the following options:
Sampling Option
Choose one of the following sampling options to determine the number of rows that Enterprise Data Catalog can run the profile on:
Domain Connection Settings
Choose one of the following domain connection settings options:

Permissions and Privileges

You can view the value frequency section in Enterprise Data Catalog if you have the following permissions and privileges:
If the data asset has sensitive data, then you can view the sensitive data in the data asset after the administrator assigns the Data Privileges: View Sensitive Data privilege. For more information about privileges and permissions, see the Informatica Administrator Reference for Catalog guide.

Unique Key Inference Settings

A unique key is a column or combinations of columns that uniquely identifies a row in a data source. The profiling service identifies columns in the data object to generate unique keys. Enterprise Data Catalog displays unique key inferences for the tabular assets.
The unique key cannot have duplicate values. If a column has a duplicate values, that column is not identified as the unique key. The unique key inference is supported on the native run-time environment.
In the Unique Key Inference Settings section, you can configure the following options for a resource to generate unique key:
Run Unique Key Inference
The profile scanner scans and infers unique keys from the data source.
Null Threshold % in Unique key Inference
Sets the threshold for null values in the unique key inference. You can enter a value between 0 and 1.
Skip Unique Key Inference When Accepted or Documented Unique Key Exists
Skips the table with the documented or accepted unique keys.
Unique Key Sampling Options
You can choose the following sampling options:
The following resources support the unique key inference:
Relational Resource Type
File System Resource
File Types
CSV Files
When you configure a non-supported resource to infer unique keys, the following error message appears:
Unique Key Inference is not Supported for the Resource Type:Resource Type