Catalog Administrator Guide > Enterprise Data Catalog Concepts > Data Discovery

Data Discovery

When you enable data discovery for a resource and scan the resource, Enterprise Data Catalog identifies profiling-related metadata, such as null values, distinct values, inferred data types, unique keys, and data domains in the resource.

When you create a resource, choose the Metadata Load Settings > Data Discovery > Enable Data Discovery option to discover profiling-related metadata and unique keys. Based on your requirements, you can configure domain connection settings, basic profile settings, unique key inference settings, and column similarity settings.

Important: Profiling might fail with a licensing error if you run data domain discovery using a Data Integration Service on a grid.

Supported Resource Types for Data Discovery

You can enable data discovery for the following types of resources:

•Cloud

- Amazon Redshift
- Amazon S3
- Microsoft Azure Data Lake Store
- Azure Microsoft SQL Data Warehouse
- Azure Microsoft SQL Server
- Microsoft Azure Blob Storage
- Salesforce
- Google Big Query
- Snowflake

•Data Engineering

- HDFS
- Hive

•File Management

- File System. Supported protocols include Local File, SFTP, and SMB/CIFS protocol.
- OneDrive
- SharePoint

•Database Management

- IBM DB2
- IBM DB2 for z/OS
- IBM Netezza
- JDBC
- Microsoft SQL Server
- Oracle
- Sybase
- Teradata

•Application

- SAP R/3
- SAP S/4HANA

•No SQL

- Apache Cassandra

You can also run data discovery on structured file types, unstructured file types, and extended unstructured formats.

Structured file types
Unstructured File Types and Extended Unstructured Formats

Domain Connection Settings

In the Domain Connection Settings section, you can configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity for the resource. You can choose a different Data Integration Service to infer column similarity.

Choose one of the following options to configure Data Integration Service properties:

Custom

Configure the following Data Integration Service parameters:

Domain Name: Name of the Informatica domain.
Data Integration Service: Name of Data Integration Service to run profiles.
Username: User name that the Data Integration Service uses to access the Model Repository Service.
Password: Password for the Model repository user.
Security Domain: Name of the security domain to which the domain user belongs.
Host: Host name of the master gateway node.
Port: Port number of the master gateway node.
Operating System Profile

Global

Choose this option to enable a reusable configuration.

A reusable configuration has Data Integration Service settings that you can use for a resource to extract profile metadata. You can configure one or more reusable configurations. Navigate to Manage > Reusable Configuration to view, create, or delete a reusable configuration.

Basic Profile Settings

In the Basic Profile Settings section, you can configure the following options for a resource:

Profile Run Option

Choose one of the following profile run options for the profiling scanner to run the profile job on the resource:

Column Profile: Identifies the number of null values, distinct values, non-distinct values, and infers data patterns and data types of the columns in the resource.
Data Domain Discovery: Discovers all the data domains associated with a column based on the column value or column name.
Column Profile and Data Domain Discovery: Identifies the number of null values, distinct values, non-distinct values, and infers data patterns, data types, and data domains in the resource.

When you run a scan on a resource multiple times, the last scan results include all the scans. For example, you choose column profile when you scan a resource. Then, before you run the scan again, you choose to perform data domain discovery. The results for the second scan includes both the column profile results and data domain discovery results.

Data domain discovery results display all the inferred data domains from all the runs. For example, if data domain D1 is inferred during the first resource scan and data domain D4 is inferred during the next scan, the scan results for the second time display both D1 and D4.

When you run a scan on a resource for the second time or for subsequent runs, you can optionally run only data discovery on the source. To run only data discovery on the resource, disable Metadata Load Settings > Source Metadata option.

Data Domain Discovery Type

Choose one of the following options for the profiling scanner to infer data domains based on column name, column data, or both:

Run Discovery on Source Data: Runs data domain discovery on source data.
Run Discovery on Source Metadata: Runs data domain discovery on column names.
Run Discovery on both Source Metadata and Data: Runs data domain discovery on source data and source metadata.
Run Discovery on Source Data Where Metadata Matches: Runs data domain discovery on the source metadata to identify the column names that match the data domains. The scanner then runs data domain discovery on the source data of the identified columns.

Note: You can choose only the Run Discovery on Source Data option when you run data domain discovery on unstructured data sources.

Priority

Choose one of the following values for the profiling scanner to prioritize the resource run:

•High
•Low

The profiling scanner runs the resources with High priority and then runs the resources with Low priority.

For example, you have three resources R1, R2, and R3. The priority value for R1 and R3 is set to high and the priority value for R2 is set to low. When you run the resources, the scanner first runs R1 and R3 followed by R2.

Sampling Option

Choose one of the following sampling options to determine the number of rows to run the profile job on:

All rows: Runs the profile on all the rows in the data source.
Auto Random rows: Runs the profile on a random sample of rows. Enterprise Data Catalog computes the number of random rows based on the number of source rows.
Random N rows
First N rows

Limit N rows: Runs the profile based on the number of rows in the data object.

Random Percentage: Runs the profile on a percentage of rows in the data object.

Exclude Views

Choose the Exclude Views option if you do not want the profiling scanner to scan the views in the relational data sources.

Incremental Profiling

Choose the option to run the profile only for the changes made to the data source. If you do not select this option, the profile runs on the entire data source.

Enterprise Data Catalog supports incremental profiling for the following resources:

•Oracle. Discovers changes made to metadata and data during incremental profiling.
•Microsoft SQL Server. Discovers changes made to metadata and data during incremental profiling.
•File System with Local File protocol. Discovers changes made to metadata and data during incremental profiling.
•HDFS. Discovers changes made to metadata during incremental profiling.
•Amazon S3. Discovers changes made to data during incremental profiling.
•Snowflake. Discovers changes made to data during incremental profiling.

When you enable incremental profiling for a resource that has one table and you run the profile multiple times on the resource, the profiling scanner validates and runs the profile on the same table every single time.

Data Profile Filter

You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names.

For more information about the Data Profile Filter field, see the Source Metadata and Data Profile Filter topic.

Cumulative

Enterprise Data Catalog does not retain the previous scan results. It displays only the latest scan results. To retain the profile results from the previous run in the latest scan results, choose the Cumulative option. If you do not choose this option, the column profile and column similarity results from the previous run are deleted and only the latest results appear in Enterprise Data Catalog.

The following use cases explain how the Cumulative option with the Data Profile Filter field and Incremental Profiling option impacts the profiling results:

•Cumulative option with Data Profile Filter field

- You run a resource after you enter the table names and view names in the Data Profile Filter field and then you choose the Cumulative option.

In this scenario, the scanner retains the previous results, appends the latest results, and displays the consolidated profile results in Enterprise Data Catalog.

- You run the resource after you enter the tables and views in the Data Profile Filter field, but you do not choose the Cumulative option.

In this scenario, the previous profile results excluding the data domain discovery results are purged and the latest profile results appear in Enterprise Data Catalog.

•Cumulative option with Incremental Profiling option

- You run a resource after you chose the Incremental Profiling option

In this scenario, the scanner retains the previous profile results irrespective of whether you choose the Cumulative option or not. Enterprise Data Catalog displays the consolidated profile results.

•You do not choose the Cumulative option and Incremental Profiling option

- You run a resource without choosing the Cumulative option and Incremental Profiling option.

In this scenario, the previous results excluding the data domain discovery results are purged during the next profile run. Enterprise Data Catalog displays the latest profile results.

Source Connection Name

Choose the source connection to run data discovery. You can create the connections in Informatica Administrator.

Note: This parameter is optional for a File System resource.

Run On

Choose one of the following run-time environments to run the profile:

Blaze: Runs the profile in the Hadoop environment on the Blaze engine.
Spark: Runs the profile in the Hadoop environment on the Spark engine.
Native: Runs the profile on the same machine where the Data Integration Service runs.
Databricks: Runs the profile in the Hadoop environment on the Spark engine in the Databricks cluster. The Databricks run-time environment supports JDBC and Azure Data Lake Store resources.

Note: Choose Blaze or Native run-time environment to run the profile job for all the resource except Hive resource. When you choose the Blaze engine or Spark engine, select a Hadoop connection to run the profiles.

Select Data Domain

Choose one of the following data domain options:

All Data Domains: Discovers all the data domains in the resource.
Specific Data Domain Groups
Specific Data Domains: Discovers the selected data domains.

In the Library workspace, you can view all the data domains and data domain groups that are available in Enterprise Data Catalog. To create a data domain or data domain group, navigate to New > Data Domain, or New > Data Domain Group page. In the Library workspace, you can view or delete data domains or data domain groups.

Use Conformance from

Choose one of the following conformance values for the data domain:

Data Domain: Uses the predefined conformance values that you configured for the data domains.
Custom: Uses the conformance value that you enter in the Custom Conformance Value field for the data domains. The custom value overrides the predefined conformance values.

Data Domain Match Criteria

Choose one of the following conformance criteria for data domain match:

Percentage: Ratio of the number of matching rows divided by the total number of rows.
Rows: Total number of rows.

Enterprise Data Catalog uses the data conformance properties that you configured for the data domains. Navigate to Library > Assets > Data Domains to view the data domains. Open each data domain to view its configured properties.

Exclude Null Values from Data Domain Discovery

Choose the option to exclude the null values from the data source when you run data domain discovery. When you use this option, the data domain inference is more accurate and reliable. For example, you have a table with 100 rows and 30 rows contain null values . The conformance row count is 40. If you do not choose this option, data domain discovery runs on all the 100 rows to discover data domains which might result in an inaccurate inference. If you choose this option, data domain discovery runs on 70 rows and the results are more accurate.

When you select the minimum percentage of rows with the exclude null values option, the conformance percentage is the ratio of number of matching rows in a column to the number of rows that do not contain null values. For example, if the total number of rows in a column is T, the number of matching rows is M, the number of rows with null values is N, then the conformance percentage is M/(T-N)%.

Similarity Profile Data Preparation and Value Frequency Settings

Configure the column similarity properties to identify similar columns and value frequency in the resource.

Run Similarity Profile: Choose one of the following options:
Save Source Data: Choose one of the following options:
Sampling Option
Domain Connection Settings

Permissions and Privileges

You can view the value frequency section in Enterprise Data Catalog if you have the following permissions and privileges:

•In the Administrator tool, the administrator must assign the Data Privileges: View Data privilege to the user.
•In Catalog Administrator, navigate to the Manage > Security > Resources page, and assign Metadata and Data Read or All Permissions permission to the resource.
•In Catalog Administrator, assign the Read permission to the DataDomain resource in the Manage > Security > Resources page.

If the data asset has sensitive data, then you can view the sensitive data in the data asset after the administrator assigns the Data Privileges: View Sensitive Data privilege. For more information about privileges and permissions, see the Informatica Administrator Reference for Catalog guide.

Unique Key Inference Settings

A unique key is a column or combinations of columns that uniquely identifies a row in a data source. The profiling service identifies columns in the data object to generate unique keys. Enterprise Data Catalog displays unique key inferences for the tabular assets.

The unique key cannot have duplicate values. If a column has a duplicate values, that column is not identified as the unique key. The unique key inference is supported on the native run-time environment.

In the Unique Key Inference Settings section, you can configure the following options for a resource to generate unique key:

Run Unique Key Inference: The profile scanner scans and infers unique keys from the data source.

Null Threshold % in Unique key Inference: Sets the threshold for null values in the unique key inference. You can enter a value between 0 and 1.

Skip Unique Key Inference When Accepted or Documented Unique Key Exists: Skips the table with the documented or accepted unique keys.

Unique Key Sampling Options: You can choose the following sampling options:

The following resources support the unique key inference:

Relational Resource Type

File System Resource

File Types: CSV Files

When you configure a non-supported resource to infer unique keys, the following error message appears:

Unique Key Inference is not Supported for the Resource Type:Resource Type