Data Discovery
When you enable data discovery for a resource and scan the resource, Enterprise Data Catalog identifies profiling-related metadata, such as null values, distinct values, inferred data types, unique keys, and data domains in the resource.
When you create a resource, choose the Metadata Load Settings > Data Discovery > Enable Data Discovery option to discover profiling-related metadata and unique keys. Based on your requirements, you can configure domain connection settings, basic profile settings, unique key inference settings, and column similarity settings.
Important: Profiling might fail with a licensing error if you run data domain discovery using a Data Integration Service on a grid.
Supported Resource Types for Data Discovery
You can enable data discovery for the following types of resources:
- •Cloud
- - Amazon Redshift
- - Amazon S3
- - Microsoft Azure Data Lake Store
- - Azure Microsoft SQL Data Warehouse
- - Azure Microsoft SQL Server
- - Microsoft Azure Blob Storage
- - Salesforce
- - Google Big Query
- - Snowflake
- •Data Engineering
- •File Management
- - File System. Supported protocols include Local File, SFTP, and SMB/CIFS protocol.
- - OneDrive
- - SharePoint
- •Database Management
- - IBM DB2
- - IBM DB2 for z/OS
- - IBM Netezza
- - JDBC
- - Microsoft SQL Server
- - Oracle
- - Sybase
- - Teradata
- •Application
- •No SQL
You can also run data discovery on structured file types, unstructured file types, and extended unstructured formats.
- Structured file types
Data discovery in Enterprise Data Catalog supports the following structured file types:
- - Avro. Supported extension type is .avro.
This file type is available for HDFS resource and File System resource. For the File System resource, you can choose only the Local File protocol.
- - Delimited and Text
- - JSON
- - Parquet.
This file type is available for HDFS, Amazon S3, Azure Data Lake Store Gen2, and File System resource. For the File System resource, you can choose only the Local File protocol.
- - XML
- Unstructured File Types and Extended Unstructured Formats
When you choose HDFS, Amazon S3, or File System as a resource, you can choose extended unstructured formats or unstructured file types. Extended unstructured formats include mp3, mp4, bmp, and jpg formats. Extended unstructured formats do not fall under structured or unstructured file types.
Data domain discovery in Enterprise Data Catalog supports the following unstructured file types:
- - Apple Files. Supported extension types include .key, .pages, .numbers, .ibooks, and .ipa.
- - Compressed Files. Supported extension types include gz, tgz, and emz.
- - Email. Supported extension types include eml, emlx, and mime.
- - Microsoft Excel
- - Microsoft PowerPoint
- - Microsoft Word
- - Open Office Files. Supported extension types include .odt, .ott, .odm, .ods, .ots, .odp, .odg, .otp, .odg, .otg, and .odf.
- - PDF
- - Webpage Files. Supported extension types include chm, oth, and xhtml.
- - CLOB. Supported on DB2 and DB2 z/OS datasources.
Domain Connection Settings
In the Domain Connection Settings section, you can configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity for the resource. You can choose a different Data Integration Service to infer column similarity.
Choose one of the following options to configure Data Integration Service properties:
Custom
Configure the following Data Integration Service parameters:
- Domain Name
- Name of the Informatica domain.
- Data Integration Service
- Name of Data Integration Service to run profiles.
- Username
- User name that the Data Integration Service uses to access the Model Repository Service.
- Password
- Password for the Model repository user.
- Security Domain
- Name of the security domain to which the domain user belongs.
- Host
- Host name of the master gateway node.
- Port
- Port number of the master gateway node.
- Operating System Profile
Choose an operating system profile if you do not have a default operating system profile. You need to assign an operating system profile to Enterprise Data Catalog users if the Data Integration Service uses operating system profiles. If you do not assign an operating system profile to a user and the user tries to run a profile in Catalog Administrator, the profile run fails. The Data Integration Service uses the operating system profile user credentials to run data discovery. Data discovery includes column profiles and data domain discovery profiles.
If you have multiple operating system profiles and you do not have a default operating system profile, choose an operating system profile in the Domain Connection Settings section.
In the Informatica Administrator > Security > Operating System Profiles tab, you can create or assign an operating system profile. To configure a default operating system profile for a user, click the Assign or Change the Default Operating System Profile option in the Security > Users > Permissions > Operating System Profiles section.
Note: You can choose an operating system profile only when you choose the Custom option for the Specify the configuration settings for Data Integration Service property. If you choose the Global option for the Specify the configuration settings for Data Integration Service property, then you must have a default operating system profile to run the profile.
For information about configuring the Data Integration Service to use operating system profiles, see Informatica Application Service Guide. For information about creating and assigning operating system profiles, see Informatica Security Guide.
Global
Choose this option to enable a reusable configuration.
A reusable configuration has Data Integration Service settings that you can use for a resource to extract profile metadata. You can configure one or more reusable configurations. Navigate to Manage > Reusable Configuration to view, create, or delete a reusable configuration.
Basic Profile Settings
In the Basic Profile Settings section, you can configure the following options for a resource:
Profile Run Option
Choose one of the following profile run options for the profiling scanner to run the profile job on the resource:
- Column Profile
- Identifies the number of null values, distinct values, non-distinct values, and infers data patterns and data types of the columns in the resource.
- Data Domain Discovery
- Discovers all the data domains associated with a column based on the column value or column name.
- Column Profile and Data Domain Discovery
- Identifies the number of null values, distinct values, non-distinct values, and infers data patterns, data types, and data domains in the resource.
When you run a scan on a resource multiple times, the last scan results include all the scans. For example, you choose column profile when you scan a resource. Then, before you run the scan again, you choose to perform data domain discovery. The results for the second scan includes both the column profile results and data domain discovery results.
Data domain discovery results display all the inferred data domains from all the runs. For example, if data domain D1 is inferred during the first resource scan and data domain D4 is inferred during the next scan, the scan results for the second time display both D1 and D4.
When you run a scan on a resource for the second time or for subsequent runs, you can optionally run only data discovery on the source. To run only data discovery on the resource, disable Metadata Load Settings > Source Metadata option.
Data Domain Discovery Type
Choose one of the following options for the profiling scanner to infer data domains based on column name, column data, or both:
- Run Discovery on Source Data
- Runs data domain discovery on source data.
- Run Discovery on Source Metadata
- Runs data domain discovery on column names.
- Run Discovery on both Source Metadata and Data
- Runs data domain discovery on source data and source metadata.
- Run Discovery on Source Data Where Metadata Matches
- Runs data domain discovery on the source metadata to identify the column names that match the data domains. The scanner then runs data domain discovery on the source data of the identified columns.
Note: You can choose only the Run Discovery on Source Data option when you run data domain discovery on unstructured data sources.
Priority
Choose one of the following values for the profiling scanner to prioritize the resource run:
The profiling scanner runs the resources with High priority and then runs the resources with Low priority.
For example, you have three resources R1, R2, and R3. The priority value for R1 and R3 is set to high and the priority value for R2 is set to low. When you run the resources, the scanner first runs R1 and R3 followed by R2.
Sampling Option
Choose one of the following sampling options to determine the number of rows to run the profile job on:
- All rows
- Runs the profile on all the rows in the data source.
- Auto Random rows
- Runs the profile on a random sample of rows. Enterprise Data Catalog computes the number of random rows based on the number of source rows.
- Random N rows
Runs the profile on the configured number of random rows
In the Random Sampling Rows field, enter the number of rows that you want to run the profile on.
- First N rows
Runs the profile on the first N number of rows in the resource.
In the Number of First N Sampling Rows field, enter the number of rows to run the profile on.
- Limit N rows
- Runs the profile based on the number of rows in the data object.
In the Number of Rows to Limit field, enter the number of rows to run the profile on.
- Random Percentage
- Runs the profile on a percentage of rows in the data object.
In the Random Percentage field, enter the number of rows to run the profile on.
Exclude Views
Choose the Exclude Views option if you do not want the profiling scanner to scan the views in the relational data sources.
Incremental Profiling
Choose the option to run the profile only for the changes made to the data source. If you do not select this option, the profile runs on the entire data source.
Enterprise Data Catalog supports incremental profiling for the following resources:
- •Oracle. Discovers changes made to metadata and data during incremental profiling.
- •Microsoft SQL Server. Discovers changes made to metadata and data during incremental profiling.
- •File System with Local File protocol. Discovers changes made to metadata and data during incremental profiling.
- •HDFS. Discovers changes made to metadata during incremental profiling.
- •Amazon S3. Discovers changes made to data during incremental profiling.
- •Snowflake. Discovers changes made to data during incremental profiling.
When you enable incremental profiling for a resource that has one table and you run the profile multiple times on the resource, the profiling scanner validates and runs the profile on the same table every single time.
Data Profile Filter
You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names.
For more information about the Data Profile Filter field, see the
Source Metadata and Data Profile Filter topic.
Cumulative
Enterprise Data Catalog does not retain the previous scan results. It displays only the latest scan results. To retain the profile results from the previous run in the latest scan results, choose the Cumulative option. If you do not choose this option, the column profile and column similarity results from the previous run are deleted and only the latest results appear in Enterprise Data Catalog.
The following use cases explain how the Cumulative option with the Data Profile Filter field and Incremental Profiling option impacts the profiling results:
- •Cumulative option with Data Profile Filter field
- - You run a resource after you enter the table names and view names in the Data Profile Filter field and then you choose the Cumulative option.
In this scenario, the scanner retains the previous results, appends the latest results, and displays the consolidated profile results in Enterprise Data Catalog.
- - You run the resource after you enter the tables and views in the Data Profile Filter field, but you do not choose the Cumulative option.
In this scenario, the previous profile results excluding the data domain discovery results are purged and the latest profile results appear in Enterprise Data Catalog.
- •Cumulative option with Incremental Profiling option
- - You run a resource after you chose the Incremental Profiling option
In this scenario, the scanner retains the previous profile results irrespective of whether you choose the Cumulative option or not. Enterprise Data Catalog displays the consolidated profile results.
- •You do not choose the Cumulative option and Incremental Profiling option
- - You run a resource without choosing the Cumulative option and Incremental Profiling option.
In this scenario, the previous results excluding the data domain discovery results are purged during the next profile run. Enterprise Data Catalog displays the latest profile results.
Source Connection Name
Choose the source connection to run data discovery. You can create the connections in Informatica Administrator.
Note: This parameter is optional for a File System resource.
Run On
Choose one of the following run-time environments to run the profile:
- Blaze
- Runs the profile in the Hadoop environment on the Blaze engine.
- Spark
- Runs the profile in the Hadoop environment on the Spark engine.
- Native
- Runs the profile on the same machine where the Data Integration Service runs.
- Databricks
- Runs the profile in the Hadoop environment on the Spark engine in the Databricks cluster. The Databricks run-time environment supports JDBC and Azure Data Lake Store resources.
Note: Choose Blaze or Native run-time environment to run the profile job for all the resource except Hive resource. When you choose the Blaze engine or Spark engine, select a Hadoop connection to run the profiles.
Select Data Domain
Choose one of the following data domain options:
- All Data Domains
- Discovers all the data domains in the resource.
- Specific Data Domain Groups
Discovers the data domains in the selected data domain groups.
In the Data Domain Groups field, choose one or more data domain groups.
- Specific Data Domains
- Discovers the selected data domains.
In the Data Domains field, choose one or more data domains.
In the Library workspace, you can view all the data domains and data domain groups that are available in Enterprise Data Catalog. To create a data domain or data domain group, navigate to New > Data Domain, or New > Data Domain Group page. In the Library workspace, you can view or delete data domains or data domain groups.
Use Conformance from
Choose one of the following conformance values for the data domain:
- Data Domain
- Uses the predefined conformance values that you configured for the data domains.
When you create a data domain, you configure the minimum percentage of source rows and minimum number of source rows as the conformance criteria for data domain match. These values are predefined conformance values.
- Custom
- Uses the conformance value that you enter in the Custom Conformance Value field for the data domains. The custom value overrides the predefined conformance values.
Data Domain Match Criteria
Choose one of the following conformance criteria for data domain match:
- Percentage
- Ratio of the number of matching rows divided by the total number of rows.
- Rows
- Total number of rows.
Enterprise Data Catalog uses the data conformance properties that you configured for the data domains. Navigate to Library > Assets > Data Domains to view the data domains. Open each data domain to view its configured properties.
Exclude Null Values from Data Domain Discovery
Choose the option to exclude the null values from the data source when you run data domain discovery. When you use this option, the data domain inference is more accurate and reliable. For example, you have a table with 100 rows and 30 rows contain null values . The conformance row count is 40. If you do not choose this option, data domain discovery runs on all the 100 rows to discover data domains which might result in an inaccurate inference. If you choose this option, data domain discovery runs on 70 rows and the results are more accurate.
When you select the minimum percentage of rows with the exclude null values option, the conformance percentage is the ratio of number of matching rows in a column to the number of rows that do not contain null values. For example, if the total number of rows in a column is T, the number of matching rows is M, the number of rows with null values is N, then the conformance percentage is M/(T-N)%.
Similarity Profile Data Preparation and Value Frequency Settings
Configure the column similarity properties to identify similar columns and value frequency in the resource.
- Run Similarity Profile
- Choose one of the following options:
- - Yes. The profiling scanner scans the data source and prepares data to discover similar columns based on column names, column patterns, and unique values.
- - No
- Save Source Data
- Choose one of the following options:
- - Yes. The profiling scanner prepares data to discover similar columns based on column names, column patterns, and unique values. It also computes value frequencies. The scanner then persists the computed information in PostgreSQL. The computed information persists in PostgreSQL until you choose to delete or purge the resource.
- - No. The profiling scanner prepares data to discover similar columns based on column names, column patterns, and unique values. The scanner then persists the computed information in PostgreSQL. The computed information persists in PostgreSQL until you choose to delete or purge the resource.
- Sampling Option
Choose one of the following sampling options to determine the number of rows that Enterprise Data Catalog can run the profile on:
- - Reuse Basic Profile Settings. Use the sampling option in the Basic Profile Settings section.
- - All rows. Runs the profile on all the rows in the data source.
- - Auto Random rows. Runs the profile on a random sample of rows. Enterprise Data Catalog computes the number of random rows based on the number of source rows.
- - Random N rows. Runs the profile on the configured number of random rows.
In the Random Sampling Rows field, enter the number of rows that you want to run the profile on.
- - First N rows. Runs the profile on the first N number of rows in the resource.
In the Number of First N Sampling Rows field, enter the number of rows to run the profile on.
- Domain Connection Settings
Choose one of the following domain connection settings options:
- - Use Profile Configuration Settings. Enterprise Data Catalog uses the Data Integration Service specified in the Domain Connection Settings section to identify similar columns in the data sources.
- - Specify Domain Connection Settings. To use a different Data Integration Service to identify similar columns in the data sources, enter the domain connection settings for the Data Integration Service.
Permissions and Privileges
You can view the value frequency section in Enterprise Data Catalog if you have the following permissions and privileges:
- •In the Administrator tool, the administrator must assign the Data Privileges: View Data privilege to the user.
- •In Catalog Administrator, navigate to the Manage > Security > Resources page, and assign Metadata and Data Read or All Permissions permission to the resource.
- •In Catalog Administrator, assign the Read permission to the DataDomain resource in the Manage > Security > Resources page.
If the data asset has sensitive data, then you can view the sensitive data in the data asset after the administrator assigns the Data Privileges: View Sensitive Data privilege. For more information about privileges and permissions, see the Informatica Administrator Reference for Catalog guide.
Unique Key Inference Settings
A unique key is a column or combinations of columns that uniquely identifies a row in a data source. The profiling service identifies columns in the data object to generate unique keys. Enterprise Data Catalog displays unique key inferences for the tabular assets.
The unique key cannot have duplicate values. If a column has a duplicate values, that column is not identified as the unique key. The unique key inference is supported on the native run-time environment.
In the Unique Key Inference Settings section, you can configure the following options for a resource to generate unique key:
- Run Unique Key Inference
- The profile scanner scans and infers unique keys from the data source.
- Null Threshold % in Unique key Inference
- Sets the threshold for null values in the unique key inference. You can enter a value between 0 and 1.
- Skip Unique Key Inference When Accepted or Documented Unique Key Exists
- Skips the table with the documented or accepted unique keys.
- Unique Key Sampling Options
- You can choose the following sampling options:
- - All Rows. Runs the unique key inference on all the rows in the data object.
- - First <number> Rows. Runs the unique key inference on the selected rows in the data object.
The following resources support the unique key inference:
- Relational Resource Type
- - Hive
- - Oracle
- - Microsoft SQL Server
- - Teradata
- - IBM Netezza
- - Amazon Redshift
- File System Resource
- - Amazon S3
- - HDFS
- - ADLS Gen1
- File Types
- CSV Files
When you configure a non-supported resource to infer unique keys, the following error message appears:
Unique Key Inference is not Supported for the Resource Type:Resource Type