Data Discovery Guide > Part II: Data Discovery with Informatica Analyst > Enterprise Discovery in Informatica Analyst > Configuration Options for Enterprise Discovery
  

Configuration Options for Enterprise Discovery

The configuration options for enterprise discovery include data domain discovery options, column profile sampling options, and general profile properties such as name and description.
You can choose to run a column profile or a profile to perform data domain discovery. You can also choose to run both column profile and a profile to perform data domain discovery as part of the configuration.

Data Domain Discovery Settings

The data domain discovery settings include choosing whether data domain discovery must run on column data, column name, or both column data and column name. You can choose data domains and specify whether data domain discovery needs to process all the rows in the data source. You can choose a conformance criteria for data domain discovery. You can exclude nulls from data domain discovery.
The following table describes the data domain discovery settings that you can configure for enterprise discovery in the Analyst tool:
Option
Description
Enable data domain discovery
Performs data domain discovery as part of enterprise discovery.
Run data domain discovery on data
Performs data domain discovery on column data.
Run data domain discovery on column name
Performs data domain discovery on the name of each column.
Minimum conformance percentage
The minimum conformance percentage of rows in the data set required for a data domain match. The conformance percentage is the ratio of number of matching rows divided by the total number of rows.
Note: The Analyst tool considers null values as nonmatching rows.
Minimum conforming rows
The minimum number of rows in the data set required for a data domain match.
Exclude null values from data domain discovery
Excludes the null values from the data set for data domain discovery.
Exclude columns with approved data domains
Excludes columns with approved data domains from the data domain inference of the profile run.
All rows
Performs data domain discovery on all source rows.
First
The maximum number of rows the profile can run on. The Analyst tool chooses rows starting from the first row in the source. You can choose a maximum of 2,147,483,647 rows.

Column Profile Settings

The sampling options determine whether the Analyst tool runs a column profile on all rows of the data sources or limited number of rows.
The following table describes the column profile settings that you can configure for an enterprise discovery profile:
Option
Description
Enable column profiling
Runs a column profile as part of enterprise discovery.
Exclude approved data types and data domains from the data type and data domain inference in the subsequent profile runs
Excludes the approved data type or data domain from data type and data domain inference from the next profile run.
The following table describes the run-time environment option that you can configure for an enterprise discovery profile:
Option
Description
Native
The Analyst tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service runs these mappings and writes the profile results to the profiling warehouse.
Blaze
The Data Integration Service pushes the profile logic to the Blaze engine on the Hadoop cluster to run profiles.
Spark
The Data Integration Service pushes the profile logic to the Spark engine on the Hadoop cluster to run profiles.
The following table describes the sampling options that you can configure for an enterprise discovery profile:
Option
Description
All Rows
Runs a column profile on all rows in the data source.
Supported on Native, Blaze, and Spark run-time environment.
First <number> Rows
Runs a profile on the sample rows from the beginning of the rows in the data object. You can choose a maximum of 2,147,483,647 rows.
Supported on Native and Blaze run-time environment.
Limit n <number> Rows
Runs a profile based on the number of rows in the data object. When you choose to run a profile in the Hadoop validation environment, Spark engine collects samples from multiple partitions of the data object and pushes the samples to a single node to compute sample size. The Limit n sampling option supports Oracle, SQL Server, and DB2 databases. You cannot apply the Advanced filter with the Limit n sampling option. You can select a maximum of 2,147,483,647 rows.
Supported on Spark run-time environment.
Random percentage
Runs a profile on a percentage of rows in the data object.
Supported on Spark run-time environment.