Data Discovery Guide > Part III: Data Discovery with Informatica Developer > Enterprise Discovery in Informatica Developer > Profile Options for Enterprise Discovery
  

Profile Options for Enterprise Discovery

Set up profile options before you run a profile to perform enterprise discovery. Profile options include data domain discovery options, column profile sampling options, and inference options for primary keys and foreign keys.
You can choose to run the enterprise discovery profile after you set up the profile options. You can also choose to create profile tasks after the setup without running the profile.

Data Domain Selection for Enterprise Discovery

Inference options determine whether data domain discovery must run on column data, column name, or both. You can specify whether the profile needs to process all the rows in the data source, and choose a conformance criteria for data domain match.
The following table describes the data domain inference options that you configure for enterprise discovery:
Option
Description
Override the default inference options
Changes the predefined inference options.
Data
The profile runs on column data.
Column name
The profile runs on column titles.
Data and column name
The profile runs on both column data and column titles.
Minimum percentage of rows
The minimum conformance percentage of rows in the data set required for a data domain match. The conformance percentage is the ratio of number of matching rows divided by the total number of rows.
Note: The Developer tool considers null values as nonmatching rows.
Minimum number of rows
The minimum number of rows in the data set required for a data domain match.
Exclude null values from data domain discovery
Excludes the null values from the data set for data domain discovery.
All rows
The profile runs on all rows of the data source.
Sample first
The maximum number of rows the profile can run on. The Developer tool chooses rows starting from the first row in the source.
Exclude approved data types and data domains from the data type and data domain inference in the subsequent profile runs
Excludes the approved data type or data domain from data type and data domain inference from the next profile run.

Column Profile Sampling Options for Enterprise Discovery

The sampling options determine whether the Developer tool runs a column profile on all rows of the data sources or limited number of rows.
The following table describes the column profile sampling options that you configure for enterprise discovery:
Option
Description
All Rows
Runs a profile on all the rows in the data object.
Supported on Native, Blaze, and Spark run-time environment.
Sample First <number> rows
Runs a profile on the sample rows from the beginning of the rows in the data object. You can choose a maximum of 2,147,483,647 rows.
Supported on Native and Blaze run-time environment.
Limit N <number> rows
Runs a profile based on the number of rows in the data object. When you choose to run a profile in the Hadoop validation environment, Spark engine collects samples from multiple partitions of the data object and pushes the samples to a single node to compute sample size. The Limit n sampling option supports Oracle, SQL Server, and DB2 databases. You cannot apply the Advanced filter with the Limit n sampling option.
Supported on Spark run-time environment.
Random Percentage
Runs a profile on a percentage of rows in the data object.
Supported on Spark run-time environment.
Exclude data type inference for columns with an approved data type
Excludes columns with an approved data type from the data type inference of the column profile run.

Run-time Environment Option

Choose native, or Hadoop, run-time environment option. You can choose a Blaze, or Spark option in the Hadoop run-time environment. Informatica Developer sets the run-time environment in the profile definition after you choose the run-time environment. The run-time environment does not affect the profile results.
The following table describes the run-time environment options for an enterprise discovery profile:
Option
Description
Native
The Developer tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service runs these mappings and writes the profile results to the profile warehouse.
Blaze
The Data Integration Service pushes the profile logic to the Blaze engine on the Hadoop cluster to run profiles.
Spark
The Data Integration Service pushes the profile logic to the Spark engine on the Hadoop cluster to run profiles.

Primary Key Inference Options for Enterprise Discovery

You can override the default primary key inference options for enterprise discovery. The options include the maximum number of rows you can run the profile on and minimum conformance percentage.
The following table describes the primary key inference options that you configure for enterprise discovery:
Options
Description
Override the default inference options
Allows you to configure custom settings for primary key inference.
Max Key Columns
Maximum number of columns that can make up a primary key.
Max Rows
Maximum number of rows you can run the profile on.
Minimum Percent
The minimum conformance percentage of the column data required for primary key match.
Maximum Violation Rows
The maximum number of rows with key violations that the profile allows when determining primary keys.

Foreign Key Inference Options for Enterprise Discovery

Set up the foreign key inference options to define the column settings for discovering foreign key relationships between data objects. The foreign key inference results depend on the primary key inference options you set up for enterprise discovery, documented primary keys, and user-defined primary keys.
You can infer the foreign keys in Informatica Developer with one of the following methods:
The following table describes the foreign key inference options that you configure for enterprise discovery:
Options
Description
Override the default inference options
Changes the predefined inference options.
Data types used in comparisons
The data type used in primary key and foreign key comparisons.
Note: This option applies if you run a column profile on the data source before the foreign key inference.
Comparison case-sensitivity
Includes case-sensitivity when comparing column data.
Trim values before comparison
Determines whether the Developer tool includes leading or trailing spaces in column data while processing.
Inferred primary keys used in comparisons
Use top _ ranked keys
The number of top-ranking primary keys used in foreign key inference when the Developer tool runs a foreign key profile across all the data sources. The Developer tool uses the top-ranking method along with documented primary keys and user-defined primary keys to infer the foreign key relationships.
Top ranking of inferred keys is based on the descending conformance percentage rounded to a single decimal precision. For example, the Developer tool considers a conformance percentage of 99.75 as 99.8 and 99.74 as 99.7.
The default value is 1. Set the value to -1 if you want the Developer tool to use all inferred keys in foreign key inference.
Note: If the primary key data sources have approved primary keys, the Developer tool does not use inferred primary keys for foreign key inference.
Max foreign keys between data objects
The maximum number of inferred columns that the Developer tool returns after the profile run that are required for foreign key discovery.
Minimum conformance percentage
The minimum eligibility value in percentage for including columns in the foreign key results.
Regenerate signature
Reloads column signatures if the source data changes.

Auto Curation Parameters for Foreign Key Inference

You can configure the auto curation parameters to infer primary key and foreign key relationships without manual intervention. The auto curation parameters are user-defined custom attributes that you can configure to identify the data relationships based on certain conditions.
When the discovery results include a large number of primary key and foreign key relationships, you might find it difficult to identify the critical data relationships among hundreds of data relationships. You might also find it difficult to curate the relationships based on certain conditions, such as data match or data type. To resolve this issue, you can configure the auto curation parameters and run the enterprise discovery profile.
If the data sources have multiple candidate foreign keys and you want to provide rules to choose a candidate foreign key, you can perform the following actions:
An administrator can edit and save the foreign key configuration file. Configure the auto curation parameters in the foreign key configuration file. The algorithm infers the primary key and foreign key relationships between multiple data objects based on the auto curation parameters.
The foreign key configuration file, ForeignKeyConfig.xml, is available in the following directory:
<Informatica installation directory>\services\DataIntegrationService\modules\ProfilingService
The auto curation parameters are data overlap match, column name match, relationship type match, and data type match.

Data Overlap Match

Data overlap match is the estimated overlap of values between the primary key and foreign keys. You can set the overlap match in the enterprise discovery profile wizard with the Minimum conformance percent option. By default, the Minimum conformance percent option is set to 90.
If the data overlap match does not meet the minimum conformance percentage, the foreign key is not considered for auto curation. When the minimum conformance for the data overlap match is met, the remaining parameters are used to compute the adjusted score.

Name Match

The name match parameter is an optional parameter. It uses the Edit Distance algorithm to determine how closely the names of primary and foreign key columns match and sets the score between 0 and 1. Set the name match weight to 0 if you do not want to use this parameter to determine primary key and foreign key relationship.

Relationship Type Match

The relationship type match determines the type of relationship between the primary key and foreign key columns and assigns a fixed score between 0 and 1. The relationship type match is computed based on the column type of the foreign key column.
The following relationship type matches can be set in the ForeignKeyConfig.xml file:

Data Type Match

The data type match compares the data types of the primary key columns to the foreign key columns and assigns a fixed conformance score which is based on how close the data types of the columns match.
The following table lists the fixed data type match scores for different combinations of primary key and foreign keys:
Numeric Foreign Key
Date Foreign Key
String Foreign Key
Numeric Primary Key
1.0
0.5
0.0
Date Primary Key
0.5
1.0
0.5
String Primary Key
0.0
0.0
1.0
You can change the default data type match scores if required.