Profile Options for Enterprise Discovery
Set up profile options before you run a profile to perform enterprise discovery. Profile options include data domain discovery options, column profile sampling options, and inference options for primary keys and foreign keys.
You can choose to run the enterprise discovery profile after you set up the profile options. You can also choose to create profile tasks after the setup without running the profile.
Data Domain Selection for Enterprise Discovery
Inference options determine whether data domain discovery must run on column data, column name, or both. You can specify whether the profile needs to process all the rows in the data source, and choose a conformance criteria for data domain match.
The following table describes the data domain inference options that you configure for enterprise discovery:
Option | Description |
---|
Override the default inference options | Changes the predefined inference options. |
Data | The profile runs on column data. |
Column name | The profile runs on column titles. |
Data and column name | The profile runs on both column data and column titles. |
Minimum percentage of rows | The minimum conformance percentage of rows in the data set required for a data domain match. The conformance percentage is the ratio of number of matching rows divided by the total number of rows. Note: The Developer tool considers null values as nonmatching rows. |
Minimum number of rows | The minimum number of rows in the data set required for a data domain match. |
Exclude null values from data domain discovery | Excludes the null values from the data set for data domain discovery. |
All rows | The profile runs on all rows of the data source. |
Sample first | The maximum number of rows the profile can run on. The Developer tool chooses rows starting from the first row in the source. |
Exclude approved data types and data domains from the data type and data domain inference in the subsequent profile runs | Excludes the approved data type or data domain from data type and data domain inference from the next profile run. |
Column Profile Sampling Options for Enterprise Discovery
The sampling options determine whether the Developer tool runs a column profile on all rows of the data sources or limited number of rows.
The following table describes the column profile sampling options that you configure for enterprise discovery:
Option | Description |
---|
All Rows | Runs a profile on all the rows in the data object. Supported on Native, Blaze, and Spark run-time environment. |
Sample First <number> rows | Runs a profile on the sample rows from the beginning of the rows in the data object. You can choose a maximum of 2,147,483,647 rows. Supported on Native and Blaze run-time environment. |
Limit N <number> rows | Runs a profile based on the number of rows in the data object. When you choose to run a profile in the Hadoop validation environment, Spark engine collects samples from multiple partitions of the data object and pushes the samples to a single node to compute sample size. The Limit n sampling option supports Oracle, SQL Server, and DB2 databases. You cannot apply the Advanced filter with the Limit n sampling option. Supported on Spark run-time environment. |
Random Percentage | Runs a profile on a percentage of rows in the data object. Supported on Spark run-time environment. |
Exclude data type inference for columns with an approved data type | Excludes columns with an approved data type from the data type inference of the column profile run. |
Run-time Environment Option
Choose native, or Hadoop, run-time environment option. You can choose a Blaze, or Spark option in the Hadoop run-time environment. Informatica Developer sets the run-time environment in the profile definition after you choose the run-time environment. The run-time environment does not affect the profile results.
The following table describes the run-time environment options for an enterprise discovery profile:
Option | Description |
---|
Native | The Developer tool submits the profile jobs to the Profiling Service Module. The Profiling Service Module then breaks down the profile jobs into a set of mappings. The Data Integration Service runs these mappings and writes the profile results to the profile warehouse. |
Blaze | The Data Integration Service pushes the profile logic to the Blaze engine on the Hadoop cluster to run profiles. |
Spark | The Data Integration Service pushes the profile logic to the Spark engine on the Hadoop cluster to run profiles. |
Primary Key Inference Options for Enterprise Discovery
You can override the default primary key inference options for enterprise discovery. The options include the maximum number of rows you can run the profile on and minimum conformance percentage.
The following table describes the primary key inference options that you configure for enterprise discovery:
Options | Description |
---|
Override the default inference options | Allows you to configure custom settings for primary key inference. |
Max Key Columns | Maximum number of columns that can make up a primary key. |
Max Rows | Maximum number of rows you can run the profile on. |
Minimum Percent | The minimum conformance percentage of the column data required for primary key match. |
Maximum Violation Rows | The maximum number of rows with key violations that the profile allows when determining primary keys. |
Foreign Key Inference Options for Enterprise Discovery
Set up the foreign key inference options to define the column settings for discovering foreign key relationships between data objects. The foreign key inference results depend on the primary key inference options you set up for enterprise discovery, documented primary keys, and user-defined primary keys.
You can infer the foreign keys in Informatica Developer with one of the following methods:
- •Use default values.
- •Configure the foriegn key inference options.
- •Use foreign key configuration file to configure the auto curation parameters.
The following table describes the foreign key inference options that you configure for enterprise discovery:
Options | Description |
---|
Override the default inference options | Changes the predefined inference options. |
Data types used in comparisons | The data type used in primary key and foreign key comparisons. Note: This option applies if you run a column profile on the data source before the foreign key inference. |
Comparison case-sensitivity | Includes case-sensitivity when comparing column data. |
Trim values before comparison | Determines whether the Developer tool includes leading or trailing spaces in column data while processing. |
Inferred primary keys used in comparisons Use top _ ranked keys | The number of top-ranking primary keys used in foreign key inference when the Developer tool runs a foreign key profile across all the data sources. The Developer tool uses the top-ranking method along with documented primary keys and user-defined primary keys to infer the foreign key relationships. Top ranking of inferred keys is based on the descending conformance percentage rounded to a single decimal precision. For example, the Developer tool considers a conformance percentage of 99.75 as 99.8 and 99.74 as 99.7. The default value is 1. Set the value to -1 if you want the Developer tool to use all inferred keys in foreign key inference. Note: If the primary key data sources have approved primary keys, the Developer tool does not use inferred primary keys for foreign key inference. |
Max foreign keys between data objects | The maximum number of inferred columns that the Developer tool returns after the profile run that are required for foreign key discovery. |
Minimum conformance percentage | The minimum eligibility value in percentage for including columns in the foreign key results. |
Regenerate signature | Reloads column signatures if the source data changes. |
Auto Curation Parameters for Foreign Key Inference
You can configure the auto curation parameters to infer primary key and foreign key relationships without manual intervention. The auto curation parameters are user-defined custom attributes that you can configure to identify the data relationships based on certain conditions.
When the discovery results include a large number of primary key and foreign key relationships, you might find it difficult to identify the critical data relationships among hundreds of data relationships. You might also find it difficult to curate the relationships based on certain conditions, such as data match or data type. To resolve this issue, you can configure the auto curation parameters and run the enterprise discovery profile.
If the data sources have multiple candidate foreign keys and you want to provide rules to choose a candidate foreign key, you can perform the following actions:
- •Configure the Max Foreign Keys between data objects and Minimum conformance percent options in the enterprise discovery profile wizard.
- •Configure the weights and scores for the auto curation parameters in the ForeignKeyConfig.xml file.
An administrator can edit and save the foreign key configuration file. Configure the auto curation parameters in the foreign key configuration file. The algorithm infers the primary key and foreign key relationships between multiple data objects based on the auto curation parameters.
The foreign key configuration file, ForeignKeyConfig.xml, is available in the following directory:
<Informatica installation directory>\services\DataIntegrationService\modules\ProfilingService
The auto curation parameters are data overlap match, column name match, relationship type match, and data type match.
Data Overlap Match
Data overlap match is the estimated overlap of values between the primary key and foreign keys. You can set the overlap match in the enterprise discovery profile wizard with the Minimum conformance percent option. By default, the Minimum conformance percent option is set to 90.
If the data overlap match does not meet the minimum conformance percentage, the foreign key is not considered for auto curation. When the minimum conformance for the data overlap match is met, the remaining parameters are used to compute the adjusted score.
Name Match
The name match parameter is an optional parameter. It uses the Edit Distance algorithm to determine how closely the names of primary and foreign key columns match and sets the score between 0 and 1. Set the name match weight to 0 if you do not want to use this parameter to determine primary key and foreign key relationship.
Relationship Type Match
The relationship type match determines the type of relationship between the primary key and foreign key columns and assigns a fixed score between 0 and 1. The relationship type match is computed based on the column type of the foreign key column.
The following relationship type matches can be set in the ForeignKeyConfig.xml file:
- •Primary key-foreign key relationship where the foreign key column is a non-key column. The default for this relationship type match is 1. You can find this relationship in many data sources.
- •Primary key-primary key relationship where the foreign key column is a primary key column. The default for this relationship type is 0.25. You rarely find this relationship type as it represents a table that has been partitioned vertically.
- •Primary key-primary key sequence relationship where the foreign key column is a primary key column and the data type of the column is a sequence data type. For example, OrderID column in a Order table has a sequence data type. The default for this relationship type is zero because the sequence keys might cause multiple false positive foreign keys, which the primary key-primary key algorithm tries to avoid. You can set the relationship type match to a higher score if the data source is known to contain a few sequence data types.
Data Type Match
The data type match compares the data types of the primary key columns to the foreign key columns and assigns a fixed conformance score which is based on how close the data types of the columns match.
The following table lists the fixed data type match scores for different combinations of primary key and foreign keys:
| Numeric Foreign Key | Date Foreign Key | String Foreign Key |
Numeric Primary Key | 1.0 | 0.5 | 0.0 |
Date Primary Key | 0.5 | 1.0 | 0.5 |
String Primary Key | 0.0 | 0.0 | 1.0 |
You can change the default data type match scores if required.