Informatica Data Quality
When you run a profile in Informatica Analyst or Informatica Developer, the Data Integration Service stores the profile results in the profiling warehouse. To view these results in Enterprise Data Catalog, you can create an Informatica Data Quality resource in Catalog Administrator and associate the profiling warehouse to the resource. When you run the resource, the Informatica Data Quality scanner extracts and migrates the profile results in the associated profiling warehouse to the catalog. The migrated profile results include column profile results, data domain discovery results, rule profiling results, scorecards and value frequencies. You can migrate the profile results from the current or earlier versions on any Informatica domain. Informatica Data Quality resource supports relational database systems.
In Catalog Administrator, you can create multiple profiling warehouse resources. When you create a profiling warehouse resource, you associate a profiling warehouse to the resource. Enterprise Data Catalog scans the associated profiling warehouse and lists all the schemas with profile results. You can choose one or more schemas for which the Informatica Data Quality scanner migrates the results to the catalog.
When you run the Informatica Data Quality scanner, the catalog compares the timestamp of the results in the catalog with the migrated results. Enterprise Data Catalog displays the latest profile results and curation results based on the timestamp. Enterprise Data Catalog merges the inference results with the existing results in the catalog.
You can view the value frequency for a column or rule in the Developer tool or Analyst tool. The Informatica Data Quality scanner migrates the value frequencies for the columns along with the profile results to the catalog. The profiling warehouse stores a maximum of 16,000 values for a column. The scanner identifies the top 1,000 maximum values for a column and migrates these values to Enterprise Data Catalog.
- Example
- Assume that you have more than 1 million tables spread across 700 schemas and multiple databases in your enterprise. Over the years, you have run profiles on these schemas in the Developer tool or Analyst tool. All the profiling results reside in one or more profiling warehouse databases. Now, you want to implement Enterprise Data Catalog in your enterprise and want to access and view the existing profile results for the schemas in the catalog. Instead of running the profiles on the schemas and databases in Catalog Administrator, which is a time-consuming and resource-intensive effort, you decide to migrate the results to the catalog. Additionally, you want the developers and analysts to continue using the Developer tool and Analyst tool.
- In this case, you can run the core scanners to extract the metadata from the schemas and databases to the catalog. Then, you can create a profiling warehouse resource, choose the schemas and connections for which you want to migrate the profile results, and run the profiling warehouse resource. The scanner migrates the profile results of the selected schemas to the catalog. This action saves time and effort. An added advantage is that the users of the Developer tool and Analyst tool can continue to use the tools and you can migrate the results as and when required.
Extracting Rule Profile Results
You can apply rules and run rule profiles in the Developer tool and the Analyst tool. You can use the following methods to create or apply rules in the column profiles:
- •In the Developer tool, create a mapplet and validate it as a rule. The rule appears as a reusable rule in the Analyst tool. You can apply the rule to column profiles in the Analyst tool and Developer tool.
- •In the Analyst tool, create a rule specification and generate a mapplet. You can apply the rule specification to column profiles in the Analyst tool. In the Developer tool, validate the mapplet as a rule. The rule appears as a reusable rule that you can use in the column profiles.
- •You can use predefined rules in the column profiles. Informatica provides the predefined rules with the Developer tool and Analyst tool.
When you run Informatica Data Quality resource, the resource extracts the rule profile results to the catalog. You cannot apply expression rules to the profile. You cannot associate business terms with rule asset in the Enterprise Data Catalog. To learn how to create and apply rules to the profiles in the Developer tool and Analyst tool, see Informatica Data Discovery Guide.
Extracting Scorecard Results
You can create and edit a scorecard in the Developer tool and Analyst tool. A scorecard has multiple components, such as metrics, metric groups, scores, and thresholds. Scorecards help an enterprise to measure the value of data quality at the metric level. You can create metric groups to group related metrics. After you run profile, you can add columns from the profile results as metric to a scorecard. When you run a scorecard, the Analyst tool and Developer tool generates weighted average value for each metric group. When you run data quality resource, the resource extracts the scorecard results to the catalog.
You cannot associate business terms with scorecard asset in the Enterprise Data Catalog. To learn how to create, edit, and run scorecards in the Developer tool and Analyst tool, see Informatica Data Discovery Guide.
Objects Extracted
The Informatica Data Quality scanner extracts the following profiling metrics from the profiling warehouse to the catalog:
- •Column profile results
- - Null, unique, and non-unique values.
- - Patterns
- - Data types
- - Maximum and minimum values
- - Standard deviation value
- - Average value
- - Sum value
- •Data domain discovery results
- - Inference results
- - Curation results
- •Rule profile results
- - Null, distinct, and non-distinct values
- - Patterns
- - Maximum and minimum values
- - Maximum and minimum length
- - Data types
- •Scorecards
- - Metric groups
- - Metric scores
- - Quality scores
- •Value frequencies
Prerequisites
Perform the following step to complete the prerequisites:
- •Before you migrate the profiling results for a resource, you must run the respective resource in Catalog Administrator to extract the metadata into the catalog.
For example, you want to migrate the profile results of the Oracle schema S1 to the catalog. Before you migrate the Oracle schema profile results, create and run an Oracle resource with S1 and use this resource to configure the profiling warehouse resource.
- •Export data domains from the profiling warehouse domain to the Enterprise Data Catalog domain if the profiling warehouse and Enterprise Data Catalog are on two different domains.
Basic Information
The General tab includes the following basic information about the resource:
Information | Description |
---|
Name | The name of the resource. |
Description | The description of the resource. |
Resource type | The type of the resource. |
Execute On | You can choose to execute on the default catalog server or offline. |
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
Extract | To extract the results from the profiling warehouse, choose one of the following options: - - Rule
- - Scorecard
- - Profile
- - Profile and Value Frequency
|
Domain Name | Name of the Informatica domain. Note: This property applies to the versions 10.2.0 HotFix 2 and later. |
Domain User Name | User name that the Data Integration Service uses to access the Model Repository Service. |
Domain Password | Password for the Model repository user. |
Security Domain | Name of the security domain to which the domain user belongs. |
Domain Host | Informatica domain host name. |
Domain Port | Informatica domain port number. |
Node Name | Name of the node on which the Data Integration service runs. |
Repository Service Name | Name of the Model Repository Service. |
Profiling Warehouse Name | Name of the Profiling Warehouse. |
The Metadata Load Settings tab includes the following properties:
Property | Description |
---|
Enable Source Metadata | Select the option to extract metadata from the data sources. |
Profiled Schema Connections | Select one or more schemas in the Select Profiled Schema Connections dialog box. The profiling warehouse scanner migrates the profiling results of the selected schemas to the catalog. |
Cumulative | Select or clear the option as necessary. |
Auto Accept Percentage | Enter a value from 0 to 100. Enterprise Data Catalog automatically accepts the data domains when the inference percentage exceeds the configured number. |
Incremental | Select or clear the option as necessary. |
Memory | Specify the memory required to run the scanner job. Select one of the following values based on the migrated data set size: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
Custom Options | Enter the JVM parameters to configure the scanner container. |
Track Data Source Changes | Select the option to view the metadata source change notification in Enterprise Data Catalog. |
Informatica Data Quality Resource Tasks
When you run the Informatica Data Quality resource, the scanner performs the following tasks:
1. Verifies that the catalog has the source metadata for the resource.
For example, to migrate Oracle profile results, you must first create and run the Oracle resource in Catalog Administrator.
2. Verifies that the matching resource has profile results.
3. Verifies that the matching resource has curation decisions.
4. If curation results do not exist, the scanner migrates the results to the catalog.
5. If curation results exist, the scanner compares the timestamp of the curation decisions in the catalog with the curation decision in the profiling warehouse. The scanner migrates and overwrites the curation results if the curation results in the profiling warehouse have the latest timestamp.
For example, you reject the SSN data domain for a column in Employee table in the catalog on 05/01/19. You accept the SSN data domain for the same column in the Developer tool on 05/02/2019. You run the Informatica Data Quality scanner on 05/03/19. The curation decision for the data domain SSN in the profiling warehouse has the latest timestamp. The scanner migrates and overwrites the curation decision in the catalog. The data domain SSN displays accepted in the catalog.
Note: Enterprise Data Catalog displays SYSTEM in the Assigned By column for a data domain asset when you curate a data domain in the Developer tool or Analyst tool and the decision is migrated to the catalog.
6. Merges the data domain inferred results.
For example, in the Developer tool, you accept the data domains SSN and Age in the Employee table. In Enterprise Data Catalog, you accept the data domains Address and City in the same Employee table. When you run the Informatica Data Quality scanner, the scanner merges the data domain accepted results and displays SSN , Age, Address, and City as inferred data domains for the Employee table in the catalog.
7. Automatically accepts a data domain if the inference percentage exceeds the Auto Accept Percentage value.
8. Identifies the top 1,000 maximum values for a column and migrates the values to the catalog.
9. In Catalog Administrator, when you click Missing Report link tab in the Connection Assignment section, the exported file does not display the details of source and target objects that are not linked.