Informatica Data Quality
When you run a profile in Informatica Analyst or Informatica Developer, the Data Integration Service stores the profile results in the profiling warehouse. To view these results in Enterprise Data Catalog, you can create an Informatica Data Quality resource in Catalog Administrator and associate the profiling warehouse to the resource. When you run the resource, the Informatica Data Quality scanner extracts and migrates the profile results in the associated profiling warehouse to the catalog. The migrated profile results include column profile results, data domain discovery results, rule profiling results, scorecards and value frequencies. You can migrate the profile results from the current or earlier versions on any Informatica domain. Informatica Data Quality resource supports relational database systems.
In Catalog Administrator, you can create multiple profiling warehouse resources. When you create a profiling warehouse resource, you associate a profiling warehouse to the resource. Enterprise Data Catalog scans the associated profiling warehouse and lists all the schemas with profile results. You can choose one or more schemas for which the Informatica Data Quality scanner migrates the results to the catalog.
When you run the Informatica Data Quality scanner, the catalog compares the timestamp of the results in the catalog with the migrated results. Enterprise Data Catalog displays the latest profile results and curation results based on the timestamp. Enterprise Data Catalog merges the inference results with the existing results in the catalog.
You can view the value frequency for a column or rule in the Developer tool or Analyst tool. The Informatica Data Quality scanner migrates the value frequencies for the columns along with the profile results to the catalog. The profiling warehouse stores a maximum of 16,000 values for a column. The scanner identifies the top 1,000 maximum values for a column and migrates these values to Enterprise Data Catalog.
- Example
- Assume that you have more than 1 million tables spread across 700 schemas and multiple databases in your enterprise. Over the years, you have run profiles on these schemas in the Developer tool or Analyst tool. All the profiling results reside in one or more profiling warehouse databases. Now, you want to implement Enterprise Data Catalog in your enterprise and want to access and view the existing profile results for the schemas in the catalog. Instead of running the profiles on the schemas and databases in Catalog Administrator, which is a time-consuming and resource-intensive effort, you decide to migrate the results to the catalog. Additionally, you want the developers and analysts to continue using the Developer tool and Analyst tool.
- In this case, you can run the core scanners to extract the metadata from the schemas and databases to the catalog. Then, you can create a profiling warehouse resource, choose the schemas and connections for which you want to migrate the profile results, and run the profiling warehouse resource. The scanner migrates the profile results of the selected schemas to the catalog. This action saves time and effort. An added advantage is that the users of the Developer tool and Analyst tool can continue to use the tools and you can migrate the results as and when required.
Extracting Rule Profile Results
You can apply rules and run rule profiles in the Developer tool and the Analyst tool. You can use the following methods to create or apply rules in the column profiles:
- •In the Developer tool, create a mapplet and validate it as a rule. The rule appears as a reusable rule in the Analyst tool. You can apply the rule to column profiles in the Analyst tool and Developer tool.
- •In the Analyst tool, create a rule specification and generate a mapplet. You can apply the rule specification to column profiles in the Analyst tool. In the Developer tool, validate the mapplet as a rule. The rule appears as a reusable rule that you can use in the column profiles.
- •You can use predefined rules in the column profiles. Informatica provides the predefined rules with the Developer tool and Analyst tool.
When you run Informatica Data Quality resource, the resource extracts the rule profile results to the catalog. You cannot apply expression rules to the profile. You cannot associate business terms with rule asset in the Enterprise Data Catalog. To learn how to create and apply rules to the profiles in the Developer tool and Analyst tool, see Informatica Data Discovery Guide.
Extracting Scorecard Results
You can create and edit a scorecard in the Developer tool and Analyst tool. A scorecard has multiple components, such as metrics, metric groups, scores, and thresholds. Scorecards help an enterprise to measure the value of data quality at the metric level. You can create metric groups to group related metrics. After you run profile, you can add columns from the profile results as metric to a scorecard. When you run a scorecard, the Analyst tool and Developer tool generates weighted average value for each metric group. When you run data quality resource, the resource extracts the scorecard results to the catalog.
You cannot associate business terms with scorecard asset in the Enterprise Data Catalog. To learn how to create, edit, and run scorecards in the Developer tool and Analyst tool, see Informatica Data Discovery Guide.
Objects Extracted
The Informatica Data Quality scanner extracts the following profiling metrics from the profiling warehouse to the catalog:
- •Column profile results
- - Null, unique, and non-unique values.
- - Patterns
- - Data types
- - Maximum and minimum values
- - Standard deviation value
- - Average value
- - Sum value
- •Data domain discovery results
- - Inference results
- - Curation results
- •Rule profile results
- - Null, distinct, and non-distinct values
- - Patterns
- - Maximum and minimum values
- - Maximum and minimum length
- - Data types
- •Scorecards
- - Metric groups
- - Metric scores
- - Quality scores
- •Value frequencies
Prerequisites
Perform the following step to complete the prerequisites:
- •Before you migrate the profiling results for a resource, you must run the respective resource in Catalog Administrator to extract the metadata into the catalog.
For example, you want to migrate the profile results of the Oracle schema S1 to the catalog. Before you migrate the Oracle schema profile results, create and run an Oracle resource with S1 and use this resource to configure the profiling warehouse resource.
- •Export data domains from the profiling warehouse domain to the Enterprise Data Catalog domain if the profiling warehouse and Enterprise Data Catalog are on two different domains.
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
Name | Enter a name for the profiling warehouse resource. |
Description | Optionally, you can enter a brief description for the resource. |
Resource type | Choose Informatica Data Quality option as the resource type. |
Profiling Warehouse Database Type | Choose Oracle, IBM DB2, or Microsoft SQL Server as the profiling warehouse database type. |
User | Enter the user name to access the profiling warehouse database. |
Password | Enter the password to access the profiling warehouse database. |
Host | Enter the host name or IP address of the Informatica domain, or enter the fully qualified JDBC connection string. For sample JDBC connection strings, see the Troubleshooting topic. |
Port | Enter the Informatica domain port number, or the port number of the database engine service. The default value is 1521. |
Service | Enter the Oracle service name. This option appears when the profiling warehouse database is Oracle. |
Database | Enter the Microsoft SQL Server database name. The scanner uses the value as a constraint to import the Microsoft SQL Server database. You can enter one or more database names separated by commas. When you specify a database name, the scanner imports only the schemas and tables in the database. For example: if you enter AdvDW, the scanner imports tables and schemas from the AdvDW database. This option appears when the profiling warehouse database is Microsoft SQL Server or IBM DB2. |
Schema Name | Enter the schema name. The scanner uses the value as a constraint to specify the non-default schema of the Microsoft SQL Server database. This option appears when the profiling warehouse database is Microsoft SQL Server. |
Extract Results of | Choose rule, scorecard, profile, or profile and value frequency to extract the results from the profiling warehouse. |
Domain Name | Name of the Informatica domain. Note: This property applies to the versions 10.2.0 HotFix 2 and later. |
Domain User Name | User name that the Data Integration Service uses to access the Model Repository Service. |
Domain Password | Password for the Model repository user. |
Security Domain | Name of the security domain to which the domain user belongs. |
Domain Host | Host name of the master gateway node. |
Domain Port | Port number of the master gateway node. |
Repository Service Name | Name of the Model Repository Service. |
Warehouse connection Name | Name of the Profiling Warehouse. |
The Metadata Load Settings tab includes the following properties:
Property | Description |
---|
Enable Source Metadata | Select the option to extract metadata from the data sources. |
Profiled Schema Connections | Select one or more schemas in the Select Profiled Schema Connections dialog box. The profiling warehouse scanner migrates the profiling results of the selected schemas to the catalog. |
Cumulative | Select or clear the option as necessary. |
Auto Accept Percentage | Enter a value from 0 to 100. Enterprise Data Catalog automatically accepts the data domains when the inference percentage exceeds the configured number. |
Incremental | Select or clear the option as necessary. |
Memory | Specify the memory required to run the scanner job. Select one of the following values based on the migrated data set size: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
JVM Options | Enter the JVM parameters to configure the scanner container. |
Track Data Source Changes | Select the option to view the metadata source change notification in Enterprise Data Catalog. |