Informatica Similarity Discovery Resource
As a data analyst or data architect, you can scan your enterprise data to find similar columns. You can run the Informatica Similarity Discovery resource to discover similar columns based on column name, column data pattern, and unique values factors.
In Catalog Administrator, you can create multiple Informatica Similarity Discovery resources. The Informatica Similarity Discovery resource runs on multiple resources to discover similar columns in the catalog. You can specify the resources on which you want to run similarity discovery.
When you create and run the Informatica Similarity Discovery resource, the resource scans the PostgreSQL database, compares the prepared data to infer similar columns, and persists the results into the PostgreSQL database. After you run the Informatica Similarity Discovery resource, you can view similar columns in Enterprise Data Catalog.
Prerequisite
Before you create and run the Informatica Similarity Discovery resource, ensure that you have selected the Run Similarity Profile option for resources on which you want to run the similarity discovery in Catalog Administrator.
Example 1. Example
The catalog contains similar customer details spread across multiple resources. You might want to identify all the resources that contain similar customer details so that you can integrate the data into a single resource and remove the duplicate data. To achieve this goal, you can enable the Run Similarity Profile option for the identified resources, and then create and run the Informatica Similarity Discovery resource on those resources to identify and view similar columns in Enterprise Data Catalog.
For more examples of column similarity, see the
Business Example topic in the
Informatica Catalog Administrator Guide.
Resource Connection Properties
The General tab includes the following properties:
Property | Description |
---|
Name | Enter a name for the resource. |
Description | Optionally, you can enter a brief description for the resource. |
Resource type | Choose Informatica Similarity Discovery option as the resource type. |
Resources to be grouped | Select the resources on which you want to run similarity discovery. By default, you can select four resources for similarity discovery. If you want to add more resources, you can configure the following custom property for the Catalog Service in Informatica Administrator: LdmCustomOptions.scanner.simDiscoveryScanner.maxPermittedResourceCount Note: After you run the Informatica Similarity Discovery resource with a set of selected resources for similarity discovery, you cannot select the same resources to run similarity discovery using another Informatica Similarity Discovery resource. For example, if you use resources R1, R2, and R3 to run similarity discovery, you cannot use resources R1, R2, and R3 again with another Similarity Discovery resource. |
The Metadata Load Settings tab includes the following properties:
Property | Description |
---|
Enable Similarity Discovery | Select the option to run similarity discovery on the selected resources. |
Features enabled | Select or clear the options as required. Select the features based on which you want to run similarity discovery. The following are the similarity discovery features: - - Name. Determines similar columns across data sources based on the column names. The name match uses fuzzy string match to identify the columns that have similar names.
- - Patterns. Determines the column data pattern in the data sources and calculates the percentage of pattern overlap across column pairs for all data types except numeric data types.
- - Unique Values. Determines similar columns across data sources based on distinct values. The distinct values overlap is calculated as a percentage of distinct values that overlap in two columns.
For more information about the similarity discovery features, see the Column Similarity topic. |
Cumulative | Choose this option to retain the column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged. |
Memory | The memory required to run the scanner job. Select one of the following values based on the data set size imported: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal. |
Custom Options | Specify the custom options to run the scanner job. |
Track Data Source Changes | Select the option to view the metadata source change notification in Enterprise Data Catalog. |
Resource tasks
When you run the Informatica Similarity Discovery resource, the scanner performs the following tasks:
- 1. Verifies whether the selected resources are enabled with the Run Similarity Profile option.
- 2. Scans the prepared data in the PostgreSQL database, compares the prepared data to infer similar columns, and persists the results into the PostgreSQL database.
The following matrix describes the type of similarity computation performed on the columns of different inferred data types:
Column type | Name | Pattern | Unique values |
---|
Numeric | Yes | No | No |
String | Yes | Yes | Yes |
Date | Yes | No | No |
For more information about the column similarity process, see the
Column Similarity Process topic.