Enterprise Data Catalog Scanner Configuration Guide > Configuring Informatica Platform Resources > Informatica Similarity Discovery Resource
  

Informatica Similarity Discovery Resource

As a data analyst or data architect, you can scan your enterprise data to find similar columns. You can run the Informatica Similarity Discovery resource to discover similar columns based on column name, column data pattern, and unique values factors.
In Catalog Administrator, you can create multiple Informatica Similarity Discovery resources. The Informatica Similarity Discovery resource runs on multiple resources to discover similar columns in the catalog. You can specify the resources on which you want to run similarity discovery.
When you create and run the Informatica Similarity Discovery resource, the resource scans the PostgreSQL database, compares the prepared data to infer similar columns, and persists the results into the PostgreSQL database. After you run the Informatica Similarity Discovery resource, you can view similar columns in Enterprise Data Catalog.

Prerequisite

Before you create and run the Informatica Similarity Discovery resource, ensure that you have selected the Run Similarity Profile option for resources on which you want to run the similarity discovery in Catalog Administrator.
Example 1. Example
The catalog contains similar customer details spread across multiple resources. You might want to identify all the resources that contain similar customer details so that you can integrate the data into a single resource and remove the duplicate data. To achieve this goal, you can enable the Run Similarity Profile option for the identified resources, and then create and run the Informatica Similarity Discovery resource on those resources to identify and view similar columns in Enterprise Data Catalog.
For more examples of column similarity, see the Business Example topic in the Informatica Catalog Administrator Guide.

Resource Connection Properties

The General tab includes the following properties:
Property
Description
Name
Enter a name for the resource.
Description
Optionally, you can enter a brief description for the resource.
Resource type
Choose Informatica Similarity Discovery option as the resource type.
Resources to be grouped
Select the resources on which you want to run similarity discovery. By default, you can select four resources for similarity discovery. If you want to add more resources, you can configure the following custom property for the Catalog Service in Informatica Administrator: LdmCustomOptions.scanner.simDiscoveryScanner.maxPermittedResourceCount
Note: After you run the Informatica Similarity Discovery resource with a set of selected resources for similarity discovery, you cannot select the same resources to run similarity discovery using another Informatica Similarity Discovery resource. For example, if you use resources R1, R2, and R3 to run similarity discovery, you cannot use resources R1, R2, and R3 again with another Similarity Discovery resource.
The Metadata Load Settings tab includes the following properties:
Property
Description
Enable Similarity Discovery
Select the option to run similarity discovery on the selected resources.
Features enabled
Select or clear the options as required. Select the features based on which you want to run similarity discovery.
The following are the similarity discovery features:
  • - Name. Determines similar columns across data sources based on the column names. The name match uses fuzzy string match to identify the columns that have similar names.
  • - Patterns. Determines the column data pattern in the data sources and calculates the percentage of pattern overlap across column pairs for all data types except numeric data types.
  • - Unique Values. Determines similar columns across data sources based on distinct values. The distinct values overlap is calculated as a percentage of distinct values that overlap in two columns.
For more information about the similarity discovery features, see the Column Similarity topic.
Cumulative
Choose this option to retain the column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged.
Memory
The memory required to run the scanner job. Select one of the following values based on the data set size imported:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal.
Custom Options
Specify the custom options to run the scanner job.
Track Data Source Changes
Select the option to view the metadata source change notification in Enterprise Data Catalog.

Resource tasks

When you run the Informatica Similarity Discovery resource, the scanner performs the following tasks:
  1. 1. Verifies whether the selected resources are enabled with the Run Similarity Profile option.
  2. 2. Scans the prepared data in the PostgreSQL database, compares the prepared data to infer similar columns, and persists the results into the PostgreSQL database.
The following matrix describes the type of similarity computation performed on the columns of different inferred data types:
Column type
Name
Pattern
Unique values
Numeric
Yes
No
No
String
Yes
Yes
Yes
Date
Yes
No
No
For more information about the column similarity process, see the Column Similarity Process topic.