Informatica Similarity Discovery Resource

When you create and run the Informatica Similarity Discovery resource, the resource scans the PostgreSQL database, compares the prepared data to infer similar columns, and persists the results into the PostgreSQL database. After you run the Informatica Similarity Discovery resource, you can view similar columns in Enterprise Data Catalog.

Prerequisite

The catalog contains similar customer details spread across multiple resources. You might want to identify all the resources that contain similar customer details so that you can integrate the data into a single resource and remove the duplicate data. To achieve this goal, you can enable the Run Similarity Profile option for the identified resources, and then create and run the Informatica Similarity Discovery resource on those resources to identify and view similar columns in Enterprise Data Catalog.

Resource Connection Properties

Property	Description
Name	Enter a name for the resource.
Description	Optionally, you can enter a brief description for the resource.
Resource type	Choose Informatica Similarity Discovery option as the resource type.
Resources to be grouped	Select the resources on which you want to run similarity discovery. By default, you can select four resources for similarity discovery. If you want to add more resources, you can configure the following custom property for the Catalog Service in Informatica Administrator: LdmCustomOptions.scanner.simDiscoveryScanner.maxPermittedResourceCount Note: After you run the Informatica Similarity Discovery resource with a set of selected resources for similarity discovery, you cannot select the same resources to run similarity discovery using another Informatica Similarity Discovery resource. For example, if you use resources R1, R2, and R3 to run similarity discovery, you cannot use resources R1, R2, and R3 again with another Similarity Discovery resource.

Property	Description
Enable Similarity Discovery	Select the option to run similarity discovery on the selected resources.
Features enabled	Select or clear the options as required. Select the features based on which you want to run similarity discovery. The following are the similarity discovery features: - Name. Determines similar columns across data sources based on the column names. The name match uses fuzzy string match to identify the columns that have similar names. - Patterns. Determines the column data pattern in the data sources and calculates the percentage of pattern overlap across column pairs for all data types except numeric data types. - Unique Values. Determines similar columns across data sources based on distinct values. The distinct values overlap is calculated as a percentage of distinct values that overlap in two columns. For more information about the similarity discovery features, see the Column Similarity topic.
Cumulative	Choose this option to retain the column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged.
Memory	The memory required to run the scanner job. Select one of the following values based on the data set size imported: - Low - Medium - High Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal.
Custom Options	Specify the custom options to run the scanner job.
Track Data Source Changes	Select the option to view the metadata source change notification in Enterprise Data Catalog.

Column type	Name	Pattern	Unique values
Numeric	Yes	No	No
String	Yes	Yes	Yes
Date	Yes	No	No

Informatica Similarity Discovery Resource

Prerequisite

Resource Connection Properties

Resource tasks