Catalog Administrator Guide > Enterprise Data Catalog Concepts > Column Similarity
  

Column Similarity

As a data analyst or data architect, you can scan your enterprise data to find similar columns. When you run the Data Domain Propagation scanner or when the Data Domain Propagation scanner runs based on a schedule, then the scanner uses the inferred similar columns to propagate smart data domains to other similar columns. This process allows you to search and discover assets of interest faster in the catalog.
In the Catalog Administrator, when you run a resource after you choose the Run Similarity Profile option, and then create and run the SimilarityDiscovery resource, the scanner discovers similar columns based on column name, column data pattern, and unique values factors in the resource.
By default, Enterprise Data Catalog uses all the factors to propagate the smart data domains to other similar columns. You can choose one or more factors to propagate smart data domains to other similar columns. For example, you identify an employee ID column as a potential data domain. You create a smart data domain for the column. To propagate this to other similar columns, you can choose only the name match and pattern match factors.
You can identify the frequency of values after you enable column similarity for a resource and choose the Save Source Data option. Based on your business requirement, you can use the value frequency to analyze data in a resource. You can compute value frequency in the view column, table column, CSV field, XML file field, and JSON file field for relational sources, semi-structured sources, and for sources that you can run the column profile on.

How Column Similarity Works

In an organization, a column name such as customer ID might exist across multiple data sources. To identify the data sources that contain such columns, you can use column similarity in Enterprise Data Catalog. It uses unsupervised clustering which is a machine learning technique to identify the similar columns. Enterprise Data Catalog performs unsupervised clustering across multiple data sources based on several factors, such as distinct value match, pattern match, and name match. It then assigns an overall similarity score as well as the match likelihood for each factor.
To determine column similarity, the unsupervised clustering uses the following factors:
Distinct value match
Determines similar columns across data sources based on distinct values. The distinct values overlap is calculated as a percentage of distinct values that overlap in two columns.
The catalog displays this measurement as Distinct Values in the Similar Columns > Confidence% section.
Pattern match
Determines the column data pattern in the data sources and calculates the percentage of pattern overlap across column pairs for all data types except numeric data types. The pattern match uses profiling to identify the dominant data patterns for each column and field. The pattern match, then, checks for an overlap of these patterns across column pairs. Enterprise Data Catalog does not infer X, XX, and XXX patterns for all data types.
The catalog displays the percentage as Pattern in the Similar Columns > Confidence% section.
For example, pattern match identifies column pairs that might contain the credit card pattern.
Name match
Determines similar columns across data sources based on the column names. The name match uses fuzzy string match to identify the columns that have similar names.
The catalog displays the percentage as Name in the Similar Columns > Confidence% section.
For example, name match can identify all the columns that has the column name, customerID.

Column Similarity Process

The column similarity process includes data preparation, staging of data, discover, and ingestion of data in the catalog.
Column similarity process includes the following jobs:
Prepare Data
After you configure the parameters in the Similarity Profile Data Preparation and Value Frequency Settings section and run the resource, the profiling scanner prepares the data in the resource to infer similar columns and then persists the computed information in PostgreSQL. Similarly, you can prepare and persist data for multiple resources.
Discover Similar Columns
To identify similar columns, create and run the Informatica Similarity Discovery resource. The Informatica Similarity Discovery resource runs on multiple resources to discover similar columns in the catalog. When you run the Informatica Similarity Discovery resource, it scans PostgreSQL, compares the prepared data to infer similar columns, and persists the results into PostgreSQL.
Informatica administrators can either install the PostgreSQL database server bundled with the Enterprise Data Catalog installer or configure an external PostgreSQL database after installation. Informatica administrators can configure an external PostgreSQL database if you run the Informatica Similarity Discovery resource. For more information on how Informatica administrators can configure an external PostgreSQL database, see the KB article https://knowledge.informatica.com/s/article/You-can-t-create-logical-partitions-or-configure-high-availability-and-failover-options-when-you-use-the-PostgreSQL-database-bundled-with-the-Enterprise-Data-Catalog-installer?language=en_US.
The following image shows the column similarity process:
The image shows the column similarity process.
In Catalog Administrator, you can enable data discovery and configure the resource properties to discover column similarity for multiple resources. After you run the resources and Informatica Similarity Discovery resource in Catalog Administrator, you can view similar columns based on column names, column data patterns, and unique values for the data assets in Enterprise Data Catalog.

Choosing Factors to Propagate Smart Data Domains

You can choose one or more factors to propagate the smart data domains to other similar columns.
    1. In Catalog Administrator, create or open a resource.
    2. Run the resource after you choose the Run Similarity Profile option.
    3. Click Library > Resource.
    4. Create and run the Informatica Similarity Discovery resource.
    The catalog shows the assets related to the resource.
    5. Create and save a text file on your local machine. Name the file as necessary.
    6. Edit the file. Enter a smart data domain and the chosen factor abbreviations in the following format:
    <data domain name>=<factor1>,<factor2>
    The factor abbreviations are N for name match, P for pattern match, and U for distinct value match.
    7. Continue to enter other data domains and their chosen factors in the file. Ensure that each data domain and its factors are in a new line. Data domain names are case-insensitive. Do not enter a whitespace in the file.
    The following snippet shows a sample text file:
    SDD_Oracle_DD1=P
    SDD_PatternCheck=P
    DD_NameCheck=N,VF
    DD_Check_Pattern=P,U
    8. Save the file.
    9. In the Catalog Administrator, edit the DataDomainPropagation system resource.
    10. On the Metadata Load Settings tab, choose the text file for the Enabled Similarity Types option.
    11. You can configure the following optional fields as required:
    12. Save and run the system resource.
    The DataDomainPropagation scanner propagates the smart data domains in the text file based on the chosen factors. To propagate the rest of the data domains, the scanner uses name match, pattern match, and distinct value match factors.

Business Example

Alex is a data analyst at a financial institution that has branches and franchises across North America. The institution has recently acquired another financial institution equal in size. There are customers who hold accounts in both the financial institutions. Alex and his team are asked to integrate all the customer details in a single database. Alex also wants to search for the customers based on the regions such as Northeast, South, Midwest, and West.
The following are the challenges that Alex and his team face:
The following table lists the scenarios that Alex and his team need to manage, and how Alex uses the Informatica Similarity Discovery resource and data domains to extract the required information:
Scenario
Resolution
Different database systems used by the financial institution and the acquired institution.
Identify the data sources that need to be scanned to find the required customers that match the eligibility criteria. Add these data sources as resources in Catalog Administrator to extract metadata from these resources.
Alex identifies the databases in the enterprise that include the customer details.
Lack of consistency and context in the column names that makes it difficult to find and analyze source columns with similar data.
In Catalog Administrator, enable Data Discovery and Run Similarity Profile options for the selected resources. The profiling scanner and Informatica Similarity Discovery resource identifies the similar columns in the resources. The similar columns are identified based on column names, column data patterns, and unique values. Enterprise Data Catalog displays the similar columns for the data assets.
Alex uses similar columns to identify the columns that contain similar data across all the data sources, identify data that can be joined, and duplicate data that can be removed.
For example, Alex finds out from an existing bank report that both the organizations store the Social Security Number on all records that have customer information. Based on this information, he infers that if the tables contain a column with SSN information then the customer details might be present in the tables. To verify, Alex searches for an SSN column in the Catalog, Enterprise Data Catalog lists the searched column along with other columns from all the data sources that are similar to the searched column. Based on the search results, Alex and his team identify data that can be joined and duplicate data that can be removed.
Identify the lineage for each data asset, the other assets that are related to a particular asset, and the impact that joining or deleting a specific data asset might cause for the other related data assets.
In Enterprise Data Catalog, view the lineage, impact summary, and relationship view for identified assets.
Alex and his team can view the lineage, impact summary, and related asset details to identify the impact before updating or deleting a specific asset.
Classify customers based on the regions and make searches faster.
In Catalog Administrator, create data domains and data domain groups to find a list of columns with customer details specific to a region.
Alex performs the following steps to define data domains and data domain groups to classify customers based on the regions:
  1. 1. Creates a data domain called customer_details.
  2. 2. Assigns the data domain to one of the columns that contain the SSN in Enterprise Data Catalog.
  3. 3. Creates ZIP_code_<area> data domain.
  4. In Informatica Analyst or Informatica Developer, Alex specifies a rule to replace the <area> parameter in the data domain with the branch locations of the financial institutions. Enterprise Data Catalog uses the rule to match a column pattern with the ZIP code for a specific branch.
    Note: A rule is business logic that defines conditions applied to data when you run a profile. You can add a rule to the profile to cleanse, modify, or validate the data in the profile.
  5. 4. Creates four data domain groups based on the regions called Northeast, South, Midwest, and West, and includes the data domains in the respective data domain group.
  6. For example, the data domain that corresponds to the ZIP_code_LosAngeles ZIP Code is included in the West data domain group.
  7. 5. Alex performs a search in Enterprise Data Catalog for customer_details. Enterprise Data Catalog lists all the columns that include SSN details of the customers and also shows the data domains ( ZIP_code_<area>) and the data domain groups associated with the column.
Alex can also search based on the defined data domain groups to find a list of columns with customer details specific to a region.