Catalog Administrator Guide > Enterprise Data Catalog Concepts > Column Similarity
  

Column Similarity

As a data analyst or data architect, you can scan your enterprise data to find similar columns. When you run the Data Domain Propagation scanner or when the Data Domain Propagation scanner runs based on a schedule, then the scanner uses the inferred similar columns to propagate example-based data domains to other similar columns. This process allows you to search and discover assets of interest faster in the catalog.
In the Catalog Administrator, when you run a resource after you choose the Run Similarity Profile option, and then run the SimilarityDiscovery system resource, the scanner discovers similar columns based on column name, column data pattern, unique values, and value frequency in the resource.
You can identify the frequency of values after you enable column similarity for a resource and choose the Save Source Data option. Based on your business requirement, you can use the value frequency to analyze data in a resource. You can compute value frequency in the view column, table column, CSV field, XML file field, and JSON file field for relational sources, semi-structured sources, and for sources that you can run the column profile on.

How Column Similarity Works

In an organization, a column name such as customer ID might exist across multiple data sources. To identify the data sources that contain such columns, you can use column similarity in Enterprise Data Catalog. It uses unsupervised clustering which is a machine learning technique to identify the similar columns. Enterprise Data Catalog performs unsupervised clustering across multiple data sources based on several factors, such as data overlap, distinct value match, pattern match, and name match. It then assigns an overall similarity score as well as the match likelihood for each factor.
The unsupervised clustering performs the following actions for two columns in the data sources:
Data overlap
Determines the data overlap in the data sources. The data overlap is calculated as a percentage of values that overlap in both the columns. Enterprise Data Catalog displays this percentage as Data in the Similar Columns > Confidence% section.
For example, data overlap can identify duplicate and redundant data sets across data sources.
Distinct value match
Determines similar columns across data sources based on distinct values. The distinct values overlap is calculated as a percentage of distinct values that overlap in two columns. Enterprise Data Catalog displays this measurement as Distinct Values in the Similar Columns > Confidence% section.
Pattern match
Determines the column data pattern in the data sources and calculates the percentage of pattern overlap across column pairs. The pattern match uses profiling to identify the dominant data patterns for each column and field. The pattern match, then, checks for an overlap of these patterns across column pairs. Enterprise Data Catalog displays the percentage as Pattern in the Similar Columns > Confidence% section.
For example, pattern match identifies column pairs that might contain the credit card pattern.
Name match
Determines similar columns across data sources based on the column names. The name match uses fuzzy string match to identify the columns that have similar names. Enterprise Data Catalog displays the percentage as Name in the Similar Columns > Confidence% section.
For example, name match can identify all the columns that has the column name, customerID.

Column Similarity Process

The column similarity process includes data preparation, staging of data, discover, and ingestion of data in the catalog.
Column similarity process includes the following jobs:
    Prepare Data
    After you configure the parameters in the Similarity Profile Data Preparation and Value Frequency Settings section and run the resource, the profiling scanner prepares the data in the resource to infer similar columns and then persists the computed information in Apache HBase. Similarly, you can prepare and persist data for multiple resources.
    Discover Similar Columns
    To identify similar columns, run the Similarity Discovery system resource. The Similarity Discovery system resource is an internal system job that runs on multiple resources to discover similar columns in the catalog. When you run the Similarity Discovery system resource, it scans HBase, compares the prepared data to infer similar columns, and persists the results into HBase.
The following image shows the column similarity process:
The image shows the column similarity process.
In Catalog Administrator, you can enable data discovery and configure the resource properties to discover column similarity for multiple resources. After you run the resources and Similarity Discovery system resource in Catalog Administrator, you can view similar columns based on column names, column data patterns, unique values, and value frequency for the data assets in Enterprise Data Catalog.

Business Example

Alex is a data analyst at a financial institution that has branches and franchises across North America. The institution has recently acquired another financial institution equal in size. There are customers who hold accounts in both the financial institutions. Alex and his team are asked to integrate all the customer details in a single database. Alex also wants to search for the customers based on the regions such as Northeast, South, Midwest, and West.
The following are the challenges that Alex and his team face:
The following table lists the scenarios that Alex and his team need to manage, and how Alex uses Similarity Discovery system resource and data domains to extract the required information:
Scenario
Resolution
Different database systems used by the financial institution and the acquired institution.
Identify the data sources that need to be scanned to find the required customers that match the eligibility criteria. Add these data sources as resources in Catalog Administrator to extract metadata from these resources.
Alex identifies the databases in the enterprise that include the customer details.
Lack of consistency and context in the column names that makes it difficult to find and analyze source columns with similar data.
In Catalog Administrator, enable Data Discovery and Run Similarity Profile options for the selected resources. The profiling scanner and Similarity Discovery system resource identifies similar columns in the resources. The similar columns are identified based on column names, column data patterns, unique values, and value frequency. Enterprise Data Catalog displays the similar columns for the data assets.
Alex uses similar columns to identify the columns that contain similar data across all the data sources, identify data that can be joined, and duplicate data that can be removed.
For example, Alex finds out from an existing bank report that both the organizations store the Social Security Number on all records that have customer information. Based on this information, he infers that if the tables contain a column with SSN information then the customer details might be present in the tables. To verify, Alex searches for an SSN column in the Catalog, Enterprise Data Catalog lists the searched column along with other columns from all the data sources that are similar to the searched column. Based on the search results, Alex and his team identify data that can be joined and duplicate data that can be removed.
Identify the lineage for each data asset, the other assets that are related to a particular asset, and the impact that joining or deleting a specific data asset might cause for the other related data assets.
In Enterprise Data Catalog, view the lineage, impact summary, and relationship view for identified assets.
Alex and his team can view the lineage, impact summary, and related asset details to identify the impact before updating or deleting a specific asset.
Classify customers based on the regions and make searches faster.
In Catalog Administrator, create data domains and data domain groups to find a list of columns with customer details specific to a region.
Alex performs the following steps to define data domains and data domain groups to classify customers based on the regions:
  1. 1. Creates a data domain called customer_details.
  2. 2. Assigns the data domain to one of the columns that contain the SSN in Enterprise Data Catalog.
  3. 3. Creates ZIP_code_<area> data domain.
  4. In Informatica Analyst or Informatica Developer, Alex specifies a rule to replace the <area> parameter in the data domain with the branch locations of the financial institutions. Enterprise Data Catalog uses the rule to match a column pattern with the ZIP code for a specific branch.
    Note: A rule is business logic that defines conditions applied to data when you run a profile. You can add a rule to the profile to cleanse, modify, or validate the data in the profile.
  5. 4. Creates four data domain groups based on the regions called Northeast, South, Midwest, and West, and includes the data domains in the respective data domain group.
  6. For example, the data domain that corresponds to the ZIP_code_LosAngeles ZIP Code is included in the West data domain group.
  7. 5. Alex performs a search in Enterprise Data Catalog for customer_details. Enterprise Data Catalog lists all the columns that include SSN details of the customers and also shows the data domains ( ZIP_code_<area>) and the data domain groups associated with the column.
Alex can also search based on the defined data domain groups to find a list of columns with customer details specific to a region.

Propagate Business Terms

Enterprise Data Catalog identifies potential business terms that can be associated with columns or fields in the catalog. Enterprise Data Catalog periodically runs the DataDomainPropagation system resource to automatically accept or recommend the business terms for similar columns. This process minimizes manual intervention to identify and associate business terms with columns or fields.
Business term propagation process involves the following steps:
    Run a Resource
    You create a resource and configure the parameters in the Metadata Load Settings > Similarity Profile Data Preparation and Value Frequency Settings section, and run the resource.
    Run a Business Glossary Resource or Axon Resource
    You create and run a Business Glossary resource or Axon resource to retrieve and display the business terms in the catalog.
    SimilarityDiscovery System Resource Run
    Enterprise Data Catalog runs the SimilarityDiscovery system resource on a predefined schedule of 45 minutes. The SimilarityDiscovery system resource scanner discovers similar columns based on column name, column data pattern, unique values, and value frequency in the resource.
    DataDomainPropagation System Resource Run
    Enterprise Data Catalog runs the DataDomainPropagation system resource on a predefined schedule of 120 minutes. The DataDomainPropagation system resource scanner performs the following tasks:
The scanner uses one of the following ways to auto-accept or recommend the business term for columns:
In Enterprise Data Catalog, a business term with the confidence score of more than 80 percent is automatically assigned as a business title to a column. The confidence score is calculated by an internal algorithm based on accepted business terms on data domains, column similarity, and name similarity between a column and business term. If there are multiple business terms with a threshold of more than 80 percent, the term with the highest confidence score is automatically assigned as a business title. Enterprise Data Catalog displays a maximum of three recommended business terms if the confidence score is between 60 and 80 percent for a column.