Live Data Map Administrator Guide > Live Data Map Concepts > Data Similarity Overview
  

Data Similarity Overview

Data similarity refers to identification of similar columns based on source data available in an enterprise.
As a data analyst or a data architect, you can scan your enterprise data to find similar data and then attach data domains to similar data patterns. This process helps you to search and discover assets of interest in the catalog faster.
A data domain is a predefined or user-defined Model Repository object based on the semantics of column data or a column name. Examples include Social Security number, phone number, and credit card number. A data domain helps you find important data and metadata that remains undiscovered in a data source. You can group logical data domains into data domain groups.
Data similarity involves preparing the data from different sources for finding similar columns, running hashing algorithms on data that Live Data Map ran a profile on, and drawing inferences by comparing the hashed data.

Business Example

Alex is a data analyst at a financial institution that has branches and franchises across North America. The institution has recently acquired another financial institution equal in size. There are customers who hold accounts in both the financial institutions. Alex and his team are asked to integrate all the customer details in a single database. Alex also wants to search for the customers based on the regions such as Northeast, South, Midwest, and West.
The following are the challenges that Alex and his team face:
The following table lists the scenarios that Alex and his team need to manage and how Alex uses similarity discovery and data domains to extract the required information:
Scenario
Resolution
Different database systems used by the financial institution and the acquired institution.
Identify the data sources that need to be scanned to find the required customers that match the eligibility criteria. Add these data sources as resources in Live Data Map to extract metadata from these resources. Alex identifies the databases in the enterprise that include the customer details.
Lack of consistency and context in the column names that makes it difficult to find and analyze source columns with similar data.
Enable profiling with similarity profiling for the selected resources. Live Data map runs a profile on the data sources and verifies the profile results for data similarity. Alex uses the profile results to identify the details about the source data, such as the values, uniqueness, and consistency of data. These attributes help Alex filter out the unwanted data.
Alex uses similarity discovery to identify columns that contain similar data across all the data sources. From an existing bank report from both the institutions, Alex finds out that both the organizations store the Social Security Number on all records that have customer information based on an existing bank report. If columns across different tables have SSN stored, Alex identifies that the customer details might be present in the tables that include the SSN details column.
When Alex searches for an SSN column in the Catalog, Enterprise Information Catalog lists the searched column along with other columns from all the data sources that are similar to the searched column.
After finding columns that contain similar data, Alex and his team can identify data that can be joined and duplicate data that can be removed.
Identify the lineage for each data asset, the other assets that are related to a particular asset, and the impact that joining or deleting a specific data asset might cause for the other related data assets.
Alex and his team can view the lineage, impact summary, and relationship view for identified assets using the Enterprise Information Catalog. Viewing the lineage, impact summary, and related asset details help Alex and team to identify the impact before updating or deleting a specific asset.
Classify customers based on the regions and make searches faster.
Alex defines data domains and data domain groups in Live Data Map Administrator.
To classify customers based on the regions, Alex performs the following steps:
  1. 1. Alex creates a data domain called customer_details in Live Data Map Administrator.
  2. 2. Alex assigns the data domain to one of the columns that contain the SSN in Enterprise Information Catalog.
  3. 3. Alex defines data domains called ZIP_code_<area> in Live Data Map Administrator. Alex replaces the part <area> with the branch locations of the financial institutions when defining the data domain. Alex configures each data domain by performing the following steps:
    1. a. Specifies the proximity rule for the data domain when creating the data domain. Alex creates data domains for all the ZIP Codes where the financial institutions have branches. A proximity rule specifies that if a specified data domain is not found in a table, Live Data Map can reduce the inference percentage for the new data domain by a specified percentage value. In this case, Alex specifies that if the data domain customer_details is not found in a table, Live Data Map can reduce the inference percentage for the data domain ZIP_code_<area> by 100 percent. This rule specifies that if the column SSN is not found in a table, Live Data Map does not search for the ZIP Code in that table.
    2. b. Specifies a rule for each data domain in the Analyst Tool or the Developer Tool for each data domain ZIP_code_<area>. Live Data Map uses the rule to match a column data pattern with the ZIP code for a specific branch.
    3. Note: A rule is business logic that defines conditions applied to data when you run a profile. You can add a rule to the profile to cleanse, modify, or validate the data in the profile.
  4. 4. Alex then creates four data domain groups based on the regions called Northeast, South, Midwest, and West, and includes the data domains in the respective data domain group. For example, the data domain that corresponds to the ZIP_code_LosAngeles ZIP Code is included in the West data domain group.
  5. 5. Alex performs a search in Enterprise Information Catalog for customer_details. Enterprise Information Catalog lists all the columns that include SSN details of the customers and also shows the data domains ( ZIP_code_<area>) and the data domain groups associated with the column.
Alex can also search based on the defined data domain groups to find a list of columns with customer details specific to a region.