Catalog Administrator Guide > Managing Resources > Enable Data Discovery
  

Enable Data Discovery

You can choose the Enable Data Discovery option for a resource to find the content, quality, and structure of the data source. You can run a column profile, perform data domain discovery, and prepare data to infer similar columns in multiple data sources. You can generate unique key from the column or combination of columns.
To perform data discovery on a resource, select the Enable Data Discovery option in the Data Discovery section in the Metadata Load Settings tab. After you enable data discovery for a resource, you can configure the Data Integration Service properties, profile-related settings, enable column similarity, and unique key inference settings. For more information about the options in the Data Discovery section, see the Data Discovery topic.
Choose the following profiling options for unstructured file types and extended unstructured formats:
You cannot run discovery on source metadata for unstructured file types and extended unstructured formats.

Profiling architecture

The profiling architecture consists of tools, services, and databases. The tools component consists of client applications. The services component has application services required to manage the tools, perform the data integration tasks, and manage the metadata of profile objects. The databases component consists of the Model Repository, the Profiling Warehouse, and the Reference Data Warehouse.
The following image shows the architecture components for profiling:
The following table describes the architecture components:
Component
Description
Catalog Administrator tool
A tool to configure the profiling metadata settings so that you have information related to data quality for further analysis.
Enterprise Data Catalog tool
A tool to discover, explore, and relate different types of metadata from disparate sources in the enterprise. You can view the profiling results in the Enterprise Data Catalog tool.
Developer tool
A client application that you use to design and implement data integration, data quality, data profiling, and data services solutions. You can use the Developer tool to create rules for profiling.
Analyst tool
A web-based client tool that is available to multiple Informatica products and is used by business users to collaborate on projects within an organization. You can use the Analyst tool to create rules for profiling.
Informatica Cluster Service
An application service that runs and manages all the Hadoop services, Apache Ambari server, and Apache Ambari agents on an embedded Hadoop cluster.
Catalog Service
An application service that runs and manages the connections between service components and the users that have access to the Catalog Administrator and Enterprise Data Catalog tools.
Model Repository Service
An application service that manages the Model repository.
Data Integration Service
An application service that performs data integration tasks for the Analyst tool, the Developer tool, and external clients.
Content Management Service
An application service that manages reference data for data domains that use reference tables. It uses the Data Integration Service to run mappings to transfer data between reference tables and external data sources.
Model repository
A relational database that stores the metadata for projects created in the Analyst tool or Developer tool.
Profiling warehouse
A database that stores profiling information such as profile results.
Reference Data warehouse
A database that stores data values for the reference table objects that you define in the Model repository.
As depicted in the profiling architecture diagram, the following processes occur when you enable profiling for a resource and scan the resource:
  1. 1. Enable data discovery for the resource in the Catalog Administrator tool to discover profiling-related metadata and unique keys.
  2. 2. The Catalog Service gets the profile definition from the Model repository.
  3. 3. The Catalog Service invokes the Profiling module in the Data Integration Service.
  4. 4. The Profiling module processes the profile job and submits the job to the Data Integration Service.
  5. 5. The Data Integration Service generates and writes the profiling results to the profiling warehouse.
  6. 6. Discover, explore, and view profiling results in the Enterprise Data Catalog tool.

Domain Connection Settings

Configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity. You can choose a different Data Integration Service to infer column similarity.
The following table describes the properties that you can configure in the Metadata Load Settings > Domain Connection Settings tab:
Property
Description
Specify the configuration settings for Data Integration Service.
Choose one of the following options:
  • - Custom. Enter the domain connection settings for Data Integration Service
  • - Global. Choose a reusable configuration.
Domain Name
Name of the Data Integration Service Domain.
Data Integration Service
Name of the Data Integration Service.
Username
Username to log in to the Data Integration Service.
Password
Password to log in to the Data Integration Service.
Security Domain
Name of the security domain.
Host
Host name for the Data Integration Service.
Port
Port number for the Data Integration Service.
Operating System Profile
Choose an operating system profile if you do not have a default operating system profile. The Data Integration Service uses the operating system profile to run the profile.
Note: You can choose an operating system profile only when you choose the Custom option for the Specify the configuration settings for Data Integration Service property. If you choose the Global option for the Specify the configuration settings for Data Integration Service property, you need a default operating system profile to run the profile.
For more information about using the operating system profiles in Enterprise Data Catalog, see the Domain Connection Settings topic.

Basic Profile Settings

Configure the profile settings to run a column profile and perform data domain discovery for a resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Basic Profile Settings tab:
Property
Description
Profile Run Option
Choose one of the following profile types:
  • - Column Profile.
  • - Data Domain Discovery.
  • - Column Profile and Data Domain Discovery.
Priority
Choose one of the following priority values:
  • - High
  • - Low
Sampling Option
Choose one of the following sampling options:
  • - All rows.
  • - Auto Random rows.
  • - Random N rows.
  • - First N rows.
  • - Limit N rows.
  • - Random Percentage
Note: For Hive resources, choose only All rows or First N rows sampling option. For XML, JSON, Avro, and Parquet resources, choose only All rows sampling option.
Note: For Cassandra resource, choose All rows, First N rows, or Limit N rows sampling options.
Exclude Views
Choose the option if you do not want the profiling scanner to scan the views in relational data sources.
Note: For Cassandra resource, enable Exclude Views option as the data discovery is not supported on views.
Incremental Profiling
Choose the option to run the profile only on the changes made to the data source. If you do not select this option, the profile runs on the entire data source.
For information about resources that support incremental profiling, see the Basic Profile Settings topic.
Data Profile Filter
You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names.
For more information about the filter field, see Source Metadata and Data Profile Filter.
Cumulative
Choose this option to retain the column profile and column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged.
For information about how to use the this option with the Data Profile Filter field and Incremental Profiling option, see the Basic Profile Settings topic.
Source Connection Name
Choose a source connection to run data discovery.
Note: This parameter is optional for a File System resource.
Run On
Choose one of the following run-time environments:
  • - Blaze. Runs the profile in the Hadoop environment on the Blaze engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • - Spark. Runs the profile in the Hadoop environment on the Spark engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • - Databricks. Runs the profile in the Databricks environment on the Databricks Spark engine in the Databricks cluster. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • Note: You can run JDBC and Azure Data Lake Store resources using the Databricks run-time environment.
  • - Native. Runs the profile on the same machine where the Data Integration Service runs.
You can choose native, Hadoop, or Databricks as the run-time environment for a column profile. You can choose a Blaze or Spark engine in the Hadoop run-time environment. You can choose the Databricks Spark option in the Databricks run-time environment. Enterprise Data Catalog sets the run-time environment in the profile definition after you choose a run-time environment.
Note: Choose Blaze or Native run-time environment to run the profile job for all the resource except Hive resource. When you choose the Blaze engine or Spark engine, select a Hadoop connection to run the profiles.
You can run the profile in the following run-time environments:
Native environment
When you run a profile in the native run-time environment, the Data Integration Service runs the mappings and writes the profile results to the profiling warehouse. By default, all profiles run in the native run-time environment. You can use native data sources to create and run profiles in the native environment.
Hadoop environment
You can run profiles on the Spark or Blaze engine in the Hadoop environment or on the Databricks Spark engine in the Databricks environment. The Data Integration Service pushes the processing to the cluster. When processing is complete, the Data Integration Service writes the profile results to the profiling warehouse.
Databricks environment
You can choose the Databricks Spark option to run the profiles in the Databricks run-time environment. After you choose the Databricks Spark option, you can select a Databricks connection. The Data Integration Service pushes the profile logic to the Databricks Spark engine on the Databricks cluster to run profiles. When you run a profile in the Databricks environment, the Analyst tool submits the profile jobs to the Profiling Service Module. The Profiling Service module then breaks down the profile jobs into a set of mappings. The Data Integration Service pushes the mappings to the Databricks Spark engine through the Hadoop connection. The Databricks Spark engine processes the mappings and the Data Integration Service writes the profile results to the profiling warehouse.
Configure the following properties when you choose Data Domain Discovery or Column Profile and Data Domain Discovery option:
Property
Description
Data Domain Discovery Type
Choose one of the following data domain discovery types:
  • - Run Discovery on Source Data.
  • - Run Discovery on Source Metadata.
  • - Run Discovery on both Source Metadata and Data.
  • - Run Discovery on Source Data Where Metadata Matches.
Select Data Domain
Choose one of the following data domain options:
  • - All Data Domains.
  • - Specific Data Domain Groups.
  • In the Data Domain Groups field, choose one or more data domain groups in the Data Domain Groups dialog box.
  • - Specific Data Domains.
  • In the Data Domains field, choose one or more data domains in the Data Domains dialog box.
Use Conformance from
Choose one of the following values:
  • - Data Domain
  • - Custom
Data Domain Match Criteria
Choose one of the following values:
  • - Percentage
  • - Rows
Exclude Null Values from Data Domain Discovery
Select the option to exclude the null values in the data source when you run data domain discovery.

Similarity Profile and Value Frequency Settings

Configure the column similarity properties to identify similar columns and value frequency in the resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Similarity Profile Data Preparation and Value Frequency Settings tab:
Property
Description
Run Similarity Profile
Choose one of the following options:
  • - Yes. The profiling scanner scans the data source and prepares data to perform the following tasks:
    • - Discover similar columns. The algorithm discovers similar columns in the resource based on column names, column patterns, and unique values.
    • - Identify business terms. The algorithm identifies and recommends the business terms for a column based on the accepted data domains and similar columns.
  • - No.
Save Source Data
Choose one of the following options:
  • - Yes. Persists the computed information about similar columns, column patterns, and unique values in the resource in PostgreSQL database.
  • - No.
Sampling Options
Choose one of the following sampling options:
  • - Reuse Basic Profile Settings.
  • - All Rows.
  • - Auto Random Rows.
  • - Random N Rows.
  • - First N Rows.
Domain Connection Settings
Choose one of the following options:
  • - Use Profile Configuration Settings.
  • - Specify Domain Connection Settings.
For information about domain connection settings properties, see the Domain Connection Settings section.

Unique Key Inference Settings

You can configure the following settings to generate unique key candidate from selected columns.
Property
Description
Run Unique Key Inference
Choose one of the following options:
  • - Yes. The profile scanner scans and infers unique keys from the data source.
  • - No
Null Threshold % in Unique Key Inference
The threshold for null values in unique key inference. You can enter a value between 0 and 1.
Skip Unique Key Inference When Accepted or Documented Unique Key Exists
Choose one of the following options:
  • - Yes. The profiling service skips unique key inference for columns with documented or accepted keys.
  • - No
Unique Key Sampling Option
Choose one of the following options:
  • - All Rows. Chooses all the rows in the data object for unique key inference.
  • - First N Rows. Chooses only the first N rows in the data object for unique key inference.
See Unique Key Inference Settings for the list of the supported resources and file type.