Catalog Administrator Guide > Managing Resources > Enable Data Discovery
  

Enable Data Discovery

You can choose the Enable Data Discovery option for a resource to find the content, quality, and structure of the data source. You can run a column profile, perform data domain discovery, and prepare data to infer similar columns in multiple data sources. You can generate unique key from the column or combination of columns.
To perform data discovery on a resource, select the Enable Data Discovery option in the Data Discovery section in the Metadata Load Settings tab. After you enable data discovery for a resource, you can configure the Data Integration Service properties, profile-related settings, enable column similarity, and unique key inference settings. For more information about the options in the Data Discovery section, see the Data Discovery topic.
Choose the following profiling options for unstructured file types and extended unstructured formats:
You cannot run discovery on source metadata for unstructured file types and extended unstructured formats.

Domain Connection Settings

Configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity. You can choose a different Data Integration Service to infer column similarity.
The following table describes the properties that you can configure in the Metadata Load Settings > Domain Connection Settings tab:
Property
Description
Specify the configuration settings for Data Integration Service.
Choose one of the following options:
  • - Custom. Enter the domain connection settings for Data Integration Service
  • - Global. Choose a reusable configuration.
Domain Name
Name of the Data Integration Service Domain.
Data Integration Service
Name of the Data Integration Service.
Username
Username to log in to the Data Integration Service.
Password
Password to log in to the Data Integration Service.
Security Domain
Name of the security domain.
Host
Host name for the Data Integration Service.
Port
Port number for the Data Integration Service.
Operating System Profile
Choose an operating system profile if you do not have a default operating system profile. The Data Integration Service uses the operating system profile to run the profile.
Note: You can choose an operating system profile only when you choose the Custom option for the Specify the configuration settings for Data Integration Service property. If you choose the Global option for the Specify the configuration settings for Data Integration Service property, you need a default operating system profile to run the profile.
For more information about using the operating system profiles in Enterprise Data Catalog, see the Domain Connection Settings topic.

Basic Profile Settings

Configure the profile settings to run a column profile and perform data domain discovery for a resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Basic Profile Settings tab:
Property
Description
Profile Run Option
Choose one of the following profile types:
  • - Column Profile.
  • - Data Domain Discovery.
  • - Column Profile and Data Domain Discovery.
Priority
Choose one of the following priority values:
  • - High
  • - Low
Sampling Option
Choose one of the following sampling options:
  • - All rows.
  • - Auto Random rows.
  • - Random N rows.
  • - First N rows.
  • - Limit N rows.
  • - Random Percentage
Note: For Hive resources, choose only All rows or First N rows sampling option. For XML, JSON, Avro, and Parquet resources, choose only All rows sampling option.
Note: For Cassandra resource, choose All rows, First N rows, or Limit N rows sampling options.
Exclude Views
Choose the option if you do not want the profiling scanner to scan the views in relational data sources.
Note: For Cassandra resource, enable Exclude Views option as the data discovery is not supported on views.
Incremental Profiling
Choose the option to run the profile only on the changes made to the data source. If you do not select this option, the profile runs on the entire data source.
For information about resources that support incremental profiling, see the Basic Profile Settings topic.
Data Profile Filter
You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names.
For more information about the filter field, see Source Metadata and Data Profile Filter.
Cumulative
Choose this option to retain the column profile and column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged.
For information about how to use the this option with the Data Profile Filter field and Incremental Profiling option, see the Basic Profile Settings topic.
Source Connection Name
Choose a source connection to run data discovery.
Note: This parameter is optional for a File System resource.
Run On
Choose one of the following run-time environments:
  • - Blaze. Runs the profile in the Hadoop environment on the Blaze engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • - Spark. Runs the profile in the Hadoop environment on the Spark engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • - Databricks. Runs the profile in the Hadoop environment on the Spark engine in the Databricks cluster.Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
  • - Native. Runs the profile on the same machine where the Data Integration Service runs.
Configure the following properties when you choose Data Domain Discovery or Column Profile and Data Domain Discovery option:
Property
Description
Data Domain Discovery Type
Choose one of the following data domain discovery types:
  • - Run Discovery on Source Data.
  • - Run Discovery on Source Metadata.
  • - Run Discovery on both Source Metadata and Data.
  • - Run Discovery on Source Data Where Metadata Matches.
Select Data Domain
Choose one of the following data domain options:
  • - All Data Domains.
  • - Specific Data Domain Groups.
  • In the Data Domain Groups field, choose one or more data domain groups in the Data Domain Groups dialog box.
  • - Specific Data Domains.
  • In the Data Domains field, choose one or more data domains in the Data Domains dialog box.
Use Conformance from
Choose one of the following values:
  • - Data Domain
  • - Custom
Data Domain Match Criteria
Choose one of the following values:
  • - Percentage
  • - Rows
Exclude Null Values from Data Domain Discovery
Select the option to exclude the null values in the data source when you run data domain discovery.

Similarity Profile and Value Frequency Settings

Configure the column similarity properties to identify similar columns and value frequency in the resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Similarity Profile Data Preparation and Value Frequency Settings tab:
Property
Description
Run Similarity Profile
Choose one of the following options:
  • - Yes. The profiling scanner scans the data source and prepares data to perform the following tasks:
    • - Discover similar columns. The algorithm discovers similar columns in the resource based on column names, column patterns, and unique values.
    • - Identify business terms. The algorithm identifies and recommends the business terms for a column based on the accepted data domains and similar columns.
  • - No.
Save Source Data
Choose one of the following options:
  • - Yes. Persists the computed information about similar columns, column patterns, and unique values in the resource in PostgreSQL database.
  • - No.
Sampling Options
Choose one of the following sampling options:
  • - Reuse Basic Profile Settings.
  • - All Rows.
  • - Auto Random Rows.
  • - Random N Rows.
  • - First N Rows.
Domain Connection Settings
Choose one of the following options:
  • - Use Profile Configuration Settings.
  • - Specify Domain Connection Settings.
For information about domain connection settings properties, see the Domain Connection Settings section.

Unique Key Inference Settings

You can configure the following settings to generate unique key candidate from selected columns.
Property
Description
Run Unique Key Inference
Choose one of the following options:
  • - Yes. The profile scanner scans and infers unique keys from the data source.
  • - No
Null Threshold % in Unique Key Inference
The threshold for null values in unique key inference. You can enter a value between 0 and 1.
Skip Unique Key Inference When Accepted or Documented Unique Key Exists
Choose one of the following options:
  • - Yes. The profiling service skips unique key inference for columns with documented or accepted keys.
  • - No
Unique Key Sampling Option
Choose one of the following options:
  • - All Rows. Chooses all the rows in the data object for unique key inference.
  • - First N Rows. Chooses only the first N rows in the data object for unique key inference.
See Unique Key Inference Settings for the list of the supported resources and file type.