Enable Data Discovery
You can choose the Enable Data Discovery option for a resource to find the content, quality, and structure of the data source. You can run a column profile, perform data domain discovery, and prepare data to infer similar columns in multiple data sources. You can generate unique key from the column or combination of columns.
To perform data discovery on a resource, select the
Enable Data Discovery option in the
Data Discovery section in the
Metadata Load Settings tab. After you enable data discovery for a resource, you can configure the Data Integration Service properties, profile-related settings, enable column similarity, and unique key inference settings. For more information about the options in the
Data Discovery section, see the
Data Discovery topic.
Choose the following profiling options for unstructured file types and extended unstructured formats:
- •Data Domain Discovery or Column Profile and Data Domain Discovery profile run option.
- •All Rows sampling option.
- •Rows as the data domain match criteria. Row count is the number of occurrences of a data domain in a data source.
You cannot run discovery on source metadata for unstructured file types and extended unstructured formats.
Domain Connection Settings
Configure the properties for the Data Integration Service. After you configure the properties, the Data Integration Service runs the profile, performs data domain discovery, and infers column similarity. You can choose a different Data Integration Service to infer column similarity.
The following table describes the properties that you can configure in the Metadata Load Settings > Domain Connection Settings tab:
Property | Description |
---|
Specify the configuration settings for Data Integration Service. | Choose one of the following options: - - Custom. Enter the domain connection settings for Data Integration Service
- - Global. Choose a reusable configuration.
|
Domain Name | Name of the Data Integration Service Domain. |
Data Integration Service | Name of the Data Integration Service. |
Username | Username to log in to the Data Integration Service. |
Password | Password to log in to the Data Integration Service. |
Security Domain | Name of the security domain. |
Host | Host name for the Data Integration Service. |
Port | Port number for the Data Integration Service. |
Operating System Profile | Choose an operating system profile if you do not have a default operating system profile. The Data Integration Service uses the operating system profile to run the profile. Note: You can choose an operating system profile only when you choose the Custom option for the Specify the configuration settings for Data Integration Service property. If you choose the Global option for the Specify the configuration settings for Data Integration Service property, you need a default operating system profile to run the profile. For more information about using the operating system profiles in Enterprise Data Catalog, see the Domain Connection Settings topic. |
Basic Profile Settings
Configure the profile settings to run a column profile and perform data domain discovery for a resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Basic Profile Settings tab:
Property | Description |
---|
Profile Run Option | Choose one of the following profile types: - - Column Profile.
- - Data Domain Discovery.
- - Column Profile and Data Domain Discovery.
|
Priority | Choose one of the following priority values: |
Sampling Option | Choose one of the following sampling options: - - All rows.
- - Auto Random rows.
- - Random N rows.
- - First N rows.
- - Limit N rows.
- - Random Percentage
Note: For Hive resources, choose only All rows or First N rows sampling option. For XML, JSON, Avro, and Parquet resources, choose only All rows sampling option. Note: For Cassandra resource, choose All rows, First N rows, or Limit N rows sampling options. |
Exclude Views | Choose the option if you do not want the profiling scanner to scan the views in relational data sources. Note: For Cassandra resource, enable Exclude Views option as the data discovery is not supported on views. |
Incremental Profiling | Choose the option to run the profile only on the changes made to the data source. If you do not select this option, the profile runs on the entire data source. For information about resources that support incremental profiling, see the Basic Profile Settings topic. |
Data Profile Filter | You can include or exclude tables and views from the profile run. Use semicolons (;) to separate the table names and view names. For more information about the filter field, see Source Metadata and Data Profile Filter. |
Cumulative | Choose this option to retain the column profile and column similarity results from the previous scan in the next resource scan results. If you do not choose this option, the previous profile results are purged. For information about how to use the this option with the Data Profile Filter field and Incremental Profiling option, see the Basic Profile Settings topic. |
Source Connection Name | Choose a source connection to run data discovery. Note: This parameter is optional for a File System resource. |
Run On | Choose one of the following run-time environments: - - Blaze. Runs the profile in the Hadoop environment on the Blaze engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
- - Spark. Runs the profile in the Hadoop environment on the Spark engine. Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
- - Databricks. Runs the profile in the Hadoop environment on the Spark engine in the Databricks cluster.Click Select..., and choose a Hadoop connection name in the Select Hadoop Connection Name dialog box.
- - Native. Runs the profile on the same machine where the Data Integration Service runs.
|
Configure the following properties when you choose Data Domain Discovery or Column Profile and Data Domain Discovery option:
Property | Description |
---|
Data Domain Discovery Type | Choose one of the following data domain discovery types: - - Run Discovery on Source Data.
- - Run Discovery on Source Metadata.
- - Run Discovery on both Source Metadata and Data.
- - Run Discovery on Source Data Where Metadata Matches.
|
Select Data Domain | Choose one of the following data domain options: - - All Data Domains.
- - Specific Data Domain Groups.
In the Data Domain Groups field, choose one or more data domain groups in the Data Domain Groups dialog box. - - Specific Data Domains.
In the Data Domains field, choose one or more data domains in the Data Domains dialog box.
|
Use Conformance from | Choose one of the following values: |
Data Domain Match Criteria | Choose one of the following values: |
Exclude Null Values from Data Domain Discovery | Select the option to exclude the null values in the data source when you run data domain discovery. |
Similarity Profile and Value Frequency Settings
Configure the column similarity properties to identify similar columns and value frequency in the resource.
The following table describes the properties that you can configure in the Metadata Load Settings > Similarity Profile Data Preparation and Value Frequency Settings tab:
Property | Description |
---|
Run Similarity Profile | Choose one of the following options: - - Yes. The profiling scanner scans the data source and prepares data to perform the following tasks:
- - Discover similar columns. The algorithm discovers similar columns in the resource based on column names, column patterns, and unique values.
- - Identify business terms. The algorithm identifies and recommends the business terms for a column based on the accepted data domains and similar columns.
- - No.
|
Save Source Data | Choose one of the following options: - - Yes. Persists the computed information about similar columns, column patterns, and unique values in the resource in PostgreSQL database.
- - No.
|
Sampling Options | Choose one of the following sampling options: - - Reuse Basic Profile Settings.
- - All Rows.
- - Auto Random Rows.
- - Random N Rows.
- - First N Rows.
|
Domain Connection Settings | Choose one of the following options: - - Use Profile Configuration Settings.
- - Specify Domain Connection Settings.
For information about domain connection settings properties, see the Domain Connection Settings section. |
Unique Key Inference Settings
You can configure the following settings to generate unique key candidate from selected columns.
Property | Description |
---|
Run Unique Key Inference | Choose one of the following options: - - Yes. The profile scanner scans and infers unique keys from the data source.
- - No
|
Null Threshold % in Unique Key Inference | The threshold for null values in unique key inference. You can enter a value between 0 and 1. |
Skip Unique Key Inference When Accepted or Documented Unique Key Exists | Choose one of the following options: - - Yes. The profiling service skips unique key inference for columns with documented or accepted keys.
- - No
|
Unique Key Sampling Option | Choose one of the following options: - - All Rows. Chooses all the rows in the data object for unique key inference.
- - First N Rows. Chooses only the first N rows in the data object for unique key inference.
|
See
Unique Key Inference Settings for the list of the supported resources and file type.