Administration > Data profiling > Data profiling configuration options

Data profiling configuration options

Based on your requirements, configure the options to determine the type of data that you want the data profiling task to collect, the scope of the profile run, and the sample rows on which you want to run the data profiling task.

After you enable Data Profiling on the Configuration wizard while creating a catalog source, you can configure the following options on the Data Profiling tab:

Profile filters

You can use filter conditions to reduce the scope of a data profiling task. The profile filters help you create subsets of metadata that you can use to run a data profiling task on a catalog source. For example, you choose to extract metadata from six schemas, but want to run a profiling task on three of the schemas.

You can add multiple filter conditions for each catalog source. Any user with the create, edit, or run catalog source permissions can configure the profile filters. You can apply filters on all relational and file system catalog sources that support profiling capabilities.

The profile filters depend on the filters you apply for metadata extraction. For example, your source includes a schema named Cust_Emp with tables such as Customer1, Customer2, Employee1, and Employee2. You have applied a metadata extraction filter to include metadata from Customer1, Customer2, and Employee1 tables from the schema. Now, when you apply a profile filter, you can only consider Customer1, Customer2, and Employee1 tables to run profiles.

The following are types of profile filters:

•Include Metadata. Runs a profile on the metadata extracted from the values that you specify in the filter condition.
•Exclude Metadata. Does not run a profile on the metadata extracted from the values that you specify in the filter condition.

Profile filter examples for relational catalog sources

•The following profile filter runs profiles on the metadata extracted from the Asia_Pacific table located in the Customers schema of an Oracle source system: The image shows a profile filter condition to run profiles on the metadata extracted from the Asia_Pacific table located in the Customers schema of an Oracle source system.

•The following profile filter does not run profiles on the metadata extracted from the Finance table located in the EMPLOYEE schema of an Oracle source system: The image shows a profile filter condition that does not run profiles on the metadata extracted from the Finance table located in the Employee schema of an Oracle source systems.

•The following profile filter runs profiles on the metadata extracted from the external tables object type using an asterisk wildcard for an Amazon Athena source system: The image shows a profile filter condition that runs profiles on the metadata extracted from the external tables object type using an asterisk wildcard for an Athena source system.

•The following profile filter does not run profiles on the metadata extracted from the external tables object type using an asterisk wildcard for an Amazon Athena source system: The image shows a profile filter condition that does not run profiles on the metadata extracted from the external tables object type using an asterisk wildcard for an Athena source system.

Profile filter examples for file system catalog sources

•The following profile filter runs profiles on the metadata extracted from the JAN2023 folder:

•The following profile filter does not run profiles on the metadata extracted from the CUSTOMER.CSV file:

Runtime environment

Select a runtime environment in which you can run data profiling tasks on a Secure Agent. If you don't select a runtime environment, the data profiling task runs in the runtime environment that your organization administrator selected when they created the connection.

Note: You can run data profiling and data quality tasks on a Windows Secure Agent configured with NTLMv2 proxy authentication.

Mode of run

To determine the type of data you want the data profiling task to collect, choose Keep Signatures Only or Keep Signatures and Values. If you choose Keep Signatures Only, the data profiling task collects only aggregate information such as data types, average, standard deviation and patterns. No data values are collected. If you want the data profiling task to collect both signatures and data values, then choose Keep Signatures and Values. Data values include minimum value, maximum value, frequent values and more.

Profile run scope

Determine whether you want to run data profiling only on the changes made to the source system or on the entire source system. Choose one of the following options:

•Full. Run data profiling on the entire metadata that is extracted based on the filters applied for extraction.
•Incremental. Run the profile only on the changed or updated metadata in the source system.

When you enable incremental profiling for a catalog source and run data profiling for the first time, the data profiling capability profiles the entire metadata based on the filters applied for metadata extraction. Subsequently, incremental profiling discovers any change to the metadata when the metadata extraction capability is enabled and runs profiling only on those changes. For example, if you rename a column after the first run and enable incremental profiling in the subsequent run, only the table containing the renamed column is profiled.

Incremental profiling discovers the following changes to the metadata during the data profiling run:

•Adding a new object
•Renaming a column
•Objects that were not profiled in the previous run because of data profiling task failure or change in the metadata extraction filter.

Connection

Select the SAP Table connection to run data profiling tasks on SAP ERP objects.

Sampling type

Determine the sample rows on which you want to run the data profiling task. The sampling options vary based on the catalog source that you create. Choose one of the following options:

•All Rows. Runs the profile on all rows in the metadata.
•Limit N Rows. Runs the profile on a limited number of rows. You can specify the limited number of rows to run the profile on.
•Random N Rows. Runs the profile on the selected number of random rows.
•Random N Percentage. Run the profile on select rows based on the percentage of data that you specify in the Percentage of Data to Sample field. Google BigQuery tables are organized as data blocks. For example, you can specify 5 percentage of data blocks from a table to run the profile on.
• Custom Query. Provide a custom SQL clause to select sample rows to run the data profiling task.

In the Sampling Query field, enter the custom SQL clause to choose sample rows on which you want to run the data profiling task. Verify that the syntax of the SQL clause matches the syntax of the database that you are connecting to.

Examples:

- If you're using the JDBC catalog source to connect to IBM DB2 and the sampling query is FETCH FIRST 50 ROWS ONLY, then the query runs the profile only on the first 50 rows.
- If you enter employeeName like '%a%' for a Salesforce catalog source, then the query selects rows that contain 'a' in the column employeeName.

Note: For the Salesforce catalog source, you cannot use the LIMIT clause in a sampling query. For example, you cannot use Name like '%a% LIMIT 10.

Elastic runtime environment

Select a runtime environment in which you can run data profiling tasks on an advanced cluster. Select an elastic runtime environment for complex file types, including AVRO and Parquet.

Note: This option is available when you configure data profiling for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 catalog sources.

To run a profile on an Avro or Parquet file, connect to the following types of advanced cluster in your organization:

•Fully-managed cluster. A multi-node serverless infrastructure that intelligently scales based on your workload and offers the lowest total cost of ownership for your organization. For more information, see Fully-managed clusters.
•Local cluster. A single-node cluster that you can start on the Secure Agent machine. You can use a local cluster to quickly onboard projects for advanced use cases. For more information, see Local clusters.

For more information about setting up AWS, Google Cloud, and Microsoft Azure for local and fully-managed clusters, see Advanced clusters in the Administrator documentation.

Staging connection

Applicable only for elastic profile executions, that is, for Parquet and Avro sources located in Amazon S3, Microsoft Azure Data Lake Storage Gen2, and Google Cloud Storage source systems.

The staging connection where profiling results are temporarily stored while the job runs.

Maximum precision of string fields

The maximum precision value for profiles on string data type. Enter a value between 1 and 255. Default value is 50.

Text qualifier

The character that defines string boundaries. If you select a quote character, the profile ignores delimiters within the quotes. Select a qualifier from the list. Default is Double Quote.

Code page for delimited files

Select a code page that the Secure Agent can use to read and write data. Use this option to ensure that profile results for assets with non-English characters don't include junk characters. Default value is UTF-8.

Choose one of the following options:

•MS Windows Latin 1. Select for ISO 8859-1 Western European characters.
•UTF-8. Select for Unicode and non-Unicode characters.
•Shift-JIS. Select for double-byte characters.
•ISO 8859-15 Latin 9 (Western European).
•ISO 8859-2 Eastern European.
•ISO 8859-3 Southeast European.
•ISO 8859-5 Cyrillic.
•ISO 8859-9 Latin 5 (Turkish).
•IBM EBCDIC International Latin-1.

Note: This option is available when you configure data profiling for the following catalog sources:

•Amazon S3
•Google Cloud Storage
•Microsoft Azure Data Lake Storage Gen2

Escape character for delimited files

You can specify an escape character if you need to override the default escape character. An escape character ignores a delimiter character in an unquoted string if the delimiter is part of the string value.

If you specify an escape character, the data profiling task overrides the default escape character that the Metadata Extraction job detects and considers the specified escape character. It then reads the delimiter character as a part of the string value. If you don't specify an escape character, the data profiling task considers the default escape character that the Metadata Extraction job detects and reads the delimiter character as a part of the string value.

Note: This option is available when you configure data profiling for the following catalog sources:

•Amazon S3
•Google Cloud Storage
•Microsoft Azure Data Lake Storage Gen2