Administration > Data quality > Data quality configuration options

Data quality configuration options

Based on your requirements, configure the options to determine the type of data that you want the data quality task to collect, the scope of the data quality run, and the sample rows on which you want to run the data quality task.

After you enable Data Quality on the Configuration wizard while creating a catalog source, you can configure the following options on the Data Profiling and Quality tab:

Runtime environment

Select a runtime environment in which you can run data quality tasks on a Secure Agent. If you don't select a runtime environment, the data quality task runs in the runtime environment that your organization administrator selected when they created the connection.

Note: You can run data profiling and data quality tasks on a Windows Secure Agent configured with NTLMv2 proxy authentication.

Data Quality Automation

Select to enable data quality automation for assets in the catalog source. When you enable data quality automation and run the catalog source job in Metadata Command Center, a data quality automation job is triggered, and rule occurrences are automatically created and associated with all data elements that are linked to corresponding glossary business assets in Data Governance and Catalog.

Choose one of the following options:

•Apply on data elements linked with business data set. Creates rule occurrences for all data elements that are linked with business data sets in the catalog source.
•Apply on all data elements. Creates rule occurrences for all data elements in the catalog source.

The following table describes different options that influence the data quality automation process:

isAutomated option on rule templates in Data Governance and Catalog	Data quality option in Metadata Command Center	Data quality automation option in Metadata Command Center	Result
Yes	Yes	Yes	Create rule occurrences for all data elements that are associated with glossary business assets.
Yes	Yes	No	Does not create new rule occurrences for data elements or update an existing rule occurrence on data elements. Does not affect the execution of the existing rule occurrences in Data Governance and Catalog.
Yes	No	Not applicable	Does not create any rule occurrences for data elements. Data quality execution stops for existing rule occurrences that are associated with assets of a particular catalog source.
No	Yes	Yes	Does not create rule occurrences for data elements. Does not affect the execution of the existing rule occurrences in Data Governance and Catalog

For more information about data quality automation, see the Asset Details in the Data Governance and Catalog help system.

Cache Result

Specify how you want to preview the rule occurrence results in Data Governance and Catalog.

Choose one of the following options:

•Agent Cache. Generates a cache file in the runtime environment. This helps you preview the cached results faster in subsequent data preview runs in Data Governance and Catalog . By default, the results are cached for seven days after you run the data preview task in the runtime environment for the first time. You can also choose to customize the number of days you want to retain the preview results for.

To customize the days of preview results, update the mps_previewFileRetentionInDays property in the System Configuration Details section of the Metadata Platform Service in Administrator. For more information about Metadata Platform Service properties, see the Secure Agent Services in the Cloud Common Services help system.

•No Cache. Does not cache the preview results. You can preview the results live in Data Governance and Catalog.

Note: Run the catalog source again whenever you change the Cache Result option from Agent Cache to No Cache.

Connection

Select the SAP Table connection to run data quality tasks on SAP ERP objects.

Run Rule Occurrence Frequency

Specify whether you want to run data quality rules based on the frequency defined for the rule occurrence in Data Governance and Catalog.

Choose one of the following options:

•Yes. Data quality rules run based on the frequency that you configured in Data Governance and Catalog. If you set a data quality schedule for the catalog source, the job doesn't impact the data quality rule occurrence frequency.
•No. Disables the data quality rule occurrence frequency that you configured in Data Governance and Catalog. If you don't set a data quality schedule for the catalog source, the data quality rules don't run.

Note: Ensure that the data quality schedule has not expired. The data quality rules don't run if the data quality schedule that you configured for the catalog source is expired.

Sampling type

Determine the sample rows on which you want to run the data quality task. The sampling options vary based on the catalog source that you create.

Choose one of the following options:

•All Rows. Runs data quality on all rows in the metadata.
•Limit N Rows. Runs data quality on a limited number of rows. You can specify the limited number of rows on which you want to run data quality.
•Random N Rows. Runs data quality on the selected number of random rows.
•Random N Percentage. Run data quality on select rows based on the percentage of data that you specify in the Percentage of Data to Sample field. Google BigQuery tables are organized as data blocks. For example, you can specify 5 percentage of data blocks from a table to run the profile on.
• Custom Query. Provide a custom SQL clause to select sample rows to run the data quality task.

In the Sampling Query field, enter the custom SQL clause to choose sample rows on which you want to run the data quality task. Verify that the syntax of the SQL clause matches the syntax of the database that you are connecting to.

Examples:

- if you're using the JDBC catalog source to connect to IBM DB2 and the sampling query is FETCH FIRST 50 ROWS ONLY, then the query runs data quality only on the first 50 rows.
- if you enter employeeName like '%a%' for a Salesforce catalog source, then the query selects rows that contain 'a' in the column employeeName.

Note: For the Salesforce catalog source, you cannot use the LIMIT clause in a sampling query. For example, you cannot use Name like '%a% LIMIT 10.

Elastic runtime environment

Select a runtime environment in which you can run data quality tasks on an advanced cluster. Select an elastic runtime environment for complex file types, including AVRO and Parquet.

Note: This option is available when you configure data quality for Amazon S3, Google Cloud Storage, and Microsoft Azure Data Lake Storage Gen2 catalog sources.

To run data quality on an Avro or Parquet file, connect to the following types of advanced cluster in your organization:

•Fully-managed cluster. A multi-node serverless infrastructure that intelligently scales based on your workload and offers the lowest total cost of ownership for your organization. For more information, see Fully-managed clusters.
•Local cluster. A single-node cluster that you can start on the Secure Agent machine. You can use a local cluster to quickly onboard projects for advanced use cases. For more information, see Local clusters.

For more information about setting up AWS, Google Cloud, and Microsoft Azure for local and fully-managed clusters, see Advanced clusters.

Staging connection

Applicable only for elastic data quality executions, that is, for Parquet and Avro sources located in Amazon S3, Microsoft Azure Data Lake Storage Gen2, and Google Cloud Storage source systems.

This is the staging connection where data quality results are stored temporarily during the execution.

Maximum precision of string fields

The maximum precision value for profiles on string data type. Enter a value between 1 and 255.

Text qualifier

The character that defines string boundaries. If you select a quote character, the data quality task ignores delimiters within the quotes. Select a qualifier from the list. Default is Double Quote.

Code page for delimited files

Select a code page that the Secure Agent can use to read and write data. Use this option to ensure that rule results for assets with non-English characters don't include junk characters. Default value is UTF-8.

Choose one of the following options:

•MS Windows Latin 1. Select for ISO 8859-1 Western European characters.
•UTF-8. Select for Unicode and non-Unicode characters.
•Shift-JIS. Select for double-byte characters.
•ISO 8859-15 Latin 9 (Western European).
•ISO 8859-2 Eastern European.
•ISO 8859-3 Southeast European.
•ISO 8859-5 Cyrillic.
•ISO 8859-9 Latin 5 (Turkish).
•IBM EBCDIC International Latin-1.

Note: This option is available when you configure data quality for the following catalog sources:

•Amazon S3
•Google Cloud Storage
•Microsoft Azure Data Lake Storage Gen2

Escape character for delimited files

You can specify an escape character if you need to override the default escape character. An escape character ignores a delimiter character in an unquoted string if the delimiter is part of the string value.

If you specify an escape character, the data quality task overrides the default escape character that the Metadata Extraction job detects and considers the specified escape character. It then reads the delimiter character as a part of the string value. If you don't specify an escape character, the data quality task considers the default escape character that the Metadata Extraction job detects and reads the delimiter character as a part of the string value.

Note: This option is available when you configure data quality for the following catalog sources:

•Amazon S3
•Google Cloud Storage
•Microsoft Azure Data Lake Storage Gen2