Create a data element classification with or without inclusion rules. Rule-based data classifications are used to classify data based on matching patterns or column names. A data classification without rules is used to organize or label data into categories specific to your organization.
To create and manage data classifications, ensure that you define appropriate roles and select the Manage Data Classifications feature for that role when configuring privileges for the Metadata Command Center service in Informatica Intelligent Cloud Services Administrator. For more information about feature privileges that the organization administrator can configure for user roles, see the Introduction and Getting Started help.
Create a rule-based data element classification to automate the classification of data. You can use a Spark SQL-based expression language to create inclusion rules in the expression editor in Metadata Command Center. You can also choose from more than 200 predefined rule-based data element classifications that Metadata Command Center provides by default.
If you create a data element classification in Metadata Command Center without any inclusion rules, then you can manually associate the data classification with data elements in Data Governance and Catalog after the metadata is ingested into the catalog. For more information about manually associating data classifications, see Working With Assets in the Cloud Data Governance and Catalog help.
To create a data element classification, perform the following steps:
1In Metadata Command Center, click New.
2In the New dialog box, select Data Classification from the list of asset types in the left pane.
3Select Data Element Classification, and click Create.
The New Data Element Classification window appears.
4On the General Information tab, enter a name for the data classification. Optionally, enter a description.
To create a rule-based data element classification, proceed to the next step. To create a data classification without inclusion rules, go to step 7.
5In the Sensitivity area, configure data classification sensitivity levels to specify whether a classification is sensitive or not. You can view the sensitivity level labels associated with data elements in Data Governance and Catalog. You can select the following types of sensitivity levels:
- None. Use this option if data is not sensitive. For example, unrestricted and widely accessible data.
- Low. Use this option if data is public. For example, public website content and company contact information.
- Medium. Use this option if data is internal. For example, emails and documents with no confidential data.
- High. Use this option if data is confidential. For example, financial records, biometric data, medical data, intellectual property, and authentication data.
Default value is None.
Note: If some of the sensitivity levels that are mentioned in this help differ from what is displayed on the Metadata Command Center interface, contact your administrator to understand the sensitivity levels defined for your organization. For more information, see the Metadata Command Center help.
6Click Next to open the Qualifier tab.
7In the Inclusion Rule section of the Qualifier tab, construct a data classification inclusion rule using expressions in the basic or advanced mode.
You can use a combination of Attributes, Operators, Built-in Functions, Lookup Tables or Constants to define a data classification inclusion rule. In the Advanced mode, you can type your expressions directly and see autocomplete suggestions as you type your expression in the classification editor. For more information about data classification rules and examples of data classification rules, see the following topics:
Note: You can specify data classification expression values without exceeding the 5000 character limit.
8Click Validate to validate your expression.
If the validation is successful, a success message appears.
9Click Save.
On the Explore page, you can view all the saved data element classifications sorted by their type.
After you create a data classification, you can perform one of the following actions:
•For rule-based data classifications, enable the data classification capability for the catalog source and add the data classification to the catalog source configuration. During the catalog source run, the inclusion rules are used to classify the metadata into meaningful categories based on matching column names and column content patterns. You can view the data classification results in Data Governance and Catalog.
•For both rule-based data classifications or data classifications without rules, manually associate the published data classifications with data elements in Data Governance and Catalog after the metadata is ingested into the catalog. The data elements are manually classified or labeled based on the associated data classification.
Data element classification inclusion rule
Apply data element classification to a data element by creating inclusion rules. You can create inclusion rules using the metadata that is extracted from the source and the data facts collected due to data profiling. Data element classification is, therefore, independent of the source type. If the data profiling capability is not enabled on the catalog source, then you can create and use metadata-based expressions only.
The data classification expressions are created using a Spark SQL-based language. You can construct a data element classification rule using the following components:
•Attributes: Attributes are column values that you obtain from the extracted metadata or from the statistics collected due to data profiling. Metadata-based attributes are column name, column comment, parent name, and parent comment. Statistics-based attributes include, number of profiled values in a column, frequent values in a column, average of values profiled for a column, and other such attributes.
•Operators: Operators are used to compare values of columns. For example, you can use an equality operator to check if the names of the two columns are the same.
Note: When you use standard comparison operators such as >, >=, =, <, or <= to compare values of columns that contain NULL or unknown values, the result is NULL if any of the compared values is NULL or unknown.
•Built-in Functions: Built-in functions are used to calculate values and manipulate data. For example, a function can be changing a name to all upper case or lower case using the Lower or Upper functions. Other supported functions include, but are not limited to, Trim, Size, Length, Substring, Forall.
•Lookup Tables: When there is a finite set of values, use lookup tables that you have imported and published to check whether the values in the column appear in the lookup table. You can look up any attribute in the column of the lookup table. For example, the expression NAME IN LOOKUP_TABLE.REFCOLUMN looks for the name of the column in the REF column of the lookup table. You can either import and publish a predefined lookup table that Metadata Command Center provides by default or import your own lookup table.
•Constants: A constant contains a value that doesn't change during the execution of the expression. Constants are used in data classification rules to store lengthy values to improve the readability of complex classification expressions. To use a constant in a data classification rule, you must define a constant by specifying a name and a value for the constant. The scope of a constant is local, not global. This means that the constant that you create within a data classification rule can be used only within that rule. To define a constant within a classification rule, see Defining a constant for a data element classification rule.
Defining a constant for a data element classification rule
Use constants in data element classification rules to store lengthy values to improve the readability of complex data classification expressions.
Before you use a constant in a rule, you must define the constant. The scope of a constant is local, not global. This means that the constant that you create within a data classification rule can be used only within that rule.
To define a constant for a data element classification rule, perform the following steps:
1In the Inclusion Rule section of the Qualifier tab, toggle to the Advanced mode.
2From the list on top, select Constants.
3Click the Add Constant icon. The Add Constant dialog box appears.
4Enter a name for the constant.
The name should not exceed 31 characters. It should start with a letter and may contain letters, digits, and underscores.
5In the Value field, enter a value that does not exceed 1000 characters in length for the constant.
6Optionally, enter a description.
7Click OK to save the constant.
Example 1. Example: Adding a constant for a data classification rule that validates credit card numbers
Consider the following data classification rule that validates all major credit cards:
(UPPER(NAME) LIKE '%CARD%NUMBER%' OR UPPER(NAME) LIKE '%CC%NUM%' OR LOWER(NAME) IN lkp_ccn_col.header_col_names) AND (size(filter(FREQUENT_VALUES, v -> REGEXP_REPLACE(v,'-|\s','') RLIKE '^3[47][0-9]{13}$|^(5[1-5][0-9]{14}|2(22[1-9][0-9]{12}|2[3-9][0-9]{13}|[3-6][0-9]{14}|7[0-1][0-9]{13}|720[0-9]{12}))$|^4[0-9]{12}(?:[0-9]{3})?$|^6(?:011\d{12}|5\d{14}|4[4-9]\d{13}|22(?:1(?:2[6-9]|[3-9]\d)|[2-8]\d{2}|9(?:[01]\d|2[0-5]))\d{10})$')) / size(FREQUENT_VALUES)) >= 0.8f
To improve the readability of this lengthy rule, let us define a constant called CCN_EXP, and assign the expression for credit card number patterns as the value to the constant in the following manner:
By using the constant CCN_EXP, the data classification rule mentioned above can be rewritten as follows to reduce the length significantly:
(UPPER(NAME) LIKE '%CARD%NUMBER%' OR UPPER(NAME) LIKE '%CC%NUM%' OR LOWER(NAME) IN lkp_ccn_col.header_col_names) AND (size(filter(FREQUENT_VALUES, v -> REGEXP_REPLACE(v,'-|\s','') RLIKE $CCN_EXP)) / size(FREQUENT_VALUES)) >= 0.8f
Example: Classify a column in a table as CUSIP numbers
CUSIP (Committee on Uniform Securities Identification Procedures) numbers identify North American securities and are usually are 9 characters long. For example, a CUSIP number can be 3 9 2 6 9 0 Q T 3. Let us construct an expression that classifies a column in a table as CUSIP numbers. The expression checks if the column name contains the word 'cusip' and all the frequently occurring values in the columns of the data set follow the specified pattern. Depending on the matches, the expression classifies the columns in the data set as 'cusip_number'. We can construct the expression in the following way:
LOWER(NAME) LIKE '%cusip%' AND forall(frequent_values,v->v==NULL OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3} [0-9a-zA-Z]{2} [0-9]' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}-[0-9a-zA-Z]{2}-[0-9' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}[0-9a-zA-Z]{2}[0-9')
We use attributes (NAME, FREQUENT_VALUES), operators (AND, IN, OR), and built-in functions (LOWER, FORALL) to construct the above expression. Let us simplify the expression to understand the function that each phrase performs:
•LOWER(NAME) LIKE '%cusip%: This phrase returns the column name with all characters changed to lowercase and checks if the column name contains the word 'cusip'.
•FORALL(FREQUENT_VALUES,v->v==NULL: The FORALL function is used with FREQUENT_VALUES to evaluate the expression for all the frequently occurring values in the columns of the data set. The v==NULL checks if there is a NULL value in the frequent value attribute.
•LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3} [0-9a-zA-Z]{2} [0-9]' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}-[0-9a-zA-Z]{2}-[0-9' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}[0-9a-zA-Z]{2}[0-9': This phrase defines the pattern of the CUSIP number to check against the frequent values occurring in the column.
Example: Frequent values in data classification
The FREQUENT_VALUES attribute is an advanced option that enables you to determine the most frequent column values. The results depend on the sampling type that you select when you configure a catalog source. You select the sampling type on the Data Profiling and Quality tab. Use the attribute in the form of an inclusion rule when you configure Data Classification.
The FREQUENT_VALUES attribute includes all distinct values and displays them in order of most common values in the profile results.
If you use the FREQUENT_VALUES attribute in a data element classification rule, the rule fetches all distinct records from the available values. Based on the records, the frequency percentage is calculated and appears in profiling results. You can construct rules with percentage values and conformance percentage.
For example, the rule can have the following structure: FREQUENT_VALUES = [USA, UK, INDIA, CHINA, RUSSIA, CANADA, BRAZIL, CHILE].
The following table contains sample values that the FREQUENT_VALUES attribute could apply to: