Administration > Data classification > Creating a data element classification
  

Creating a data element classification

Create a data element classification with or without inclusion rules. Rule-based data classifications are used to classify data based on matching patterns or column names. A data classification without rules is used to organize or label data into categories specific to your organization.
To create and manage data classifications, ensure that you define appropriate roles and select the Manage Data Classifications feature for that role when configuring privileges for the Metadata Command Center service in Informatica Intelligent Cloud Services Administrator. For more information about feature privileges that the organization administrator can configure for user roles, see the Introduction and Getting Started help.
Create a rule-based data element classification to automate the classification of data. You can use a Spark SQL-based expression language to create inclusion rules in the expression editor in Metadata Command Center. You can also choose from more than 200 predefined rule-based data element classifications that Metadata Command Center provides by default.
If you create a data element classification in Metadata Command Center without any inclusion rules, then you can manually associate the data classification with data elements in Data Governance and Catalog after the metadata is ingested into the catalog. For more information about manually associating data classifications, see Working With Assets in the Cloud Data Governance and Catalog help.
To create a data element classification, perform the following steps:
    1In Metadata Command Center, click New.
    2In the New dialog box, select Data Classification from the list of asset types in the left pane.
    3Select Data Element Classification, and click Create.
    The New Data Element Classification window appears.
    The image shows the New Data Element Classifications window with the sensitivity level options.
    4On the General Information tab, enter a name for the data classification. Optionally, enter a description.
    To create a rule-based data element classification, proceed to the next step. To create a data classification without inclusion rules, go to step 7.
    5In the Sensitivity area, configure data classification sensitivity levels to specify whether a classification is sensitive or not. You can view the sensitivity level labels associated with data elements in Data Governance and Catalog. You can select the following types of sensitivity levels:
    Default value is None.
    Note: If some of the sensitivity levels that are mentioned in this help differ from what is displayed on the Metadata Command Center interface, contact your administrator to understand the sensitivity levels defined for your organization. For more information, see the Metadata Command Center help.
    6Click Next to open the Qualifier tab.
    7In the Inclusion Rule section of the Qualifier tab, construct a data classification inclusion rule using expressions in the basic or advanced mode.
    You can use a combination of Attributes, Operators, Built-in Functions, Lookup Tables or Constants to define a data classification inclusion rule. In the Advanced mode, you can type your expressions directly and see autocomplete suggestions as you type your expression in the classification editor. For more information about data classification rules and examples of data classification rules, see the following topics:
    Note: You can specify data classification expression values without exceeding the 5000 character limit.
    8Click Validate to validate your expression.
    If the validation is successful, a success message appears.
    9Click Save.
    On the Explore page, you can view all the saved data element classifications sorted by their type.
After you create a data classification, you can perform one of the following actions:

Data element classification inclusion rule

Apply data element classification to a data element by creating inclusion rules. You can create inclusion rules using the metadata that is extracted from the source and the data facts collected due to data profiling. Data element classification is, therefore, independent of the source type. If the data profiling capability is not enabled on the catalog source, then you can create and use metadata-based expressions only.
The data classification expressions are created using a Spark SQL-based language. You can construct a data element classification rule using the following components:

Defining a constant for a data element classification rule

Use constants in data element classification rules to store lengthy values to improve the readability of complex data classification expressions.
Before you use a constant in a rule, you must define the constant. The scope of a constant is local, not global. This means that the constant that you create within a data classification rule can be used only within that rule.
To define a constant for a data element classification rule, perform the following steps:
    1In the Inclusion Rule section of the Qualifier tab, toggle to the Advanced mode.
    2From the list on top, select Constants.
    Image of the Advanced mode in the New Data Classification window.
    3Click the Add Constant icon. The Add Constant dialog box appears.
    Image of the Add Constant dialog box.
    4Enter a name for the constant.
    The name should not exceed 31 characters. It should start with a letter and may contain letters, digits, and underscores.
    5In the Value field, enter a value that does not exceed 1000 characters in length for the constant.
    6Optionally, enter a description.
    7Click OK to save the constant.
Example 1. Example: Adding a constant for a data classification rule that validates credit card numbers
Consider the following data classification rule that validates all major credit cards:
(UPPER(NAME) LIKE '%CARD%NUMBER%' OR UPPER(NAME) LIKE '%CC%NUM%' OR LOWER(NAME) IN lkp_ccn_col.header_col_names) AND (size(filter(FREQUENT_VALUES, v -> REGEXP_REPLACE(v,'-|\s','') RLIKE '^3[47][0-9]{13}$|^(5[1-5][0-9]{14}|2(22[1-9][0-9]{12}|2[3-9][0-9]{13}|[3-6][0-9]{14}|7[0-1][0-9]{13}|720[0-9]{12}))$|^4[0-9]{12}(?:[0-9]{3})?$|^6(?:011\d{12}|5\d{14}|4[4-9]\d{13}|22(?:1(?:2[6-9]|[3-9]\d)|[2-8]\d{2}|9(?:[01]\d|2[0-5]))\d{10})$')) / size(FREQUENT_VALUES)) >= 0.8f
To improve the readability of this lengthy rule, let us define a constant called CCN_EXP, and assign the expression for credit card number patterns as the value to the constant in the following manner:
CCN_EXP='^3[47][0-9]{13}$|^(5[1-5][0-9]{14}|2(22[1-9][0-9]{12}|2[3-9][0-9]{13}|[3-6][0-9]{14}|7[0-1][0-9]{13}|720[0-9]{12}))$|^4[0-9]{12}(?:[0-9]{3})?$|^6(?:011\d{12}|5\d{14}|4[4-9]\d{13}|22(?:1(?:2[6-9]|[3-9]\d)|[2-8]\d{2}|9(?:[01]\d|2[0-5]))\d{10})$'
By using the constant CCN_EXP, the data classification rule mentioned above can be rewritten as follows to reduce the length significantly:
(UPPER(NAME) LIKE '%CARD%NUMBER%' OR UPPER(NAME) LIKE '%CC%NUM%' OR LOWER(NAME) IN lkp_ccn_col.header_col_names) AND (size(filter(FREQUENT_VALUES, v -> REGEXP_REPLACE(v,'-|\s','') RLIKE $CCN_EXP)) / size(FREQUENT_VALUES)) >= 0.8f

Example: Classify a column in a table as CUSIP numbers

CUSIP (Committee on Uniform Securities Identification Procedures) numbers identify North American securities and are usually are 9 characters long. For example, a CUSIP number can be 3 9 2 6 9 0 Q T 3. Let us construct an expression that classifies a column in a table as CUSIP numbers. The expression checks if the column name contains the word 'cusip' and all the frequently occurring values in the columns of the data set follow the specified pattern. Depending on the matches, the expression classifies the columns in the data set as 'cusip_number'. We can construct the expression in the following way:
LOWER(NAME) LIKE '%cusip%' AND forall(frequent_values,v->v==NULL OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3} [0-9a-zA-Z]{2} [0-9]' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}-[0-9a-zA-Z]{2}-[0-9' OR LOWER(v) RLIKE '[0-9]{3}[0-9a-zA-Z]{3}[0-9a-zA-Z]{2}[0-9')
We use attributes (NAME, FREQUENT_VALUES), operators (AND, IN, OR), and built-in functions (LOWER, FORALL) to construct the above expression. Let us simplify the expression to understand the function that each phrase performs:

Example: Frequent values in data classification

The FREQUENT_VALUES attribute is an advanced option that enables you to determine the most frequent column values. The results depend on the sampling type that you select when you configure a catalog source. You select the sampling type on the Data Profiling and Quality tab. Use the attribute in the form of an inclusion rule when you configure Data Classification.
The FREQUENT_VALUES attribute includes all distinct values and displays them in order of most common values in the profile results.
If you use the FREQUENT_VALUES attribute in a data element classification rule, the rule fetches all distinct records from the available values. Based on the records, the frequency percentage is calculated and appears in profiling results. You can construct rules with percentage values and conformance percentage.
For example, the rule can have the following structure: FREQUENT_VALUES = [USA, UK, INDIA, CHINA, RUSSIA, CANADA, BRAZIL, CHILE].
The following table contains sample values that the FREQUENT_VALUES attribute could apply to:
VALUES
COUNT
PERCENTAGE
USA
2
20%
UK
2
20%
INDIA
1
10%
CHINA
1
10%
RUSSIA
1
10%
CANADA
1
10%
BRAZIL
1
10%
CHILE
1
10%