Masking Techniques
The masking technique is the type of data masking to apply to the selected column.
You can select one of the following masking techniques for an input column:
- Random
- Produces random, non-repeatable results for the same source data and masking rules. You can mask date, numeric, and string datatypes. Random masking does not require a seed value. The results of random masking are non-deterministic.
- Expression
- Applies an expression to a source column to create or mask data. You can mask all datatypes.
- Key
- Replaces source data with repeatable values. The Data Masking transformation produces deterministic results for the same source data, masking rules, and seed value. You can mask date, numeric, and string datatypes.
- Substitution
- Replaces a column of data with similar but unrelated data from a dictionary. You can mask the string datatype.
- Dependent
- Replaces the values of one source column based on the values of another source column. You can mask the string datatype.
- Tokenization
- Replaces source data with data generated based on customized masking criteria. The Data Masking transformation applies rules specified in a customized algorithm. You can mask the string datatype.
- Special Mask Formats
- Credit card number, email address, IP address, phone number, SSN, SIN, or URL. The Data Masking transformation applies built-in rules to intelligently mask these common types of sensitive data.
- No Masking
- The Data Masking transformation does not change the source data.
Default is No Masking.
Random Masking
Random masking generates random nondeterministic masked data. The Data Masking transformation returns different values when the same source value occurs in different rows. You can define masking rules that affect the format of data that the Data Masking transformation returns. Mask numeric, string, and date values with random masking.
Masking String Values
Configure random masking to generate random output for string columns. To configure limitations for each character in the output string, configure a mask format. Configure filter characters to define which source characters to mask and the characters to mask them with.
You can apply the following masking rules for a string port:
- Range
- Configure the minimum and maximum string length. The Data Masking transformation returns a string of random characters between the minimum and maximum string length.
- Mask Format
- Define the type of character to substitute for each character in the input data. You can limit each character to an alphabetic, numeric, or alphanumeric character type.
- Source String Characters
- Define the characters in the source string that you want to mask. For example, mask the number sign (#) character whenever it occurs in the input data. The Data Masking transformation masks all the input characters when Source String Characters is blank.
- Result String Replacement Characters
- Substitute the characters in the target string with the characters you define in Result String Characters. For example, enter the following characters to configure each mask to contain uppercase alphabetic characters A - Z:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Masking Numeric Values
When you mask numeric data, you can configure a range of output values for a column. The Data Masking transformation returns a value between the minimum and maximum values of the range depending on port precision. To define the range, configure the minimum and maximum ranges or configure a blurring range based on a variance from the original source value.
You can configure the following masking parameters for numeric data:
- Range
- Define a range of output values. The Data Masking transformation returns numeric data between the minimum and maximum values.
- Blurring Range
- Define a range of output values that are within a fixed variance or a percentage variance of the source data. The Data Masking transformation returns numeric data that is close to the value of the source data. You can configure a range and a blurring range.
Masking Date Values
To mask date values with random masking, either configure a range of output dates or choose a variance. When you configure a variance, choose a part of the date to blur. Choose the year, month, day, hour, minute, or second. The Data Masking transformation returns a date that is within the range you configure.
You can configure the following masking rules when you mask a datetime value:
- Range
- Sets the minimum and maximum values to return for the selected datetime value.
- Blurring
- Masks a date based on a variance that you apply to a unit of the date. The Data Masking transformation returns a date that is within the variance. You can blur the year, month, day, hour, minute, or second. Choose a low and high variance to apply.
Expression Masking
Expression masking applies an expression to a port to change the data or create new data. When you configure expression masking, create an expression in the Expression Editor. Select input and output ports, functions, variables, and operators to build expressions.
You can concatenate data from multiple ports to create a value for another port. For example, you need to create a login name. The source has first name and last name columns. Mask the first and last name from lookup files. In the Data Masking transformation, create another port called Login. For the Login port, configure an expression to concatenate the first letter of the first name with the last name:
SUBSTR(FIRSTNM,1,1)||LASTNM
Select functions, ports, variables, and operators from the point-and-click interface to minimize errors when you build expressions.
The Expression Editor displays the output ports which are not configured for expression masking. You cannot use the output from an expression as input to another expression. If you manually add the output port name to the expression, you might get unexpected results.
When you create an expression, verify that the expression returns a value that matches the port datatype. The Data Masking transformation returns zero if the data type of the expression port is numeric and the data type of the expression is not the same. The Data Masking transformation returns null values if the data type of the expression port is a string and the data type of the expression is not the same.
Repeatable Expression Masking
Configure repeatable expression masking when a source column occurs in more than one table and you need to mask the column from each table with the same value.
When you configure repeatable expression masking, the Data Masking transformation saves the results of an expression in a storage table. If the column occurs in another source table, the Data Masking transformation returns the masked value from the storage table instead of from the expression.
Dictionary Name
When you configure repeatable expression masking you must enter a dictionary name. The dictionary name is a key that allows multiple Data Masking transformations to generate the same masked values from the same source values. Define the same dictionary name in each Data Masking transformation. The dictionary name can be any text.
Storage Table
The storage table contains the results of repeatable expression masking between sessions. A storage table row contains the source column and a masked value pair. The storage table for expression masking is a separate table from the storage table for substitution masking.
Each time the Data Masking transformation masks a value with a repeatable expression, it searches the storage table by dictionary name, locale, column name and input value. If it finds a row in the storage table, it returns the masked value from the storage table. If the Data Masking transformation does not find a row, it generates a masked value from the expression for the column.
You need to encrypt storage tables for expression masking when you have unencrypted data in the storage and use the same dictionary name as key.
Encrypting Storage Tables for Expression Masking
You can use transformation language encoding functions to encrypt storage tables. You need to encrypt storage tables when you have enabled storage encryption.
1. Create a mapping with the IDM_EXPRESSION_STORAGE storage table as source.
2. Create a Data Masking transformation.
3. Apply the expression masking technique on the masked value ports.
4. Use the following expression on the MASKEDVALUE port:
Enc_Base64(AES_Encrypt(MASKEDVALUE, Key))
5. Link the ports to the target.
Example
For example, an Employees table contains the following columns:
FirstName
LastName
LoginID
In the Data Masking transformation, mask LoginID with an expression that combines FirstName and LastName. Configure the expression mask to be repeatable. Enter a dictionary name as a key for repeatable masking.
The Computer_Users table contains a LoginID, but no FirstName or LastName columns:
Dept
LoginID
Password
To mask the LoginID in Computer_Users with the same LoginID as Employees, configure expression masking for the LoginID column. Enable repeatable masking and enter the same dictionary name that you defined for the LoginID Employees table. The Integration Service retrieves the LoginID values from the storage table.
Create a default expression to use when the Integration Service cannot find a row in the storage table for LoginID. The Computer_Users table does not have the FirstName or LastName columns, so the expression creates a less meaningful LoginID.
Storage Table Scripts
Informatica provides scripts that you can run to create the storage table. The scripts are in the following location:
<PowerCenter installation directory>\client\bin\Extensions\DataMasking
The directory contains a script for Sybase, Microsoft SQL Server, IBM DB2, and Oracle databases. Each script is named <Expression_<database type>.
Rules and Guidelines for Expression Masking
Use the following rules and guidelines for expression masking:
- •You cannot use the output from an expression as input to another expression. If you manually add the output port name to the expression, you might get unexpected results.
- •Use the point-and-click method to build expressions. Select functions, ports, variables, and operators from the point-and-click interface to minimize errors when you build expressions.
- •If the Data Masking transformation is configured for repeatable masking, and the storage table does not exist, the Integration Service substitutes the source data with default values.
Key Masking
A column configured for key masking returns deterministic masked data each time the source value and seed value are the same. The Data Masking transformation returns unique values for the column.
When you configure a column for key masking, the Data Masking transformation creates a seed value for the column. You can change the seed value to produce repeatable data between different Data Masking transformations. For example, configure key masking to enforce referential integrity. Use the same seed value to mask a primary key in a table and the foreign key value in another table.
You can define masking rules that affect the format of data that the Data Masking transformation returns. Mask string and numeric values with key masking.
Masking String Values
You can configure key masking in order to generate repeatable output for strings. Configure a mask format to define limitations for each character in the output string. Configure source string characters that define which source characters to mask. Configure result string replacement characters to limit the masked data to certain characters.
You can configure the following masking rules for key masking strings:
- Seed
- Apply a seed value to generate deterministic masked data for a column. You can enter a number between 1 and 1,000.
- Mask Format
- Define the type of character to substitute for each character in the input data. You can limit each character to an alphabetic, numeric, or alphanumeric character type.
- Source String Characters
- Define the characters in the source string that you want to mask. For example, mask the number sign (#) character whenever it occurs in the input data. The Data Masking transformation masks all the input characters when Source String Characters is blank. The Data Masking transformation does not always return unique data if the number of source string characters is less than the number of result string characters.
- Result String Characters
- Substitute the characters in the target string with the characters you define in Result String Characters. For example, enter the following characters to configure each mask to contain all uppercase alphabetic characters:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Masking Numeric Values
Configure key masking for numeric source data to generate deterministic output. When you configure a column for numeric key masking, you assign a random seed value to the column. When the Data Masking transformation masks the source data, it applies a masking algorithm that requires the seed.
You can change the seed value for a column to produce repeatable results if the same source value occurs in a different column. For example, you want to maintain a primary-foreign key relationship between two tables. In each Data Masking transformation, enter the same seed value for the primary-key column as the seed value for the foreign-key column. The Data Masking transformation produces deterministic results for the same numeric values. The referential integrity is maintained between the tables.
Masking Datetime Values
When you can configure key masking for datetime values, the Data Masking transformation requires a random number as a seed. You can change the seed to match the seed value for another column in order to return repeatable datetime values between the columns.
The Data Masking transformation can mask dates between 1753 and 2400 with key masking. If the source year is in a leap year, the Data Masking transformation returns a year that is also a leap year. If the source month contains 31 days, the Data Masking transformation returns a month that has 31 days. If the source month is February, the Data Masking transformation returns February.
The Data Masking transformation always generates valid dates.
Substitution Masking
Substitution masking replaces a column of data with similar but unrelated data. Use substitution masking to replace production data with realistic test data. When you configure substitution masking, define the dictionary that contains the substitute values.
The Data Masking transformation performs a lookup on the dictionary that you configure. The Data Masking transformation replaces source data with data from the dictionary. Dictionary files can contain string data, datetime values, integers, and floating point numbers. Enter datetime values in the following format:
mm/dd/yyyy
You can substitute data with repeatable or non-repeatable values. When you choose repeatable values, the Data Masking transformation produces deterministic results for the same source data and seed value. You must configure a seed value to substitute data with deterministic results.The Integration Service maintains a storage table of source and masked values for repeatable masking.
You can substitute more than one column of data with masked values from the same dictionary row. Configure substitution masking for one input column. Configure dependent data masking for the other columns that receive masked data from the same dictionary row.
Dictionaries
A dictionary is a reference table that contains the substitute data and a serial number for each row in the table. Create a reference table for substitution masking from a flat file or relational table that you import into the Model repository.
The Data Masking transformation generates a number to retrieve a dictionary row by the serial number. The Data Masking transformation generates a hash key for repeatable substitution masking or a random number for non-repeatable masking. You can configure an additional lookup condition if you configure repeatable substitution masking.
You can configure a dictionary to mask more than one port in the Data Masking transformation.
When the Data Masking transformation retrieves substitution data from a dictionary, the transformation does not check if the substitute data value is the same as the original value. For example, the Data Masking transformation might substitute the name John with the same name (John) from a dictionary file.
The following example shows a dictionary table that contains first name and gender:
SNO | GENDER | FIRSTNAME |
---|
1 | M | Adam |
2 | M | Adeel |
3 | M | Adil |
4 | F | Alice |
5 | F | Alison |
In this dictionary, the first field in the row is the serial number, and the second field is gender. The Integration Service always looks up a dictionary record by serial number. You can add gender as a lookup condition if you configure repeatable masking. The Integration Service retrieves a row from the dictionary using a hash key, and it finds a row with a gender that matches the gender in the source data.
Use the following rules and guidelines when you create a reference table:
- •Each record in the table must have a serial number.
- •The serial numbers are sequential integers starting at one. The serial numbers cannot have a missing number in the sequence.
- •The serial number column can be anywhere in a table row. It can have any label.
If you use a flat file table to create the reference table, use the following rules and guidelines:
- •The first row of the flat file table must have column labels to identify the fields in each record. The fields are separated by commas. If the first row does not contain column labels, the Integration Service takes the values of the fields in the first row as column names.
- •If you create a flat file table on Windows and copy it to a UNIX machine, verify that the file format is correct for UNIX. For example, Windows and UNIX use different characters for the end of line marker.
Storage Tables
The Data Masking transformation maintains storage tables for repeatable substitution between sessions. A storage table row contains the source column and a masked value pair. Each time the Data Masking transformation masks a value with a repeatable substitute value, it searches the storage table by dictionary name, locale, column name, input value, and seed. If it finds a row, it returns the masked value from the storage table. If the Data Masking transformation does not find a row, it retrieves a row from the dictionary with a hash key.
The dictionary name format in the storage table is different for a flat file dictionary and a relational dictionary. A flat file dictionary name is identified by the file name. The relational dictionary name has the following syntax:
<Connection object>_<dictionary table name>
Informatica provides scripts that you can run to create a relational storage table. The scripts are in the following location:
<PowerCenter Client installation directory>\client\bin\Extensions\DataMasking
The directory contains a script for Sybase, Microsoft SQL Server, IBM DB2, and Oracle databases. Each script is named Substitution_<database type>. You can create a table in a different database if you configure the SQL statements and the primary key constraints.
You need to encrypt storage tables for substitution masking when you have unencrypted data in the storage and use the same seed value and dictionary to encrypt the same columns.
Encrypting Storage Tables for Substitution Masking
You can use transformation language encoding functions to encrypt storage tables. You need to encrypt storage tables when you have enabled storage encryption.
1. Create a mapping with the IDM_SUBSTITUTION_STORAGE storage table as source.
2. Create a Data Masking transformation.
3. Apply the substitution masking technique on the input value and the masked value ports.
4. Use the following expression on the INPUTVALUE port:
Enc_Base64(AES_Encrypt(INPUTVALUE, Key))
5. Use the following expression on the MASKEDVALUE port:
Enc_Base64(AES_Encrypt(MASKEDVALUE, Key))
6. Link the ports to the target.
Substitution Masking Properties
You can configure the following masking rules for substitution masking:
- •Repeatable Output. Returns deterministic results between sessions. The Data Masking transformation stores masked values in the storage table.
- •Seed Value. Apply a seed value to generate deterministic masked data for a column. Enter a number between 1 and 1,000.
- •Unique Output. Force the Data Masking transformation to create unique output values for unique input values. No two input values are masked to the same output value. The dictionary must have enough unique rows to enable unique output.
When you disable unique output, the Data Masking transformation might not mask input values to unique output values. The dictionary might contain fewer rows.
- •Unique Port. The port used to identify unique records for substitution masking. For example, you want to mask first names in a table called Customer. If you select the table column that contains the first names as the unique port, the data masking transformation replaces duplicate first names with the same masked value. If you select the Customer_ID column as the unique port, the data masking transformation replaces each first name with a unique value.
- •Dictionary Information. Configure the reference table that contains the substitute data values. Click Select Source to select a reference table.
- - Dictionary Name. Displays the name of the reference table that you select.
- - Output Column. Choose the column to return to the Data Masking transformation.
- •Lookup condition. Configure a lookup condition to further qualify what dictionary row to use for substitution masking. The lookup condition is similar to the WHERE clause in an SQL query. When you configure a lookup condition you compare the value of a column in the source with a column in the dictionary.
For example, you want to mask the first name. The source data and the dictionary have a first name column and a gender column. You can add a condition that each female first name is replaced with a female name from the dictionary. The lookup condition compares gender in the source to gender in the dictionary.
- - Input port. Source data column to use in the lookup.
- - Dictionary column. Dictionary column to compare the input port to.
Rules and Guidelines for Substitution Masking
Use the following rules and guidelines for substitution masking:
- •If a storage table does not exist for a unique repeatable substitution mask, the session fails.
- •If the dictionary contains no rows, the Data Masking transformation returns an error message.
- •When the Data Masking transformation finds an input value with the locale, dictionary, and seed in the storage table, it retrieves the masked value, even if the row is no longer in the dictionary.
- •If you delete a connection object or modify the dictionary, truncate the storage table. Otherwise, you might get unexpected results.
- •If the number of values in the dictionary is less than the number of unique values in the source data, the Data Masking Transformation cannot mask the data with unique repeatable values. The Data Masking transformation returns an error message.
Dependent Masking
Dependent masking substitutes multiple columns of source data with data from the same dictionary row.
When the Data Masking transformation performs substitution masking for multiple columns, the masked data might contain unrealistic combinations of fields. You can configure dependent masking in order to substitute data for multiple input columns from the same dictionary row. The masked data receives valid combinations such as, "New York, New York" or "Chicago, Illinois."
When you configure dependent masking, you first configure an input column for substitution masking. Configure other input columns to be dependent on that substitution column. For example, you choose the ZIP code column for substitution masking and choose city and state columns to be dependent on the ZIP code column. Dependent masking ensures that the substituted city and state values are valid for the substituted ZIP code value.
Note: You cannot configure a column for dependent masking without first configuring a column for substitution masking.
Configure the following masking rules when you configure a column for dependent masking:
- Dependent column
- The name of the input column that you configured for substitution masking. The Data Masking transformation retrieves substitute data from a dictionary using the masking rules for that column. The column you configure for substitution masking becomes the key column for retrieving masked data from the dictionary.
- Output column
- The name of the dictionary column that contains the value for the column you are configuring with dependent masking.
Dependent Masking Example
A data masking dictionary might contain address rows with the following values:
SNO | STREET | CITY | STATE | ZIP | COUNTRY |
---|
1 | 32 Apple Lane | Chicago | IL | 61523 | US |
2 | 776 Ash Street | Dallas | TX | 75240 | US |
3 | 2229 Big Square | Atleeville | TN | 38057 | US |
4 | 6698 Cowboy Street | Houston | TX | 77001 | US |
You need to mask source data with valid combinations of the city, state, and ZIP code from the Address dictionary.
Configure the ZIP port for substitution masking. Enter the following masking rules for the ZIP port:
Rule | Value |
---|
Dictionary Name | Address |
Serial Number Column | SNO |
Output Column | ZIP |
Configure the City port for dependent masking. Enter the following masking rules for the City port:
Rule | Value |
---|
Dependent Column | ZIP |
Output Column | City |
Configure the State port for dependent masking. Enter the following masking rules for the State port:
Rule | Value |
---|
Dependent Column | ZIP |
Output Column | State |
When the Data Masking transformation masks the ZIP code, it returns the correct city and state for the ZIP code from the dictionary row.
Tokenization Masking
Use the tokenization masking technique to mask source string data based on criteria that you specify in an algorithm. For example, you can create an algorithm that contains a fake email address to replace field entries in the source data.
You can configure the format of the masked data using Tokenization masking. You must assign a tokenizer name to the masking algorithm before you can use it. The tokenizer name references the masking algorithm (JAR) used. Specify the tokenizer name when you apply the tokenization masking technique.
Configuring Tokenization Masking
Perform the following tasks before you use the tokenization masking technique:
1. Browse to the tokenprovider directory in the path: <Informatica_home>\services\shared.
2. Open the following XML file: com.informatica.products.ilm.tx-tokenizerprovider.xml.
3. Add the tokenizer name and the fully qualified name of the class file for each tokenizer you want to use. Implement the tokenizer class within the com.informatica.products.ilm.tx-tokenprovider-<Build-Number>.jar class in the tokenprovider directory. For each tokenizer, enter the information in the XML file as in the following example:
<TokenizerProvider>
<Tokenizer Name="CCTokenizer"
ClassName="com.informatica.tokenprovider.CCTokenizer"/>
</TokenizerProvider>
Where:
- - Tokenizer Name is the user-defined name in quotes.
- - ClassName is the user-defined name for the CLASSNAME attribute. Implement this from within com.informatica.products.ilm.tx-tokenprovider-<Build-Number>.jar.
After configuration, you can use the Tokenization masking technique. Enter the tokenizer name to specify the algorithm to use when you create a mapping.