JupyterLab Extension for INFACore > Explore the data > Apply pre-built rules

Apply pre-built rules

You can apply pre-built rules to evaluate the accuracy and validity of your data.

For example, you can determine the accuracy of the input addresses, fix errors in the addresses, and enhance the addresses where possible with additional information. You can also measure and report on the quality of each address, retrieve country-specific enrichments for addresses, and verify addresses to the certification standards that a country defines.

The in-built rules evaluate and update input data and return the results.

Select the required prebuilt function from the Apply Functions section in INFACore. You need to specify the data object variable and the required column name for which you want to apply the function. For information about how to create the data object variable, see .

Get country ISO values

You can apply this function to your data to standardize country names to their respective country codes.

To perform this operation, select the Get Country ISO Values function, and specify the data object variable. Then, enter the name of the country column for which you want the ISO values.

The function reads the input country name and returns the full ISO country name and their respective ISO two-letter or three-letter country codes.

In the following example, the input flat file includes a column named country and contains the following data: India, XYZ, and 123!.

The following example is the input code snippet when you apply the function::

The following snippet shows the output:

Out_Country_Name: string
Out_Country_ISO2Char: string
Out_Country_ISO3Char: string
----
Out_Country_Name: [["INDIA","XYZ","123!"]]
Out_Country_ISO2Char: [["IN","XYZ","123!"]]
Out_Country_ISO3Char: [["IND","XYZ","123!"]]

Convert diacritic English characters

You can apply this function to translate diacritic characters to regular ASCII text characters.

To perform this operation, select the Convert Diacritic English Chars function, and specify the data object variable. Then, enter the column name that contains the diacritic characters.

The function changes the diacritic characters to regular ASCII text. For example, when you apply this function to a column named tëst, the function changes the ë to an e and returns tëst as test.

In the following example, the input flat file includes a column named diacritic_chars and contains diacritic English characters:

The following snippet is the input code when you apply the function:

import informatica.infacore as ic
FF_DV = ic.get_data_source("Flat File").get_connection("DR_FlatFile").get_data_object("input.csv")
dqf = ic.DataQualityFunctions()
dqf.convert_diacritic_english_chars(FF_DV,"diacritic_chars")

The following snippet shows the output, where the diacritic characters are replaced with ASCII text characters:

In_Field: string
Out_Field: string
----
In_Field: [["tëst","ÔN","\uXXXX"]]
Out_Field: [["test","ON","\uXXXX"]]

Remove control characters

You can apply this function to identify and remove non-printable characters in the source data.

To perform this operation, select the Remove Control Characters function, and specify the data object variable. Then, enter the column name that contains the control characters.

The function removes the control characters. For example, the input flat file includes a column named characters that contains control characters.

The following snippet is the input code when you apply the function:

The following snippet shows the output, where the control characters are removed:

Input_Field: string
Output_Field: string
----
Input_Field: [[" "," ","→"]]
Output_Field: [[""," ","→"]]

Parse name

You can parse the first name and surname and the respective gender to determine the gender score and gender status for the person name.

If you know the gender for a name, then the function uses the gender specific score to determine the gender. Acceptable input for male and female genders is M and F.

If you do not know the gender, the function uses the highest of the male or female scores to determine the status. The rule also calculates the probable gender based on the first name input and provides a confidence score based on the frequency a name occurs as male or female.

Genders are only assigned a score if the probability of the gender being male or female is 70% or more. Unknown genders always have a confidence score of zero.

To perform this operation, select the Parse Name function, and specify the data object variable. Then, enter the first name, surname, and gender column names that you want to parse.

For example, the input flat file includes columns for first names, surnames, and gender.

The following snippet is the input code when you apply the function:

import informatica.infacore as ic
FF_DV = ic.get_data_source("Flat File").get_connection("DR_FlatFile").get_data_object("input.csv")
dqf = ic.DataQualityFunctions()
result=dqf.parse_name(FF_DV,"FirstName","Surname","Gender")
df_reader = ic.DataFrameReader(result)
p_df = df_reader.to_pandas()
p_df.head()

The function parses the data and returns the following gender score and gender status:

In_Firstname: [["James","Mary","Ishika"]]
In_Surname: [["Thomson","Patricia","Garg"]]
In_Gender: [["M","F","F"]]
R1_male_first_name_prob: [[0.9611602416284911,0.862213129336417,0.5]]
R1_female_first_name_prob: [[0.09618367642795013,0.9994692271687158,0.9994110718492345]]
R2_male_surname_prob: [[0.9810939357907253,0.06919354838709678,0.9996677740863787]]
R2_female_surname_prob: [[0.9999596024884867,0.00023361381512597722,0.9996677740863787]]
R_male_name_parse_prob: [[0.9999983675038314,0.31748599893598634,0.9996677740863787]]
R_female_name_parse_prob: [[0.999620522413733,0.9929068110415156,0.9999998041624873]]
Status: [["Probably Valid","Uncommon Name","Probably Valid"]]

Standardize the United States company names

You can apply rules to standardize United States company names and provide acronyms for the acronyms if possible.

To perform this operation, select the Standardize US Companyname function, and specify the data object variable. Then, enter the column name that contains the United States company names that you want to standardize.

For example, the input flat file includes a column named company that contains few company names in the Unites States.

The following snippet is the input code when you apply the function:

The following snippet shows the output with the standardized company names:

Input_Company_Name: string
Out_CompanyName_Std: string
Out_CompanyName_Acronym: string
----
Input_Company_Name: [["Informatica","ABC","Amazon"]]
Out_CompanyName_Std: [["Informatica Corp.","American Broadcasting Co.","Amazon"]]
Out_CompanyName_Acronym: [["Informatica Corp.","American Broadcasting Co.","Amazon"]]

Validate names

You can apply this function to flag suspicious or fake names in the source data.

To perform this operation, select the Find Invalid Names function, and specify the data object variable. Then, enter the column name that contains names that you want to validate.

When you run the code, the function validates the names.

For example, the input flat file includes a Name column that contains control characters.

The following snippet is the input code when you apply the function:

The following snippet shows the output, where the function flags the names that are valid and those that are suspected:

Name: string
Validation_Status: string
----
Name: [["Arcane","Aquata","ARTHUR M LYNCH JR"]]
Validation_Status: [["Valid","Suspect","Valid"]]

Validate email address

You can apply this function to check if the email format in the data source is valid. The function does not check if the email is an active address.

To perform this operation, specify the data object variable, and then enter the column name that contains the email address that you want to validate.

When you run the code to validate the email format, the function returns values as valid or not valid.

For example, the input flat file includes a column named email that contains the email addresses that you want to validate.

The following snippet is the input code when you apply the function:

The following snippet shows the output returned with the validation values for the email address:

Email_Id: string
Validation_Status: string
----
Email_Id: [["foobar@informatica.com","123","foobar"]]
Validation_Status: [["Valid","Invalid","Invalid"]]

Validate SSN

You can apply this function to parse a United States Social Security Number (SSN) pattern from a large string of text.

The function parses SSNs with dashes or without dashes, formats, and validates the SSN. Format options include without punctuation, with punctuation, and spaces.

To perform this operation, specify the data object variable, and then enter the column name that contains the SSNs that you want to validate.

By default, the rule writes Social Security Numbers without any punctuation.

The rule can output the following formats:

•No Punctuation - nnnnnnnnn
•Space - nnn nnn nnn
•Dash - nnn-nnn-nnn

To change the standardization format, open the dq_SSN_Format transformation in the rule and update the expression on the SSN_Format port.

When you run the code to validate the SSN, the function returns values as valid or not valid.

For example, the input flat file includes a column named SSN that contains the United States SSN.

The following snippet is the input code when you apply the function:

import informatica.infacore as ic
FF_DV = ic.get_data_source("Flat File").get_connection("DR_FlatFile").get_data_object("input.csv")
dqf = ic.DataQualityFunctions()
dqf.validate_ssn(FF_DV,"ssn")

The following snippet shows the output returned with the validation values for the SSN:

Input_SSN: string
Out_SSN: string
Out_SSN_Status: string
Out_SSN_Status_Message: string
Out_SSN_Score: double
Out_Remaining_Text: string
----
Input_SSN: [["532459641","680-11-2943","41735abc55555"]]
Out_SSN: [["532459641","680112943","41735abc5"]]
Out_SSN_Status: [["Valid","Valid","Invalid"]]
Out_SSN_Status_Message: [["Valid structure, valid group and area values.","Valid structure,
valid group and area values.","Invalid SSN length"]]
Out_SSN_Score: [[0.9,0.9,0.25]]
Out_Remaining_Text: [["","","41735abc55555"]]

Validate the Unites States zip code

You can validate if the input data is a five-digit United States zip code.

To perform this operation, select the Validate USA Zipcode function, and specify the data object variable. Then, enter the column name that contains the United States zip codes that you want to validate.

When you run the code, the function validates if the zip code is valid.

For example, the input flat file includes a column named zipcode that contains the Unites States zip codes.

The following snippet is the input code when you apply the function:

The following snippet shows the validation values for the zip codes:

ZIPCode: string
Validation_Status: string
----
ZIPCode: [["99501","000","12340"]]
Validation_Status: [["Valid","Invalid","Invalid"]]

Validate a state in the United States

You can apply this function to verify if the entry is a valid state in the United States.

To perform this operation, select the Validate USA State function, and specify the data object variable. Then, enter the column name that contains the state names that you want to validate.

When you run the code, the function validates the state name.

For example, the input flat file includes a column named state that contains the names of the states in the Unites States.

The following snippet is the input code when you apply the function:

The following snippet shows the output returned with the validation values for the states:

State: string
Out_State_Status: string
----
State: [["Punjab","California","California1"]]
Out_State_Status: [["Invalid","Valid","Invalid"]]

Validate the United States counties

You can apply this function to validate if the input string is a valid county name in the United States.

To perform this operation, select the Validate US County function, and specify the data object variable. Then, enter the column name that contains the county names that you want to validate.

When you run the code, the function validates if the county name is valid.

For example, the input flat file includes a column named county that contains the names of the Unites States counties.

The following snippet is the input code when you apply the function:

The following snippet shows the output returned with the validation values for the counties:

County: string
County_Status: string
----
County: [["Jackson County","Marion","Clay"]]
County_Status: [["Invalid","Valid","Valid"]]

Validate the United States phone number

You can apply this function to validate and enrich telephone numbers in the United States.

To perform this operation, select the Validate USA Phone Number function, and specify the data object variable. Then, enter the column name that contains the phone numbers that you want to validate.

When you run the code, the function validates if the phone number is a valid United States number and also enriches the phone number.

In the following example, the input flat file includes a column named phnNumber that contains the names of the Unites States phone numbers.

The following snippet is the input code when you apply the function:

The following snippet shows the output returned with the validation values for the phone number:

Input_Phone: string
Out_Phone_w_Extension_Std: string
Out_Phone_Std: string
Out_Phone_Dashes: string
Out_Phone_No_Spaces: string
Out_Phone_Extension: string
Out_Phone_Status: string
Out_Phone_Status_Msg: string
Out_Phone_Quality_Score: double
Out_Remaining_Text: string
----
Input_Phone: [["(555) 555-1234","+15853042461","+911234567890"]]
Out_Phone_w_Extension_Std: [["","(585) 304-2461",""]]
Out_Phone_Std: [["","(585) 304-2461",""]]
Out_Phone_Dashes: [["","585-304-2461",""]]
Out_Phone_No_Spaces: [["","5853042461",""]]
Out_Phone_Extension: [["","",""]]
Out_Phone_Status: [["Invalid","Valid","Invalid"]]
Out_Phone_Status_Msg: [["Area code invalid.","no error","Area code invalid."]]
Out_Phone_Quality_Score: [[40,95,40]]
Out_Remaining_Text: [["(","+1","+91"]]