JupyterLab Extension for INFACore > Explore the data > Parse unstructured data

Parse unstructured data

You can transform your input data into a user-defined structured format based on an intelligent structure model. You can use the Parse Unstructured Data function to analyze data such as log files, XML, or JSON files, Word tables, and other unstructured or semi-structured formats.

To parse unstructured data, use the Parse Unstructured Data function. The function uses Intelligent Structure Discovery to determine the underlying structure of the sample data file and creates a model of the structure.

Intelligent Structure Discovery creates the intelligent structure model based on a sample of your input data.

You can create models from the following input types:

•Text files, including delimited files such as CSV files and complex files that contain textual hierarchies
•Machine generated files such as weblogs and clickstreams
•JSON files
•XML files
•ORC files
•Avro files
•Parquet files
•Microsoft Excel files
•Data within PDF form fields
•Data within Microsoft Word tables
•XSD files

You can refine the intelligent structure model and customize the structure of the output data. You can edit the nodes in the model to combine, exclude, flatten, or collapse them.

You can process input from the source efficiently and seamlessly based on the intelligent structure model that you select. When you select the function, you associate it with the intelligent structure model. Select a data source based on a flat file to process local input files.

Configure the structure parser function

INFACore parses unstructured or semi-structured data using the Intelligent Structure Discovery (ISD) jars which is bundled with the INFACore installation.

To parse data, select the Parse Unstructured Data function, and specify the following fields:

•New DataFrame Name: Specify a name for the new DataFrame. A DataFrame is a two-dimensional data structure, where data is aligned in a tabular fashion in rows and columns.
•Schema file path. Specify the file path to the sample schema file.
•Input file path. Specify the input file path of you source data that contains unstructured data.

Example 1. Example

The following image is a snapshot of the unstructured data in JSON format in the json_input.json file that you want parse:

The input file contains data in unstructured format.

Provide the path to the sample schema sample_schema.txt file that you want INFACore to refer to parse the unstructured data:

See the sample Python code that displays when you apply the parser function with the input file and sample schema file:

import informatica.infacore as ic
pf = ic.ParserFunctions()
parser_data = pf.parse_unstructured_data("C:\\Users\\John\\Documents\\FF_SOURCES\\json_input.json",
"C:\\Users\\John\\Documents\\FF_SOURCES\\sample_schema.txt")

To apply the Pandas function, invoke the Python SDK to convert the INFACore DataFrame to the Pandas DataFrame and return the rows:

df_reader = ic.DataFrameReader(parser_data)
p_df = df_reader.to_pandas()
p_df.head()

For more information, see the INFACore SDK Reference for Python.

When you run the code, the structure parser function returns data in a structured format:

State	Account Length	Area Code	Phone	Int'l Plan	VMail Plan	VMail Message	token	Mins	Calls	Charge	CustServ Calls	Churn
PA	163	806	403-2562	no	yes	300	Day	8.162204	3	7.579174	3	True.
PA	163	806	403-2562	no	yes	300	Eve	3.933035	4	6.508639	3	True.
PA	163	806	403-2562	no	yes	300	Night	4.065759	100	5.111624	3	True.
PA	163	806	403-2562	no	yes	300	Intl	4.92816	6	5.673203	3	True.
SC	15	836	158-8416	yes	no	0	Day	10.018993	4	4.226289	8	False.