JupyterLab Extension for INFACore > Explore the data > Parse unstructured data
  

Parse unstructured data

You can transform your input data into a user-defined structured format based on an intelligent structure model. You can use the Parse Unstructured Data function to analyze data such as log files, XML, or JSON files, Word tables, and other unstructured or semi-structured formats.
To parse unstructured data, use the Parse Unstructured Data function. The function uses Intelligent Structure Discovery to determine the underlying structure of the sample data file and creates a model of the structure.
Intelligent Structure Discovery creates the intelligent structure model based on a sample of your input data.
You can create models from the following input types:
You can refine the intelligent structure model and customize the structure of the output data. You can edit the nodes in the model to combine, exclude, flatten, or collapse them.
You can process input from the source efficiently and seamlessly based on the intelligent structure model that you select. When you select the function, you associate it with the intelligent structure model. Select a data source based on a flat file to process local input files.

Configure the structure parser function

INFACore parses unstructured or semi-structured data using the Intelligent Structure Discovery (ISD) jars which is bundled with the INFACore installation.
To parse data, select the Parse Unstructured Data function, and specify the following fields:
Example 1. Example
The following image is a snapshot of the unstructured data in JSON format in the json_input.json file that you want parse:
The input file contains data in unstructured format.
Provide the path to the sample schema sample_schema.txt file that you want INFACore to refer to parse the unstructured data:
You can view the sample schema file.
See the sample Python code that displays when you apply the parser function with the input file and sample schema file:
import informatica.infacore as ic
pf = ic.ParserFunctions()
parser_data = pf.parse_unstructured_data("C:\\Users\\John\\Documents\\FF_SOURCES\\json_input.json",
"C:\\Users\\John\\Documents\\FF_SOURCES\\sample_schema.txt")
To apply the Pandas function, invoke the Python SDK to convert the INFACore DataFrame to the Pandas DataFrame and return the rows:
df_reader = ic.DataFrameReader(parser_data)
p_df = df_reader.to_pandas()
p_df.head()
For more information, see the INFACore SDK Reference for Python.
When you run the code, the structure parser function returns data in a structured format:
State
Account Length
Area Code
Phone
Int'l Plan
VMail Plan
VMail Message
token
Mins
Calls
Charge
CustServ Calls
Churn
PA
163
806
403-2562
no
yes
300
Day
8.162204
3
7.579174
3
True.
PA
163
806
403-2562
no
yes
300
Eve
3.933035
4
6.508639
3
True.
PA
163
806
403-2562
no
yes
300
Night
4.065759
100
5.111624
3
True.
PA
163
806
403-2562
no
yes
300
Intl
4.92816
6
5.673203
3
True.
SC
15
836
158-8416
yes
no
0
Day
10.018993
4
4.226289
8
False.