You can transform your input data into a user-defined structured format based on an intelligent structure model. You can use the Parse Unstructured Data function to analyze data such as log files, XML, or JSON files, Word tables, and other unstructured or semi-structured formats.
To parse unstructured data, use the Parse Unstructured Data function. The function uses Intelligent Structure Discovery to determine the underlying structure of the sample data file and creates a model of the structure.
Intelligent Structure Discovery creates the intelligent structure model based on a sample of your input data.
You can create models from the following input types:
•Text files, including delimited files such as CSV files and complex files that contain textual hierarchies
•Machine generated files such as weblogs and clickstreams
•JSON files
•XML files
•ORC files
•Avro files
•Parquet files
•Microsoft Excel files
•Data within PDF form fields
•Data within Microsoft Word tables
•XSD files
You can refine the intelligent structure model and customize the structure of the output data. You can edit the nodes in the model to combine, exclude, flatten, or collapse them.
You can process input from the source efficiently and seamlessly based on the intelligent structure model that you select. When you select the function, you associate it with the intelligent structure model. Select a data source based on a flat file to process local input files.
Configure the structure parser function
INFACore parses unstructured or semi-structured data using the Intelligent Structure Discovery (ISD) jars which is bundled with the INFACore installation.
To parse data, select the Parse Unstructured Data function, and specify the following fields:
•New DataFrame Name: Specify a name for the new DataFrame. A DataFrame is a two-dimensional data structure, where data is aligned in a tabular fashion in rows and columns.
•Schema file path. Specify the file path to the sample schema file.
•Input file path. Specify the input file path of you source data that contains unstructured data.
Example 1. Example
The following image is a snapshot of the unstructured data in JSON format in the json_input.json file that you want parse:
Provide the path to the sample schema sample_schema.txt file that you want INFACore to refer to parse the unstructured data:
See the sample Python code that displays when you apply the parser function with the input file and sample schema file:
import informatica.infacore as ic pf = ic.ParserFunctions() parser_data = pf.parse_unstructured_data("C:\\Users\\John\\Documents\\FF_SOURCES\\json_input.json", "C:\\Users\\John\\Documents\\FF_SOURCES\\sample_schema.txt")
To apply the Pandas function, invoke the Python SDK to convert the INFACore DataFrame to the Pandas DataFrame and return the rows: