The Hierarchy Processor transformation can convert one hierarchical input group to multiple output groups. These output groups can contain delimited flat files or relational files.
For example, you have a customer order file with customer and order information. This file is in hierarchical JSON format. You wish to create a relational customers table to update information in the master database. You also want a separate delimited orders file to see which orders have been increasing.
Defining relational output with the Hierarchy Processor transformation
To define a Hierarchy Processor transformation, use the Hierarchy Processor tab to map incoming fields to output fields, configure the output data structure, and optionally generate keys for relational output.
The following image shows the Hierarchy Processor tab with hierarchical input and relational output:
1Output format. Select Relational to convert incoming hierarchical data into one or more relational output groups.
2Input groups, incoming fields. Use these fields to map to the output fields.
3Output group, output fields. Use these fields to create the complex output file.
4Generate keys. Optionally generate keys for the input group to define relationships between output groups.
5Add incoming field to output group. Use to add fields to the output group.
6Output field names. Click on a field name to modify the field name or data type.
7Data Configuration icons. Use to configure the output groups and fields.
8Expression. Click on an expression to view or customize the output field expression.
9Add or delete output field. Use to create or delete output fields.
Tip: You can resize the Incoming Fields or Output Fields panels to better see the information.
Configuring the Hierarchy Processor transformation
You can configure the Hierarchy Processor transformation to produce relational output.
1Select the Relational output format.
2 Configure the transformation output by adding incoming fields to the output or by manually adding fields to the output.
3Optionally, generate keys for the input group to define relationships between output groups.
4Configure the output groups and fields:
aConfigure data sources.
bOptionally, customize the output field expressions.
cOptionally, add filters to the transformation.
Generating unique keys
In a mapping that converts hierarchical data to relational output, you can optionally generate unique keys for the hierarchical input group.
The following image shows the option to generate unique keys in the Hierarchy Processor tab:
When you generate unique keys, you generate a primary key for the input group and a key for every array element within the input group. Each key is a combination of a global unique ID and a value that increases for each additional field. You can map the generated keys to the output groups just like any other incoming field.
When you map key fields from an input parent element to the output group of the child data set, the output data or output group has a primary key and foreign key relationship. This relationship is generated on the output side, based on how you mapped the generated keys.
Adding incoming fields to relational output groups
You can add incoming fields to the output individually or you can add an entire input group. You can add incoming fields to an output group, array, or struct field. You can rename or delete output fields as needed.
To add an output field, hover over the incoming field or input group that you want to add and then click the Add button.
You can add fields in the following ways based on the incoming field's data type and the output type:
Add incoming field
Adds the selected incoming field to the selected output group or field.
Select "Add incoming field" to add incoming fields with primitive data types to the output.
Add input group
Adds all incoming fields in the group to the selected output group or field.
Select "Add input group" to add incoming fields with primitive data types to the output.
Add single occurring children
Adds all single occurring children under the field to the output, including single occurring child fields that are nested under a struct. Does not add single occurring children that are nested under arrays.
You can select the "Add single occurring children" option only when you've select an incoming struct or array field and the output is relational. For more information, see Add single occurring children.
Add all descendants
Adds all children under the field to the output, including all arrays and structs. If the incoming field contains arrays, Data Integration creates a separate output group for each array.
You can select the "Add all descendants" option only when you've select an incoming struct or array field and the output is relational. For more information, see Add all descendants.
Add single occurring children
You can add single occurring children when adding incoming struct and array fields to the output and the output format is relational. The "Add single occurring children" option adds all single occurring children under the field to the output group, including the single occurring child fields that are nested under a struct.
When you select "Add single occurring children", the incoming field must have child objects.
If the field you select contains an array, the array and its children are not added because an array can contain multiple elements.
Example of adding single occurring children
You want to extract customer information from the Customer array, with each output record containing information about one customer.
Select Add on the Customer array, with the Add Single Occurring Children option selected. This action adds all single occurring children under Customer, including the children under the FullAddress struct to the output group.
Note that you cannot add single occurring children at the Customers struct level, because there are no single occurring children under that top-level struct.
The following image shows the incoming and output fields:
Add all descendants
You can add all descendants when adding incoming struct and array fields to the output and the output format is relational. The "Add all descendants" option adds all children under the field to the output group, including all arrays and structs.
You can't map hierarchical data with arrays to the same output group when you select the "Add all descendants" option. If you add an incoming field that contains hierarchical data with arrays to the output group, the Hierarchy Processor transformation creates a separate output group for each array.
Example of adding all descendants
You want to add all incoming fields to relational output groups. The incoming fields contain the arrays Customer and Order.
Click Add next to the parent incoming field and select the Add All Descendants option. This action maps the fields in the Customer array to the first output group, and maps all fields in the Order array to the second output group.
The following image shows the input and output fields:
Configuring relational output groups and fields
You can create and modify output groups and output fields. When the output format is relational, you can create multiple output groups.
After you add incoming fields to an output group, you can click the output field name on the Hierarchy Processor tab to modify the fields. You can also add and define output fields manually.
The following table describes the properties for output fields:
Property
Description
Child Of
The parent field or group that this field belongs to. The name structure describes the group, parent fields, and struct name. For example: OUTGROUP.Grandparent.Parent.<struct name>
Name
The name of the current output group or field.
Type
The data type of the current field. You can choose a primitive data type.
Precision
The total number of significant digits in the field.
Scale
The number of digits to the right of the decimal point.
Array Element Type
The data type of the current array element.
Array Element Precision
The precision for the current array element. Used when creating the target data.
Array Element Scale
The scale for the current array element. Used when creating the target data.
Struct Name
The struct name for the current struct field.
Element Struct Name
The element struct name for the current array of structs field.
Description
Optional description of the field and its usage.
Note: The output field properties that appear depend on the data type.
Configuring data sources
In a Hierarchy Processor transformation, a data source identifies the input group or incoming array that populates the primitive child fields of the output group or field.
When the output is relational, the data source for the output groups is always the input group or an incoming field. For example, you add an array to the output group. If you add single occurring children, the data source for the output group is the input group. If you add all descendants, the data source for each output group is an incoming field array.
Configuring filter conditions
You can define filter conditions to project a subset of the input data in the Hierarchy Processor transformation. You can filter based on incoming fields or output fields.
You can configure a filter condition to read data from primitive fields into an output array or struct field when the data in the array or struct field must correspond to the data in a sibling field in the output group.
Filter configuration example
You want to convert relational data to a JSON file. The incoming data is in a relational table that contains orders information. The orders table contains multiple rows for each order because each order can contain several products.
The incoming data looks like the following data:
OrderNumber,ProductName,ProductType,NumberOfItems,PricePerItem 12345,M&Ms Candies Chocolate Peanut Party Size - 38 Oz,Candy,2,14.49 12345,Stella Parm Shredded Cup - 20 Oz,Dairy,1,10.99 12345,AHA Sparkling Water Blueberry Pomegranate - 8-12 Fl. Oz.,Beverages,1,3.33 23456,Weetabix Biscuit Cereal Whole Grain 2 Count - 14 Oz,Breakfast & Cereal,2,4.99 23456,Producers Milk Lowfat 1% - Half Gallon,Dairy,1,2.79 23456,Egglands Best Eggs Cage Free Large Brown - 12 Count,Eggs,1,4.99
You want to read the product details into an array, where the product details are associated with a particular order number.
The following image shows the structure of the incoming and output fields:
In the Output Fields panel, set the data source for the Output group to Input, and configure the group by field as Input.OrderNumber to remove duplicate records from the output. Set the data source for the ProductDetails array to Input.
To ensure that the details in the ProductDetails array correspond to the order number in the output, configure the following filter condition for the array:
To further refine the records, use an AND condition in the filter. For example, to exclude records in which the product type is "Candy," configure the following filter condition:
:fld.{Input.OrderNumber}= :fld.{Output.OrderNumber} AND :fld.{Input.ProductType} != 'Candy'
The output contains one record for each order, and incoming records with the product type "Candy" are excluded.
You can define expressions in the Hierarchy Processor transformation to create customized relational or hierarchical output, but not flattened output. You also use expressions to define filter conditions.
The Hierarchy Processor transformation can process information from different data sets. Some of the field names might not be unique among the different data sets. As a result, you can't simply reference the field by its name, because the same field name might be used in a different data set, or within the hierarchy of the same data set.
The syntax of the expressions in the Hierarchy Processor transformation differs from that used in the Expression transformation.
To reference a field in a Hierarchy Processor transformation, use the following syntax:
:fld.{input_group_name.field_name}.field_name
The following table describes the syntax in more detail:
Syntax part
Description
.fld.
Denotes the Hierarchy Processor transformation expression syntax.
input_group_name
Name of the input group or dataset.
field_name
Name of the field, including the full path name if it's not a top-level field.
If any field is of the type array, include the array name. If an array is primitive and has no array name, use elem as the array name.
For fields within a struct or an array, the actual field name is specified outside of the closing brace.
.field_name
Include the field_name portion only when referencing a field within a struct or an array. Follow these guidelines:
- For fields within a struct, the field_name portion uses the format: .structName.fieldname
- For fields within an array, the field_name portion uses the format: .fieldName
Configuring expressions
You can configure expressions when the output format is relational or hierarchical. For flattened output, the expression is fixed to the default format once you add the field and configuration is not possible.
To configure an expression in the Hierarchy Processor transformation, follow these steps:
1Perform one of the following actions to create the output fields:
aTo use an incoming field as an output, click the Add link that appears when you hover over the incoming field.
bTo create a new field, click New Field in the Output Fields panel near the output group name or near an array name.
2Click the expression in the Output Fields section.
For newly created fields, the link appears as Configure. The Field Expression dialog box appears.
3Enter the expression in the expression editor. You can add incoming fields, functions, and variables to the expression by clicking the Add link next to the object that you want to use. You can also type in the expression manually.
The expression can contain constants, variables, built-in functions, and user-defined functions. You can nest functions to create a complex expression.
Note: If you modify the references to any of the fields, your expression will fail.
4Click Validate.
5Correct any errors.
6Click OK.
You can close the editor and troubleshoot an invalid expression at another time.
Running a mapping with JSON data
To run a mapping that contains a Hierarchy Processor transformation with JSON-formatted data, you need to use a mapping task.
Reading JSON input
When you read JSON data, the input files can be based on a schema with multiple lines or on a schema with a single line.
The following sample shows a JSON schema on a single line:
By default, the Hierarchy Processor transformation reads each JSON schema as a single line. To read input that spans across multiple lines, you can configure the formatting options in the Source transformation to read multiple-line JSON files.
Writing JSON output
When you write JSON data, you can write each output record to a separate file, or you can write all output records to one file.
By default, each output record is written to a separate file. To write the output records to one JSON-formatted file, set the following Spark session property in the mapping task:
Session Property Name
Session Property Value
spark.sql.shuffle.partitions
1
Hierarchical to relational example
A customer order file contains the current customer contact information and the recent orders for those customers. The order file is in hierarchical JSON format and is generated by your company's cloud application. You want to process the hierarchical data and write the data to target files in relational and delimited formats.
Using the order file data, you want to create a relational customers table to use for an update on the customer information in the master database. Separately, you want to analyze the orders that have been increasing. You can use the order file to create a separate delimited orders file for the analysis.
You want to transform the data from the hierarchical input to relational and delimited output.
To create and run the mapping, perform the following tasks:
1Ensure that you have access to an Amazon S3 V2 Connector for the S3 source and target objects.
2Add a Source transformation that reads hierarchical data from the source JSON file.
3Configure the following properties for the source object:
Property
Value
Connection
Amazon S3 V2
Source Format
JSON
4Add a Hierarchy Processor transformation. The following image shows the data flow:
5Create the OutputCustomers output group to create the relational customers data file.
The following image shows how to add incoming fields, which will create the output group:
6Create the OutputOrders output group to create the delimited orders data file.
7In the Hierarchy Processor tab, map Incoming Fields to Output Fields.
You can add fields individually or use the following options for struct and array fields:
- Add All Descendants. Adds all children under the field, including all arrays and structs. If the incoming field contains arrays, Data Integration creates a separate output group for each array.
- Add Single Occurring Children. Adds all single occurring children under the field, even if the single occurring child is nested under a struct.
8Add a Target transformation to write the customers data output.
9Configure the following properties for the target object:
Property
Value
Connection
Amazon S3 V2
Formatting Option
Relational
10Add a Target transformation to write the orders data output.
11Configure the following properties for the target object:
Property
Value
Connection
Amazon S3 V2
Formatting Option
Delimited
12Link the OutputCustomers output group to the TargetCustomers Target transformation.
13Link the OutputOrders output group to the TargetOrders Target transformation.