User Guide > Targets in a Streaming Mapping > Complex File Data Objects
  

Complex File Data Objects

Create a complex file data object with an HDFS connection to write data to HDFS sequence files or binary files.
When you create a complex file data object, a read and write operation is created. To use the complex file data object as a target in streaming mappings, configure the complex file data object write operation properties. You can select the mapping environment and run the mappings on the Spark engine of the Hadoop environment.
When you configure the data operation properties, specify the format in which the complex file data object writes data to the HDFS sequence file. You can also specify binary as format.
You can pass any payload format directly from source to target in Streaming mappings. You can project columns in binary format pass a payload from source to target in its original form or to pass a payload format that is not supported.
Streaming mappings can read, process, and write hierarchical data. You can use array, struct, and map complex data types to process the hierarchical data. You assign complex data types to ports in a mapping to flow hierarchical data. Ports that flow hierarchical data are called complex ports.
For more information about processing hierarchical data, see the Informatica Big Data Management User Guide.

Complex File Data Object Overview Properties

The Data Integration Service uses overview properties when it reads data from or writes data to a complex file.
Overview properties include general properties that apply to the complex file data object. They also include object properties that apply to the resources in the complex file data object. The Developer tool displays overview properties for complex files in the Overview view.

General Properties

The following table describes the general properties that you configure for complex files:
Property
Description
Name
The name of the complex file data object.
Description
The description of the complex file data object.
Native Name
Name of the HDFS connection.
Path Information
The path on the Hadoop file system.

Compression and Decompression for Complex File Targets

You can write compressed files, specify compression formats, and decompress files. You can use compression formats such as Bzip2 and Lz4, or specify a custom compression format.
You can compress sequence files at a record level or at a block level.
For information about how Hadoop processes compressed and uncompressed files, see the Hadoop documentation.
The following table describes the compression formats:
Compression Options
Description
None
The file is not compressed.
Auto
The Data Integration Service detects the compression format of the file based on the file extension.
DEFLATE
The DEFLATE compression format that uses a combination of the LZ77 algorithm and Huffman coding.
Gzip
The GNU zip compression format that uses the DEFLATE algorithm.
Bzip2
The Bzip2 compression format that uses the Burrows–Wheeler algorithm.
Lzo
The Lzo compression format that uses the Lempel-Ziv-Oberhumer algorithm.
In a streaming mapping, the compression format is LZ4. The LZ4 compression format uses the LZ77 algorithm.
Snappy
The LZ77-type compression format with a fixed, byte-oriented encoding.
Custom
Custom compression format. If you select this option, you must specify the fully qualified class name implementing the CompressionCodec interface in the Custom Compression Codec field.

Complex File Data Object Write Operation Properties

The Data Integration Service uses write operation properties when it writes data to a complex file. Select the Input transformation to edit the general, ports, targets, and run-time properties.

General Properties

The Developer tool displays general properties for complex file targets in the Write view.
The following table describes the general properties that you configure for complex file targets:
Property
Description
Name
The name of the complex file.
You can edit the name in the Overview view. When you use the complex file as a target in a mapping, you can edit the name in the mapping.
Description
The description of the complex file.

Ports Properties

Port properties for a physical data object include port names and port attributes such as data type and precision.
The following table describes the ports properties that you configure for complex file targets:
Property
Description
Name
The name of the resource.
Type
The native data type of the resource.
Precision
The maximum number of significant digits for numeric data types, or the maximum number of characters for string data types.
Scale
The scale for each column. Scale is the maximum number of digits that a column can accommodate to the right of the decimal point. Applies to decimal columns.
The scale values you configure depend on the data type.
Detail
The detail of the data object.
Description
The description of the resource.

Target Properties

The target properties list the targets of the complex file data object.
The following table describes the target properties that you configure for complex file targets in a streaming mapping:
Property
Description
Target
The target which the complex data object writes to.
You can add or remove targets.

Run-time Properties

The run-time properties include the name of the connection that the Data Integration Service uses to write data to the HDFS sequence file or binary file.
You can configure dynamic partitioning or fixed partitioning.

Advanced Properties

The Developer tool displays the advanced properties for complex file targets in the Input transformation in the Write view.
The following table describes the advanced properties that you configure for complex file targets in a streaming mapping:
Property
Description
Operation Type
Indicates the type of data object operation.
This is a read-only property.
File Directory
The directory location of the complex file target.
If the directory is in HDFS, enter the path without the node URI. For example, /user/lib/testdir specifies the location of a directory in HDFS. The path must not contain more than 512 characters.
If the directory is in the local system, enter the fully qualified path. For example, /user/testdir specifies the location of a directory in the local system.
Note: The Data Integration Service ignores any subdirectories and their contents.
File Name
The name of the output file. Spark appends the file name with a unique identifier before it writes the file to HDFS.
File Format
The file format. Select one of the following file formats:
  • - Binary. Select Binary to read any file format.
  • - Sequence. Select Sequence File Format for target files of a specific format that contain key and value pairs.
Output Format
The class name for files of the output format. If you select Output Format in the File Format field, you must specify the fully qualified class name implementing the OutputFormat interface.
Output Key Class
The class name for the output key. By default, the output key class is NullWritable.
Output Value Class
The class name for the output value. By default, the output value class is Text.
Compression Format
Optional. The compression format for binary files. Select one of the following options:
  • - None
  • - Auto
  • - DEFLATE
  • - gzip
  • - bzip2
  • - LZO
  • - Snappy
  • - Custom
Custom Compression Codec
Required for custom compression. Specify the fully qualified class name implementing the CompressionCodec interface.
Sequence File Compression Type
Optional. The compression format for sequence files. Select one of the following options:
  • - None
  • - Record
  • - Block

Column Projection Properties

The following table describes the columns projection properties that you configure for complex file targets:
Property
Description
Column Name
The name of the column in the source table that contains data.
This property is read-only.
Type
The native data type of the resource.
This property is read-only.
Enable Column Projection
Indicates that you use a schema to publish the data to the target.
By default, the columns are projected as binary data type To change the format in which the data is projected, select this option and specify the schema format.
Schema Format
The format in which you stream data to the target. Select one of the following formats:
  • - XML
  • - JSON
  • - Avro
Schema
Specify the XSD schema for the XML format, the sample JSON for the JSON format, or sample Avro file for the Avro format.
Column Mapping
Click View to see the mapping of the data object to target mapping.
Project as Hierarchical Type
Project columns as complex data type for hierarchical data.
For more information on hierarchical data, see the Informatica Big Data Management User Guide.

Complex File Execution Parameters

When you write to an HDFS complex file, you can configure how the complex file data object writes to the file. Specify these properties in the execution parameters property of the streaming mapping.
Use execution parameters to configure the following properties:
Rollover properties
When you write to an HDFS complex file, the file rollover process closes the current file that is being written to and creates a new file on the basis of file size or time. When you write to the, you can configure a time-based rollover or size-based rollover. You can use the following optional execution parameters to configure rollover:
The default is size-based rollover. You can implement both rollover schemes for a target file, in which case, the event that occurs first triggers a rollover. For example, if you set rollover time to 1 hour and rollover size to 1 GB, the target service rolls the file over when the file reaches a size of 1 GB even if the 1-hour period has not elapsed.
Pool properties
You can configure the maximum pool size that one Spark executor can have to write to a file. Use the pool.maxTotal execution parameter to specify the pool size. Default pool size is 8.
Retry Interval
You can specify the time interval for which Spark tries to create the target file or write to it if it fails to do so the first time. Spark tries a maximum of three times during the time interval that you specify. Use the retryTimeout execution parameter to specify the timeout in milliseconds. Default is 30,000 milliseconds.