Complex File Data Objects
Create a complex file data object with an HDFS connection to write data to HDFS sequence files or binary files.
When you create a complex file data object, a read and write operation is created. To use the complex file data object as a target in streaming mappings, configure the complex file data object write operation properties. You can select the mapping environment and run the mappings on the Spark engine of the Hadoop environment.
When you configure the data operation properties, specify the format in which the complex file data object writes data to the HDFS sequence file. You can also specify binary as format.
You can pass any payload format directly from source to target in Streaming mappings. You can project columns in binary format pass a payload from source to target in its original form or to pass a payload format that is not supported.
Streaming mappings can read, process, and write hierarchical data. You can use array, struct, and map complex data types to process the hierarchical data. You assign complex data types to ports in a mapping to flow hierarchical data. Ports that flow hierarchical data are called complex ports.
For more information about processing hierarchical data, see the Informatica Big Data Management User Guide.
Complex File Data Object Overview Properties
The Data Integration Service uses overview properties when it reads data from or writes data to a complex file.
Overview properties include general properties that apply to the complex file data object. They also include object properties that apply to the resources in the complex file data object. The Developer tool displays overview properties for complex files in the Overview view.
General Properties
The following table describes the general properties that you configure for complex files:
Property | Description |
---|
Name | The name of the complex file data object. |
Description | The description of the complex file data object. |
Native Name | Name of the HDFS connection. |
Path Information | The path on the Hadoop file system. |
Compression and Decompression for Complex File Targets
You can write compressed files, specify compression formats, and decompress files. You can use compression formats such as Bzip2 and Lz4, or specify a custom compression format.
You can compress sequence files at a record level or at a block level.
For information about how Hadoop processes compressed and uncompressed files, see the Hadoop documentation.
The following table describes the compression formats:
Compression Options | Description |
---|
None | The file is not compressed. |
Auto | The Data Integration Service detects the compression format of the file based on the file extension. |
DEFLATE | The DEFLATE compression format that uses a combination of the LZ77 algorithm and Huffman coding. |
Gzip | The GNU zip compression format that uses the DEFLATE algorithm. |
Bzip2 | The Bzip2 compression format that uses the Burrows–Wheeler algorithm. |
Lzo | The Lzo compression format that uses the Lempel-Ziv-Oberhumer algorithm. In a streaming mapping, the compression format is LZ4. The LZ4 compression format uses the LZ77 algorithm. |
Snappy | The LZ77-type compression format with a fixed, byte-oriented encoding. |
Custom | Custom compression format. If you select this option, you must specify the fully qualified class name implementing the CompressionCodec interface in the Custom Compression Codec field. |
Complex File Data Object Write Operation Properties
The Data Integration Service uses write operation properties when it writes data to a complex file. Select the Input transformation to edit the general, ports, targets, and run-time properties.
General Properties
The Developer tool displays general properties for complex file targets in the Write view.
The following table describes the general properties that you configure for complex file targets:
Property | Description |
---|
Name | The name of the complex file. You can edit the name in the Overview view. When you use the complex file as a target in a mapping, you can edit the name in the mapping. |
Description | The description of the complex file. |
Ports Properties
Port properties for a physical data object include port names and port attributes such as data type and precision.
The following table describes the ports properties that you configure for complex file targets:
Property | Description |
---|
Name | The name of the resource. |
Type | The native data type of the resource. |
Precision | The maximum number of significant digits for numeric data types, or the maximum number of characters for string data types. |
Scale | The scale for each column. Scale is the maximum number of digits that a column can accommodate to the right of the decimal point. Applies to decimal columns. The scale values you configure depend on the data type. |
Detail | The detail of the data object. |
Description | The description of the resource. |
Target Properties
The target properties list the targets of the complex file data object.
The following table describes the target properties that you configure for complex file targets in a streaming mapping:
Property | Description |
---|
Target | The target which the complex data object writes to. You can add or remove targets. |
Run-time Properties
The run-time properties include the name of the connection that the Data Integration Service uses to write data to the HDFS sequence file or binary file.
You can configure dynamic partitioning or fixed partitioning.
Advanced Properties
The Developer tool displays the advanced properties for complex file targets in the Input transformation in the Write view.
The following table describes the advanced properties that you configure for complex file targets in a streaming mapping:
Property | Description |
---|
Operation Type | Indicates the type of data object operation. This is a read-only property. |
File Directory | The directory location of the complex file target. If the directory is in HDFS, enter the path without the node URI. For example, /user/lib/testdir specifies the location of a directory in HDFS. The path must not contain more than 512 characters. If the directory is in the local system, enter the fully qualified path. For example, /user/testdir specifies the location of a directory in the local system. Note: The Data Integration Service ignores any subdirectories and their contents. |
File Name | The name of the output file. Spark appends the file name with a unique identifier before it writes the file to HDFS. |
File Format | The file format. Select one of the following file formats: - - Binary. Select Binary to read any file format.
- - Sequence. Select Sequence File Format for target files of a specific format that contain key and value pairs.
|
Output Format | The class name for files of the output format. If you select Output Format in the File Format field, you must specify the fully qualified class name implementing the OutputFormat interface. |
Output Key Class | The class name for the output key. By default, the output key class is NullWritable. |
Output Value Class | The class name for the output value. By default, the output value class is Text. |
Compression Format | Optional. The compression format for binary files. Select one of the following options: - - None
- - Auto
- - DEFLATE
- - gzip
- - bzip2
- - LZO
- - Snappy
- - Custom
|
Custom Compression Codec | Required for custom compression. Specify the fully qualified class name implementing the CompressionCodec interface. |
Sequence File Compression Type | Optional. The compression format for sequence files. Select one of the following options: |
Column Projection Properties
The following table describes the columns projection properties that you configure for complex file targets:
Property | Description |
---|
Column Name | The name of the column in the source table that contains data. This property is read-only. |
Type | The native data type of the resource. This property is read-only. |
Enable Column Projection | Indicates that you use a schema to publish the data to the target. By default, the columns are projected as binary data type To change the format in which the data is projected, select this option and specify the schema format. |
Schema Format | The format in which you stream data to the target. Select one of the following formats: |
Schema | Specify the XSD schema for the XML format, the sample JSON for the JSON format, or sample Avro file for the Avro format. |
Column Mapping | Click View to see the mapping of the data object to target mapping. |
Project as Hierarchical Type | Project columns as complex data type for hierarchical data. For more information on hierarchical data, see the Informatica Big Data Management User Guide. |
Complex File Execution Parameters
When you write to an HDFS complex file, you can configure how the complex file data object writes to the file. Specify these properties in the execution parameters property of the streaming mapping.
Use execution parameters to configure the following properties:
- Rollover properties
- When you write to an HDFS complex file, the file rollover process closes the current file that is being written to and creates a new file on the basis of file size or time. When you write to the, you can configure a time-based rollover or size-based rollover. You can use the following optional execution parameters to configure rollover:
- - rolloverTime. You can configure a rollover of the HDFS file when a certain period of time has elapsed. Specify rollover time in hours. For example, you can specify a value of 1.
- - rolloverSize. You can configure a rollover of the HDFS target file when the target file reaches a certain size. Specify the size in GB. For example, you can specify a value of 1.
- The default is size-based rollover. You can implement both rollover schemes for a target file, in which case, the event that occurs first triggers a rollover. For example, if you set rollover time to 1 hour and rollover size to 1 GB, the target service rolls the file over when the file reaches a size of 1 GB even if the 1-hour period has not elapsed.
- Pool properties
- You can configure the maximum pool size that one Spark executor can have to write to a file. Use the pool.maxTotal execution parameter to specify the pool size. Default pool size is 8.
- Retry Interval
- You can specify the time interval for which Spark tries to create the target file or write to it if it fails to do so the first time. Spark tries a maximum of three times during the time interval that you specify. Use the retryTimeout execution parameter to specify the timeout in milliseconds. Default is 30,000 milliseconds.