Connections for INFACore > Connections to source and target endpoints > Hadoop Files
  

Hadoop Files

You can use Hadoop Files V2 Connector to securely read data from and write data to complex files on local system or to HDFS (Hadoop Distributed File System). You can read or write structured, semi-structured, and unstructured data.

Feature snapshot

Operation
Support
Read
Yes
Write
Yes

Connection properties

The following table describes the Hadoop Files connection properties:
Connection property
Description
Connection Name
Name of the connection.
Each connection name must be unique within the organization. Connection names can contain alphanumeric characters, spaces, and the following special characters: _ . + -,
Maximum length is 255 characters.
User Name
Required to read data from HDFS. Enter a user name that has access to the single-node HDFS location to read data from or write data to.
NameNode URI
The URI to access HDFS.
Use the following format to specify the name node URI in Cloudera, Amazon EMR, and Hortonworks distributions:
hdfs://<namenode>:<port>/
Where
  • - <namenode> is the host name or IP address of the name node.
  • - <port> is the port that the name node listens for remote procedure calls (RPC).
If the Hadoop cluster is configured for high availability, you must copy the fs.defaultFS value in the core-site.xml file and append / to specify the name node URI.
For example, the following snippet shows the fs.defaultFS value in a sample core-site.xml file:
<property>
<name>fs.defaultFS</name>
<value>hdfs://nameservice1</value>
<source>core-site.xml</source>
</property>
In the above snippet, the fs.defaultFS value is
hdfs://nameservice1
and the corresponding name node URI is
hdfs://nameservice1/
Note: Specify either the name node URI or the local path. Do not specify the name node URI if you want to read data from or write data to a local file system path.
Local Path
A local file system path to read and write data. Read the following conditions to specify the local path:
  • - You must enter NA in local path if you specify the name node URI. If the local path does not contain NA, the name node URI does not work.
  • - If you specify the name node URI and local path, the local path takes the preference. The connection uses the local path to run all tasks.
  • - If you leave the local path blank, the agent configures the root directory (/) in the connection. The connection uses the local path to run all tasks.
  • - If the file or directory is in the local system, enter the fully qualified path of the file or directory.
  • For example, /user/testdir specifies the location of a directory in the local system.
Default value for Local Path is NA.
Configuration Files Path
The directory that contains the Hadoop configuration files.
Note: Copy the core-site.xml, hdfs-site.xml, and hive-site.xmlfrom the Hadoop cluster and add them to a folder in Linux Box.
Keytab File
The file that contains encrypted keys and Kerberos principals to authenticate the machine.
Principal Name
Users assigned to the superuser privilege can perform all the tasks that a user with the administrator privilege can perform.
Impersonation Username
You can enable different users to run jobs in a Hadoop cluster that uses Kerberos authentication or connect to sources and targets that use Kerberos authentication. To enable different users to run jobs or connect to big data sources and targets, you must configure user impersonation.
Note: When you read from or write to remote files, the Name Node URI and Configuration Files Path fields are mandatory. When you read from or write to local files only Local Path field is required.

Read properties

The following table describes the advanced source properties that you can configure in the Python code to read from the endpoint:
Advanced Property
Description
File path
Mandatory. Location of the file or directory from which you want to read data. Maximum length is 255 characters. If the path is a directory, all the files in the directory must have the same file format.
If the file or directory is in HDFS, enter the path without the node URI. For example, /user/lib/testdir specifies the location of a directory in HDFS. The path must not contain more than 512 characters.
If the file or directory is in the local system, enter the fully qualified path. For example, /user/testdir specifies the location of a directory in the local system.
File Pattern
Mandatory. Name and format of the file from which you want to read data.
Specify the value in the following format: <filename>.<format>
For example, customer.avro
Allow Wildcard Characters
Indicates whether you want to use wildcard characters for the source directory name or the source file name.
If you select this option, you can use asterisk (*) wildcard character for the source directory name or the source file name in the File path field.
Allow Recursive Read
Indicates whether you want to use wildcard characters to read complex files of the Parquet, Avro, or JSON formats recursively from the specified folder and its subfolders and files.
You can use the wildcard character as part of the file or directory. For example, you can use a wildcard character to recursively read data from the following folders:
  • - /myfolder*/. Returns all files within any folder or subfolder that has a pattern myfolder in the path name.
  • - /myfolder*/*.csv. Returns all .csv files within any folder or subfolder that has a pattern myfolder in the path name.
  • - /myfolder*/ and file name is abc*. Returns all files that have a pattern abc within any folder or subfolder that has a pattern myfolder in the path name.
File Format
Specifies a file format of a complex file source. Select one of the following options:
  • - Binary
  • - Custom Input
  • - Sequence File Format
Default is Binary.
Input Format
The class name for files of the input file format. If you select input file format in the File Format field, you must specify the fully qualified class name implementing the InputFormat interface.
To read files that use the Avro format, use the following input format:
com.informatica.avro.AvroToXML
Input Format Parameters
Parameters for the input format class. Enter name-value pairs separated with a semicolon. Enclose the parameter name and value within double quotes.
For example, use the following syntax:
"param1"="value1";"param2"="value2"
Compression Format
Compression format of the source files. Select one of the following options:
  • - None
  • - Auto
  • - DEFLATE
  • - gzip
  • - bzip2
  • - Lzo
  • - Snappy
  • - Custom
Custom Compression Codec
Required if you use custom compression format. Specify the fully qualified class name implementing the CompressionCodec interface.

Write properties

The following table describes the advanced target properties that you can configure in the Python code to write to the endpoint:
Advanced Property
Description
File Directory
Optional. The directory location of one or more output files. Maximum length is 255 characters. If you do not specify a directory location, the output files are created at the location specified in the connection.
If the directory is in HDFS, enter the path without the node URI. For example, /user/lib/testdir specifies the location of a directory in HDFS. The path must not contain more than 512 characters.
If the file or directory is in the local system, enter the fully qualified path. For example, /user/testdir specifies the location of a directory in the local system.
File Name
Optional. Renames the output file. The file name is not applicable when you read or write multiple Hadoop Files V2s.
Overwrite Target
Indicates whether the Secure Agent must first delete the target data before writing data.
If you select the Overwrite Target option, the Secure Agent deletes the target data before writing data. If you do not select this option, the Secure Agent creates a new file in the target and writes the data to the file.
File Format
Specifies a file format of a complex file source. Select one of the following options:
  • - Binary
  • - Custom Input
  • - Sequence File Format
Default is Binary.
Output Format
The class name for files of the output format. If you select Output Format in the File Format field, you must specify the fully qualified class name implementing the OutputFormat interface.
Output Key Class
The class name for the output key. If you select Output Format in the File Format field, you must specify the fully qualified class name for the output key.
You can specify one of the following output key classes:
  • - BytesWritable
  • - Text
  • - LongWritable
  • - IntWritable
Note: Hadoop Files V2 generates the key in ascending order.
Output Value Class
The class name for the output value. If you select Output Format in the File Format field, you must specify the fully qualified class name for the output value.
You can use any custom writable class that Hadoop supports. Determine the output value class based on the type of data that you want to write.
Note: When you use custom output formats, the value part of the data that is streamed to the complex file data object write operation must be in a serialized form.
Compression Format
Compression format of the source files. Select one of the following options:
  • - None
  • - Auto
  • - DEFLATE
  • - gzip
  • - bzip2
  • - LZO
  • - Snappy
  • - Custom
Custom Compression Codec
Required if you use custom compression format. Specify the fully qualified class name implementing the CompressionCodec interface.
Sequence File Compression Type
Optional. The compression format for sequence files. Select one of the following options:
  • - None
  • - Record
  • - Block