Administrator Guide > Connection Properties > Hadoop Connection Properties
  

Hadoop Connection Properties

Use the Hadoop connection to configure mappings to run on a Hadoop cluster. A Hadoop connection is a cluster type connection. You can create and manage a Hadoop connection in the Administrator tool or the Developer tool. You can use infacmd to create a Hadoop connection. Hadoop connection properties are case sensitive unless otherwise noted.

Hadoop Cluster Properties

The following table describes the general connection properties for the Hadoop connection:
Property
Description
Name
The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID
String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name.
Description
The description of the connection. Enter a string that you can use to identify the connection. The description cannot exceed 4,000 characters.
Cluster Configuration
The name of the cluster configuration associated with the Hadoop environment.

Common Properties

The following table describes the common connection properties that you configure for the Hadoop connection:
Property
Description
Impersonation User Name
User name of the user that the Data Integration Service impersonates to run mappings on a Hadoop cluster.
If the Hadoop cluster uses Kerberos authentication, the principal name for the JDBC connection string and the user name must be the same.
Note: You must use user impersonation for the Hadoop connection if the Hadoop cluster uses Kerberos authentication.
If the Hadoop cluster does not use Kerberos authentication, the user name depends on the behavior of the JDBC driver.
If you do not specify a user name, the Hadoop cluster authenticates jobs based on the operating system profile user name of the machine that runs the Data Integration Service.
Temporary Table Compression Codec
Hadoop compression library for a compression codec class name.
Codec Class Name
Codec class name that enables data compression and improves performance on temporary staging tables.
Hive Staging Database Name
Namespace for tables. Use the name default for tables that do not have a specified database name.
Hadoop Engine Custom Properties
Custom properties that are unique to the Hadoop connection.
You can specify multiple properties.
Use the following format:
<property1>=<value>
To specify multiple properties use &: as the property separator.
Use custom properties only at the request of Informatica Global Customer Support.

Reject Directory Properties

The following table describes the connection properties that you configure to the Hadoop Reject Directory.
Property
Description
Write Reject Files to Hadoop
If you use the Blaze engine to run mappings, select the check box to specify a location to move reject files. If checked, the Data Integration Service moves the reject files to the HDFS location listed in the property, Reject File Directory.
By default, the Data Integration Service stores the reject files based on the RejectDir system parameter.
Reject File Directory
The directory for Hadoop mapping files on HDFS when you run mappings on the Blaze engine.

Hive Pushdown Configuration

The following table describes the connection properties that you configure to push mapping logic to the Hadoop cluster:
Property
Description
Environment SQL
SQL commands to set the Hadoop environment. The Data Integration Service executes the environment SQL at the beginning of each Hive script generated in a Hive execution plan.
The following rules and guidelines apply to the usage of environment SQL:
  • - Use the environment SQL to specify Hive queries.
  • - Use the environment SQL to set the classpath for Hive user-defined functions and then use environment SQL or PreSQL to specify the Hive user-defined functions. You cannot use PreSQL in the data object properties to specify the classpath. The path must be the fully qualified path to the JAR files used for user-defined functions. Set the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the path to the JAR files for user-defined functions.
  • - You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries.
  • - If you use multiple values for the environment SQL, ensure that there is no space between the values. The following sample text shows two values that can be used for the Environment SQL property:
  • set hive.execution.engine='tez';set
    hive.exec.dynamic.partition.mode='nonstrict';
Hive Warehouse Directory
Optional. The absolute HDFS file path of the default database for the warehouse that is local to the cluster.
If you do not configure the Hive warehouse directory, the Hive engine first tries to write to the directory specified in the cluster configuration property hive.metastore.warehouse.dir. If the cluster configuration does not have the property, the Hive engine writes to the default directory /user/hive/warehouse.

Hive Configuration

The following table describes the connection properties that you configure for the Hive engine:
Property
Description
Engine Type
The engine that the Hadoop environment uses to run a mapping on the Hadoop cluster.
The value that you configure is based on the mapreduce.framework.name property in the mapred-site.xml configuration set of the cluster configuration.
You can choose MRv2 or Tez.
You can select Tez if it is configured for the Hadoop cluster.
Default is MRv2.

Blaze Configuration

The following table describes the connection properties that you configure for the Blaze engine:
Property
Description
Blaze Staging Directory
The HDFS file path of the directory that the Blaze engine uses to store temporary files. Verify that the directory exists. The YARN user, Blaze engine user, and mapping impersonation user must have write permission on this directory.
For example, enter: /blaze/workdir
Blaze User Name
The operating system profile user name for the Blaze engine.
Minimum Port
The minimum value for the port number range for the Blaze engine.
Default is 12300.
Maximum Port
The maximum value for the port number range for the Blaze engine.
Default is 12600.
YARN Queue Name
The YARN scheduler queue name used by the Blaze engine that specifies available resources on a cluster.
Blaze Job Monitor Address
The host name and port number for the Blaze Job Monitor.
Use the following format:
<hostname>:<port>
Where
  • - <hostname> is the host name or IP address of the Blaze Job Monitor server.
  • - <port> is the port on which the Blaze Job Monitor listens for remote procedure calls (RPC).
For example, enter: myhostname:9080
Blaze Service Custom Properties
Custom properties that are unique to the Blaze engine.
To enter multiple properties, separate each name-value pair with the following text: &:.
Use custom properties only at the request of Informatica Global Customer Support.

Spark Configuration

The following table describes the connection properties that you configure for the Spark engine:
Property
Description
Spark Staging Directory
The HDFS file path of the directory that the Spark engine uses to store temporary files for running jobs. The YARN user, Spark engine user, and mapping impersonation user must have write permission on this directory.
Spark Event Log Directory
Optional. The HDFS file path of the directory that the Spark engine uses to log events. The Data Integration Service accesses the Spark event log directory to retrieve final source and target statistics when a mapping completes. These statistics appear on the Summary Statistics tab and the Detailed Statistics tab of the Monitoring tool.
If you do not configure the Spark event log directory, the statistics might be incomplete in the Monitoring tool.
YARN Queue Name
The YARN scheduler queue name used by the Spark engine that specifies available resources on a cluster. The name is case sensitive.
Spark Execution Parameters
An optional list of configuration parameters to apply to the Spark engine. You can change the default Spark configuration properties values, such as spark.executor.memory or spark.driver.cores.
Use the following format:
<property1>=<value>
To enter multiple properties, separate each name-value pair with the following text: &: