Developer Tool Guide > Connection Properties > Hadoop Connection Properties
  

Hadoop Connection Properties

Use the Hadoop connection to configure mappings to run on a Hadoop cluster. A Hadoop connection is a cluster type connection. You can create and manage a Hadoop connection in the Administrator tool or the Developer tool. You can use infacmd to create a Hadoop connection. Hadoop connection properties are case sensitive unless otherwise noted.

Hadoop Cluster Properties

Configure properties in the Hadoop connection to enable communication between the Data Integration Service and the Hadoop cluster.
The following table describes the general connection properties for the Hadoop connection:
Property
Description
Name
The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID
String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name.
Description
The description of the connection. Enter a string that you can use to identify the connection. The description cannot exceed 4,000 characters.
Cluster Configuration
The name of the cluster configuration associated with the Hadoop environment.
Required if you do not configure the Cloud Provisioning Configuration.
Cloud Provisioning Configuration
Name of the cloud provisioning configuration associated with a cloud platform such as Amazon AWS or Microsoft Azure.
Required if you do not configure the Cluster Configuration.
Cluster Environment Variables*
Environment variables that the Hadoop cluster uses.
For example, the variable ORACLE_HOME represents the directory where the Oracle database client software is installed.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties
Cluster Library Path*
The path for shared libraries on the cluster.
The $DEFAULT_CLUSTER_LIBRARY_PATH variable contains a list of default directories.
Cluster Classpath*
The classpath to access the Hadoop jar files and the required libraries.
The $DEFAULT_CLUSTER_CLASSPATH variable contains a list of paths to the default jar files and libraries.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties
Cluster Executable Path*
The path for executable files on the cluster.
The $DEFAULT_CLUSTER_EXEC_PATH variable contains a list of paths to the default executable files.
* Informatica does not recommend changing these property values before you consult with third-party documentation, Informatica documentation, or Informatica Global Customer Support. If you change a value without knowledge of the property, you might experience performance degradation or other unexpected results.

Common Properties

The following table describes the common connection properties that you configure for the Hadoop connection:
Property
Description
Impersonation User Name
Required if the Hadoop cluster uses Kerberos authentication. Hadoop impersonation user. The user name that the Data Integration Service impersonates to run mappings in the Hadoop environment.
The Data Integration Service runs mappings based on the user that is configured. Refer the following order to determine which user the Data Integration Services uses to run mappings:
  1. 1. Operating system profile user. The mapping runs with the operating system profile user if the profile user is configured. If there is no operating system profile user, the mapping runs with the Hadoop impersonation user.
  2. 2. Hadoop impersonation user. The mapping runs with the Hadoop impersonation user if the operating system profile user is not configured. If the Hadoop impersonation user is not configured, the Data Integration Service runs mappings with the Data Integration Service user.
  3. 3. Informatica services user. The mapping runs with the operating user that starts the Informatica daemon if the operating system profile user and the Hadoop impersonation user are not configured.
Temporary Table Compression Codec
Hadoop compression library for a compression codec class name.
Note: The Spark engine does not support compression settings for temporary tables. When you run mappings on the Spark engine, the Spark engine stores temporary tables in an uncompressed file format.
Codec Class Name
Codec class name that enables data compression and improves performance on temporary staging tables.
Hive Staging Database Name
Namespace for Hive staging tables. Use the name default for tables that do not have a specified database name.
If you do not configure a namespace, the Data Integration Service uses the Hive database name in the Hive target connection to create staging tables.
Advanced Properties
List of advanced properties that are unique to the Hadoop environment. The properties are common to the Blaze, Spark, and Hive engines. The advanced properties include a list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties

Reject Directory Properties

The following table describes the connection properties that you configure to the Hadoop Reject Directory.
Property
Description
Write Reject Files to Hadoop
If you use the Blaze engine to run mappings, select the check box to specify a location to move reject files. If checked, the Data Integration Service moves the reject files to the HDFS location listed in the property, Reject File Directory.
By default, the Data Integration Service stores the reject files based on the RejectDir system parameter.
Reject File Directory
The directory for Hadoop mapping files on HDFS when you run mappings.

Hive Pushdown Configuration

The following table describes the connection properties that you configure for the Hive engine:
Property
Description
Environment SQL
SQL commands to set the Hadoop environment. The Data Integration Service executes the environment SQL at the beginning of each Hive script generated in a Hive execution plan.
The following rules and guidelines apply to the usage of environment SQL:
  • - Use the environment SQL to specify Hive queries.
  • - Use the environment SQL to set the classpath for Hive user-defined functions and then use environment SQL or PreSQL to specify the Hive user-defined functions. You cannot use PreSQL in the data object properties to specify the classpath. If you use Hive user-defined functions, you must copy the .jar files to the following directory:<Informatica installation directory>/services/shared/hadoop/<Hadoop distribution name>/extras/hive-auxjars
  • - You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries.
  • - If you use multiple values for the environment SQL, ensure that there is no space between the values.
Hive Warehouse Directory
Optional. The absolute HDFS file path of the default database for the warehouse that is local to the cluster.
If you do not configure the Hive warehouse directory, the Hive engine first tries to write to the directory specified in the cluster configuration property hive.metastore.warehouse.dir. If the cluster configuration does not have the property, the Hive engine writes to the default directory /user/hive/warehouse.
Engine Type
The engine that the Hadoop environment uses to run a mapping on the Hadoop cluster. You can choose MRv2 or Tez. You can select Tez if it is configured for Amazon EMR, Azure HDInsight, or Hortonworks HDP.
Advanced Properties
List of advanced properties that are unique to the Hive engine. The advanced properties include a list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties

Blaze Configuration

The following table describes the connection properties that you configure for the Blaze engine:
Property
Description
Blaze Staging Directory
The HDFS file path of the directory that the Blaze engine uses to store temporary files. Verify that the directory exists. The YARN user, Blaze engine user, and mapping impersonation user must have write permission on this directory.
Default is /blaze/workdir. If you clear this property, the staging files are written to the Hadoop staging directory /tmp/blaze_<user name>.
Blaze User Name
The owner of the Blaze service and Blaze service logs.
When the Hadoop cluster uses Kerberos authentication, the default user is the Data Integration Service SPN user. When the Hadoop cluster does not use Kerberos authentication and the Blaze user is not configured, the default user is the Data Integration Service user.
Minimum Port
The minimum value for the port number range for the Blaze engine. Default is 12300.
Maximum Port
The maximum value for the port number range for the Blaze engine. Default is 12600.
YARN Queue Name
The YARN scheduler queue name used by the Blaze engine that specifies available resources on a cluster.
Blaze Job Monitor Address
The host name and port number for the Blaze Job Monitor.
Use the following format:
<hostname>:<port>
Where
  • - <hostname> is the host name or IP address of the Blaze Job Monitor server.
  • - <port> is the port on which the Blaze Job Monitor listens for remote procedure calls (RPC).
For example, enter: myhostname:9080
Blaze YARN Node Label
Node label that determines the node on the Hadoop cluster where the Blaze engine runs. If you do not specify a node label, the Blaze engine runs on the nodes in the default partition.
If the Hadoop cluster supports logical operators for node labels, you can specify a list of node labels. To list the node labels, use the operators && (AND), || (OR), and ! (NOT).
Advanced Properties
List of advanced properties that are unique to the Blaze engine. The advanced properties include a list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties

Spark Configuration

The following table describes the connection properties that you configure for the Spark engine:
Property
Description
Spark Staging Directory
The HDFS file path of the directory that the Spark engine uses to store temporary files for running jobs. The YARN user, Data Integration Service user, and mapping impersonation user must have write permission on this directory.
By default, the temporary files are written to the Hadoop staging directory /tmp/spark_<user name>.
Spark Event Log Directory
Optional. The HDFS file path of the directory that the Spark engine uses to log events.
YARN Queue Name
The YARN scheduler queue name used by the Spark engine that specifies available resources on a cluster. The name is case sensitive.
Advanced Properties
List of advanced properties that are unique to the Spark engine. The advanced properties include a list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service, the Hadoop connection, and in the mapping. You can override a property configured at a high level by setting the value at a lower level. For example, if you configure a property in the Data Integration Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data Integration Service processes property overrides based on the following priorities:
  1. 1. Mapping custom properties set using infacmd ms runMapping with the -cp option
  2. 2. Mapping run-time properties for the Hadoop environment
  3. 3. Hadoop connection advanced properties for run-time engines
  4. 4. Hadoop connection advanced general properties, environment variables, and classpaths
  5. 5. Data Integration Service custom properties