Configuring Hadoop Connection Properties
When you create a Hadoop connection, default values are assigned to cluster environment variables, cluster path properties, and advanced properties. You can add or edit values for these properties. You can also reset to default values.
You can configure the following Hadoop connection properties based on the cluster environment and functionality that you use:
- •Cluster Environment Variables
- •Cluster Library Path
- •Cluster ClassPath
- •Cluster Executable Path
- •Common Advanced Properties
- •Hive Engine Advanced Properties
- •Blaze Engine Advanced Properties
- •Spark Engine Advanced Properties
To reset to default values, delete the property values. For example, if you delete the values of an edited Cluster Library Path property, the value resets to the default $DEFAULT_CLUSTER_LIBRARY_PATH.
Cluster Environment Variables
Cluster Environment Variables property lists the environment variables that the cluster uses. Each environment variable contains a name and a value. You can add environment variables or edit environment variables.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following environment variables in the Cluster Environment Variables property:
- HADOOP_NODE_JDK_HOME
- Represents the directory from which you run the cluster services and the JDK version that the cluster nodes use. Required to run the Java transformation in the Hadoop environment and Sqoop mappings on the Blaze engine. You must use JDK version 1.7 or later. Default is /usr/java/default. The JDK version that the Data Integration Service uses must be compatible with the JRE version on the cluster.
Set to <cluster JDK home>/jdk<version>.
For example, HADOOP_NODE_JDK_HOME=<cluster JDK home>/jdk<version>.
- DB2_HOME
- Specifies the DB2 home directory. Required to run mappings with DB2 sources and targets on the Hive engine.
Set to /databases/db2<version>.
For example, DB2_HOME=/databases/db2V10.5_64BIT.
- DB2INSTANCE
Specifies the DB2 database instance name. Required to run mappings with DB2 sources and targets on the Hive engine.
Set to <DB2 instance name>.
For example, DB2INSTANCE=db10inst.
- DB2CODEPAGE
Specifies the code page configured in the DB2 instance. Required to run mappings with DB2 sources and targets on the Hive engine.
Set to <DB2 instance code page>.
For example, DB2CODEPAGE="1208".
- GPHOME_LOADERS
Represents the directory to the Greenplum libraries. Required to run Greenplum mappings on the Hive engine.
Set to <Greenplum libraries directory>.
For example, GPHOME_LOADERS=opt/thirdparty/.
- PYTHONPATH
Represents the directory to the Python path libraries. Required to run Greenplum mappings on the Hive engine.
Set to <Python path libraries directory>.
For example, PYTHONPATH=$GPHOME_LOADERS/bin/ext.
- NZ_HOME
Represents the directory that contains the Netezza client libraries. Required to run Netezza mappings on the Hive or Blaze engine.
Set to <Netezza client library directory>.
For example, NZ_HOME=/opt/thirdparty/netezza.
- NZ_ODBC_INI_PATH
Represents the directory that contains the odbc.ini file. Required to run Netezza mappings on the Hive or Blaze engine.
Set to <odbc.ini file path>.
For example, NZ_ODBC_INI_PATH=/opt/ODBCINI.
- ODBCINI
Represents the path and file name of the odbc.ini file.
- - Required to run Netezza mappings on the Hive or Blaze engine.
Set to <odbc.ini file path>/<file name>.
For example, ODBCINI=/opt/ODBCINI/odbc.ini.
- - Required to run mappings with ODBC sources and targets on the Hive engine.
Set to <odbc.ini file path>/<file name>.
For example, ODBCINI=$HADOOP_NODE_INFA_HOME/ODBC7.1/odbc.ini.
- ODBC_HOME
Specifies the ODBC home directory. Required to run mappings with ODBC sources and targets on the Hive engine.
Set to <odbc home directory>.
For example, ODBC_HOME=$HADOOP_NODE_INFA_HOME/ODBC7.1.
- ORACLE_HOME
Specifies the Oracle home directory. Required to run mappings with Oracle sources and targets on the Hive engine.
Set to <Oracle home directory>.
For example, ORACLE_HOME=/databases/oracle12.1.0_64BIT.
- TNS_ADMIN
Specifies the directory to the Oracle client tnsnames.ora configuration files. Required to run mappings with Oracle sources and targets on the Hive engine.
Set to <tnsnames.ora config files directory>.
For example, TNS_ADMIN=/opt/ora_tns.
- HADOOP_CLASSPATH
Represents the directory to the TDCH libraries. Required to run Teradata mappings through TDCH on the Hive engine.
Set to <TDCH libraries directory>.
For example,
/opt/cloudera/parcels/CDH-5.13.0-1.cdh5.13.0.p0.29/lib/hive/conf
/opt/cloudera/parcels/CDH-5.13.0-1.cdh5.13.0.p0.29/lib/hive/lib/*
/usr/lib/tdch/1.5/lib/*
Cluster Library Path
Cluster Library Path property is a list of path variables for shared libraries on the cluster. You can add or edit library path variables.
To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the following library path variables in the Cluster Library Path property:
- $DB2_HOME/lib64
Represents the directory to the DB2 libraries. Required to run mappings with DB2 sources and targets on the Hive engine.
- $GPHOME_LOADERS/lib
Represents the path to the Greenplum libraries. Required to run Greenplum mappings on the Hive engine.
- $GPHOME_LOADERS/ext/python/lib
Represents the path to the Python libraries. Required to run Greenplum mappings on the Hive engine.
- $NZ_HOME/lib64
Represents the path to the Netezza libraries. Required to run Netezza mappings on the Hive or Blaze engine.
- $ORACLE_HOME/lib
Represents the directory to the Oracle libraries. Required to run mappings with Oracle sources and targets on the Hive engine.
- /usr/lib/tdch/1.5/lib/*
The path to the TDCH libraries directory. Required to run Teradata mappings through TDCH on the Hive engine.
Cluster ClassPath
Cluster ClassPath property is a list of classpath variables to access the Hadoop jar files and the required libraries on the cluster. You can add or edit classpath variables.
To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the following classpath variable in the Cluster ClassPath property:
- /usr/lib/tdch/1.5/lib/*
Path to the TDCH libraries directory. Required to run Teradata mappings through TDCH on the Hive engine.
Cluster Executable Path
Cluster Executable Path property is a list of path variables to access executable files on the cluster. You can add or edit executable path variables.
To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the following library path variables in the Cluster Executable Path property:
- $DB2_HOME/bin
Represents the directory to the DB2 binaries. Required to run mappings with DB2 sources and targets on the Hive engine.
- $GPHOME_LOADERS/bin
Represents the path to the Greenplum binaries. Required to run Greenplum mappings on the Hive engine.
- $GPHOME_LOADERS/ext/python/bin
Represents the path to the Python binaries. Required to run Greenplum mappings on the Hive engine.
- $ORACLE_HOME/bin
Represents the path to the Oracle binaries. Required to run mappings with Oracle sources and targets on the Hive engine.
Common Advanced Properties
Common advanced properties are a list of advanced or custom properties that are unique to the Hadoop environment. The properties are common to the Blaze, Spark, and Hive engines. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following property in the Advanced Properties of the common properties section:
- infapdo.java.opts
List of Java options to customize the Java run-time environment. The property contains default values.
If mappings in a MapR environment contain a Consolidation transformation or a Match transformation, change the following value:
- - -Xmx512M. Specifies the maximum size for the Java virtual memory. Default is 512 MB. Increase the value to at least 700 MB.
For example, infapdo.java.opts=-Xmx700M
Hive Engine Advanced Properties
Hive advanced properties are a list of advanced or custom properties that are unique to the Hive engine. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Blaze Engine Advanced Properties
Blaze advanced properties are a list of advanced or custom properties that are unique to the Blaze engine. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following properties in the Advanced Properties of the Blaze configuration section:
- infagrid.cadi.namespace
Namespace for the Data Integration Service to use. Required to set up multiple Blaze instances.
Set to <unique namespace>.
For example, infagrid.cadi.namespace=TestUser1_namespace
- infagrid.blaze.console.jsfport
JSF port for the Blaze engine console. Use a port number that no other cluster processes use. Required to set up multiple Blaze instances.
Set to <unique JSF port value>.
For example, infagrid.blaze.console.jsfport=9090
- infagrid.blaze.console.httpport
HTTP port for the Blaze engine console. Use a port number that no other cluster processes use. Required to set up multiple Blaze instances.
Set to <unique HTTP port value>.
For example, infagrid.blaze.console.httpport=9091
- infagrid.node.local.root.log.dir
Path for the Blaze service logs. Default is /tmp/infa/logs/blaze. Required to set up multiple Blaze instances.
Set to <local Blaze services log directory>.
For example, infagrid.node.local.root.log.dir=<directory path>
- infacal.hadoop.logs.directory
- Path in HDFS for the persistent Blaze logs. Default is /var/log/hadoop-yarn/apps/informatica. Required to set up multiple Blaze instances.
Set to <persistent log directory path>.
For example, infacal.hadoop.logs.directory=<directory path>
Spark Engine Advanced Properties
Spark advanced properties are a list of advanced or custom properties that are unique to the Spark engine. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following properties in the Advanced Properties of the Spark configuration section:
- spark.scheduler.maxRegisteredResourcesWaitingTime
The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000. Decrease the value to reduce delays before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.
Set to 15000.
For example, spark.scheduler.maxRegisteredResourcesWaitingTime=15000
- spark.scheduler.minRegisteredResourcesRatio
The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8. Decrease the value to reduce any delay before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.
Set to: 0.5
For example, spark.scheduler.minRegisteredResourcesRatio=0.5
- spark.shuffle.encryption.enabled
Enables encrypted communication when authentication is enabled. Required for Spark encryption.
Set to TRUE.
For example, spark.shuffle.encryption.enabled=TRUE
- spark.authenticate
Enables authentication for the Spark service on Hadoop. Required for Spark encryption.
Set to TRUE.
For example, spark.authenticate=TRUE
- spark.authenticate.enableSaslEncryption
Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption uses SASL authentication.
Set to TRUE.
For example, spark.authenticate.enableSaslEncryption=TRUE
- spark.authenticate.sasl.encryption.aes.enabled
Enables AES support when SASL authentication is enabled. Required if Spark encryption uses SASL authentication.
Set to TRUE.
For example, spark.authenticate.sasl.encryption.aes.enabled=TRUE
- infaspark.pythontx.executorEnv.LD_PRELOAD
The location of the Python shared library in the Python installation folder on the Data Integration Service machine. Required to run a Python transformation on the Spark engine.
For example, set to:
infaspark.pythontx.executorEnv.LD_PRELOAD=
<Informatica installation directory>/services/shared/spark/python/lib/libpython3.6m.so
- infaspark.pythontx.submit.lib.JEP_HOME
The location of the Jep package in the Python installation folder on the Data Integration Service machine. Required to run a Python transformation on the Spark engine.
For example, set to:
infaspark.pythontx.submit.lib.JEP_HOME=
<Informatica installation directory>/services/shared/spark/python/lib/python3.6/site-packages/jep/
- infaspark.executor.extraJavaOptions
List of extra Java options for the Spark executor. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.
For example, set to:
infaspark.executor.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to krb5.conf file>/krb5.conf
-Djava.security.auth.login.config=/<path to jAAS config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.executor.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf
- infaspark.driver.cluster.mode.extraJavaOptions
List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.
For example, set to:
infaspark.driver.cluster.mode.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to keytab file>/krb5.conf
-Djava.security.auth.login.config=<path to jaas config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.driver.cluster.mode.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf