Configuring Hadoop Connection Properties
When you create a Hadoop connection, default values are assigned to cluster environment variables, cluster path properties, and advanced properties. You can add or edit values for these properties. You can also reset to default values.
You can configure the following Hadoop connection properties based on the cluster environment and functionality that you use:
- •Cluster Environment Variables
- •Cluster Library Path
- •Common Advanced Properties
- •Blaze Engine Advanced Properties
- •Spark Engine Advanced Properties
To reset to default values, delete the property values. For example, if you delete the values of an edited Cluster Library Path property, the value resets to the default $DEFAULT_CLUSTER_LIBRARY_PATH.
Cluster Environment Variables
Cluster Environment Variables property lists the environment variables that the cluster uses. Each environment variable contains a name and a value. You can add environment variables or edit environment variables.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following environment variables in the Cluster Environment Variables property:
- HADOOP_NODE_JDK_HOME
- Represents the directory from which you run the cluster services and the JDK version that the cluster nodes use. Required to run the Java transformation in the Hadoop environment and Sqoop mappings on the Blaze engine. Default is /usr/java/default. The JDK version that the Data Integration Service uses must be compatible with the JDK version on the cluster.
Set to <cluster JDK home>/jdk<version>.
For example, HADOOP_NODE_JDK_HOME=<cluster JDK home>/jdk<version>.
Cluster Library Path
Cluster Library Path property is a list of path variables for shared libraries on the cluster. You can add or edit library path variables.
To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the library path variables in the Cluster Library Path property.
Common Advanced Properties
Common advanced properties are a list of advanced or custom properties that are unique to the Hadoop environment. The properties are common to the Blaze and Spark engines. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following property in the Advanced Properties of the common properties section:
- infapdo.java.opts
List of Java options to customize the Java run-time environment. The property contains default values.
If mappings in a MapR environment contain a Consolidation transformation or a Match transformation, change the following value:
- - -Xmx512M. Specifies the maximum size for the Java virtual memory. Default is 512 MB. Increase the value to at least 700 MB.
For example, infapdo.java.opts=-Xmx700M
Blaze Engine Advanced Properties
Blaze advanced properties are a list of advanced or custom properties that are unique to the Blaze engine. Each property contains a name and a value. You can add or edit advanced properties.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following properties in the Advanced Properties of the Blaze configuration section:
- infagrid.cadi.namespace
Namespace for the Data Integration Service to use. Required to set up multiple Blaze instances.
Set to <unique namespace>.
For example, infagrid.cadi.namespace=TestUser1_namespace
- infagrid.blaze.console.jsfport
JSF port for the Blaze engine console. Use a port number that no other cluster processes use. Required to set up multiple Blaze instances.
Set to <unique JSF port value>.
For example, infagrid.blaze.console.jsfport=9090
- infagrid.blaze.console.httpport
HTTP port for the Blaze engine console. Use a port number that no other cluster processes use. Required to set up multiple Blaze instances.
Set to <unique HTTP port value>.
For example, infagrid.blaze.console.httpport=9091
- infagrid.node.local.root.log.dir
Path for the Blaze service logs. Default is /tmp/infa/logs/blaze. Required to set up multiple Blaze instances.
Verify that all blaze users have write permission on /tmp.
Set to <local Blaze services log directory>.
For example, infagrid.node.local.root.log.dir=<directory path>
- infacal.hadoop.logs.directory
- Path in HDFS for the persistent Blaze logs. Default is /var/log/hadoop-yarn/apps/informatica. Required to set up multiple Blaze instances.
Set to <persistent log directory path>.
For example, infacal.hadoop.logs.directory=<directory path>
- infagrid.node.hadoop.local.root.log.dir
- Path in the Hadoop connection for the service log directory.
- Set to <service log directory path>.
For example, infagrid.node.local.root.log.dir=$HADOOP_NODE_INFA_HOME/blazeLogs
Spark Advanced Properties
Spark advanced properties are a list of advanced or custom properties that are unique to the Spark engine. Each property contains a name and a value. You can add or edit advanced properties. Each property contains a name and a value. You can add or edit advanced properties.
Configure the following properties in the Advanced Properties of the Spark configuration section:
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
- infasjs.env.spark.context-settings.passthrough.spark.dynamicAllocation.executorIdleTimeout
- Maximum time that an Spark Jobserver executor node can be idle before it is removed. Increase the value to assist in debugging data preview jobs that use the Spark engine.
You can specify the time in seconds, minutes, or hours using the suffix s, m, or h, respectively. If you do not specify a time unit, the property uses milliseconds.
If you disable dynamic resource allocation, this property is not used.
Default is 120s.
- infasjs.env.spark.jobserver.max-jobs-per-context
- Maximum number of Spark jobs that can run simultaneously on a Spark context. If you increase the value of this property, you might need to allocate more resources by increasing spark.executor.cores and spark.executor.memory.
Default is 10.
- infasjs.env.spark.jobserver.sparkJobTimeoutInMinutes
- Maximum time in minutes that a Spark job can run on a Spark context before the Spark Jobserver cancels the job. Increase the value to assist in debugging data preview jobs that use the Spark engine.
Default is 15.
- infaspark.class.log.level.map
- Logging level for specific classes in the Spark driver or executor. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value of this property to a JSON string in the following format: {"<fully qualified class name":"<log level>"}
Join multiple class logging level statements with a comma. You can use the following logging levels: FATAL, WARN, INFO, DEBUG, ALL.
For example, set to:
infaspark.class.log.level.map={"org.apache.spark.deploy.yarn.ApplicationMaster":"TRACE","org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider":"DEBUG"}
- infaspark.driver.cluster.mode.extraJavaOptions
List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.
For example, set to:
infaspark.driver.cluster.mode.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to keytab file>/krb5.conf
-Djava.security.auth.login.config=<path to jaas config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.driver.cluster.mode.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf
- infaspark.driver.log.level
- Logging level for the Spark driver logs. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value to one of the following levels: FATAL, WARN, INFO, DEBUG, ALL.
- infaspark.executor.extraJavaOptions
List of extra Java options for the Spark executor. Required for streaming mappings to read from or write to a Kafka cluster that uses Kerberos authentication.
For example, set to:
infaspark.executor.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to krb5.conf file>/krb5.conf
-Djava.security.auth.login.config=/<path to jAAS config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.executor.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf
- infaspark.executor.log.level
- Logging level for the Spark executor logs. When you configure this property, it overrides the tracing level you set for the mapping.
Set the value to one of the following levels: FATAL, WARN, INFO, DEBUG, ALL.
- infaspark.flatfile.writer.nullValue
- When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For example, 12, AB,"",23p09udj.
- The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the target, configure the property in the Databricks Spark connection.
Set to: TRUE
- infaspark.json.parser.mode
Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following modes:
- - DROPMALFORMED. The parser ignores all corrupted records. Default mode.
- - PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
- - FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark application goes down.
- infaspark.json.parser.multiLine
Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.
- infaspark.pythontx.exec
Required to run a Python transformation on the Spark engine for Data Engineering Integration. The location of the Python executable binary on the worker nodes in the Hadoop cluster.
For example, set to:
infaspark.pythontx.exec=/usr/bin/python3.4
If you use the installation of Python on the Data Integration Service machine, set the value to the Python executable binary in the Informatica installation directory on the Data Integration Service machine.
For example, set to:
infaspark.pythontx.exec=INFA_HOME/services/shared/spark/python/lib/python3.4
- infaspark.pythontx.executorEnv.LD_PRELOAD
Required to run a Python transformation on the Spark engine for Data Engineering Streaming. The location of the Python shared library in the Python installation folder on the Data Integration Service machine.
For example, set to:
infaspark.pythontx.executorEnv.LD_PRELOAD=
INFA_HOME/services/shared/spark/python/lib/libpython3.6m.so
- infaspark.pythontx.executorEnv.PYTHONHOME
Required to run a Python transformation on the Spark engine for Data Engineering Integration and Data Engineering Streaming. The location of the Python installation directory on the worker nodes in the Hadoop cluster.
For example, set to:
infaspark.pythontx.executorEnv.PYTHONHOME=/usr
If you use the installation of Python on the Data Integration Service machine, use the location of the Python installation directory on the Data Integration Service machine.
For example, set to:
infaspark.pythontx.executorEnv.PYTHONHOME=
INFA_HOME/services/shared/spark/python/
- infaspark.pythontx.submit.lib.JEP_HOME
Required to run a Python transformation on the Spark engine for Data Engineering Streaming. The location of the Jep package in the Python installation folder on the Data Integration Service machine.
For example, set to:
infaspark.pythontx.submit.lib.JEP_HOME=
INFA_HOME/services/shared/spark/python/lib/python3.6/site-packages/jep/
- infaspark.useHiveWarehouseAPI
- Enables the Hive Warehouse Connector. Set to TRUE.
For example, infaspark.useHiveWarehouseAPI=true.
- spark.authenticate
Enables authentication for the Spark service on Hadoop. Required for Spark encryption.
Set to TRUE.
For example, spark.authenticate=TRUE
- spark.authenticate.enableSaslEncryption
Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption uses SASL authentication.
Set to TRUE.
For example, spark.authenticate.enableSaslEncryption=TRUE
- spark.datasource.hive.warehouse.load.staging.dir
- Directory for the temporary HDFS files used for batch writes to Hive. Required when you enable the Hive Warehouse Connector.
For example, set to /tmp
- spark.datasource.hive.warehouse.metastoreUri
- URI for the Hive metastore. Required when you enable the Hive Warehouse Connector. Use the value for hive.metastore.uris from the hive_site_xml cluster configuration properties.
For example, set the value to thrift://mycluster-1.com:9083 .
- spark.driver.cores
- Indicates the number of cores that each driver uses uses to run jobs on the Spark engine.
Set to: spark.driver.cores=1
- spark.driver.extraJavaOptions
- List of extra Java options for the Spark driver.
When you write date/time data within a complex data type to a Hive target using a Hortonworks HDP 3.1 cluster, append the following value to the property: -Duser.timezone=UTC
- spark.driver.memory
- Indicates the amount of driver process memory that the Spark engine uses to run jobs.
Recommended value: Allocate at least 256 MB for every data source.
Set to: spark.driver.memory=3G
- spark.executor.cores
- Indicates the number of cores that each executor process uses to run tasklets on the Spark engine.
Set to: spark.executor.cores=1
- spark.executor.extraJavaOptions
- List of extra Java options for the Spark executor.
When you write date/time data within a complex data type to a Hive target using a Hortonworks HDP 3.1 cluster, append the following value to the property: -Duser.timezone=UTC
- spark.executor.instances
- Indicates the number of instances that each executor process uses to run tasklets on the Spark engine.
Set to: spark.executor.instances=1
- spark.executor.memory
- Indicates the amount of memory that each executor process uses to run tasklets on the Spark engine.
Set to: spark.executor.memory=3G
- spark.hadoop.hive.llap.daemon.service.hosts
- Application name for the LLAP service. Required when you enable the Hive Warehouse Connector. Use the value for hive.llap.daemon.service.hosts from the hive_site_xml cluster configuration properties.
- spark.hadoop.hive.zookeeper.quorum
- Zookeeper hosts used by Hive LLAP. Required when you enable the Hive Warehouse Connector. Use the value for hive.zookeeper.quorum from the hive_site_xml cluster configuration properties.
- spark.hadoop.validateOutputSpecs
- Validates if the HBase table exists. Required for streaming mappings to write to a HBase target in an Amazon EMR cluster. Set the value to false.
- spark.scheduler.maxRegisteredResourcesWaitingTime
The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000. Decrease the value to reduce delays before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.
Set to 15000.
For example, spark.scheduler.maxRegisteredResourcesWaitingTime=15000
- spark.scheduler.minRegisteredResourcesRatio
The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8. Decrease the value to reduce any delay before starting the Spark job execution. Required to improve performance for mappings on the Spark engine.
Set to: 0.5
For example, spark.scheduler.minRegisteredResourcesRatio=0.5
- spark.shuffle.encryption.enabled
Enables encrypted communication when authentication is enabled. Required for Spark encryption.
Set to TRUE.
For example, spark.shuffle.encryption.enabled=TRUE
- spark.sql.hive.hiveserver2.jdbc.url
- URL for HiveServer2 Interactive. Required to use the Hive Warehouse Connector. Use the value in Ambari for HiveServer2 JDBC URL.
- spark.yarn.access.hadoopFileSystems
- Comma-separated list of external file systems that the Spark service can access. By default, the Spark service has access to the file systems listed in fs.defaultFS in the core-site.xml configuration set of the cluster configuration. Set this property to give the Spark service access to additional file systems.
- If you run a mapping on a Cloudera CDP Public Cloud cluster and you use an HDFS on a Cloudera Data Lake cluster, you must allow access to that file system. Append the value for the property fs.defaultFS found in core-site.xml on the Data Lake cluster. For example: spark.yarn.access.hadoopFileSystems=hdfs://infarndcdppamdl-master1.infarndc.src9-ltfl.cloudera.site:8020