Update Configuration Files on the Domain Environment
You can configure additional functionality through the configuration files on the domain.
When the Informatica installer creates nodes on the domain, it creates directories and files for Big Data Management. When you configure Big Data Management on the domain, you need to edit configuration files to enable functionality.
The following table describes the installation directories and the default paths:
Directory | Description |
---|
Data Integration Service Hadoop distribution directory | The Hadoop distribution directory on the Data Integration Service node. It corresponds to the distribution on the Hadoop environment. You can view the directory path in the Data Integration Service properties of the Administrator tool. Default is <Informatica installation directory>/Informatica/services/shared/hadoop/<Hadoop distribution name>_<version>. |
Configuration file directories | Configuration files for Big Data Management are stored in the following directories: - - <Informatica installation directory>/Informatica/services/shared/hadoop/<Hadoop distribution name>_<version>/InfaConf. Contains the hadoopEnv.properties file.
- - <Informatica installation directory>/Informatica/services/shared/hadoop/<Hadoop distribution name>_<version>/conf. Contains *-site files.
|
Update hadoopEnv.properties
Update the hadoopEnv.properties file to configure functionality such as Sqoop connectivity and the Spark run-time engine.
Open hadoopEnv.properties and back it up before you configure it. You can find the hadoopEnv.properties file in the following location:<Data Integration Service Hadoop distribution directory>/infaConf
Configure Performance for the Spark Engine
Configure the following performance properties for the Spark engine:
- spark.dynamicAllocation.enabled
- Performance configuration to run mappings on the Spark engine. Enables dynamic resource allocation. Required when you enable the external shuffle service.
- Set the value to TRUE.
- spark.shuffle.service.enabled
- Performance configuration to run mappings on the Spark engine. Enables the external shuffle service. Required when you enable dynamic resource allocation.
- Set the value to TRUE.
- spark.scheduler.maxRegisteredResourcesWaitingTime
- Performance configuration to run mappings on the Spark engine. The number of milliseconds to wait for resources to register before scheduling a task. Reduce this from the default value of 30000 to reduce delays before starting the Spark job execution.
- Set the value to 15000.
- spark.scheduler.minRegisteredResourcesRatio
- Performance configuration to run mappings on the Spark engine. The minimum ratio of registered resources to acquire before task scheduling begins. Reduce this from the default value of 0.8 to reduce any delay before starting the Spark job execution.
- Set the value to .5.
- spark.executor.instances
- Performance configuration to run mappings on the Spark engine. If you enable dynamic resource allocation for the Spark engine, Informatica recommends that you convert this property to a comment. For example, #spark.executor.instances=100
Configure Sqoop Connectivity
Configure the following property for Sqoop connectivity:
- infapdo.env.entry.hadoop_node_jdk_home
Configure the HADOOP_NODE_JDK_HOME to represent the JDK version that the cluster nodes use. You must use JDK version 1.7 or later.
Configure the property as follows:
infapdo.env.entry.hadoop_node_jdk_home=HADOOP_NODE_JDK_HOME=<cluster JDK home>/jdk<version>
- For example,
infapdo.env.entry.hadoop_node_jdk_home=HADOOP_NODE_JDK_HOME=/usr/java/default
Configure Environment Variables
You can optionally add third-party environment variables and extend the existing PATH environment variable in the hadoopEnv.properties file. The following text shows sample entries to configure environment variables:
infapdo.env.entry.oracle_home=ORACLE_HOME=/databases/oracle
infapdo.env.entry.db2_home=DB2_HOME=/databases/db2
infapdo.env.entry.db2instance=DB2INSTANCE=OCA_DB2INSTANCE
infapdo.env.entry.db2codepage=DB2CODEPAGE="1208"
infapdo.env.entry.odbchome=ODBCHOME=$HADOOP_NODE_INFA_HOME/ODBC7.1
infapdo.env.entry.home=HOME=/opt/thirdparty
infapdo.env.entry.gphome_loaders=GPHOME_LOADERS=/databases/greenplum
infapdo.env.entry.pythonpath=PYTHONPATH=$GPHOME_LOADERS/bin/ext
infapdo.env.entry.nz_home=NZ_HOME=/databases/netezza
infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native:$HADOOP_NODE_INFA_HOME/ODBC7.1/lib:$HADOOP_NODE_INFA_HOME/jre/lib/amd64:$HADOOP_NODE_INFA_HOME/jre/lib/amd64/server:$HADOOP_NODE_INFA_HOME/java/jre/lib/amd64:$HADOOP_NODE_INFA_HOME/java/jre/lib/amd64/server:/databases/oracle/lib:/databases/db2/lib64:$LD_LIBRARY_PATH
infapdo.env.entry.path=PATH=$HADOOP_NODE_HADOOP_DIST/scripts:$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/jre/bin:$HADOOP_NODE_INFA_HOME/java/jre/bin:$HADOOP_NODE_INFA_HOME/ODBC7.1/bin:/databases/oracle/bin:/databases/db2/bin:$PATH
#teradata
infapdo.env.entry.twb_root=TWB_ROOT=/databases/teradata/tbuild
infapdo.env.entry.manpath=MANPATH=/databases/teradata/odbc_64:/databases/teradata/odbc_64
infapdo.env.entry.nlspath=NLSPATH=/databases/teradata/odbc_64/msg/%N:/databases/teradata/msg/%N
infapdo.env.entry.pwd=PWD=/databases/teradata/odbc_64/samples/C
Note: If you plan to push down a mapping that includes a Consolidation transformation or a Match transformation, update the Java virtual memory value in the hadoopEnv.properties file.
The hadoopEnv.properties file has the following default Java virtual memory value:
infapdo.java.opts=-Xmx512M
To run a mapping that includes a Consolidation transformation or a Match transformation, increase the Xmx value to at least 700M.
Update hive-site.xml
Update the hive-site.xml file to set properties for dynamic partitioning.
Configure Dynamic Partitioning
To use Hive dynamic partitioned tables, configure the following properties:
- hive.exec.dynamic.partition
- Enables dynamic partitioned tables. Set this value to TRUE.
- exec.dynamic.partition.mode
- Allows all partitions to be dynamic. Set this value to nonstrict.
Update yarn-site.xml
Update the yarn-site.xml file on the domain to enable access to Hive tables in Amazon S3 buckets.
Configure Access to Hive Tables in Amazon S3 Buckets
Configure the AWS access key to run mappings with sources and targets on Hive tables in Amazon S3 buckets.
Note: To use a Hive table as a target on Amazon S3, grant write permission to the bucket through bucket policies, or add these properties to the configuration file. You must add these properties to the core-site.xml file on the Hadoop environment and to the yarn-site.xml file on the domain environment.
Configure the following properties:
- fs.s3a.access.key
- The ID for the Blaze and Spark engines to connect to the Amazon S3a file system. For example,
<property>
<name>fs.s3a.access.key</name>
<value>[Your Access Key]</value>
</property>
- fs.s3a.secret.key
- The password for the Blaze and Spark engines to connect to the Amazon S3a file system.
<property>
<name>fs.s3a.secret.key</name>
<value>[Your Access Id]</value>
</property>
When you use a server side encryption protocol for Hive buckets, configure properties to enable access to encrypted Hive buckets.
Configure the following properties:
- fs.s3.enableServerSideEncryption
- Set to TRUE to enable server side encryption. For example,
<property>
<name>fs.s3.enableServerSideEncryption</name>
<value>TRUE</value>
</property>
- fs.s3a.server-side-encryption-algorithm
- The encryption algorithm that you use to encrypt Hive buckets. For example,
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>AES256</value>
</property>