Embedded Hadoop Cluster Deployment
When you install Enterprise Data Catalog on an embedded Hadoop cluster, you can choose to create application services, such as the Model Repository Service, Data Integration Service, and Catalog Service.
The Enterprise Data Catalog installer creates an Informatica Cluster Service application service if you choose the embedded Hadoop distribution. Enterprise Data Catalog uses Apache Ambari to manage and monitor the embedded Hadoop cluster. The embedded Hadoop cluster for Enterprise Data Catalog supports the high availability option.
The following components of the Enterprise Data Catalog embedded Hadoop cluster environments support the high availability option:
Prerequisites for the Embedded Cluster
Before you install Enterprise Data Catalog on an embedded Hadoop cluster, you must verify that the system environment meets the prerequisites required to deploy Enterprise Data Catalog.
Verify that the internal Hadoop distribution meets the following prerequisites:
- •Operating system is 64-bit Red Hat Enterprise Linux version 6.5 or later.
Note: For Red Hat Enterprise Linux version 7.0, make sure that you are using the following versions of snappy-devel and Sudo:
- - snappy-devel-1.0.5-1.el6.x86_64 on all Apache Ambari hosts.
- - Sudo 1.8.16
- •Verify that you disable SSL certificate validation if you are using Red Hat Enterprise Linux.
- •Verify that the cluster nodes meet the following requirements:
Node Type | Minimum Requirements |
---|
Master node | - - The number of CPUs is 4.
- - Unused memory available for use is 16 GB.
- - Disk space is 60 GB.
|
Slave node | - - The number of CPUs is 4.
- - Unused memory available for use is 16 GB.
- - Disk space is 60 GB.
|
- •If the cluster is enabled for SSL, ensure that you import the Ambari Server certificate to the Informatica domain truststore.
- •Verify that the root directory (/) has a minimum of 10 GB of free disk space.
- •If you want to mount Informatica Cluster Service on a separate mount location, verify that the mount location has a minimum of 50 GB of free disk space.
- •Verify that the Linux repository includes postgresql version 8.14.18, release 1.el6_4, installed or install the listed version and release of postgresql.
- •Make sure that you merge the user and host keytab files before you enable Kerberos authentication for Informatica Cluster Service.
- •Verify that you install the following prerequisite packages before you enable Kerberos:
- - krb5-workstation
- - krb5-libs
- - krb5-auth-dialog
- •Make sure that the NOEXEC flag is not set for the file system mounted on the/tmp directory.
- •Ensure that the Linux base repositories are configured.
- •Verify that you have the write permission on the /home directory.
- •On each host machine, verify that you have the following tools and applications available:
- - YUM and RPM (RHEL/CentOS/Oracle Linux)
- - Zypper and php_curl (SLES)
- - apt (Ubuntu)
- - scp, curl, unzip, tar, and wget
- - awk
- - OpenSSL version 1.0.1e-30.el6_6.5.x86_64 or later. Make sure that you do not use versions in the 1.0.2 branch.
Note: Make sure that the $PATH variable points to the /usr/bin directory to use the correct version of Linux OpenSSL.
- - Verify that the secure path in the /etc/sudoers file has the /usr/bin directory location at the start.
- - Python version 2.6.x for Red Hat Enterprise Linux version 6.5.
Note: If you install SUSE Linux Enterprise 11, update all the hosts to Python version 2.6.8-0.15.1.
- - Python version 2.7.x for Red Hat Enterprise Linux version 7.0.
- - If you install on SUSE Linux Enterprise 12, make sure that you install the following RPM Package Manager (RPMs) on all the cluster nodes:
- ▪ openssl-1.0.1c-2.1.3.x86_64.rpm
- ▪ libopenssl1_0_0-1.0.1c-2.1.3.x86_64.rpm
- ▪ libopenssl1_0_0-32bit-1.0.1c-2.1.3.x86_64.rpm
- ▪ python-devel-2.6.8-0.15.1.x86_64
- - If you have not configured the Linux base repository or if you do not have an Internet connection, install the following packages:
- ▪ Version 8.4 of the following RPMs on the Ambari Server host:
- ▪ postgresql-libs
- ▪ postgresql-server
- ▪ postgresql
- ▪ The following RPMs on all cluster nodes:
- ▪ nc
- ▪ redhat-lsb
- ▪ psmisc
- ▪ python-devel-2.7.5-34.el7.x86_64
- - If you do not have an Internet connection, make sure that you have installed Java Development Kit (JDK) version 1.8. Configure the JAVA_HOME environment variable to point to the JDK installation.
- - If you have an Internet connection and any version of JDK installed, uninstall the JDK.
Note: Enterprise Data Catalog installs JDK version 1.8 and PostgreSQL version 8.4 as part of Apache Ambari installation. The location of the JDK package is /var/lib/ambari-server/resources/jdk-8u60-linux-x64.tar.gz.
- •Ensure that you install JDK 1.8 on all cluster nodes.
- •Apache Ambari requires certain ports that are open and available during the installation to communicate with the hosts that Apache Ambari deploys and manages. You need to temporarily disable the iptables to meet this requirement.
- •Verify that you meet the memory and package requirements for Apache Ambari. For more information, see the Hortonworks documentation.
- •Make sure that each machine in the cluster includes the 127.0.0.1 localhost localhost.localdomain entry in the /etc/hosts file.
- •Verify that the /etc/hosts file includes the fully-qualified host names for all the cluster nodes. Alternatively, make sure that reverse DNS lookup returns the fully-qualified host names for all the cluster nodes.
- •Before you deploy Enterprise Data Catalog on clusters where Apache Ranger is enabled, make sure that you configure the following permissions for the Informatica domain user:
- - Write permission on the HDFS folder.
- - Permission to submit applications to the YARN queue.
- •If the cluster is enabled for SSL, it is recommended to enable SSL for the Informatica domain, the Informatica Cluster Service, and the Catalog Service.
- •If you want to enable Kerberos authentication for Enterprise Data Catalog deployed on a multi-node Informatica domain, make sure that you complete the following prerequisites:
- - Make sure that all the domain nodes include the krb5.conf file in the following directories:
- ▪ $INFA_HOME/services/shared/security/
- ▪ /etc/
- - Make sure that the /etc/hosts file of all cluster nodes and domain nodes include the krb hosts entry and a host entry for other nodes.
- - Install krb5-workstation in all domain nodes.
- - Make sure that the keytab file is present in a common location on all domain nodes.
- •If you want to enable SSL authentication for Enterprise Data Catalog deployed on a multi-node Informatica domain, make sure that you complete the following prerequisites:
- - Export the Default.keystore of each node to the infa_truststore.jks on all nodes.
- - Make sure that the Default.keystore is unique for each host node.
- - Copy the Default.keystore to a unique location of each node.
- - If Informatica Cluster Service and Catalog Service are on different nodes, then export the Apache Ambari server certificate to the infa_truststore.jks on all nodes.
Preparing the Internal Hadoop Cluster Environment
You need to perform multiple validation checks before you can before you install Enterprise Data Catalog on an internal Hadoop cluster.
Perform the following steps before you install Enterprise Data Catalog on an internal Hadoop cluster environment:
- •Configure the /etc/hosts file on each machine so that you have fully qualified domain names. Informatica recommends the following host name format in lowercase: <machine ipaddress> <fully qualified name> <alias>.
Note: To verify the configured host name, run the #hostname -f command.
- •Set up passwordless Secure Shell (SSH) connections between the following components:
- - From Informatica Cluster Service to Hadoop Gateway.
- - From the Hadoop Gateway to Apache Hadoop nodes.
- •Make sure that the /etc/hosts file on the machine that hosts Informatica domain includes entries for all Hadoop hosts.
Informatica Cluster Service
The Informatica Cluster Service is an application service that runs and manages all the Hadoop services, Apache Ambari server, and Apache Ambari agents on an embedded Hadoop cluster. If you choose the embedded cluster deployment mode, you need to create the Informatica Cluster Service before you create the Catalog Service. Then, you can pass the Informatica Cluster Service value to the Catalog Service.
Informatica Cluster Service distributes the Hortonworks binaries and launches the required Hadoop services on the hosts where the embedded cluster runs.
You can deploy Informatica Cluster Service on hosts where Centrify is enabled. Centrify integrates with an existing Active Directory infrastructure to manage user authentication on remote Linux hosts.
Note: Informatica does not integrate with Centrify to manage or generate keytabs.
You can deploy Informatica Cluster Service on hosts that provide access using JSch SSH encryption algorithms.
The following table lists the supported methods and algorithms:
Method | Algorithm |
---|
Key exchange | - - diffie-hellman-group-exchange-sha1
- - diffie-hellman-group1-sha1
- - diffie-hellman-group14-sha1
- - diffie-hellman-group-exchange-sha256
- - ecdh-sha2-nistp256
- - ecdh-sha2-nistp384
- - ecdh-sha2-nistp521
|
Cipher | - - blowfish-cbc
- - 3des-cbc
- - aes128-cbc
- - aes192-cbc
- - aes256-cbc
- - aes128-ctr
- - aes192-ctr
- - aes256-ctr
- - 3des-ctr
- - arcfour
- - arcfour128
- - arcfour256
|
MAC | - - hmac-md5
- - hmac-sha1
- - hmac-md5-96
- - hmac-sha1-96
|
Host key type | - - ssh-dss
- - ssh-rsa
- - ecdsa-sha2-nistp256
- - ecdsa-sha2-nistp384
- - ecdsa-sha2-nistp521
|
Embedded Cluster Node Management
A Hadoop cluster has a set of machines that is configured to run Hadoop applications and services. A typical Hadoop cluster includes a master node and multiple slave or worker nodes. The master node runs the master daemons JobTracker and NameNode. A slave node runs the DataNode and TaskTracker daemons. In small clusters, the master node might also run the slave daemons.
Cluster with High Availability
You can use the highly availability option for the HDFS, HBase, YARN, and Solr components of the embedded Hadoop cluster environment. If you set up Informatica Cluster Service on a multi-node and highly available cluster, you need a minimum of three nodes for Enterprise Data Catalog to function successfully. If you have already set up Informatica Cluster Service on a single node, you cannot make the cluster highly available by adding more nodes to the cluster.
If the embedded cluster contains only three nodes, Enterprise Data Catalog distributes all master and slave services on all the three nodes. If the embedded cluster contains more than three nodes, Enterprise Data Catalog automatically chooses top three nodes with the highest system configuration as master nodes. The remaining nodes serve as slave nodes. When you add nodes to the embedded cluster, the newly added nodes serve as slave nodes. The nodes that you add to the cluster must meet the minimum configuration requirements for slave nodes.
Cluster without High Availability
You can set up Informatica Cluster Service on a single node that is not highly available. In such cases, the master and worker nodes remain on the same node. You cannot bring up Informatica Cluster Service if you add a single node to an existing single-node cluster or try to set up Informatica Cluster Service with two nodes.
Delete Nodes
You can delete nodes from the embedded cluster provided they meet the following conditions:
- •You cannot delete a master node.
- •You cannot delete a node if the number of live data nodes in the cluster becomes less than three on deleting the node.