Installation and Configuration Guide > Part II: Before You Install the Services > Prepare for the Enterprise Data Catalog Cluster > Embedded Hadoop Cluster Deployment
  

Embedded Hadoop Cluster Deployment

When you install Enterprise Data Catalog on an embedded Hadoop cluster, you can choose to create application services, such as the Model Repository Service, Data Integration Service, and Catalog Service.
The Enterprise Data Catalog installer creates an Informatica Cluster Service application service if you choose the embedded Hadoop distribution. Enterprise Data Catalog uses Apache Ambari to manage and monitor the embedded Hadoop cluster. The embedded Hadoop cluster for Enterprise Data Catalog supports the high availability option.
The following components of the Enterprise Data Catalog embedded Hadoop cluster environments support the high availability option:

Prerequisites for the Embedded Cluster

Before you install Enterprise Data Catalog on an embedded Hadoop cluster, you must verify that the system environment meets the prerequisites required to deploy Enterprise Data Catalog.
Verify that the internal Hadoop distribution meets the following prerequisites:

Preparing the Internal Hadoop Cluster Environment

You need to perform multiple validation checks before you can before you install Enterprise Data Catalog on an internal Hadoop cluster.
Perform the following steps before you install Enterprise Data Catalog on an internal Hadoop cluster environment:

Informatica Cluster Service

The Informatica Cluster Service is an application service that runs and manages all the Hadoop services, Apache Ambari server, and Apache Ambari agents on an embedded Hadoop cluster. If you choose the embedded cluster deployment mode, you need to create the Informatica Cluster Service before you create the Catalog Service. Then, you can pass the Informatica Cluster Service value to the Catalog Service.
Informatica Cluster Service distributes the Hortonworks binaries and launches the required Hadoop services on the hosts where the embedded cluster runs.
You can deploy Informatica Cluster Service on hosts where Centrify is enabled. Centrify integrates with an existing Active Directory infrastructure to manage user authentication on remote Linux hosts.
Note: Informatica does not integrate with Centrify to manage or generate keytabs.
You can deploy Informatica Cluster Service on hosts that provide access using JSch SSH encryption algorithms.
The following table lists the supported methods and algorithms:
Method
Algorithm
Key exchange
  • - diffie-hellman-group-exchange-sha1
  • - diffie-hellman-group1-sha1
  • - diffie-hellman-group14-sha1
  • - diffie-hellman-group-exchange-sha256
  • - ecdh-sha2-nistp256
  • - ecdh-sha2-nistp384
  • - ecdh-sha2-nistp521
Cipher
  • - blowfish-cbc
  • - 3des-cbc
  • - aes128-cbc
  • - aes192-cbc
  • - aes256-cbc
  • - aes128-ctr
  • - aes192-ctr
  • - aes256-ctr
  • - 3des-ctr
  • - arcfour
  • - arcfour128
  • - arcfour256
MAC
  • - hmac-md5
  • - hmac-sha1
  • - hmac-md5-96
  • - hmac-sha1-96
Host key type
  • - ssh-dss
  • - ssh-rsa
  • - ecdsa-sha2-nistp256
  • - ecdsa-sha2-nistp384
  • - ecdsa-sha2-nistp521

Embedded Cluster Node Management

A Hadoop cluster has a set of machines that is configured to run Hadoop applications and services. A typical Hadoop cluster includes a master node and multiple slave or worker nodes. The master node runs the master daemons JobTracker and NameNode. A slave node runs the DataNode and TaskTracker daemons. In small clusters, the master node might also run the slave daemons.

Cluster with High Availability

You can use the highly availability option for the HDFS, HBase, YARN, and Solr components of the embedded Hadoop cluster environment. If you set up Informatica Cluster Service on a multi-node and highly available cluster, you need a minimum of three nodes for Enterprise Data Catalog to function successfully. If you have already set up Informatica Cluster Service on a single node, you cannot make the cluster highly available by adding more nodes to the cluster.
If the embedded cluster contains only three nodes, Enterprise Data Catalog distributes all master and slave services on all the three nodes. If the embedded cluster contains more than three nodes, Enterprise Data Catalog automatically chooses top three nodes with the highest system configuration as master nodes. The remaining nodes serve as slave nodes. When you add nodes to the embedded cluster, the newly added nodes serve as slave nodes. The nodes that you add to the cluster must meet the minimum configuration requirements for slave nodes.

Cluster without High Availability

You can set up Informatica Cluster Service on a single node that is not highly available. In such cases, the master and worker nodes remain on the same node. You cannot bring up Informatica Cluster Service if you add a single node to an existing single-node cluster or try to set up Informatica Cluster Service with two nodes.

Delete Nodes

You can delete nodes from the embedded cluster provided they meet the following conditions: