Hadoop Connector Guide > Hadoop Connections > Creating a Hadoop Connection as a Source
  

Creating a Hadoop Connection as a Source

To use Hadoop Connector in Data Synchronization task, you must create a connection in Informatica Cloud.
Perform the following steps to create a Hadoop connection in Informatica Cloud:
    1. In Informatica Cloud home page, click Configure.
    The Configure menu appears.
    2. Select Connections.
    The Connections page appears.
    3. Click New to create a connection.
    The New connection page appears.
    4. Specify the values to the connection parameters. The following table describes the connection properties:
    Connection Property
    Description
    Username
    The Username of schema of Hadoop component.
    Password
    The password of schema of Hadoop component.
    JDBC Connection URL
    The JDBC URL to connect to the Hadoop Component. Refer JDBC URL
    Driver
    The JDBC driver class to connect to the Hadoop Component. Refer Setting Hadoop Classpath for non-Kerberos Clusters
    Commit Interval
    The Batch size, in rows, to load data to hive.
    Hadoop Installation Path
    The Installation path of the Hadoop component.
    Not applicable to a kerberos cluster.
    Hive Installation Path
    Hive Installation Path
    Not applicable to a kerberos cluster.
    HDFS Installation Path
    The HDFS Installation Path.
    Not applicable to a kerberos cluster.
    HBase Installation Path
    The HBase Installation Path.
    Not applicable to a kerberos cluster.
    Impala Installation Path
    The Impala Installation Path.
    Not applicable to a kerberos cluster.
    Miscellaneous Library Path
    The library that communicates with Hadoop.
    Not applicable to a kerberos cluster.
    Enable Logging
    Enable logging enables the log messages.
    Hadoop Distribution
    The Hadoop distributions for which you can use Kerberos Authentication. You can use Kerberos authentication for the Cloudera and HDP Hadoop distributions.
    Authentication Type
    You can select native or Kerberos authentication.
    Key Tab File
    The file that contains encrypted keys and Kerberos principals to authenticate the machine.
    Hive Site XML
    The directory where the core-site.xml, hive-site.xml and hdfs-site.xml are located. The three XML files must locate in the same location.
    Superuser Principle Name
    Users assigned to the superuser privilege can perform all the tasks that a user with the administrator privilege can perform.
    Impersonation Username
    You can enable different users to run mappings in a Hadoop cluster that uses Kerberos authentication or connect to sources and targets that use Kerberos authentication. To enable different users to run mappings or connect to big data sources and targets, you must configure user impersonation.
    Note: Installation paths are the paths where you place the Hadoop jar. Hadoop Connector loads the libraries from installation paths before it sends instructions to Hadoop. When you use Kerberos Authentication type, you need not specify the Hadoop installation path, Hive installation path, HDFS installation path, HBase Installation path, Impala installation path, and Miscellaneous Library path.
    If you do not use Kerberos Authentication and do not mention the installation path, you can set the Hadoop classpath for Amazon EMR, HortonWorks, MapR and Cloudera.
    Refer Setting Hadoop Classpath for non-Kerberos Clusters
    5. Click Test to evaluate the connection.
    6. Click OK to save the connection.

JDBC URL

The connector connects to different components of Hadoop with JDBC. The URL format and parameters differ from components to components.
The Hive uses following JDBC URL format:
jdbc:<hive/hive2>://<server>:<port>/<schema>
The significance of URL parameters is discussed below:
For example, jdbc:hive2://invrlx63iso7:10000/default connects the default schema of Hive, uses a Hive Thrift server HiveServer2 that starts on the server invrlx63iso7 on port 10000.
Hadoop Connecter uses the Hive thrift server to communicate with Hive.
The command to start the Thrift server is –hive –service hiveserver2.
Cloudera Impala uses the JDBC URL in the following format:
jdbc:hive2://<server>:<port>/;auth=<auth mechanism>

JDBC Driver Class

The JDBC Driver class varies among Hadoop components. For example, org.apache.hive.jdbc.HiveDriver for Hive and Impala.

Setting Hadoop Classpath for non-Kerberos Clusters

For non-Kerberos Hadoop clusters, if you do not mention the installation paths in connection properties, you can still perform the connection operations by setting the class path. You must simply set the classpath for the respective distributions.
This section helps you to set the classpath for Amazon EMR, HortonWorks, Pivotal, and MapR.

Setting Hadoop Classpath for Amazon EMR, HortonWorks, Pivotal, and MapR

Perform the following steps to generate the setHadoopConnectorClasspath.sh for Amazon, Horton works, Pivotal, and MapR.
    1. Use the command displayed in the following image to start the Secure Agent:
    2. Create the Hadoop connection using the connector.
    3. Test the connection. This generates the setHadoopConnectorClasspath.sh file in <Secure Agent installation directory>/apps/Data_Integration_Server/<latest DIS version>/ICS/main/tomcat path.
    4. Stop the Secure Agent.
    5. From <Secure Agent installation directory>, execute the . ./main/tomcat/setHadoopConnectorClasspath.sh command.
    6. Start the Secure Agent and execute the Data Synchronization tasks:
    Note: If you want to generate the setHadoopConnectorClasspath.sh file again, then delete the existing classpath and regenerate.
    In certain cases the Hadoop classpath may point to the incorrect classpath. In this case, you must direct the Hadoop classpath to the correct classpath.
    1. a. Enter the command hadoop classpath from the terminal to display the stream of jars.
    2. b. Copy the stream of jars in a notepad.
    3. c. If the following entries are present in the notepad file, delete them:
    4. d. Copy the remaining content and export it to a variable called HADOOP_CLASSPATH.
    5. e. In saas-infaagentapp.sh file make the following entry:
    6. f. Restart the Secure Agent.