Enterprise Data Catalog Scanner Configuration Guide > Configuring Data Engineering Resources > Hive
  

Hive

Apache Hive is a data warehouse software built on Apache Hadoop to provide data query and analysis. Apache Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Objects Extracted

The Hive resource extracts metadata from the following assets in a Hive data source:

Permissions to Configure the Resource

Configure read permission on the Hive data source for the user account that you use to access the data source.

Prerequisites

If the domain is SSL-enabled and the cluster is Kerberos-enabled, perform the following steps:
  1. 1. Copy the krb5.conf and infa_truststore.jks files from the cluster, and the hive.service.keytab file from /etc/security/keytabs to the following locations on the Informatica domain:
  2. 2. Specify the krb5.conf file path in the Informatica domain and the Informatica Cluster Service nodes.
  3. 3. Add the Service Principal Name (SPN) and the keytab properties to the Data Integration Service properties.
  4. 4. Run the kinit command on the Informatica domain with the required SPN and keytab file.
  5. 5. Run the kinit command on the cluster with the required SPN and keytab file.
  6. 6. Restart the Data Integration Service and the Catalog Service.

Choose Hadoop Connection for Existing Hive Resources

If you have a Hive resource that uses the Hive engine as the run-time environment, edit the resource, choose Blaze, Spark, or Databricks as the Run on option, select a Hadoop connection, and run the resource to view the profile results.

Import the Hive Resource

Perform the following steps if you use a Hive resource:
  1. 1. If you want to enable profiling for the Hive resource, create a cluster configuration based on the type of cluster that you use.
  2. 2. Edit the Hive resource to assign the new Hive connection in the Source Connection field.
  3. 3. Select Blaze, Spark, or Databricks as the Run on option.
  4. 4. Select the Hadoop connection name.
  5. 5. If the Hive data source is located in a Kerberos-enabled cluster, make sure that you perform the following steps:
    1. a. Use Informatica Administrator to configure the Hadoop Kerberos Service Principal Name and the Hadoop Kerberos Keytab in the properties for the Data Integration Service.
    2. b. Set the value of Data Integration Service Hadoop Distribution Directory to /data/
    3. c. Create the cluster configuration if the cluster configuration was not created prior to upgrade. If the cluster configuration was created prior to upgrade, refresh the cluster configuration using the Informatica Administrator or the CLI.
    4. d. Use Informatica Administrator to append the following line to the JVM command line options for the Data Integration Service: -Djava.security.krb5.conf=/data. The JVM options are located in the Processes section of the Data Integration Service.

Basic Information

The General tab includes the following basic information about the resource:
Information
Description
Name
The name of the resource.
Description
The description of the resource.
Resource type
The type of the resource.
Execute On
You can choose to execute on the default catalog server or offline.

Resource Connection Properties

The following table describes the connection properties:
Property
Description
Hadoop Distribution
Select one of the following Hadoop distribution types for the Hive resource:
  • - Cloudera
  • - Hortonworks
  • - MapR
  • - Amazon EMR
  • - Azure HDInsight
  • - IBM BigInsights
URL
JDBC connection URL used to access the Hive server.
User
The Hive user name.
Password
The password for the Hive user name.
Keytab file
Path to the keytab file if Hive uses Kerberos for authentication.
User proxy
The proxy user name to be used if Hive uses Kerberos for authentication.
Kerberos Configuration File
Specify the path to the Kerberos configuration file if you use Kerberos-based authentication for Hive.
Enable Debug for Kerberos
Select this option to enable debugging options for Kerberos-based authentication.
For information about configuring Azure storage access for Azure HDInsight distribution type, see Data Engineering Integration Guide.
The following table describes the Additional and Advanced properties for source metadata settings on the Metadata Load Settings tab:
Property
Description
Enable Source Metadata
Select to extract metadata from the data source.
Schema
Click Select... to specify the Hive schemas that you want to import. You can use one of the following options from the Select Schema dialog box to import the schemas:
  • - Select from List: Use this option to select the required schemas from a list of available schemas.
  • - Select using regex: Provide an SQL regular expression to select schemas that match the expression.
Source Metadata Filter
You can include or exclude tables and views from the resource run. Use semicolons (;) to separate the table names and view names.
For more information about the filter field, see Source Metadata and Data Profile Filter.
Table
Specify the name of the Hive table that you want to import. If you leave this property blank, Enterprise Data Catalog imports all the Hive tables.
SerDe jars list
Specify the path to the Serializer/DeSerializer (SerDe) jar file list. You can specify multiple jar files by separating the jar file paths using a semicolon (;).
Worker Threads
Specify the number of worker threads to process metadata asynchronously. You can leave the value empty if you want Enterprise Data Catalog to calculate the value. Enterprise Data Catalog assigns a value between one and six based on the JVM architecture and number of available CPU cores.
You can use the following points to decide the value to use:
  • - You can provide a value that is greater than or equal to one and lesser than six to specify the number of worker threads required.
  • - If you specify an invalid value, Enterprise Data Catalog shows a warning and uses the value one.
  • - If your machine has more memory, you can specify a higher value to process more metadata asynchronously.
  • Note: Specifying a higher value might impact performance of the system.
Case Sensitive
Specifies that the resource is configured for case insensitivity. Select one of the following values:
  • - True. Select this check box to specify that the resource is configured as case sensitive.
  • - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is False.
Memory
Specify the memory value required to run a scanner job.
Specify one of the following memory values:
  • - Low
  • - Medium
  • - High
Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
Custom Options
JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters:
  • - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
  • - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
  • - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
  • - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
Track Data Source Changes
View metadata source change notifications in Enterprise Data Catalog.
Auto Assign Connections
Indicates whether the connections must be assigned automatically.
Enable Reference Resources
Extracts metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports.
Retain Unresolved Reference Assets
Retains unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource.
You can enable data discovery for a Hive resource. For more information about enabling data discovery, see the Enable Data Discovery topic.
You can enable composite data domain discovery for a Hive resource. For more information about enabling composite data discovery, see the Composite Data Domain Discovery topic.

Configure Hive Resource with Apache Knox Gateway

Enterprise Data Catalog supports Knox if you configure Hive for Knox. Verify that you install Informatica and Hive hosting service on the same cluster.
Note: You cannot deploy Enterprise Data Catalog on a cluster if you configure all the services on the nodes for Knox. Verify that you configure Knox for Hive service and not for other services running on the nodes.