Enterprise Data Catalog Scanner Configuration Guide > Configuring Data Engineering Resources > Hive

Hive

Apache Hive is a data warehouse software built on Apache Hadoop to provide data query and analysis. Apache Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Objects Extracted

The Hive resource extracts metadata from the following assets in a Hive data source:

•Tables
•Views
•Database
•Schema
•Connection details for views of different schemas

Permissions to Configure the Resource

Configure read permission on the Hive data source for the user account that you use to access the data source.

Prerequisites

Perform the following steps to complete the prerequisites:

1. Copy the hive.service.keytab file from /etc/security/keytabs to a location on the Informatica domain that is accessible for all the nodes in the cluster.

Note: If the Informatica domain is deployed on multiple nodes, verify that you copy the file to all the nodes on which you deployed the domain. Ensure that you place the file at the same location on all the nodes and the location is accessible to all the nodes in the cluster.

2. Enter values for the following Hive data source connection properties:

- User. The Hive user name for Kerberos authentication.
- URL. The Hive JDBC connection string.
- Keytab file. Path to the keytab file if Hive uses Kerberos for authentication.

Resource Connection Properties

The following table describes the connection properties:

Property	Description
Hadoop Distribution	Select one of the following Hadoop distribution types for the Hive resource: - Cloudera - Hortonworks - MapR - Amazon EMR - Azure HDInsight - IBM BigInsights
URL	JDBC connection URL used to access the Hive server.
User	The Hive user name.
Password	The password for the Hive user name.
Keytab file	Path to the keytab file if Hive uses Kerberos for authentication.
User proxy	The proxy user name to be used if Hive uses Kerberos for authentication.
Kerberos Configuration File	Specify the path to the Kerberos configuration file if you use Kerberos-based authentication for Hive.
Enable Debug for Kerberos	Select this option to enable debugging options for Kerberos-based authentication.

For information about configuring Azure storage access for Azure HDInsight distribution type, see Data Engineering Integration Guide.

The following table describes the Additional and Advanced properties for source metadata settings on the Metadata Load Settings tab:

Property	Description
Enable Source Metadata	Select to extract metadata from the data source.
Schema	Click Select... to specify the Hive schemas that you want to import. You can use one of the following options from the Select Schema dialog box to import the schemas: - Select from List: Use this option to select the required schemas from a list of available schemas. - Select using regex: Provide an SQL regular expression to select schemas that match the expression.
Source Metadata Filter	You can include or exclude tables and views from the resource run. Use semicolons (;) to separate the table names and view names. For more information about the filter field, see Source Metadata and Data Profile Filter.
Table	Specify the name of the Hive table that you want to import. If you leave this property blank, Enterprise Data Catalog imports all the Hive tables.
SerDe jars list	Specify the path to the Serializer/DeSerializer (SerDe) jar file list. You can specify multiple jar files by separating the jar file paths using a semicolon (;).
Worker Threads	Specify the number of worker threads to process metadata asynchronously. You can leave the value empty if you want Enterprise Data Catalog to calculate the value. Enterprise Data Catalog assigns a value between one and six based on the JVM architecture and number of available CPU cores. You can use the following points to decide the value to use: - You can provide a value that is greater than or equal to one and lesser than six to specify the number of worker threads required. - If you specify an invalid value, Enterprise Data Catalog shows a warning and uses the value one. - If your machine has more memory, you can specify a higher value to process more metadata asynchronously. Note: Specifying a higher value might impact performance of the system.
Case Sensitive	Specifies that the resource is configured for case insensitivity. Select one of the following values: - True. Select this check box to specify that the resource is configured as case sensitive. - False. Clear this check box to specify that the resource is configured as case insensitive. The default value is False.
Memory	Specify the memory value required to run a scanner job. Specify one of the following memory values: - Low - Medium - High Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal
JVM Options	JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO. - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number. - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value. - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
Track Data Source Changes	View metadata source change notifications in Enterprise Data Catalog.
Auto Assign Connections	Indicates whether the connections must be assigned automatically.
Enable Reference Resources	Extracts metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports.
Retain Unresolved Reference Assets	Retains unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource.

You can enable data discovery for a Hive resource. For more information about enabling data discovery, see the Enable Data Discovery topic.

You can enable composite data domain discovery for a Hive resource. For more information about enabling composite data discovery, see the Composite Data Domain Discovery topic.

Configure Hive Resource with Apache Knox Gateway

Enterprise Data Catalog supports Knox if you configure Hive for Knox. Verify that you install Informatica and Hive hosting service on the same cluster.

Note: You cannot deploy Enterprise Data Catalog on a cluster if you configure all the services on the nodes for Knox. Verify that you configure Knox for Hive service and not for other services running on the nodes.