Hive
Apache Hive is a data warehouse software built on Apache Hadoop to provide data query and analysis. Apache Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Objects Extracted
The Hive resource extracts metadata from the following assets in a Hive data source:
- •Tables
- •Views
- •Database
- •Schema
- •Connection details for views of different schemas
Permissions to Configure the Resource
Configure read permission on the Hive data source for the user account that you use to access the data source.
Prerequisites
Perform the following steps to complete the prerequisites:
- 1. Copy the hive.service.keytab file from /etc/security/keytabs to a location on the Informatica domain that is accessible for all the nodes in the cluster.
Note: If the Informatica domain is deployed on multiple nodes, verify that you copy the file to all the nodes on which you deployed the domain. Ensure that you place the file at the same location on all the nodes and the location is accessible to all the nodes in the cluster.
- 2. Enter values for the following Hive data source connection properties:
- - User. The Hive user name for Kerberos authentication.
- - URL. The Hive JDBC connection string.
- - Keytab file. Path to the keytab file if Hive uses Kerberos for authentication.
Resource Connection Properties
The following table describes the connection properties:
Property | Description |
---|
Hadoop Distribution | Select one of the following Hadoop distribution types for the Hive resource: - - Cloudera
- - Hortonworks
- - MapR
- - Amazon EMR
- - Azure HDInsight
- - IBM BigInsights
|
URL | JDBC connection URL used to access the Hive server. |
User | The Hive user name. |
Password | The password for the Hive user name. |
Keytab file | Path to the keytab file if Hive uses Kerberos for authentication. |
User proxy | The proxy user name to be used if Hive uses Kerberos for authentication. |
Kerberos Configuration File | Specify the path to the Kerberos configuration file if you use Kerberos-based authentication for Hive. |
Enable Debug for Kerberos | Select this option to enable debugging options for Kerberos-based authentication. |
For information about configuring Azure storage access for Azure HDInsight distribution type, see Data Engineering Integration Guide.
The following table describes the Additional and Advanced properties for source metadata settings on the Metadata Load Settings tab:
Property | Description |
---|
Enable Source Metadata | Select to extract metadata from the data source. |
Schema | Click Select... to specify the Hive schemas that you want to import. You can use one of the following options from the Select Schema dialog box to import the schemas: - - Select from List: Use this option to select the required schemas from a list of available schemas.
- - Select using regex: Provide an SQL regular expression to select schemas that match the expression.
|
Source Metadata Filter | You can include or exclude tables and views from the resource run. Use semicolons (;) to separate the table names and view names. For more information about the filter field, see Source Metadata and Data Profile Filter. |
Table | Specify the name of the Hive table that you want to import. If you leave this property blank, Enterprise Data Catalog imports all the Hive tables. |
SerDe jars list | Specify the path to the Serializer/DeSerializer (SerDe) jar file list. You can specify multiple jar files by separating the jar file paths using a semicolon (;). |
Worker Threads | Specify the number of worker threads to process metadata asynchronously. You can leave the value empty if you want Enterprise Data Catalog to calculate the value. Enterprise Data Catalog assigns a value between one and six based on the JVM architecture and number of available CPU cores. You can use the following points to decide the value to use: |
Case Sensitive | Specifies that the resource is configured for case insensitivity. Select one of the following values: - - True. Select this check box to specify that the resource is configured as case sensitive.
- - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is False. |
Memory | Specify the memory value required to run a scanner job. Specify one of the following memory values: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
JVM Options | JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
- - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
- - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
- - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
|
Track Data Source Changes | View metadata source change notifications in Enterprise Data Catalog. |
Auto Assign Connections | Indicates whether the connections must be assigned automatically. |
Enable Reference Resources | Extracts metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports. |
Retain Unresolved Reference Assets | Retains unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource. |
You can enable data discovery for a Hive resource. For more information about enabling data discovery, see the
Enable Data Discovery topic.
You can enable composite data domain discovery for a Hive resource. For more information about enabling composite data discovery, see the
Composite Data Domain Discovery topic.
Configure Hive Resource with Apache Knox Gateway
Enterprise Data Catalog supports Knox if you configure Hive for Knox. Verify that you install Informatica and Hive hosting service on the same cluster.
Note: You cannot deploy Enterprise Data Catalog on a cluster if you configure all the services on the nodes for Knox. Verify that you configure Knox for Hive service and not for other services running on the nodes.