Hive
Apache Hive is a data warehouse software built on Apache Hadoop to provide data query and analysis. Apache Hive provides a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Objects Extracted
The Hive resource extracts metadata from the following assets in a Hive data source:
- •Tables
- •Views
- •Database
- •Schema
- •Connection details for views of different schemas
Permissions to Configure the Resource
Configure read permission on the Hive data source for the user account that you use to access the data source.
Prerequisites
If the domain is SSL-enabled and the cluster is Kerberos-enabled, perform the following steps:
- 1. Copy the krb5.conf and infa_truststore.jks files from the cluster, and the hive.service.keytab file from /etc/security/keytabs to the following locations on the Informatica domain:
- - <INFA_HOME>/services/shared/security
- - <INFA_HOME>/java/jre/lib/security
- 2. Specify the krb5.conf file path in the Informatica domain and the Informatica Cluster Service nodes.
- 3. Add the Service Principal Name (SPN) and the keytab properties to the Data Integration Service properties.
- 4. Run the kinit command on the Informatica domain with the required SPN and keytab file.
- 5. Run the kinit command on the cluster with the required SPN and keytab file.
- 6. Restart the Data Integration Service and the Catalog Service.
Choose Hadoop Connection for Existing Hive Resources
If you have a Hive resource that uses the Hive engine as the run-time environment, edit the resource, choose Blaze, Spark, or Databricks as the Run on option, select a Hadoop connection, and run the resource to view the profile results.
Import the Hive Resource
Perform the following steps if you use a Hive resource:
- 1. If you want to enable profiling for the Hive resource, create a cluster configuration based on the type of cluster that you use.
- 2. Edit the Hive resource to assign the new Hive connection in the Source Connection field.
- 3. Select Blaze, Spark, or Databricks as the Run on option.
- 4. Select the Hadoop connection name.
- 5. If the Hive data source is located in a Kerberos-enabled cluster, make sure that you perform the following steps:
- a. Use Informatica Administrator to configure the Hadoop Kerberos Service Principal Name and the Hadoop Kerberos Keytab in the properties for the Data Integration Service.
- b. Set the value of Data Integration Service Hadoop Distribution Directory to /data/
- c. Create the cluster configuration if the cluster configuration was not created prior to upgrade. If the cluster configuration was created prior to upgrade, refresh the cluster configuration using the Informatica Administrator or the CLI.
- d. Use Informatica Administrator to append the following line to the JVM command line options for the Data Integration Service: -Djava.security.krb5.conf=/data. The JVM options are located in the Processes section of the Data Integration Service.
Basic Information
The General tab includes the following basic information about the resource:
Information | Description |
---|
Name | The name of the resource. |
Description | The description of the resource. |
Resource type | The type of the resource. |
Execute On | You can choose to execute on the default catalog server or offline. |
Resource Connection Properties
The following table describes the connection properties:
Property | Description |
---|
Hadoop Distribution | Select one of the following Hadoop distribution types for the Hive resource: - - Cloudera
- - Hortonworks
- - MapR
- - Amazon EMR
- - Azure HDInsight
- - IBM BigInsights
|
URL | JDBC connection URL used to access the Hive server. |
User | The Hive user name. |
Password | The password for the Hive user name. |
Keytab file | Path to the keytab file if Hive uses Kerberos for authentication. |
User proxy | The proxy user name to be used if Hive uses Kerberos for authentication. |
Kerberos Configuration File | Specify the path to the Kerberos configuration file if you use Kerberos-based authentication for Hive. |
Enable Debug for Kerberos | Select this option to enable debugging options for Kerberos-based authentication. |
For information about configuring Azure storage access for Azure HDInsight distribution type, see Data Engineering Integration Guide.
The following table describes the Additional and Advanced properties for source metadata settings on the Metadata Load Settings tab:
Property | Description |
---|
Enable Source Metadata | Select to extract metadata from the data source. |
Schema | Click Select... to specify the Hive schemas that you want to import. You can use one of the following options from the Select Schema dialog box to import the schemas: - - Select from List: Use this option to select the required schemas from a list of available schemas.
- - Select using regex: Provide an SQL regular expression to select schemas that match the expression.
|
Source Metadata Filter | You can include or exclude tables and views from the resource run. Use semicolons (;) to separate the table names and view names. For more information about the filter field, see Source Metadata and Data Profile Filter. |
Table | Specify the name of the Hive table that you want to import. If you leave this property blank, Enterprise Data Catalog imports all the Hive tables. |
SerDe jars list | Specify the path to the Serializer/DeSerializer (SerDe) jar file list. You can specify multiple jar files by separating the jar file paths using a semicolon (;). |
Worker Threads | Specify the number of worker threads to process metadata asynchronously. You can leave the value empty if you want Enterprise Data Catalog to calculate the value. Enterprise Data Catalog assigns a value between one and six based on the JVM architecture and number of available CPU cores. You can use the following points to decide the value to use: |
Case Sensitive | Specifies that the resource is configured for case insensitivity. Select one of the following values: - - True. Select this check box to specify that the resource is configured as case sensitive.
- - False. Clear this check box to specify that the resource is configured as case insensitive.
The default value is False. |
Memory | Specify the memory value required to run a scanner job. Specify one of the following memory values: Note: For more information about the memory values, see the Tuning Enterprise Data Catalog Performance article on How To-Library Articles tab in the Informatica Doc Portal |
Custom Options | JVM parameters that you can set to configure scanner container. Use the following arguments to configure the parameters: - - -Dscannerloglevel=<DEBUG/INFO/ERROR>. Changes the log level of scanner to values, such as DEBUG, ERROR, or INFO. Default value is INFO.
- - -Dscanner.container.core=<No. of core>. Increases the core for the scanner container. The value should be a number.
- - -Dscanner.yarn.app.environment=<key=value>. Key pair value that you need to set in the Yarn environment. Use a comma to separate the key pair value.
- - -Dscanner.pmem.enabled.container.memory.jvm.memory.ratio=<1.0/2.0>. Increases the scanner container memory when pmem is enabled. Default value is 1.
|
Track Data Source Changes | View metadata source change notifications in Enterprise Data Catalog. |
Auto Assign Connections | Indicates whether the connections must be assigned automatically. |
Enable Reference Resources | Extracts metadata about assets that are not included in this resource, but referred to in the resource. Examples include source and target tables in PowerCenter mappings, and source tables and files from Tableau reports. |
Retain Unresolved Reference Assets | Retains unresolved reference assets in the catalog after you assign connections. Retaining unresolved reference assets help you view the complete lineage. The unresolved assets include deleted files, temporary tables, and other assets that are not present in the primary resource. |
You can enable data discovery for a Hive resource. For more information about enabling data discovery, see the
Enable Data Discovery topic.
You can enable composite data domain discovery for a Hive resource. For more information about enabling composite data discovery, see the
Composite Data Domain Discovery topic.
Configure Hive Resource with Apache Knox Gateway
Enterprise Data Catalog supports Knox if you configure Hive for Knox. Verify that you install Informatica and Hive hosting service on the same cluster.
Note: You cannot deploy Enterprise Data Catalog on a cluster if you configure all the services on the nodes for Knox. Verify that you configure Knox for Hive service and not for other services running on the nodes.