Hive Sources
You can include Hive sources in an Informatica mapping that runs in the Hadoop environment.
Consider the following limitations when you configure a Hive source in a mapping that runs in the Hadoop environment:
- •A mapping fails to run when you have Unicode characters in a Hive source definition.
- •The third-party Hive JDBC driver does not return the correct precision and scale values for the Decimal data type. As a result, when you import Hive tables with a Decimal data type into the Developer tool, the Decimal data type precision is set to 38 and the scale is set to 0. Consider the following configuration rules and guidelines based on the version of Hive:
- - Hive 0.11. Accept the default precision and scale for the Decimal data type in the Developer tool.
- - Hive 0.12. Accept the default precision and scale for the Decimal data type in the Developer tool.
- - Hive 0.12 with Cloudera CDH 5.0. You can configure the precision and scale fields for source columns with the Decimal data type in the Developer tool.
- - Hive 0.13 and above. You can configure the precision and scale fields for source columns with the Decimal data type in the Developer tool.
- - Hive 0.14 or above. The precision and scale used for the Decimal data type in the Hive database also appears in the Developer tool.
A mapping that runs on the Spark engine can have partitioned Hive source tables and bucketed sources.
PreSQL and PostSQL Commands
You can create SQL commands for Hive sources. You can execute the SQL commands to execute SQL statements such as insert, update, and delete on the Hive source.
PreSQL is an SQL command that runs against the Hive source before the mapping reads from the source. PostSQL is an SQL command that runs against the Hive source after the mapping writes to the target.
You can use PreSQL and PostSQL on the Spark engine. The Data Integration Service does not validate PreSQL or PostSQL commands for a Hive source.
Note: You can manually validate the SQL by running the following query in a Hive command line utility:
CREATE VIEW <table name> (<port list>) AS <SQL>
where:
- •<table name> is a name of your choice
- •<port list> is the comma-delimited list of ports in the source
- •<SQL> is the query to validate
Pre-Mapping SQL Commands
PreSQL is an SQL command that runs against a Hive source before the mapping reads from the source.
For example, you might use a Hive source in a mapping. The data stored in the Hive source changes regularly and you must update the data in the Hive source before the mapping reads from the source to make sure that the mapping reads the latest records. To update the Hive source, you can configure a PreSQL command.
Post-Mapping SQL Commands
PostSQL is an SQL command that runs against a Hive source after the mapping writes to the target.
For example, you might use a Hive source in a mapping. After the mapping writes to a target, you might want to delete the stage records stored in the Hive source. You want to run the command only after the mapping writes the data to the target to make sure that the data is not removed prematurely from the Hive source. To delete the records in the Hive source table after the mapping writes to the target, you can configure a PostSQL command.
Rules and Guidelines for Pre- and Post-Mapping SQL Commands
Consider the following restrictions when you run PreSQL and PostSQL commands against Hive sources:
- •When you create an SQL override on a Hive source, you must enclose keywords or special characters in backtick (`) characters.
- •When you run a mapping with a Hive source in the Hadoop environment, references to a local path in pre-mapping SQL commands are relative to the Data Integration Service node. When you run a mapping with a Hive source in the native environment, references to local path in pre-mapping SQL commands are relative to the Hive server node.
Rules and Guidelines for Hive Sources on the Blaze Engine
You can include Hive sources in an Informatica mapping that runs on the Blaze engine.
Consider the following rules and guidelines when you configure a Hive source in a mapping that runs on the Blaze engine:
- •Hive sources for a Blaze mapping include the TEXT, Sequence, Avro, RCfile, ORC, and Parquet storage formats.
- •A mapping that runs on the Blaze engine can have bucketed Hive sources and Hive ACID tables.
- •Hive ACID tables must be bucketed.
- •The Blaze engine supports Hive tables that are enabled for locking.
- •Hive sources can contain quoted identifiers in Hive table names, column names, and schema names.
- •The TEXT storage format in a Hive source for a Blaze mapping can support ASCII characters as column delimiters and the newline characters as a row separator. You cannot use hex values of ASCII characters. For example, use a semicolon (;) instead of 3B.
- •You can define an SQL override in the Hive source for a Blaze mapping.
- •The Blaze engine can read from an RCFile as a Hive source. To read from an RCFile table, you must create the table with the SerDe clause.
- •The Blaze engine can read from Hive tables that are compressed. To read from a compressed Hive table, you must set the TBLPROPERTIES clause.
RCFile as Hive Tables
The Blaze engine can read and write to RCFile as Hive tables. However, the Blaze engine supports only the ColumnarSerDe SerDe. In Hortonworks, the default SerDe for an RCFile is LazyBinaryColumnarSerDe. To read and write to an RCFile table, you must create the table by specifying the SerDe as org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe.
For example:
CREATE TABLE TEST_RCFIle
(id int, name string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
You can also set the default RCFile SerDe from the Ambari or Cloudera manager. Set the property hive.default.rcfile.serde to org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe.
Compressed Hive Tables
The Blaze engine can read and write to Hive tables that are compressed. However, to read from a compressed Hive table or write to a Hive table in compressed format, you must set the TBLPROPERTIES clause as follows:
- • When you create the table, set the table properties:
TBLPROPERTIES ('property_name'='property_value')
- • If the table already exists, alter the table to set the table properties:
ALTER TABLE table_name SET TBLPROPERTIES ('property_name' = 'property_value');
The property name and value are not case sensitive. Depending on the file format, the table property can take different values.
The following table lists the property names and values for different file formats:
File Format | Table Property Name | Table Property Values |
---|
Avro | avro.compression | BZIP2, deflate, Snappy |
ORC | orc.compress | Snappy, ZLIB |
Parquet | parquet.compression | GZIP, Snappy |
RCFile | rcfile.compression | Snappy, ZLIB |
Sequence | sequencefile.compression | BZIP2, GZIP, LZ4, Snappy |
Text | text.compression | BZIP2, GZIP, LZ4, Snappy |
Note: Unlike the Hive engine, the Blaze engine does not write data in the default ZLIB compressed format when it writes to a Hive target stored as ORC format. To write in a compressed format, alter the table to set the TBLPROPERTIES clause to use ZLIB or Snappy compression for the ORC file format.
The following text shows sample commands to create table and alter table:
- •Create table:
create table CBO_3T_JOINS_CUSTOMER_HIVE_SEQ_GZIP
(C_CUSTKEY DECIMAL(38,0), C_NAME STRING,C_ADDRESS STRING,
C_PHONE STRING,C_ACCTBAL DECIMAL(10,2),
C_MKTSEGMENT VARCHAR(10),C_COMMENT vARCHAR(117))
partitioned by (C_NATIONKEY DECIMAL(38,0))
TBLPROPERTIES ('sequencefile.compression'='gzip')
stored as SEQUENCEFILE;
- •Alter table:
ALTER TABLE table_name
SET TBLPROPERTIES (avro.compression'='BZIP2');