Big Data
This section describes new big data features in 10.2.
Big Data Management Installation
Effective in version 10.2, the Data Integration Service automatically installs the Big Data Management binaries on the cluster.
When you run a mapping , the Data Integration Service checks for the binary files on the cluster. If they do not exist or if they are not synchronized, the Data Integration Service prepares the files for transfer. It transfers the files to the distributed cache through the Informatica Hadoop staging directory on HDFS. By default, the staging directory is /tmp. This process replaces the requirement to install distribution packages on the Hadoop cluster.
For more information, see the Informatica Big Data Management 10.2 Hadoop Integration Guide.
Cluster Configuration
A cluster configuration is an object in the domain that contains configuration information about the Hadoop cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop environment.
When you create the cluster configuration, you import cluster configuration properties that are contained in configuration site files. You can import these properties directly from a cluster or from a cluster configuration archive file. You can also create connections to associate with the cluster configuration.
Previously, you ran the Hadoop Configuration Manager utility to configure connections and other information to enable the Informatica domain to communicate with the cluster.
For more information about cluster configuration, see the "Cluster Configuration" chapter in the Informatica Big Data Management 10.2 Administrator Guide.
Processing Hierarchical Data
Effective in version 10.2, you can use complex data types, such as array, struct, and map, in mappings that run on the Spark engine. With complex data types, the Spark engine directly reads, processes, and writes hierarchical data in Avro, JSON, and Parquet complex files.
Develop mappings with complex ports, operators, and functions to perform the following tasks:
- •Generate and modify hierarchical data.
- •Transform relational data to hierarchical data.
- •Transform hierarchical data to relational data.
- •Convert data from one complex file format to another.
When you process hierarchical data, you can use hierarchical conversion wizards to simplify the mapping development tasks. Use these wizards in the following scenarios:
- •To generate hierarchical data of type struct from one or more ports.
- •To generate hierarchical data of a nested struct type from ports in two transformations.
- •To extract elements from hierarchical data in a complex port.
- •To flatten hierarchical data in a complex port.
For more information, see the "Processing Hierarchical Data on the Spark Engine" chapter in the Informatica Big Data Management 10.2 User Guide.
Stateful Computing on the Spark Engine
Effective in version 10.2, you can use window functions in an Expression transformation to perform stateful calculations on the Spark engine. Window functions operate on a group of rows and calculate a single return value for every input row. You can use window functions to perform the following tasks:
- •Retrieve data from previous or subsequent rows.
- •Calculate a cumulative sum based on a group of rows.
- •Calculate a cumulative average based on a group of rows.
For more information, see the "Stateful Computing on the Spark Engine" chapter of the Big Data Management 10.2 User Guide.
Data Integration Service Queuing
Effective in version 10.2, if you deploy multiple mapping jobs or workflow mapping tasks at the same time, the Data Integration Service queues the jobs in a persisted queue and runs the jobs when resources are available. You can view the current status of mapping jobs on the Monitor tab of the Administrator tool.
All queues are persisted by default. If the Data Integration Service node shuts down unexpectedly, the queue does not fail over when the Data Integration Service fails over. The queue remains on the Data Integration Service machine, and the Data Integration Service resumes processing the queue when you restart it.
By default, each queue can hold 10,000 jobs at a time. When the queue is full, the Data Integration Service rejects job requests and marks them as failed. When the Data Integration Service starts running jobs in the queue, you can deploy additional jobs.
For more information, see the "Queuing" chapter in the Informatica Big Data Management 10.2 Administrator Guide.
Blaze Job Monitor
Effective in version 10.2, you can configure the host and port number to start the Blaze Job Monitor application in the Hadoop connection properties. The default value is <hostname>:9080. If you do not configure the host name, the Blaze engine uses the first alphabetical node in the cluster.
For more information, see the "Connections" chapter in the Big Data Management 10.2 User Guide.
Data Integration Service Properties for Hadoop Integration
Effective in version 10.2, the Data Integration Service added properties required to integrate the domain with the Hadoop environment.
The following table describes the new properties:
Property | Description |
---|
Hadoop Staging Directory | The HDFS directory where the Data Integration Services pushes Informatica Hadoop binaries and stores temporary files during processing. Default is /tmp. |
Hadoop Staging User | Required if the Data Integration Service user is empty. The HDFS user that performs operations on the Hadoop staging directory. The user needs write permissions on Hadoop staging directory. Default is the Data Integration Service user. |
Custom Hadoop OS Path | The local path to the Informatica Hadoop binaries compatible with the Hadoop operating system. Required when the Hadoop cluster and the Data Integration Service are on different supported operating systems. Download and extract the Informatica binaries for the Hadoop cluster on the machine that hosts the Data Integration Service. The Data Integration Service uses the binaries in this directory to integrate the domain with the Hadoop cluster. The Data Integration Service can synchronize the following operating systems: Changes take effect after you recycle the Data Integration Service. |
As a result of the changes in cluster integration, the following properties are removed from the Data Integration Service:
- •Informatica Home Directory on Hadoop
- •Hadoop Distribution Directory
For more information, see the Informatica 10.2 Hadoop Integration Guide.
Sqoop
Effective in version 10.2, if you use Sqoop data objects, you can use the following specialized Sqoop connectors to run mappings on the Spark engine:
- •Cloudera Connector Powered by Teradata
- •Hortonworks Connector for Teradata
These specialized connectors use native protocols to connect to the Teradata database.
For more information, see the Informatica Big Data Management 10.2 User Guide.
Autoscaling in an Amazon EMR Cluster
Effective in version 10.2, Big Data Management adds support for Spark mappings to take advantage of autoscaling in an Amazon EMR cluster.
Autoscaling enables the EMR cluster administrator to establish threshold-based rules for adding and subtracting cluster task and core nodes. Big Data Management certifies support for Spark mappings that run on an autoscaling-enabled EMR cluster.
Transformation Support on the Blaze Engine
Effective in version 10.2, the following transformations have additional support on the Blaze engine
- •Update Strategy. Supports targets that are ORC bucketed on all columns.
For more information, see the "Mapping Objects in a Hadoop Environment" chapter in the Informatica Big Data Management 10.2 User Guide.
Hive Functionality for the Blaze Engine
Effective in version 10.2, mappings that run on the Blaze engine can read and write to bucketed and sorted targets.
For information about how to configure mappings for the Blaze engine, see the "Mappings in a Hadoop Environment" chapter in the Informatica Big Data Management 10.2 User Guide.
Transformation Support on the Spark Engine
Effective in version 10.2, the following transformations are supported with restrictions on the Spark engine:
- •Normalizer
- •Rank
- •Update Strategy
Effective in version 10.2, the following transformations have additional support on the Spark engine:
- •Lookup. Supports unconnected lookup from the Filter, Aggregator, Router, Expression, and Update Strategy transformation.
For more information, see the "Mapping Objects in a Hadoop Environment" chapter in the Informatica Big Data Management 10.2 User Guide.
Hive Functionality for the Spark Engine
Effective in version 10.2, the following functionality is supported for mappings that run on the Spark engine:
- •Reading and writing to Hive resources in Amazon S3 buckets
- •Reading and writing to transactional Hive tables
- •Reading and writing to Hive table columns that are secured with fine-grained SQL authorization
For information about how to configure mappings for the Spark engine, see the "Mappings in a Hadoop Environment" chapter in the Informatica Big Data Management 10.2 User Guide.