Big Data Management
This section describes new Big Data Management features in version 10.2.1.
Blaze Engine Resource Conservation
Effective in version 10.2.1, you can preserve the resources that the Blaze engine infrastructure uses.
Set the infagrid.blaze.service.idle.timeout property to specify the number of minutes that the Blaze engine remains idle before releasing resources. Set the infagrid.orchestrator.svc.sunset.time property to specify the maximum number of hours for the Blaze orchestrator service. You can use the infacmd isp createConnection command, or set the property in the Blaze Advanced properties in the Hadoop connection in the Administrator tool or the Developer tool.
For more information about these properties, see the Informatica Big Data Management 10.2.1 Administrator Guide.
Cluster Workflows
You can use new workflow tasks to create a cluster workflow.
A cluster workflow creates a cluster on a cloud platform and runs Mapping and other workflow tasks on the cluster. You can choose to terminate and delete the cluster when workflow tasks are complete to save cluster resources.
Two new workflow tasks enable you to create and delete a Hadoop cluster as part of a cluster workflow:
- Create Cluster Task
- The Create Cluster task enables you to create, configure and start a Hadoop cluster on the following cloud platforms:
- - Amazon Web Services (AWS). You can create an Amazon EMR cluster.
- - Microsoft Azure. You can create an HDInsight cluster.
- Delete Cluster Task
- The optional Delete Cluster task enables you to delete a cluster after Mapping tasks and any other tasks in the workflow are complete. You might want to do this to save costs.
Previously, you could use Command tasks in a workflow to create clusters on a cloud platform. For more information about cluster workflows and workflow tasks, see the Informatica 10.2.1 Developer Workflow Guide.
Note: In 10.2.1, the Command task method of creating and deleting clusters now supports Cloudera Altus clusters on AWS. For more information, see the article "How to Create Cloudera Altus Clusters with a Cluster Workflow on Big Data Management" on the Informatica Network.
- Mapping Task
- Mapping task advanced properties include a new ClusterIdentifier property. The ClusterIdentifier identifies the cluster to use to run the Mapping task.
For more information about cluster workflows, see the Informatica 10.2.1 Developer Workflow Guide.
Cloud Provisioning Configuration
A cloud provisioning configuration is an object that contains information about connecting to a Hadoop cluster.
The cloud provisioning configuration includes information about how to integrate the domain with Hadoop account authentication and storage resources. A cluster workflow uses the information in the cloud provisioning configuration to connect to and create a cluster on a cloud platform such as Amazon Web Services or Microsoft Azure.
For more information about cloud provisioning, see the "Cloud Provisioning Configuration" chapter in the Informatica Big Data Management 10.2.1 Administrator Guide.
High Availability
Effective in version 10.2.1, you can enable high availability for the following services and security systems in the Hadoop environment on Cloudera CDH, Hortonworks HDP, and MapR Hadoop distributions:
- •Apache Ranger
- •Apache Ranger KMS
- •Apache Sentry
- •Cloudera Navigator Encrypt
- •HBase
- •Hive Metastore
- •HiveServer2
- •Name node
- •Resource Manager
Hive Functionality in the Hadoop Environment
This section describes new features for Hive functionality in the Hadoop environment in version 10.2.1.
Hive Table Truncation
Effective in version 10.2.1, you can truncate external partitioned Hive tables on all run-time engines.
You can truncate tables in the following Hive storage formats:
- •Avro
- •ORC
- •Parquet
- •RCFile
- •Sequence
- •Text
You can truncate tables in the following Hive external table formats:
- •Hive on HDFS
- •Hive on Amazon S3
- •Hive on Azure Blob
- •Hive on WASB
- •Hive on ADLS
For more information on truncating Hive targets, see the "Mapping Targets in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Pre- and Post-Mapping SQL Commands
Effective in version 10.2.1, you can configure PreSQL and PostSQL commands against Hive sources and targets in mappings that run on the Spark engine.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
Importing from PowerCenter
This section describes new import from PowerCenter features in version 10.2.1.
Import Session Properties from PowerCenter
Effective in version 10.2.1, you can import session properties, such as SQL-based overrides in relational sources and targets and overrides for the Lookup transformation from the PowerCenter repository to the Model repository.
For more information about the import from PowerCenter functionality, see the "Import from PowerCenter" chapter in the Informatica 10.2.1 Developer Mapping Guide.
SQL Parameters
Effective in version 10.2.1, you can specify an SQL parameter type to import all SQL-based overrides into the Model repository. The remaining session override properties map to String or a corresponding parameter type.
For more information, see the "Import from PowerCenter" chapter in the Informatica 10.2.1 Developer Mapping Guide.
Import a Command Task from PowerCenter
Effective in version 10.2.1, you can import a Command task from PowerCenter into the Model repository.
For more information, see the "Workflows" chapter in the Informatica 10.2.1 Developer Workflow Guide.
Intelligent Structure Model
Effective in version 10.2.1, you can use the intelligent structure model in Big Data Management.
- Spark Engine Support for Data Objects with Intelligent Structure Model
You can incorporate an intelligent structure model in an Amazon S3, Microsoft Azure Blob, or complex file data object. When you add the data object to a mapping that runs on the Spark engine, you can process any input type that the model can parse.
The data object can accept input and parse PDF forms, JSON, Microsoft Excel, Microsoft Word tables, CSV, text, or XML input files, based on the file which you used to create the model.
Intelligent structure model in the complex file, Amazon S3, and Microsoft Azure Blob data objects is available for technical preview. Technical preview functionality is supported but is unwarranted and is not production-ready. Informatica recommends that you use these features in non-production environments only.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
Mass Ingestion
Effective in version 10.2.1, you can perform mass ingestion jobs to ingest or replicate large amounts of data for use or storage in a database or a repository. To perform mass ingestion jobs, you use the Mass Ingestion tool to create a mass ingestion specification. You configure the mass ingestion specification to ingest data from a relational database to a Hive or HDFS target. You can also specify parameters to cleanse the data that you ingest.
A mass ingestion specification replaces the need to manually create and run mappings. You can create one mass ingestion specification that ingests all of the data at once.
For more information on mass ingestion, see the Informatica Big Data Management 10.2.1 Mass Ingestion Guide.
Monitoring
This section describes the new features related to monitoring in Big Data Management in version 10.2.1.
Hadoop Cluster Monitoring
Effective in version 10.2.1, you can configure the amount of information that appears in the application logs that you monitor for a Hadoop cluster.
The amount of information in the application logs depends on the tracing level that you configure for a mapping in the Developer tool. The following table describes the amount of information that appears in the application logs for each tracing level:
Tracing Level | Messages |
---|
None | The log displays FATAL messages. FATAL messages include non-recoverable system failures that cause the service to shut down or become unavailable. |
Terse | The log displays FATAL and ERROR code messages. ERROR messages include connection failures, failures to save or retrieve metadata, service errors. |
Normal | The log displays FATAL, ERROR, and WARNING messages. WARNING errors include recoverable system failures or warnings. |
Verbose initialization | The log displays FATAL, ERROR, WARNING, and INFO messages. INFO messages include system and service change messages. |
Verbose data | The log displays FATAL, ERROR, WARNING, INFO, and DEBUG messages. DEBUG messages are user request logs. |
For more information, see the "Monitoring Mappings in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Spark Monitoring
Effective in version 10.2.1, the Spark executor listens on a port for Spark events as part of Spark monitoring support and it is not required to configure the SparkMonitoringPort.
The Data Integration Service has a range of available ports, and the Spark executor selects a port from the available range. During failure, the port connection remains available and you do not need to restart the Data Integration Service before running the mapping.
The custom property for the monitoring port is retained. If you configure the property, the Data Integration Service uses the specified port to listen to Spark events.
Previously, the Data Integration Service custom property, the Spark monitoring port could configure the Spark listening port. If you did not configure the property, Spark Monitoring was disabled by default.
Tez Monitoring
Effective in 10.2.1, you can view Tez engine monitoring support related properties. You can use the Hive engine to run the mapping on MapReduce or Tez. The Tez engine can process jobs on Hortonworks HDP, Azure HDInsight, and Amazon Elastic MapReduce. To run a Spark mapping on Tez, you can use any of the supported clusters for Tez.
In the Administrator tool, you can also review the Hive query properties for Tez when you monitor the Hive engine. In the Hive session log and in Tez, you can view information related to Tez statistics, such as DAG tracking URL, total vertex count, and DAG progress.
You can monitor any Hive query on the Tez engine. When you enable logging for verbose data or verbose initialization, you can view the Tez engine information in the Administrator tool or in the session log. You can also monitor the status of the mapping on the Tez engine on the Monitoring tab in the Administrator tool.
For more information about Tez monitoring, see the Informatica Big Data Management 10.2.1 User Guide and the Informatica Big Data Management 10.2.1 Hadoop Integration Guide.
Processing Hierarchical Data on the Spark Engine
Effective in version 10.2.1, the Spark engine includes the following additional functionality to process hierarchical data:
- Map data type
- You can use map data type to generate and process map data in complex files.
- Complex files on Amazon S3
- You can use complex data types to read and write hierarchical data in Avro and Parquet files on Amazon S3. You project columns as complex data type in the data object read and write operations.
For more information, see the "Processing Hierarchical Data on the Spark Engine" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Rule Specification Support on the Spark Engine
Effective in version 10.2.1, you can run a mapping that contains a rule specification on the Spark engine in addition to the Blaze and Hive engines.
You can also run a mapping that contains a mapplet that you generate from a rule specification on the Spark engine in addition to the Blaze and Hive engines.
For more information about rule specifications, see the Informatica 10.2.1 Rule Specification Guide.
Security
This section describes the new features related to security in Big Data Management in version 10.2.1.
Cloudera Navigator Encrypt
Effective in version 10.2.1, you can use Cloudera Navigator Encrypt to secure the data and implement transparent encryption of data at rest.
EMR File System Authorization
Effective in version 10.2.1, you can use EMR File System (EMRFS) authorization to access data in Amazon S3 on Spark engine.
IAM Roles
Effective in version 10.2.1, you can use IAM roles for EMR File System to read and write data from the cluster to Amazon S3 in Amazon EMR cluster version 5.10.
Kerberos Authentication
Effective in version 10.2.1, you can enable Kerberos authentication for the following clusters:
- •Amazon EMR
- •Azure HDInsight with WASB as storage
LDAP Authentication
Effective in version 10.2.1, you can configure Lightweight Directory Access Protocol (LDAP) authentication for Amazon EMR cluster version 5.10.
Sqoop
Effective in version 10.2.1, you can use the following new Sqoop features:
- Support for MapR Connector for Teradata
You can use MapR Connector for Teradata to read data from or write data to Teradata on the Spark engine. MapR Connector for Teradata is a Teradata Connector for Hadoop (TDCH) specialized connector for Sqoop. When you run Sqoop mappings on the Spark engine, the Data Integration Service invokes the connector by default.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
- Spark engine optimization for Sqoop pass-through mappings
When you run a Sqoop pass-through mapping on the Spark engine, the Data Integration Service optimizes mapping performance in the following scenarios:
- - You read data from a Sqoop source and write data to a Hive target that uses the Text format.
- - You read data from a Sqoop source and write data to an HDFS target that uses the Flat, Avro, or Parquet format.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
- Spark engine support for high availability and security features
Sqoop honors all the high availability and security features such as Kerberos keytab login and KMS encryption that the Spark engine supports.
For more information, see the "Data Integration Service" chapter in the Informatica 10.2.1 Application Services Guide and "infacmd dis Command Reference" chapter in the Informatica 10.2.1 Command Reference Guide.
- Spark engine support for Teradata data objects
If you use a Teradata data object and you run a mapping on the Spark engine and on a Hortonworks or Cloudera cluster, the Data Integration Service runs the mapping through Sqoop.
If you use a Hortonworks cluster, the Data Integration Service invokes Hortonworks Connector for Teradata at run time. If you use a Cloudera cluster, the Data Integration Service invokes Cloudera Connector Powered by Teradata at run time.
For more information, see the Informatica PowerExchange for Teradata Parallel Transporter API 10.2.1 User Guide.
Transformation Support in the Hadoop Environment
This section describes new transformation features in the Hadoop environment in version 10.2.1.
Transformation Support on the Spark Engine
This section describes new transformation features on the Spark engine in version 10.2.1.
Transformation Support
Effective in version 10.2.1, the following transformations are supported on the Spark engine:
- •Case Converter
- •Classifier
- •Comparison
- •Key Generator
- •Labeler
- •Merge
- •Parser
- •Python
- •Standardizer
- •Weighted Average
Effective in version 10.2.1, the following transformations are supported with restrictions on the Spark engine:
- •Address Validator
- •Consolidation
- •Decision
- •Match
- •Sequence Generator
Effective in version 10.2.1, the following transformation has additional support on the Spark engine:
- •Java. Supports complex data types such as array, map, and struct to process hierarchical data.
For more information on transformation support, see the "Mapping Transformations in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
For more information about transformation operations, see the Informatica 10.2.1 Developer Transformation Guide.
Python Transformation
Effective in version 10.2.1, you can create a Python transformation in the Developer tool. Use the Python transformation to execute Python code in a mapping that runs on the Spark engine.
You can use a Python transformation to implement a machine model on the data that you pass through the transformation. For example, use the Python transformation to write Python code that loads a pre-trained model. You can use the pre-trained model to classify input data or create predictions.
Note: The Python transformation is available for technical preview. Technical preview functionality is supported but is not production-ready. Informatica recommends that you use in non-production environments only.
For more information, see the "Python Transformation" chapter in the Informatica 10.2.1 Developer Transformation Guide.
Update Strategy Transformation
Effective in version 10.2.1, you can use Hive MERGE statements for mappings that run on the Spark engine to perform update strategy tasks. Using MERGE in queries is usually more efficient and helps increase performance.
Hive MERGE statements are supported for the following Hadoop distributions:
- •Amazon EMR 5.10
- •Azure HDInsight 3.6
- •Hortonworks HDP 2.6
To use Hive MERGE, select the option in the advanced properties of the Update Strategy transformation.
Previously, the Data Integration Service used INSERT, UPDATE and DELETE statements to perform this task using any run-time engine. The Update Strategy transformation still uses these statements in the following scenarios:
- •You do not select the Hive MERGE option.
- •Mappings run on the Hive or Blaze engine.
- •If the Hadoop distribution does not support Hive MERGE.
For more information about using a MERGE statement in Update Strategy transformations, see the chapter on Update Strategy transformation in the Informatica Big Data Management 10.2.1 User Guide.
Transformation Support on the Blaze Engine
This section describes new transformation features on the Blaze engine in version 10.2.1.
Aggregator Transformation
Effective in version 10.2.1, the data cache for the Aggregator transformation uses variable length to store binary and string data types on the Blaze engine. Variable length reduces the amount of data that the data cache stores when the Aggregator transformation runs.
When data that passes through the Aggregator transformation is stored in the data cache using variable length, the Aggregator transformation is optimized to use sorted input and a Sorter transformation is inserted before the Aggregator transformation in the run-time mapping.
For more information, see the "Mapping Transformations in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Match Transformation
Effective in version 10.2.1, you can run a mapping that contains a Match transformation that you configure for identity analysis on the Blaze engine.
Configure the Match transformation to write the identity index data to cache files. The mapping fails validation if you configure the Match transformation to write the index data to database tables.
For more information on transformation support, see the "Mapping Transformations in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Rank Transformation
Effective in version 10.2.1, the data cache for the Rank transformation uses variable length to store binary and string data types on the Blaze engine. Variable length reduces the amount of data that the data cache stores when the Rank transformation runs.
When data that passes through the Rank transformation is stored in the data cache using variable length, the Rank transformation is optimized to use sorted input and a Sorter transformation is inserted before the Rank transformation in the run-time mapping.
For more information, see the "Mapping Transformations in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
For more information about transformation operations, see the Informatica 10.2.1 Developer Transformation Guide.