Big Data
This section describes new big data features in version 10.1.1.
Blaze Engine
Effective in version 10.1.1, the Blaze engine has the following new features:
Hive Sources and Targets on the Blaze Engine
Effective in version 10.1.1, Hive sources and targets have the following additional support on the Blaze engine:
- •Hive decimal data type values with precision 38
- •Quoted identifiers in Hive table names, column names, and schema names
- •Partitioned Hive tables as targets
- •Bucketed Hive tables as source and targets
- •SQL overrides for Hive sources
- •Table locking for Hive sources and targets
- •Create or replace target tables for Hive targets
- •Truncate target table for Hive targets and Hive partitioned tables
For more information, see the "Mapping Objects in the Hadoop Environment" chapter in the Informatica Big Data Management® 10.1.1 User Guide.
Transformation Support on the Blaze Engine
Effective in version 10.1.1, transformations have the following additional support on the Blaze engine:
- •Lookup transformation. You can use SQL overrides and filter queries with Hive lookup sources.
- •Sorter transformation. Global sorts are supported when the Sorter transformation is connected to a flat file target. To maintain global sort order, you must enable the Maintain Row Order property in the flat file target. If the Sorter transformation is midstream in the mapping, then rows are sorted locally.
- •Update Strategy transformation. The Update Strategy transformation is supported with some restrictions.
For more information, see the "Mapping Objects in the Hadoop Environment" chapter in the Informatica Big Data Management 10.1.1 User Guide.
Blaze Engine Monitoring
Effective in Version 10.1.1, more detailed statistics about mapping jobs are available in the Blaze Summary Report. In the Blaze Job Monitor, a green summary report button appears beside the names of successful grid tasks which opens the Blaze Summary Report.
The Blaze Summary Report contains the following information about a mapping job:
- •Time taken by individual segments. A pie chart of segments within the grid task.
- •Mapping properties. A table containing basic information about the mapping job.
- •Tasklet execution time. A time series graph of all tasklets within the selected segment.
- •Selected tasklet information. Source and target row counts and cache information for each individual tasklet.
Note: The Blaze Summary Report is in beta. It contains most of the major features, but is not yet complete.
Blaze Engine Logs
Effective in version 10.1.1, the following error logging enhancements are available on the Blaze engine:
- •Execution statistics are available in the LDTM log when the log tracing level is set to verbose initialization or verbose data. The log includes the following mapping execution details:
- - Start time, end time, and state of each task
- - Blaze Job Monitor URL
- - Number of total, succeeded, and failed/cancelled tasklets
- - Number of processed and rejected rows for sources and targets
- - Data errors, if any, for transformations in each executed segment
- •The LDTM log includes the following transformation statistics:
- - Number of output rows for sources and targets
- - Number of error rows for sources and targets
- •The session log also displays a list of all segments within the grid task with corresponding links to the Blaze Job Monitor. Click on a link to see the execution details of that segment.
For more information, see the "Monitoring Mappings in a Hadoop Environment" chapter in the Informatica Big Data Management 10.1.1 User Guide.
Installation and Configuration
This section describes new features related to big data installation and configuration.
Address Reference Data Installation
Effective in version 10.1.1, Informatica Big Data Management installs with a shell script that you can use to install address reference data files. The script installs the reference data files on the compute nodes that you specify.
When you run an address validation mapping in a Hadoop environment, the reference data files must reside on each compute node on which the mapping runs. Use the script to install the reference data files on multiple nodes in a single operation.
The shell script name is copyRefDataToComputeNodes.sh.
Find the script in the following directory in the Informatica Big Data Management installation:
[Informatica installation directory]/tools/dq/av
When you run the script, you can enter the following information:
- •The current location of the reference data files.
- •The directory to which the script installs the files.
- •The location of the file that contains the compute node names.
- •The user name of the user who runs the script.
If you do not enter the information, the script uses a series of default values to identify the file locations and the user name.
For more information, see the Informatica Big Data Management 10.1.1 Installation and Configuration Guide.
Hadoop Configuration Manager in Silent Mode
Effective in version 10.1.1, you can use the Hadoop Configuration Manager in silent mode to configure Big Data Mangement.
For more information about configuring Big Data Management in silent mode, see the Informatica Big Data Management 10.1.1 Installation and Configuration Guide.
Installation in an Ambari Stack
Effective in version 10.1.1, you can use the Ambari configuration manager to install Big Data Management as a service in an Ambari stack.
For more information about installing Big Data Management in an Ambari stack, see the Informatica 10.1.1 Big Data Management Installation and Configuration Guide.
Script to Populate HDFS in HDInsight Clusters
Effective in version 10.1.1, you can use a script to populate the HDFS file system on an Azure HDInsight cluster when you configure the cluster for Big Data Management.
For more information about using the script to populate the HDFS file system, see the Informatica Big Data Management 10.1.1 Installation and Configuration Guide.
Spark Engine
Effective in version 10.1.1, the Spark engine has the following new features:
Binary Data Types
Effective in version 10.1.1, the Spark engine supports binary data type for the following functions:
- •DEC_BASE64
- •ENC_BASE64
- •MD5
- •UUID4
- •UUID_UNPARSE
- •CRC32
- •COMPRESS
- •DECOMPRESS (ignores precision)
- •AES Encrypt
- •AES Decrypt
Note: The Spark engine does not support binary data type for the join and lookup conditions.
For more information, see the "Function Reference" chapter in the Informatica Big Data Management 10.1.1 User Guide.
Transformation Support on the Spark Engine
Effective in version 10.1.1, transformations have the following additional support on the Spark engine:
- •The Java transformation is supported with some restrictions.
- •The Lookup transformation can access a Hive lookup source.
For more information, see the "Mapping Objects in the Hadoop Environment" chapter in the Informatica Big Data Management 10.1.1 User Guide.
Run-time Statistics for Spark Engine Job Runs
Effective in version 10.1.1, you can view summary and detailed statistics for mapping jobs run on the Spark engine.
You can view the following Spark summary statistics in the Summary Statistics view:
- •Source. The name of the mapping source file.
- •Target. The name of the target file.
- •Rows. The number of rows read for source and target.
The Detailed Statistics view displays a graph of the row counts for Spark engine job runs.
For more information, see the "Mapping Objects in the Hadoop Environment" chapter in the Informatica Big Data Management 10.1.1 User Guide.
Security
This section describes new big data security features in version 10.1.1.
Fine-Grained SQL Authorization Support for Hive Sources
Effective in version 10.1.1, you can configure a Hive connection to observe fine-grained SQL authorization when a Hive source table uses this level of authorization. Enable the Observe Fine Grained SQL Authorization option in the Hive connection to observe row and column-level restrictions that are configured for Hive tables and views.
For more information, see the Authorization section in the "Introduction to Big Data Management Security" chapter of the Informatica 10.1.1 Big Data Management Security Guide.
Spark Engine Security Support
Effective in version 10.1.1, the Spark engine supports the following additional security systems:
- •Apache Sentry on Cloudera CDH clusters
- •Apache Ranger on Hortonworks HDP clusters
- •HDFS Transparent Encryption on Hadoop distributions that the Spark engine supports
- •Operating system profiles on Hadoop distributions that the Spark engine supports
For more information, see the "Introduction to Big Data Management Security" chapter in the Informatica Big Data Management 10.1.1 Security Guide.
Sqoop
Effective in version 10.1.1, you can use the following new features when you configure Sqoop:
For more information, see the Informatica 10.1.1 Big Data Management User Guide.