Big Data Management
This section describes the changes to Big Data Management in version 10.2.1.
Azure Storage Access
Effective in version 10.2.1, you must override the properties in the cluster configuration core-site.xml before you run a mapping on the Azure HDInsight cluster.
- WASB
- If you use a cluster with WASB as storage, you can get the storage account key associated with the HDInsight cluster from the administrator or you can decrypt the encrypted storage account key, and then override the decrypted value in the cluster configuration core-site.xml.
- ADLS
- If you use a cluster with ADLS as storage, you must copy the client credentials from the web application, and then override the values in the cluster configuration core-site.xml.
Previously, you copied the files from the Hadoop cluster to the machine that runs the Data Integration Service.
Configuring the Hadoop Distribution
This section describes changes to Hadoop distribution configuration.
Hadoop Distribution Configuration
Effective in version 10.2.1, you configure the Hadoop distribution in cluster configuration properties.
The Distribution Name and Distribution Version properties are populated when you import a cluster configuration from the cluster. You can edit the distribution version after you finish the import process.
Previously, the Hadoop distribution was identified by the path to the distribution directory on the machine that hosts the Data Integration Service.
Effective in version 10.2.1, the following property is removed from the Data Integration Service properties:
- •Data Integration Service Hadoop Distribution Directory
For more information about the Distribution Name and Distribution Version properties, see the Big Data Management 10.2.1 Administration Guide.
MapR Configuration
Effective in version 10.2.1, it is no longer necessary to configure Data Integration Service process properties for the domain when you use Big Data Management with MapR. Big Data Management supports Kerberos authentication with no user action necessary.
Previously, you configured JVM Option properties in the Data Integration Service custom properties, as well as environment variables, to enable support for Kerberos authentication.
For more information about integrating the domain with a MapR cluster, see the Big Data Management 10.2.1 Hadoop Integration Guide.
Developer Tool Configuration
Effective in version 10.2.1, you can create a Metadata Access Service. The Metadata Access Service is an application service that allows the Developer tool to access Hadoop connection information to import and preview metadata. When you import an object from a Hadoop cluster, the following adapters use Metadata Access Service to extract the object metadata at design time:
- •PowerExchange for HBase
- •PowerExchange for HDFS
- •PowerExchange for Hive
- •PowerExchange for MapR-DB
Previously, you performed the following steps manually on each Developer tool to establish communication between the Developer tool machine and Hadoop cluster at design time:
- •Extracted cluster configuration files.
- •Ran krb5.ini file to import metadata from Hive, HBase, and complex file sources from a kerberos-enabled Hadoop cluster.
The Metadata Access Service eliminates the need to configure each Developer tool machine for design-time connectivity to Hadoop cluster.
For more information, see the "Metadata Access Service" chapter in the Informatica 10.2.1 Application Service Guide.
Hadoop Connection Changes
Effective in version 10.2.1, the Hadoop connection contains new and different properties and functionality. These include several properties that you previously configured in other connections or configuration files, and other changes.
This section lists changes to the Hadoop connection in version 10.2.1.
Properties Moved from hadoopEnv.properties to the Hadoop Connection
Effective in version 10.2.1, the properties that you previously configured in the hadoopEnv.properties file are now configurable in advanced properties for the Hadoop connection.
For information about Hive and Hadoop connections, see the Informatica Big Data Management 10.2.1 User Guide. For more information about configuring Big Data Management, see the Informatica Big Data Management 10.2.1 Hadoop Integration Guide.
Properties Moved from the Hive Connection to the Hadoop Connection
The following Hive connection properties to enable mappings to run on a Hadoop cluster are now in the Hadoop connection:
- •Database Name. Namespace for tables. Use the name default for tables that do not have a specified database name.
- •Advanced Hive/Hadoop Properties. Configures or overrides Hive or Hadoop cluster properties in the hive-site.xml configuration set on the machine on which the Data Integration Service runs. You can specify multiple properties.
- •Temporary Table Compression Codec. Hadoop compression library for a compression codec class name.
- •Codec Class Name. Codec class name that enables data compression and improves performance on temporary staging tables.
Previously, you configured these properties in the Hive connection.
For information about Hive and Hadoop connections, see the Informatica Big Data Management 10.2.1 Administrator Guide.
Advanced Properties for Hadoop Run-time Engines
Effective in version 10.2.1, configure advanced properties for the Blaze, Spark and Hive run-time engines in Hadoop connection properties.
Informatica standardized the property names for run-time engine-related properties. The following table shows the old and new names:
Pre-10.2.1 Property Name | 10.2.1 Hadoop Connection Properties Section | 10.2.1 Property Name |
---|
Blaze Service Custom Properties | Blaze Configuration | Advanced Properties |
Spark Execution Parameters | Spark Configuration | Advanced Properties |
Hive Custom Properties | Hive Pushdown Configuration | Advanced Properties |
Previously, you configured advanced properties for run-time engines in the hadoopRes.properties or hadoopEnv.properties files, or in the Hadoop Engine Custom Properties field under Common Properties in the Administrator tool.
Additional Properties for the Blaze Engine
Effective in version 10.2.1, you can configure an additional property in the Blaze Configuration Properties section of the Hadoop connection properties.
The following table describes the property:
Property | Description |
---|
Blaze YARN Node Label | Node label that determines the node on the Hadoop cluster where the Blaze engine runs. If you do not specify a node label, the Blaze engine runs on the nodes in the default partition. If the Hadoop cluster supports logical operators for node labels, you can specify a list of node labels. To list the node labels, use the operators && (AND), || (OR), and ! (NOT). |
For more information on using node labels on the Blaze engine, see the "Mappings in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Hive Connection Properties
Effective in version 10.2.1, properties for the Hive connection have changed.
The following Hive connection properties have been removed:
- •Access Hive as a source or target
- •Use Hive to run mappings in a Hadoop cluster
Previously, these properties were deprecated. Effective in version 10.2.1, they are obsolete.
Configure the following Hive connection properties in the Hadoop connection:
- •Database Name
- •Advanced Hive/Hadoop Properties
- •Temporary Table Compression Codec
- •Codec Class Name
Previously, you configured these properties in the Hive connection.
For information about Hive and Hadoop connections, see the Informatica Big Data Management 10.2.1 User Guide.
Monitoring
This section describes the changes to monitoring in Big Data Management in version 10.2.1.
Spark Monitoring
Effective in version 10.2.1, changes in Spark monitoring relate to the following areas:
- •Event changes
- •Updates in the Summary Statistics view
Event Changes
Effective in version 10.2.1, only monitoring information is checked in the Spark events in the session log.
Previously, all the Spark events were relayed as is from the Spark application to the Spark executor. When the events relayed took a long time, performance issues occurred.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
Summary Statistics View
Effective in version 10.2.1, you can view the statistics for Spark execution based on the run stages. For instance, Spark Run Stages shows the statistics of spark application run stages. Stage_0 shows the statistics related to run stage with ID=0 in the spark application. Rows and Average Rows/Sec show the number of rows written out of the stage and the corresponding throughput. Bytes and Average Bytes/Sec show the bytes and throughput broadcasted in the stage.
Previously, you could only view the Source and Target rows and average rows for each second processed for the Spark run.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
Precision and Scale on the Hive Engine
Effective in version 10.2.1, the output of user-defined functions that perform multiplication on the Hive engine can have a maximum scale of 6 if the following conditions are true:
- •The difference between the precision and scale is greater than or equal to 32.
- •The resultant precision is greater than 38.
Previously, the scale could be as low as 0.
For more information, see the "Mappings in the Hadoop Environment" chapter in the Informatica Big Data Management 10.2.1 User Guide.
Sqoop
Effective in version 10.2.1, the following changes apply to Sqoop:
- •When you run Sqoop mappings on the Spark engine, the Data Integration Service prints the Sqoop log events in the mapping log. Previously, the Data Integration Service printed the Sqoop log events in the Hadoop cluster log.
For more information, see the Informatica Big Data Management 10.2.1 User Guide.
- •If you add or delete a Type 4 JDBC driver .jar file required for Sqoop connectivity from the externaljdbcjars directory, changes take effect after you restart the Data Integration Service. If you run the mapping on the Blaze engine, changes take effect after you restart the Data Integration Service and Blaze Grid Manager.
Note: When you run the mapping for the first time, you do not need to restart the Data Integration Service and Blaze Grid Manager. You need to restart the Data Integration Service and Blaze Grid Manager only for the subsequent mapping runs.
Previously, you did not have to restart the Data Integration Service and Blaze Grid Manager after you added or deleted a Sqoop .jar file.
For more information, see the Informatica Big Data Management 10.2.1 Hadoop Integration Guide.
Transformation Support on the Hive Engine
Effective in version 10.2.1, a Labeler or Parser transformation that performs probabilistic analysis requires the Java 8 Development Kit on any node on which it runs.
Previously, the transformations required the Java 7 Development Kit.
If you run a mapping that contains a Labeler or Parser transformation that you configured for probabilistic analysis, verify the Java version on the Hive nodes.
Note: On a Blaze or Spark node, the Data Integration Service uses the Java Development Kit that installs with the Informatica engine. Informatica 10.2.1 installs with version 8 of the Java Development Kit.
For more information, see the Informatica 10.2.1 Installation Guide or the Informatica 10.2.1 Upgrade Guide that applies to the Informatica version that you upgrade.