Big Data Management

Effective in version 10.2.2, you can integrate the Informatica domain with the Azure Databricks environment.

Azure Databricks is an analytics cloud platform that is optimized for the Microsoft Azure cloud services. It incorporates the open source Apache Spark cluster technologies and capabilities.

The Informatica domain can be installed on an Azure VM or on-premises. The integration process is similar to the integration with the Hadoop environment. You perform integration tasks, including importing the cluster configuration from the Databricks environment. The Informatica domain uses token authentication to access the Databricks environment. The Databricks token ID is stored in the Databricks connection.

Sources and Targets

You can run mappings against the following sources and targets within the Databricks environment:

Transformations

You can add the following transformations to a Databricks mapping:

The Databricks Spark engine processes the transformation in much the same way as the Spark engine processes in the Hadoop environment.

Data Types

The following data types are supported:

Mappings

When you configure a mapping, you can choose to validate and run the mapping in the Databricks environment. When you run the mapping, the Data Integration Service generates Scala code and passes it to the Databricks Spark engine.

Workflows

You can develop cluster workflows to create ephemeral clusters in the Databricks environment.

For more information, refer to the following guides:

Data Preview on the Spark Engine

Effective in version 10.2.2, you can preview data within a mapping that runs on the Spark engine in the Developer tool. Previewing data helps to design and debug big data mappings.

You can choose sources and transformations as preview points in a mapping that contain the following hierarchical types:

Data preview is available for technical preview. Technical preview functionality is supported for evaluation purposes but is unwarranted and is not production-ready. Informatica recommends that you use in non-production environments only. Informatica intends to include the preview functionality in an upcoming release for production use, but might choose not to in accordance with changing market or technical circumstances. For more information, contact Informatica Global Customer Support.

For more information, see the Informatica® Big Data Management 10.2.2 User Guide.

Hierarchical Data

This section describes new features for hierarchical data in version 10.2.2.

Dynamic Complex Ports

Effective in version 10.2.2, you can add dynamic complex ports to a dynamic mapping that runs on the Spark engine. Use dynamic complex ports to manage frequent schema changes to hierarchical data in complex files.

A dynamic complex port receives new or changed elements of a complex port based on the schema changes at run time. The input rules determine the elements of a dynamic complex port. Based on the input rules, a dynamic complex port receives one or more elements of a complex port from the upstream transformation. You can use dynamic complex ports such as dynamic array, dynamic map, and dynamic struct in some transformations on the Spark engine.

For more information, see the "Processing Hierarchical Data with Schema Changes" chapter in the Informatica Big Data Management 10.2.2 User Guide.

High Availability

This section describes new high availability features in version 10.2.2.

Big Data Job Recovery

Effective in version 10.2.2, the Data Integration Service can recover a big data job configured to run on the Spark engine when a Data Integration Service node stops unexpectedly. When a Data Integration Service node fails before a job completes, the Data Integration Service sends the job to another node, which resumes processing job tasks from the point at which the node failure occurred.

To recover big data mappings, you must enable big data job recovery in Data Integration Service properties and run the job from infacmd.

For more information, see the "Data Integration Service Processing" chapter in the Informatica Big Data Management 10.2.2 Administrator Guide.

Distributed Data Integration Service Queues

Effective in version 10.2.2, the Data Integration Service uses a distributed queue to store job information when big data recovery is enabled for deployed big data jobs. The distributed queue is stored in the Model repository, and any available Data Integration Service can run jobs from the queue when resources are available.

For more information, see the "Data Integration Service Processing" chapter in the Informatica Big Data Management 10.2.2 Administrator Guide.

Intelligent Structure Model

This section describes new intelligent structure model features in version 10.2.2.

Aliases in XML Files

Effective in version 10.2.2, Intelligent Structure Discovery can process XML files that use different aliases to identify the same namespace, as used in the XML file with which an intelligent structure model was created.

Data Types

Effective in version 10.2.2, and starting with the Winter 2019 March release of Informatica Intelligent Cloud Services, when a complex file reader uses an intelligent structure model, Intelligent Structure Discovery passes the data types to the output data ports.

For example, when Intelligent Structure Discovery detects that a field contains a date, it passes the data to the output data ports as a date, not as a string.

Field Names

Effective in version 10.2.2, and starting with the Winter 2019 March release of Informatica Intelligent Cloud Services, field names in complex file data objects that you import from an intelligent structure model can begin with numbers and reserved words, and can contain the following special characters: \. [ ] { } ( ) * + - ? . ^ $ |

When a field begins with a number or reserved word, the Big Data Management mapping adds an underscore (_) to the beginning of the field name. For example, if a field in an intelligent structure model begins with OR, the mapping imports the field as _OR. When the field name contains a special character, the mapping converts the character to an underscore.

Processing Large XML Files

Effective in version 10.2.2, Intelligent Structure Discovery can stream XML files and process data for repeating elements in chunks. This makes the processing of large XML files more efficient.

Data Drift

Effective in version 10.2.2, and starting with the Winter 2019 March release of Informatica Intelligent Cloud Services, Intelligent Structure Discovery enhances the handling of data drifts.

In Intelligent Structure Discovery, data drifts occur when the input data contains fields that the sample file did not contain. In this case, Intelligent Structure Discovery passes the undefined data to an unassigned data port on the target, rather than discarding the data.

Mass Ingestion

Effective in version 10.2.2, you can run an incremental load to ingest incremental data. When you run an incremental load, the Spark engine fetches incremental data based on a timestamp or an ID column and loads the incremental data to the Hive or HDFS target. If you ingest the data to a Hive target, the Spark engine can also propagate the schema changes that have been made on the source tables.

If you ingest incremental data, the Mass Ingestion Service leverages Sqoop's incremental import mode.

For more information, see the Informatica Big Data Management 10.2.2 Mass Ingestion Guide.

Monitoring

This section describes the new features related to monitoring in Big Data Management in version 10.2.2.

Spark Monitoring

Effective in version 10.2.2, you can view both the pre-job and post-job tasks within the Summary Statistics pane for the Spark monitoring.

For more information about the pre-job and post-job tasks, see the Informatica Big Data Management 10.2.2 User Guide.

Security

This section describes the new features related to security in Big Data Management in version 10.2.2.

Enterprise Security Package

Effective in version 10.2.2, Informatica supports an Azure HDInsight cluster with Enterprise Security Package.

The Enterprise Security Package uses Kerberos for authentication and Apache Ranger for authorization.

For more information about Enterprise Security Package, see the Informatica Big Data Management 10.2.2 Administrator Guide.

Targets

This section describes new features for targets in version 10.2.2.

HDFS Flat File Targets

Effective in version 10.2.2, you can append output data to HDFS target files and reject files. To append output data, choose to append data if the HDFS target exists.

To help you manage the files that contain appended data, the Data Integration Service appends the mapping execution ID to the names of the target files and reject files.

For more information, see the "Targets" chapter in the Informatica Big Data Management 10.2.2 User Guide.

Big Data Management

Azure Databricks Integration

Sources and Targets

Transformations

Data Types

Mappings

Workflows

Data Preview on the Spark Engine

Hierarchical Data

Dynamic Complex Ports

High Availability

Big Data Job Recovery

Distributed Data Integration Service Queues

Intelligent Structure Model

Aliases in XML Files

Data Types

Field Names

Processing Large XML Files

Data Drift

Mass Ingestion

Monitoring

Spark Monitoring

Security

Enterprise Security Package

Targets

HDFS Flat File Targets