Data Engineering Integration
This section describes the changes to Data Engineering Integration in version 10.4.0.
Data Preview
Effective in version 10.4.0, the Data Integration Service uses Spark Jobserver to preview data on the Spark engine. Spark Jobserver allows for faster data preview jobs because it maintains a running Spark context instead of refreshing the context for each job. Mappings configured to run with Amazon EMR, Cloudera CDH, and Hortonworks HDP use Spark Jobserver to preview data.
Previously, the Data Integration Service used spark-submit scripts for all data preview jobs on the Spark engine. Mappings configured to run with Azure HDInsight and MapR use spark-submit scripts to preview data on the Spark engine. Previewing data on mappings configured to run with Azure HDInsight and MapR is available for technical preview.
For more information, see the "Data Preview" chapter in the Data Engineering Integration 10.4.0 User Guide.
Union Transformation
Effective in version 10.4.0, you can choose a Union transformation as the preview point when you preview data. Previously, the Union transformation was not supported as a preview point.
infacmd dp Commands
You can use the infacmd dp plugin to perform data preview operations. Use infacmd dp commands to manually start and stop the Spark Jobserver.
The following table describes infacmd dp commands:
Command | Description |
---|
startSparkJobServer | Starts Spark Jobserver on the Integration Service machine. By default, the Spark Jobserver starts when you preview hierarchical data. |
stopSparkJobServer | Stops the Spark Jobserver running on specified Integration Service. By default, the Spark Jobserver stops if it is idle for 60 minutes or when the Data Integration Service is stopped or recycled. |
For more information, see the "infacmd dp Command Reference" chapter in the Informatica 10.4.0 Command Reference.
Date/Time Format on Databricks
Effective in version 10.4.0, when the Databricks Spark engine reads and writes date/time values, it uses the format YYYY-MM-DD HH24:MM:SS.US.
Previously, you set the format in the mapping properties for the run-time preferences of the Developer tool.
You might need to perform additional tasks to continue using date/time data on the Databricks engine. For more information, see the "Databricks Integration" chapter in the Data Engineering 10.4.0 Integration Guide.
Null Values in Target
Effective in version 10.4.0, the following changes are applicable when you write data to a complex file:
- •If the mapping source contains null values and you use the Create Target option to create a Parquet target file, the default schema contains optional fields and you can insert null values to the target.
Previously, all the fields were created as REQUIRED in the default schema and you had to manually update the data type in the target schema from "required" to "Optional" to write the columns with null values to the target.
- •If the mapping source contains null values and you use the Create Target option to create an Avro target file, null values are defined in the default schema and you can insert null values into the target file.
Previously, the null values were not defined in the default schema and you had to manually update the default target schema to add "null" data type to the schema.
Note: You can manually edit the schema if you do not want to allow null values to the target. You cannot edit the schema to prevent the null values in the target with mapping flow enabled.
These changes are applicable to the following adapters:
- •PowerExchange for HDFS
- •PowerExchange for Amazon S3
- •PowerExchange for Google Cloud Storage
- •PowerExchange for Microsoft Azure Blob Storage
- •PowerExchange for Microsoft Azure Data Lake Storage Gen1
Python Transformation
Effective in version 10.4.0, you access resource files in the Python code by referencing an index in the array resourceFilesArray. Use resourceFilesArray in new mappings that you create in version 10.4.0.
Previously, the array was named resourceJepFile. Upgraded mappings that use resourceJepFile will continue to run successfully.
For more information, see the "Python Transformation" chapter in the Informatica Data Engineering Integration 10.4.0 User Guide.