Data Engineering Integration
This section describes new Data Engineering Integration features in version 10.4.0.
New Data Types Support
Effective in version 10.4.0, you can you can use the following new data types for complex files:
- • When you run a mapping that reads or writes to Avro and Parquet complex file objects in the native environment or on the Hadoop environment, you can use the following data types:
- - Date
- - Decimal
- - Timestamp
- •You can use Time data type to read and write Avro or Parquet complex file objects in the native environment or on the Blaze engine.
- •You can use Date, Time, Timestamp, and Decimal data types are applicable when you run a mapping on the Databricks Spark engine.
The new data types are applicable to the following adapters:
- •PowerExchange for HDFS
- •PowerExchange for Amazon S3
- •PowerExchange for Google Cloud Storage
- •PowerExchange for Microsoft Azure Blob Storage
- •PowerExchange for Microsoft Azure Data Lake Storage Gen1
- •PowerExchange for Microsoft Azure Data Lake Storage Gen2
For more information about data types, see the "Data Type Reference" chapter in the Data Engineering Integration 10.4.0 User Guide.
AWS Databricks Integration
Effective in version 10.4.0, you can integrate the Informatica domain with Databricks on AWS.
You can use AWS Databricks to run mappings with the following functionality:
AWS Databricks supports the same data types as Azure Databricks.
For more information, refer to the following guides:
- •Data Engineering 10.4.0 Integration Guide
- •Data Engineering 10.4.0 Administrator Guide
- •Data Engineering Integration10.4.0 User Guide
- •Informatica 10.4.0 Developer Workflow Guide
Cluster Workflows for HDInsight Access to ALDS Gen2 Resources
Effective in version 10.4.0, you can create a cluster workflow to run on an Azure HDInsight cluster to access ADLS Gen2 resources.
For more information about cluster workflows, see the Informatica Data Engineering Integration 10.4.0 User Guide.
Databricks Delta Lake Storage Access
Effective in version 10.4.0, you can access Databricks Delta Lake storage as sources and targets.
Mappings can access Delta Lake resources on the AWS and Azure platforms.
For information about configuring access to Delta Lake tables, see the Data Engineering Integration Guide. For information about creating mappings to access Delta Lake tables, see the Data Engineering Integration User Guide.
Display Nodes Used in Mapping
Effective in version 10.4.0, you can view the maximum number of cluster nodes used by a mapping in a given time duration.
You can use the REST Operations Hub API ClusterStats(startTimeInmillis=[value], endTimeInmillis=[value]) to view the maximum number of Hadoop nodes for a cluster configuration used by a mapping in a given time duration.
For more information about the REST API, see the "Monitoring REST API Reference" chapter in the Data Engineering10.4.0 Administrator Guide
Log Aggregation
Effective in version 10.4.0, you can get aggregated logs for deployed mappings that run in the Hadoop environment.
You can collect the aggregated cluster logs for a mapping based on the job ID in the Monitoring tool or by using the infacmd ms fetchAggregatedClusterLogs command. You can get a .zip or tar.gz file of the aggregated cluster logs for a mapping based on the job ID and write the compressed aggregated log file to a target directory.
For more information, see the Informatica 10.4.0 Administrator Guide.
Parsing Hierarchical Data on the Spark Engine
Effective in 10.4.0, you can use complex functions to parse up to 5 MB of data midstream in a mapping.
The Spark engine can parse raw string source data using the following complex functions:
The complex functions parse JSON or XML data in the source string and generate struct target data.
For more information, see the "Hierarchical Data Processing" chapter in the Informatica Data Engineering Integration 10.4.0 User Guide.
For more information about the complex functions, see the "Functions" chapter in the Informatica 10.4.0 Developer Transformation Language Reference.
Profiles and Sampling Options on the Spark Engine
Effective in version 10.4.0, you can run profiles and choose sampling options on the Spark engine.
- Profiling on the Spark engine
- You can create and run profiles on the Spark engine in the Informatica Developer and Informatica Analyst tools. You can perform data domain discovery and create scorecards on the Spark engine.
- Sampling options on the Spark engine
- You can choose following sampling options to run profiles on the Spark engine:
For information about the profiles and sampling options on the Spark engine, see Informatica 10.4.0 Data Discovery Guide.
Python Transformation
Effective in version 10.4.0, the Python transformation has the following functionality:
Active Mode
You can create an active Python transformation. As an active transformation, the Python transformation can change the number of rows that pass through it. For example, the Python transformation can generate multiple output rows from a single input row or the transformation can generate a single output row from multiple input rows.
For more information, see the "Python Transformation" chapter in the Informatica Data Engineering Integration 10.4.0 User Guide.
Partitioned Data
You can run Python code to process incoming data based on the data's default partitioning scheme, or you can repartition the data before the Python code runs. To repartition the data before the Python code runs, select one or more input ports as a partition key.
For more information, see the "Python Transformation" chapter in the Informatica Data Engineering Integration 10.4.0 User Guide.
Sqoop
Effective in version 10.4.0, you can configure the following Sqoop arguments in the JDBC connection:
- •--update-key
- •--update-mode
- •--validate
- •--validation-failurehandler
- •--validation-threshold
- •--validator
- •--mapreduce-job-name
- •--bindir
- •--class-name
- •--jar-file
- •--outdir
- •--package-name
For more information about configuring these Sqoop arguments, see the Sqoop documentation.