Databricks Connector > SQL ELT with Databricks Connector > SQL ELT optimization for mapping tasks

SQL ELT optimization for mapping tasks

When you run a task configured for SQL ELT optimization, the task converts the transformation logic to an SQL query. The task sends the query to the database, and the database executes the query.

The amount of transformation logic that you can push to the database depends on the database, transformation logic, and task configuration. The Secure Agent processes all transformation logic that it cannot push to the database.

Configure SQL ELT optimization for a mapping in the tasks properties. You can configure SQL ELT optimization in task that references a mapping or a mapping in advanced mode.

Note: You can only configure SQL ELT optimization when you use the SQL warehouse to connect to Databricks.

SQL ELT optimization types

When you apply SQL ELT optimization, the task pushes transformation logic to the source or target database based on the SQL ELT optimization type you specify in the task properties. Data Integration translates the transformation logic into SQL queries or Databricks commands to the Databricks database. The database runs the SQL queries or Databricks commands to process the transformations.

You can configure the following SQL ELT optimization types in a mapping:

None: The task does not push down the transformation logic to the Databricks database.

Full: The task pushes as much of the transformation logic as possible to process in the Databricks target database.

Source: The task pushes as much as the transformation logic as possible to process in the Databricks source database. This applies only to SQL warehouse.

Read from and write to Databricks

You can configure SQL ELT optimization in a mapping to read from and write to Databricks using a Databricks connection.

Example

You work in a motorbike retail company with more than 30,000 dealerships and 2000 inspection centers globally. The company stores millions of records in Databricks hosted on Azure. You want to use Data Integration to perform some transformations on the data before you write back to Databricks.

Use a Databricks connection in the mapping to read from the Databricks source and write the processed data to the Databricks target. Configure full SQL ELT optimization in the mapping to enhance the performance.

Read from Amazon S3 and write to Databricks

You can configure SQL ELT optimization for a mapping that uses an Amazon S3 V2 connection in the Source transformation to read from Amazon S3 and a Databricks connection in the Target transformation to write to Databricks.

Example

You work for a healthcare organization. Your organization offers a suite of services to manage electronic medical records, patient engagement, telephonic health services, and care coordination services. The organization uses infrastructure based on Amazon Web Services and stores its data on Amazon S3. The management plans to load data to a data warehouse to perform healthcare analytics and create data points to improve operational efficiency. To load data from an Amazon S3 based storage object to Databricks, you must use ETL and ELT with the required transformations that support the data warehouse model.

Use an Amazon S3 V2 connection to read data from a file object in an Amazon S3 source and a Databricks connection to write to a Databricks target. Configure full SQL ELT optimization in the mapping to optimize the performance.

Read from Microsoft Azure Data Lake Storage Gen2 and write to Databricks

You can configure SQL ELT optimization for a mapping that uses an Microsoft Azure Data Lake Storage Gen2 connection in the Source transformation to read from Microsoft Azure Data Lake Storage Gen2 and a Databricks connection in the Target transformation to write to Databricks.

Example

You want to load data from an Microsoft Azure Data Lake Storage Gen2 based storage object to Databricks for analytical purposes. You want to transform the data before it is made available to users. Use an Microsoft Azure Data Lake Storage Gen2 connection to read data from a Microsoft Azure Data Lake Storage Gen2 source and a Databricks connection to write to a Databricks target. Configure full SQL ELT optimization in the mapping task to optimize the performance of loading data to Databricks. SQL ELT optimization enhances the performance of the task and reduces the cost involved.

SQL ELT compatibility

You can configure the task to push transformations, functions, and operators to the database.

When you use SQL ELT optimization, the Secure Agent converts the expression in the transformation by determining equivalent operators and functions in the database. If there is no equivalent operator and function, the Secure Agent processes the transformation logic.

Functions with Databricks

When you use SQL ELT optimization, Data Integration converts the expression in the transformation by determining equivalent functions in the database. If there is no equivalent function, Data Integration processes the transformation logic.

The following table summarizes the availability of SQL ELT functions that you can push to the Databricks database by using full or source SQL ELT optimization:

Function	Function
ABS()	MOD()
ADD_TO_DATE()	POWER()
ASCII()	RAND()
AVG()	REG_EXTRACT()
CEIL()	REG_MATCH()
CHR()	REG_REPLACE()
CONCAT()	REPLACESTR()
COS()	REPLACECHR()
COSH()	REVERSE()
COUNT()	ROUND(DATE)
CRC32()	ROUND(NUMBER)
DATE_COMPARE()	RPAD()
DATE_DIFF()	RTRIM()
DECODE()	SET_DATE_PART()
EXP()	SHA256()
FLOOR()	SIN()
GET_DATE_PART()	SINH()
GREATEST()	SQRT()
IIF()	STDDEV()
IN()	SUBSTR()
INDEXOF()	SUM()
INITCAP()	SYSDATE()
INSTR()	SYSTIMESTAMP()
IS_DATE()	TAN()
IS_NULL()	TANH()
IS_NUMBER()	TO_BIGINT
IS_SPACES()	TO_CHAR(DATE)
LAST()	TO_CHAR(NUMBER)
LAST_DAY()	TO_DATE()
LENGTH()	TO_DECIMAL()
LN()	TO_FLOAT()
LOWER()	TO_INTEGER()
LPAD()	TRUNC(DATE)
LTRIM()	TRUNC(NUMBER)
MAKE_DATE_TIME()	UPPER()
MAX()	VARIANCE()
MD5()	-
MIN()	-

Operators with Databricks

When you use SQL ELT optimization, the Secure Agent converts the expression in the transformation by determining equivalent operators in the database. If there is no equivalent operator, the Secure Agent processes the transformation logic.

The following table lists the SQL ELT operators that you can push to Databricks:

Operator	Operator
+	=
-	>=
*	<=
/	!=
%	AND
\|\|	OR
>	NOT
<

Variables with Databricks

You can use full SQL ELT optimization to push the SESSSTARTTIME and SYSDATE variable to the Databricks database.

SYSDATE is stored as a transformation date/time datatype variable. To return a static date and time, use the SESSSTARTTIME variable.

Data flow run order

You can specify the flow run order when you want Data Integration to load the targets in the mapping enabled for SQL ELT optimization in a particular order.

If your mapping contains multiple pipelines, you can define the flow run order to load the targets from the pipelines in a particular order. To define the data flow run order during SQL ELT optimization, see the Data Flow Run Order Informatica Knowledge Base Article.

Configuring SQL ELT optimization

To optimize a mapping, add the mapping to a task, and then configure SQL ELT optimization in the mapping task.

1Create a mapping task.

2In the SQL ELT optimization section on the Runtime Options tab, set the SQL ELT optimization value to Full or To Source.

3If full SQL ELT optimization is not available, select how Data Integration handles SQL ELT optimization in the Fallback Option menu:

- Partial PDO. Default. Data Integration pushes as much transformation logic as possible to the source and target database. The task processes any transformation logic that it can't push to a database. You can use Partial PDO only when you read from and write to Databrick.
- Non PDO. The task runs without SQL ELT optimization.
- Fail Task. Data Integration fails the task.

Note: The fallback options are not applicable to mappings in advanced mode.

When you run the mapping task, the transformation logic is pushed to the Databricks database.

Transformations with Databricks

When you configure SQL ELT optimization, the Secure Agent tries to push the configured transformation to Databricks.

You can use full or source SQL ELT optimization to push the following transformations to Databricks:

•Aggregator
•Expression
•Filter
•Joiner
•Lookup
•Sorter
•Union
•Router. Doesn't apply to source SQL ELT optimization.
•Rank
•SQL

Note: Expression, Filter, Lookup , Rank, and SQL transformations don't apply to mappings in advanced mode.

Aggregator transformation

You can configure full SQL ELT optimization to push an Aggregator transformation to process in Databricks.

Aggregate calculations

You can perform the following aggregate calculations:

•AVG
•COUNT
•FIRST
•LAST
•MAX
•MIN
•SUM
•STDDEV
•VARIANCE

Incoming ports

When you configure an Aggregator transformation and the incoming port is not used in an aggregate function or in a group by field, the output is not deterministic as the ANY_VALUE() function returns any value from the port.

You can pass only single arguments to the LAST, STDDEV, and VARIANCE functions.

Lookup transformation

You can configure full SQL ELT optimization to push a Lookup transformation to process in Databricks. This applies to both connected and unconnected lookups.

You can add the following lookups:

•Cached
•Uncached
•Unconnected with cached

When you configure a connected lookup, select the Multiple Matches property value as Return all rows in the lookup properties for SQL ELT optimization to work.

You can nest the unconnected lookup function with other expression functions.

When you configure an unconnected Lookup transformation, consider the following rules:

•You must select the Multiple Matches property value as Report error in the unconnected lookup properties for SQL ELT optimization to work.
•You can only configure an Expression transformation for an output received from an unconnected lookup.

SQL Transformation

You can use an SQL transformation to push supported scalar functions to Databricks.

When you configure SQL ELT optimization for a mapping, you can use scalar functions in a SQL transformation and run queries with the Databricks target endpoint.

You can use a simple SELECT statement without 'FROM' and 'WHERE' arguments. The SQL transformation only supports functions with simple SELECT statement.

The following snippet demonstrates the syntax of a simple SELECT SQL query:

SELECT <function_name1>(~Arg~), <function_name2> (~Arg~)...

For example, SELECT SQRT(~AGE~)

For more information about the supported functions, see the Databricks documentation.

Rules and guidelines for SQL transformation: Consider the following rules and guidelines when you use SQL transformation:

Features

You can configure SQL ELT optimization for a mapping that reads from the following sources and writes to a Databricks target:

•Databricks
•Amazon S3
•Microsoft Azure Data Lake Storage Gen2

When you configure a mapping, some parameters are not supported for a mapping enabled for SQL ELT optimization. You can refer to the list of parameters that each source supports.

Databricks sources, targets, and lookups

You must configure a Databricks connection with simple or hybrid mode when you enable SQL ELT optimization in a mapping task.

Source properties

When you configure SQL ELT optimization, the mappings support the following advance properties for a Databricks source:

•Source Object Type

- Single
- Multiple
- Query
- Parameter

Note: When you use the query source type to read from Databricks, you can choose to retain the field metadata and save the mapping. Even if you edit the query and run the mapping, the field metadata specified at design time is retained.

•Query Options

- Filter. You can use both simple and advanced filter conditions.

•Database Name
•Table Name
•SQL Override

Note: Contains, Ends With, and Starts With filter operators are not applicable when you use source filter to filter records.

Target properties

When you configure SQL ELT optimization, the mappings support the following properties for an Databricks target:

•Target Object Type

- Single
- Parameter
- Create New at Runtime

•Operation

- Insert
- Update
- Upsert
- Delete
- Data driven

•Create Target
•Target Database Name
•Target Table Name
•Update Mode
•Write Disposition for Insert operation.

Note: You cannot run pre-SQL or post-SQL commands in the source and target when you configure mappings for full SQL ELT optimization.

Lookup properties

When you configure SQL ELT optimization, the mappings support the following advance properties for a Databricks lookup:

•Source Object Type

- Single
- Query
- Parameter
- Multiple Matches for cached lookup

Note: Un-cached unconnected lookup is not supported for SQL ELT optimization. For cached lookup, only Return all rows is supported.

•Database Name
•Table Name
•SQL Override

Note: If you configure advanced properties that are not supported, the Secure Agent either ignores the properties or logs a SQL ELT optimization validation error in the session logs file. If the Secure Agent logs an error in the session log, the mappings run in the Informatica runtime environment without full SQL ELT optimization.

Supported features for Amazon S3 V2 source

When you configure SQL ELT optimization, the mappings support the following properties for an Amazon S3 V2 source:

•Source connection parameter
•Source Type - Single, query
•Parameter
•Format - Avro, ORC, Parquet, JSON, CSV, and Delta
•Source Type - File and directory. XML source type is not applicable.
•Folder Path
•File Name

When you configure SQL ELT optimization, the mapping supports the following transformations:

•Filter
•Expression
•Aggregator
•Sorter
•Router
•Joiner
•Lookup
•Union
•Rank

For information on how to configure the supported properties, see the help for Amazon S3 V2 Connector.

Supported features for Google Cloud Storage source

When you configure SQL ELT optimization, the mappings support the following properties for a Google Cloud Storage source:

•Source connection - Parameter, Allow parameter to be overridden at run time
•Source Type - Single, Parameter
•Format - Delimited, Avro, JSON, ORC, and Parquet
•Google Cloud Storage Path
•Source File Name
•Is Directory

Supported features for Microsoft Azure Data Lake Storage Gen2 source

When you configure SQL ELT optimization, the Microsoft Azure Data Lake Storage Gen2 connection supports the following properties:

•Account Name
•Client ID
•Client Secret
•Tenant ID
•File System Name
•Directory Path
•Adls Gen2 End-point
•Server-side Encryption

When you configure SQL ELT optimization, the mappings support the following properties for a Microsoft Azure Data Lake Storage Gen2 source:

•Source connection, connection parameter
•Source Type - Single, parameter
•Format - Avro, Parquet, JSON, ORC, CSV, and Delta
•Intelligent Structure Model
•Formatting Options
•Filesystem Name Override
•Source Type - File, Directory
•Directory Override - Absolute path; Relative path
•File Name Override - Source object
•Allow Wildcard Characters

When you configure SQL ELT optimization, the mapping supports the following transformations:

•Filter
•Expression
•Aggregator
•Sorter
•Router
•Joiner
•Lookup
•Union
•Rank

For information on how to configure the supported properties, see the help for Microsoft Azure Data Lake Storage Gen2 Connector.

Configuring a custom query for the Databricks source object

You can push down a custom query to Databricks.

Before you run a task that contains a custom query as the source object, you must set the Create Temporary View session property in the mapping task properties.

Note: If you do not set the Create Temporary View property, the mapping runs without SQL ELT optimization.

Perform the following task to set the property:

1In the mapping task, navigate to the SQL ELT Optimization section on the Runtime Options tab.
2Select Create Temporary View.
3Click Finish.

SQL ELT optimization for multiple targets

You can add multiple Databricks targets in a mapping.

When you configure a mapping to write to multiple Databricks targets, you can further optimize the write operation when you configure full SQL ELT optimization.

To optimize, you can configure an insert, update, upsert, or delete operation for multiple targets individually.

Note: You cannot configure an insert operation for unconnected target columns for mappings in advanced mode.

You can select the same Databricks target table in multiple Target transformations and perform different operations for each of the Target transformations to run independent of each other.

If your mapping contains multiple pipelines, you can define the flow run order to load the targets from the pipelines in a particular order.

Single commit for SQL ELT optimization

When you enable full SQL ELT optimization for a mapping to write to multiple Databricks targets, you can configure the mapping to commit the configured operations for all the targets within a connection group together.

You can use single commit to combine the metadata from all the targets and send the metadata for processing in a single execution call. When you use single commit, the Secure Agent separates the targets into connection groups based on equivalent connection attributes and commits the operations together for each connection group. This optimizes the performance of the write operation.

When you run a mapping with multiple targets, the Databricks connections used for these multiple target transformations that have the same connection attribute values are grouped together to form connection groups. As all the targets in a connection group have the same connection attributes, only a single connection is established for each connection group which represents that particular connection group. The transactions on each connection group runs on a single Databricks cluster.

If the Secure Agent fails to write to any of the targets, the task execution stops and the completed transactions for the targets that belong to the same connection group are not rolled back.

To enable single commit to write to multiple targets, set the EnableSingleCommit=Yes custom property in the Advanced Session Properties section on the Runtime Options tab of the mapping task.

Clean stop a SQL ELT optimization job

When a task enabled for SQL ELT optimization is running, you can clean stop the job to terminate all the issued statements and processes spawned by the job.

Use the Clean Stop option on the My Jobs page in Data Integration and the All Jobs and Running Jobs page in Monitor.

See the following exceptions before you clean stop a SQL ELT optimization task:

•When you clean stop a task enabled for source SQL ELT optimization that reads from or writes to Databricks and the target or source properties in the mapping contains pre-SQL or post-SQL statements, even if the select query is terminated, the job continues to run the target post-SQL query.
•When you start a job enabled for full SQL ELT optimization and clean stop it immediately, and if the mapping is configured to create a new target at runtime, Data Integration creates the table even if the job is terminated.

Rules and guidelines for SQL ELT optimization

Use the following rules and guidelines when you enable a mapping for SQL ELT optimization to a Databricks database:

General rules and guidelines

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping:

•When you use the IN() function with OR, AND, or NOT operator in an IIF() function, the IIF() function returns incorrect results.
•When you configure mapping to read from an Amazon S3 or a Microsoft Azure Data Lake Storage Gen2 source, you cannot use Databricks connection in the Lookup transformation in the mapping.

Mapping with Databricks source and target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from and writes to Databricks:

•When you provide the database name in the connection, the following rules and guidelines apply:

- For full SQL ELT optimization, temporary staging tables or views are created in the database name provided in the target connection attribute database. If the attribute is empty the database name is picked from JDBC URL of the target connection.
- For Source SQL ELT optimization, temporary staging tables or views are created in the database name provided in the source connection attribute database. If the attribute is empty the database name is picked from JDBC URL of source connection.
- The database name provided in the connection should have read and write permission. If the database name field is empty, then the database name provided in the JDBC URL will take precedence and should have read and write permission.

Note: The guidelines are only applicable to custom query support, SQL override, and SQL ELT optimization in Databricks using flat file formats.

•You cannot use the Data Driven target operation with the Update Mode advanced target property.
•LAST function is a non-deterministic function. This function returns different results each time it is called, even when you provide the same input values.
•When you configure a Filter transformation or specify a filter condition, do not specify special characters.
•When you configure a mapping enabled for full SQL ELT optimization to read from multiple sources and you override the database name and table name from the advanced properties, the mapping fails.
•To configure a Filter transformation or specify a filter condition on columns of date or timestamp in a Databricks table, you must pass the data through the TO_DATE() function as an expression in the filter condition.
•When you specify custom query as a source object, ensure that the SQL query does not contain any partitioning hints such as COALESCE, REPARTITION, or REPARTITION_BY_RANGE.
•When you configure a mapping enabled for full SQL ELT optimization on the Databricks SQL engine and push the data to the Databricks target, ensure that you map all the fields in target. Else, the mapping fails.
•When you create a new target at runtime, you must not specify a database name and table name in the Target Database Name and Target Table Name in the target advanced properties.
•When you read data from a column of Date data type and write data into a column of Date data type, the SQL ELT query pushes the column of Date data type and casts the column to Timestamp data type.
•You cannot completely parameterize a multi-line custom query using a parameter file. If you specify a multi-line custom query in a parameter file, the mapping considers only the first line of the multi-line query.
•When you push the CRC32() funtion to Databricks, the data type of the return value is either Bigint or String.
•When you push the DATE_DIFF() function to Databricks, the function returns the integral part of the value and not the fractional part.
•When you push the GREATEST() function to Databricks and configure input value arguments of String data type, you must not specify the caseFlag argument.
•To push the TO_CHAR(DATE) function to Databricks, use the following string and format arguments:

- YYYY
- YY
- MM
- MON
- MONTH
- DD
- DDD
- DY
- DAY
- HH12
- HH24
- MI
- Q
- SS
- SS.MS
- SS.US
- SS.NS

•To push the TO_DATE(string, format) function to Databricks, you must use the following format arguments:

- YYYY
- YY
- MM
- MON
- MONTH
- DD
- DDD
- HH12
- HH24
- MI
- SS
- SS.MS
- SS.US
- SS.NS

•When you enable full SQL ELT optimization in a mapping and use the IFF() condition in an Expression transformation, the mapping fails for the following functions:

- IS_SPACES
- IS_NUMBER
- IS_DATE

•A mapping enabled with full SQL ELT optimization and contains an SQL transformation fails when the column names in the SQL override query don't match with the column names in the custom query.

Mapping with Amazon S3 source and Databricks target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from an Amazon S3 source and writes to a Databricks target:

•When you use an unconnected lookup and run a Databricks mapping enabled for SQL ELT optimization, the mapping fails if the IIF() function returns a Float value in the lookup query.
•When you select the source type as directory in the advanced source properties, ensure that all the files in the directory contain the same schema.
•When you select query as the source type in lookup, you cannot override the database name and table name in the advanced source properties.
•When you include a source transformation in a mapping enabled with SQL ELT optimization, exclude the FileName field from the source. The FileName field is not applicable.
•When you parameterize a lookup object in a mapping enabled with SQL ELT optimization, the mapping fails as you cannot exclude the filename port at runtime.
•When you parameterize the source object in a mapping task, ensure that you pass the source object parameter value with the fully qualified path in the parameter file.
•You cannot use wildcard characters for the source file name and directory name in the source transformation.
•You cannot use wildcard characters for the folder path or file name in the advanced source properties.
•When you read from a partition folder that has a transaction log file, select the source type as Directory in the advanced source properties.
•You cannot configure dynamic lookup cache.
•When you use a Joiner transformation in a mapping enabled with SQL ELT optimization and create a new target at runtime, ensure that the fields do not have a not null constraint.
•Ensure that the field names in Parquet, ORC, AVRO, or JSON files do not contain Unicode characters.

Mapping with Azure Data Lake Storage Gen2 source and Databricks target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from a Azure Data Lake Storage Gen2 source and writes to a Databricks target:

•Mappings fail if the lookup object contains unsupported data types.
•When you select the source type as directory in the advanced source property, ensure that all the files in the directory contain the same schema.
•When you select query as the source type in lookup, you cannot override the database name and table name in the advanced source properties.
•When you include a source transformation in a mapping enabled with SQL ELT optimization, exclude the FileName field from the source. The FileName field is not applicable.
•When you parameterize a lookup object in a mapping enabled with SQL ELT optimization, the mapping fails as you cannot exclude the filename port at runtime.
•When you parameterize the source object in a mapping task, ensure that you pass the source object parameter value with the fully qualified path in the parameter file.
•You cannot use wildcard characters for the source file name and directory name in the source transformation.
•When you read from a partition folder that has a transaction log file, select the source type as Directory in the advanced source properties.
•You cannot configure dynamic lookup cache.
•When you use a Joiner transformation in a mapping enabled with SQL ELT optimization and create a new target at runtime, ensure that the fields do not have a not null constraint.
•Ensure that the field names in Parquet, ORC, AVRO, or JSON files do not contain Unicode characters.

Cross workspace mappings

When you set up a mapping enabled with full SQL ELT optimization to access data from a Databricks workspace, and the associated metastore resides in a separate workspace, the mapping runs without SQL ELT optimization.

Troubleshooting SQL ELT optimization

Mapping fails when configured to read date or timestamp information and write to default date/time format: When you configure a mapping to read date or timestamp information from a string column and process the data with the default date/time format to write to Databricks target, the mapping fails with the following error:; [ERROR] The Secure Agent failed to run the full pushdown query due to the following error: [Invalid timestamp: '12/31/1972 00:00:00.000001']; To resolve this issue, set the JVM option -DHonorInfaDateFormat=true for the Secure Agent.; Perform the following steps to configure the JVM option in Administrator:
IS_DATE(), IS_SPACES(), and IS_NUMBER() functions return 0 or 1 instead of True or False.: When you use IS_DATE(), IS_SPACES(), and IS_NUMBER() functions, the functions return 0 or 1 instead of True or False.; To resolve this issue, set the JVM option -DDeltaSQLELTBooleanReturnAsString=true for the Secure Agent.; Perform the following steps to configure the JVM option in Administrator: