Databricks Connector > SQL ELT with Databricks Connector > SQL ELT optimization for mapping tasks
  

SQL ELT optimization for mapping tasks

When you run a task configured for SQL ELT optimization, the task converts the transformation logic to an SQL query. The task sends the query to the database, and the database executes the query.
The amount of transformation logic that you can push to the database depends on the database, transformation logic, and task configuration. The Secure Agent processes all transformation logic that it cannot push to the database.
Configure SQL ELT optimization for a mapping in the tasks properties. You can configure SQL ELT optimization in task that references a mapping or a mapping in advanced mode.
Note: You can only configure SQL ELT optimization when you use the SQL warehouse to connect to Databricks.

SQL ELT optimization types

When you apply SQL ELT optimization, the task pushes transformation logic to the source or target database based on the SQL ELT optimization type you specify in the task properties. Data Integration translates the transformation logic into SQL queries or Databricks commands to the Databricks database. The database runs the SQL queries or Databricks commands to process the transformations.
You can configure the following SQL ELT optimization types in a mapping:
None
The task does not push down the transformation logic to the Databricks database.
Full
The task pushes as much of the transformation logic as possible to process in the Databricks target database.
Data Integration analyses all the transformations from the source to the target. If all the transformations are compatible in the target, it pushes the entire mapping logic to the target. If it cannot push the entire mapping logic to the target, Data Integration first pushes as much transformation logic to the source database and then pushes as much transformation logic as possible to the target database.
When a transformation is not supported in the mapping, the task partially pushes down the mapping logic to the point where the transformation is supported for SQL ELT optimization. However, this applies only to SQL warehouse.
When you enable full SQL ELT optimization, you can determine how Data Integration handles the job when SQL ELT optimization does not work in the Fallback Option menu.
Source
The task pushes as much as the transformation logic as possible to process in the Databricks source database. This applies only to SQL warehouse.
Note: You cannot enable source SQL ELT optimization for mappings in advanced mode.

Read from and write to Databricks

You can configure SQL ELT optimization in a mapping to read from and write to Databricks using a Databricks connection.

Example

You work in a motorbike retail company with more than 30,000 dealerships and 2000 inspection centers globally. The company stores millions of records in Databricks hosted on Azure. You want to use Data Integration to perform some transformations on the data before you write back to Databricks.
Use a Databricks connection in the mapping to read from the Databricks source and write the processed data to the Databricks target. Configure full SQL ELT optimization in the mapping to enhance the performance.

Read from Amazon S3 and write to Databricks

You can configure SQL ELT optimization for a mapping that uses an Amazon S3 V2 connection in the Source transformation to read from Amazon S3 and a Databricks connection in the Target transformation to write to Databricks.

Example

You work for a healthcare organization. Your organization offers a suite of services to manage electronic medical records, patient engagement, telephonic health services, and care coordination services. The organization uses infrastructure based on Amazon Web Services and stores its data on Amazon S3. The management plans to load data to a data warehouse to perform healthcare analytics and create data points to improve operational efficiency. To load data from an Amazon S3 based storage object to Databricks, you must use ETL and ELT with the required transformations that support the data warehouse model.
Use an Amazon S3 V2 connection to read data from a file object in an Amazon S3 source and a Databricks connection to write to a Databricks target. Configure full SQL ELT optimization in the mapping to optimize the performance.

Read from Microsoft Azure Data Lake Storage Gen2 and write to Databricks

You can configure SQL ELT optimization for a mapping that uses an Microsoft Azure Data Lake Storage Gen2 connection in the Source transformation to read from Microsoft Azure Data Lake Storage Gen2 and a Databricks connection in the Target transformation to write to Databricks.

Example

You want to load data from an Microsoft Azure Data Lake Storage Gen2 based storage object to Databricks for analytical purposes. You want to transform the data before it is made available to users. Use an Microsoft Azure Data Lake Storage Gen2 connection to read data from a Microsoft Azure Data Lake Storage Gen2 source and a Databricks connection to write to a Databricks target. Configure full SQL ELT optimization in the mapping task to optimize the performance of loading data to Databricks. SQL ELT optimization enhances the performance of the task and reduces the cost involved.

SQL ELT compatibility

You can configure the task to push transformations, functions, and operators to the database.
When you use SQL ELT optimization, the Secure Agent converts the expression in the transformation by determining equivalent operators and functions in the database. If there is no equivalent operator and function, the Secure Agent processes the transformation logic.

Functions with Databricks

When you use SQL ELT optimization, Data Integration converts the expression in the transformation by determining equivalent functions in the database. If there is no equivalent function, Data Integration processes the transformation logic.
The following table summarizes the availability of SQL ELT functions that you can push to the Databricks database by using full or source SQL ELT optimization:
Function
Function
ABS()
MOD()
ADD_TO_DATE()
POWER()
ASCII()
RAND()
AVG()
REG_EXTRACT()
CEIL()
REG_MATCH()
CHR()
REG_REPLACE()
CONCAT()
REPLACESTR()
COS()
REPLACECHR()
COSH()
REVERSE()
COUNT()
ROUND(DATE)
CRC32()
ROUND(NUMBER)
DATE_COMPARE()
RPAD()
DATE_DIFF()
RTRIM()
DECODE()
SET_DATE_PART()
EXP()
SHA256()
FLOOR()
SIN()
GET_DATE_PART()
SINH()
GREATEST()
SQRT()
IIF()
STDDEV()
IN()
SUBSTR()
INDEXOF()
SUM()
INITCAP()
SYSDATE()
INSTR()
SYSTIMESTAMP()
IS_DATE()
TAN()
IS_NULL()
TANH()
IS_NUMBER()
TO_BIGINT
IS_SPACES()
TO_CHAR(DATE)
LAST()
TO_CHAR(NUMBER)
LAST_DAY()
TO_DATE()
LENGTH()
TO_DECIMAL()
LN()
TO_FLOAT()
LOWER()
TO_INTEGER()
LPAD()
TRUNC(DATE)
LTRIM()
TRUNC(NUMBER)
MAKE_DATE_TIME()
UPPER()
MAX()
VARIANCE()
MD5()
-
MIN()
-

Operators with Databricks

When you use SQL ELT optimization, the Secure Agent converts the expression in the transformation by determining equivalent operators in the database. If there is no equivalent operator, the Secure Agent processes the transformation logic.
The following table lists the SQL ELT operators that you can push to Databricks:
Operator
Operator
+
=
-
>=
*
<=
/
!=
%
AND
||
OR
>
NOT
<

Variables with Databricks

You can use full SQL ELT optimization to push the SESSSTARTTIME and SYSDATE variable to the Databricks database.
SYSDATE is stored as a transformation date/time datatype variable. To return a static date and time, use the SESSSTARTTIME variable.

Configuring SQL ELT optimization

To optimize a mapping, add the mapping to a task, and then configure SQL ELT optimization in the mapping task.
    1Create a mapping task.
    2In the SQL ELT optimization section on the Runtime Options tab, set the SQL ELT optimization value to Full or To Source.
    3If full SQL ELT optimization is not available, select how Data Integration handles SQL ELT optimization in the Fallback Option menu:
    Note: The fallback options are not applicable to mappings in advanced mode.
When you run the mapping task, the transformation logic is pushed to the Databricks database.

Transformations with Databricks

When you configure SQL ELT optimization, the Secure Agent tries to push the configured transformation to Databricks.
You can use full or source SQL ELT optimization to push the following transformations to Databricks:
Note: Expression, Filter, Lookup , Rank, and SQL transformations don't apply to mappings in advanced mode.

Aggregator transformation

You can configure full SQL ELT optimization to push an Aggregator transformation to process in Databricks.

Aggregate calculations

You can perform the following aggregate calculations:

Incoming ports

When you configure an Aggregator transformation and the incoming port is not used in an aggregate function or in a group by field, the output is not deterministic as the ANY_VALUE() function returns any value from the port.
You can pass only single arguments to the LAST, STDDEV, and VARIANCE functions.

Lookup transformation

You can configure full SQL ELT optimization to push a Lookup transformation to process in Databricks. This applies to both connected and unconnected lookups.
You can add the following lookups:
When you configure a connected lookup, select the Multiple Matches property value as Return all rows in the lookup properties for SQL ELT optimization to work.
You can nest the unconnected lookup function with other expression functions.
When you configure an unconnected Lookup transformation, consider the following rules:

SQL Transformation

You can use an SQL transformation to push supported scalar functions to Databricks.
When you configure SQL ELT optimization for a mapping, you can use scalar functions in a SQL transformation and run queries with the Databricks target endpoint.
You can use a simple SELECT statement without 'FROM' and 'WHERE' arguments. The SQL transformation only supports functions with simple SELECT statement.
The following snippet demonstrates the syntax of a simple SELECT SQL query:
SELECT <function_name1>(~Arg~), <function_name2> (~Arg~)...
For example, SELECT SQRT(~AGE~)
For more information about the supported functions, see the Databricks documentation.
Rules and guidelines for SQL transformation
Consider the following rules and guidelines when you use SQL transformation:

Features

You can configure SQL ELT optimization for a mapping that reads from the following sources and writes to a Databricks target:
When you configure a mapping, some parameters are not supported for a mapping enabled for SQL ELT optimization. You can refer to the list of parameters that each source supports.

Databricks sources, targets, and lookups

You must configure a Databricks connection with simple or hybrid mode when you enable SQL ELT optimization in a mapping task.

Source properties

When you configure SQL ELT optimization, the mappings support the following advance properties for a Databricks source:
Note: Contains, Ends With, and Starts With filter operators are not applicable when you use source filter to filter records.

Target properties

When you configure SQL ELT optimization, the mappings support the following properties for an Databricks target:
Note: You cannot run pre-SQL or post-SQL commands in the source and target when you configure mappings for full SQL ELT optimization.

Lookup properties

When you configure SQL ELT optimization, the mappings support the following advance properties for a Databricks lookup:
Note: If you configure advanced properties that are not supported, the Secure Agent either ignores the properties or logs a SQL ELT optimization validation error in the session logs file. If the Secure Agent logs an error in the session log, the mappings run in the Informatica runtime environment without full SQL ELT optimization.

Supported features for Amazon S3 V2 source

When you configure SQL ELT optimization, the mappings support the following properties for an Amazon S3 V2 source:
When you configure SQL ELT optimization, the mapping supports the following transformations:
For information on how to configure the supported properties, see the help for Amazon S3 V2 Connector.

Supported features for Microsoft Azure Data Lake Storage Gen2 source

When you configure SQL ELT optimization, the Microsoft Azure Data Lake Storage Gen2 connection supports the following properties:
When you configure SQL ELT optimization, the mappings support the following properties for a Microsoft Azure Data Lake Storage Gen2 source:
When you configure SQL ELT optimization, the mapping supports the following transformations:
For information on how to configure the supported properties, see the help for Microsoft Azure Data Lake Storage Gen2 Connector.

Configuring a custom query for the Databricks source object

You can push down a custom query to Databricks.
Before you run a task that contains a custom query as the source object, you must set the Create Temporary View session property in the mapping task properties.
Note: If you do not set the Create Temporary View property, the mapping runs without SQL ELT optimization.
Perform the following task to set the property:
  1. 1In the mapping task, navigate to the SQL ELT Optimization section on the Runtime Options tab.
  2. 2Select Create Temporary View.
  3. 3Click Finish.

SQL ELT optimization for multiple targets

You can add multiple Databricks targets in a mapping.
When you configure a mapping to write to multiple Databricks targets, you can further optimize the write operation when you configure full SQL ELT optimization.
To optimize, you can configure an insert, update, upsert, or delete operation for multiple targets individually.
Note: You cannot configure an insert operation for unconnected target columns for mappings in advanced mode.
You can select the same Databricks target table in multiple Target transformations and perform different operations for each of the Target transformations to run independent of each other.
If your mapping contains multiple pipelines, you can define the flow run order to load the targets from the pipelines in a particular order.

Single commit for SQL ELT optimization

When you enable full SQL ELT optimization for a mapping to write to multiple Databricks targets, you can configure the mapping to commit the configured operations for all the targets within a connection group together.
You can use single commit to combine the metadata from all the targets and send the metadata for processing in a single execution call. When you use single commit, the Secure Agent separates the targets into connection groups based on equivalent connection attributes and commits the operations together for each connection group. This optimizes the performance of the write operation.
When you run a mapping with multiple targets, the Databricks connections used for these multiple target transformations that have the same connection attribute values are grouped together to form connection groups. As all the targets in a connection group have the same connection attributes, only a single connection is established for each connection group which represents that particular connection group. The transactions on each connection group runs on a single Databricks cluster.
If the Secure Agent fails to write to any of the targets, the task execution stops and the completed transactions for the targets that belong to the same connection group are not rolled back.
To enable single commit to write to multiple targets, set the EnableSingleCommit=Yes custom property in the Advanced Session Properties section on the Runtime Options tab of the mapping task.

Rules and guidelines for SQL ELT optimization

Use the following rules and guidelines when you enable a mapping for SQL ELT optimization to a Databricks database:

General rules and guidelines

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping:

Mapping with Databricks source and target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from and writes to Databricks:

Mapping with Amazon S3 source and Databricks target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from an Amazon S3 source and writes to a Databricks target:

Mapping with Azure Data Lake Storage Gen2 source and Databricks target

Use the following rules and guidelines when you configure SQL ELT optimization in a mapping that reads from a Azure Data Lake Storage Gen2 source and writes to a Databricks target:

Cross workspace mappings

When you set up a mapping enabled with full SQL ELT optimization to access data from a Databricks workspace, and the associated metastore resides in a separate workspace, the mapping runs without SQL ELT optimization.

Troubleshooting SQL ELT optimization

Mapping fails when configured to read date or timestamp information and write to default date/time format
When you configure a mapping to read date or timestamp information from a string column and process the data with the default date/time format to write to Databricks target, the mapping fails with the following error:
[ERROR] The Secure Agent failed to run the full pushdown query due to the following error: [Invalid timestamp: '12/31/1972 00:00:00.000001']
To resolve this issue, set the JVM option -DHonorInfaDateFormat=true for the Secure Agent.
Perform the following steps to configure the JVM option in Administrator:
  1. 1Select Administrator > Runtime Environments.
  2. 2On the Runtime Environments page, select the Secure Agent machine that runs the mapping.
  3. 3Click Edit.
  4. 4In the System Configuration Details section, select Data Integration Server as the Service and DTM as the Type.
  5. 5Edit the JVMOption system property and set the value to -DHonorInfaDateFormat=true.
  6. 6Click Save.
IS_DATE(), IS_SPACES(), and IS_NUMBER() functions return 0 or 1 instead of True or False.
When you use IS_DATE(), IS_SPACES(), and IS_NUMBER() functions, the functions return 0 or 1 instead of True or False.
To resolve this issue, set the JVM option -DDeltaSQLELTBooleanReturnAsString=true for the Secure Agent.
Perform the following steps to configure the JVM option in Administrator:
  1. 1Select Administrator > Runtime Environments.
  2. 2On the Runtime Environments page, select the Secure Agent machine that runs the mapping.
  3. 3Click Edit.
  4. 4In the System Configuration Details section, select Data Integration Server as the Service and DTM as the Type.
  5. 5Edit the JVMOption system property and set the value to -DDeltaSQLELTBooleanReturnAsString=true.
  6. 6Click Save.