Databricks targets

To use Databricks targets in database ingestion and replication tasks, first prepare the target and review the usage considerations.

Target preparation:

1In the Databricks connection properties, set the JDBC Driver Class Name property to com.databricks.client.jdbc.Driver.

2On Windows, install Visual C++ Redistributable Packages for Visual Studio 2013 on the computer where the Secure Agent runs.

Usage considerations:

•For incremental load jobs, you must enable Change Data Capture (CDC) for all source columns.

• You can access Databricks tables created on top of the following storage types:

- Microsoft Azure Data Lake Storage (ADLS) Gen2
- Amazon Web Services (AWS) S3

The Databricks connection uses a JDBC URL to connect to the Databricks cluster. When you configure the target, specify the JDBC URL and credentials to use for connecting to the cluster. Also define the connection information that the target uses to connect to the staging location in Amazon S3 or ADLS Gen2.

•Before writing data to Databricks target tables, database ingestion and replication jobs stage the data in an Amazon S3 bucket or ADLS directory. You must specify the directory for the data when you configure the database ingestion and replication task.

Note:

Database Ingestion and Replication does not use the ADLS Staging Filesystem Name and S3 Staging Bucket properties in the Databricks connection properties to determine the directory.

•Database Ingestion and Replication uses jobs that run once to load data from staging files on Amazon S3 or Amazon Data Lake Store Gen2 to external tables.

By default, Database Ingestion and Replication runs jobs on the cluster that is specified in the Databricks connection properties. If you want to run jobs on another cluster, set the dbDeltaUseExistingCluster custom property to false on the Target page in the database ingestion and replication task wizard.

•If the cluster specified in the Databricks connection properties is not up and running, the database ingestion and replication job waits for the cluster to start. By default, the job waits for 10 minutes. If the cluster does not start within 10 minutes, the connection times out and deployment of the job fails.

If you want to increase the timeout value for the connection, set the dbClusterStartWaitingTime custom property to the maximum time in seconds for which the ingestion job must wait for the cluster to be up and running. You can set the custom property on the Target page in the database ingestion and replication task wizard.

•By default, Database Ingestion and Replication uses the Databricks COPY INTO feature to load data from the staging file to Databricks target tables. You can disable it for all load types by setting the writerDatabricksUseSqlLoad custom property to false on the Target page in the database ingestion and replication task wizard.

•If you use an AWS cluster, you must specify the S3 Service Regional Endpoint value in the Databricks connection properties. For example:

s3.us-east-2.amazonaws.com

To test a Databricks connection using a Secure Agent on Linux, you must specify the JDBC URL in the SQL Endpoint JDBC URL field in the Databricks connection properties. After you test the connection, remove the SQL Endpoint JDBC URL value. Otherwise, when you define a database ingestion and replication task that uses the connection, a design-time error occurs because Data Ingestion and Replication tries to use the JDBC URL as well as the required Databricks Host, Cluster ID, Organization ID, and Databricks Token values to connect to target, resulting in login failures.

•You cannot test a Databricks connection using a Secure Agent on Windows. The test fails. In this situation, you can perform the test using a Secure Agent on Linux. However, note that you can use a Databricks connection with a Secure Agent on Windows when creating a database ingestion and replication task or running a database ingestion and replication job.

•Processing of Rename Column operations on Databricks target tables, without the need to rewrite the underlying Parquet files, requires the Databricks Column Mapping feature with Databricks Runtime 10.2 or later. If you set the Rename Column option to Replicate on the Schedule and Runtime Options page in the task wizard, you must alter the generated target table to set the following Databricks table properties after task deployment and before you run the job:

ALTER TABLE <target_table> SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')

These properties enable the Databricks Column Mapping feature with the required reader and writer versions. If you do not set these properties, the database ingestion and replication job will fail.

• Database ingestion and replication jobs that have Databricks targets can get schema information for generating the target tables from Databricks Unity Catalog. To enable access to information in a Unity Catalog, specify the catalog name in the Catalog Name field in the Databricks connection properties. The catalog name is appended to the SQL Warehouse JDBC URL value for a data warehouse.

Note:

Catalog use is optional for SQL warehouses and does not apply to job clusters.

If you use Unity Catalog, a personal storage location is automatically provisioned. To use the personal staging location, in the Staging Environment field of the connection properties, select Personal Staging Location. The Parquet data files for ingestion and replication jobs can then be staged to the local personal storage location, which has a data retention period of 7 days. By default, the staging location is stage://tmp/<user_name> in the root AWS or Azure location. The <user_name> is taken from the Database Token connection property. This user must have read and write access to the personal staging location.

•You can stage data in a Volume in Databricks. Volumes are Unity Catalog objects used to manage and secure non-tabular datasets such as files and directories. Volumes can be of the following types:

- Managed volumes

▪ The data is stored in a location fully controlled and managed by Databricks. If you use volumes as the staging environment, you can specify the volume path where staging files are generated in the Volume Path field of the connection properties. Files will be written only up to the dataDirectory level.
▪ If you not provide a path to the files within a volume, a new volume is generated automatically in the format:

▪ The generated volume is deleted when an initial load job is successfully completed. Only a volume that is empty is automatically deleted. The volume is not deleted when the initial load phase of a combined load job is completed.

- Unmanaged / external volumes

▪ The data is stored in user-managed storage locations in Amazon S3 or Microsoft Azure Data Lake Storage, which you mount or access from Databricks.
▪ You must provide the volume path where staging files are generated in the Volume Path field of the connection properties. Files will be written only up to the dataDirectory level.
▪ The volume is not deleted when an initial load job or the initial load phase of a combined load job is completed.

•Database Ingestion and Replication supports special characters in table and column names in the source tables. The special characters are replaced with an underscore (_) in the Databricks target table or column names.

The custom property key-value pair targetReplacementValue=toHex prevents Database Ingestion and Replication from replacing the special characters with an underscore in the generated target schema and converts the special characters to hexadecimal format.

To convert the special characters to their hexadecimal value, perform the following actions before deploying the database ingestion and replication task:

1Create a properties file to be used at the metadata-manager layer. Add the targetReplacementValue=toHex key-value pair to the properties file.
2Open the Runtime Environment page in Administrator and edit the Secure Agent. Create a custom property in the Custom Configuration Details area:

▪ Select the Database Ingestion service.
▪ Select the DBMI_AGENT_ENV type.
▪ Enter DBMI_TASK_OVERRIDE_PROPERTIES as the property name.
▪ Enter the properties file location as the property value.

3Set the targetReplacementValue custom property to toHex on the Target page of the task wizard.

Before running the task, add <jobname>. to the targetReplacementValue key in the properties file:

<jobname>.targetReplacementValue=toHex

If the property affects all jobs, add "alljobs." to the targetReplacementValue key:

alljobs.targetReplacementValue=toHex

•If you select source decimal or numeric columns to replicate to Databricks target columns, ensure that each source decimal or numeric column has a scale that is less than or equal to its precision. Otherwise, when you run the job, an error is issued that reports an invalid decimal scale. This consideration applies to any source type that is mapped to a Databricks target.

•If you generated Databricks unmanaged tables as the target tables and no longer need a target unmanaged table, use a SQL DROP TABLE statement to delete the table from the target database. You must not manually delete the external directory in Amazon S3 or Azure Data Lake Storage for the unmanaged table. If you do so and try to deploy another job that uses that table, the deployment fails with a Metadata Handler error.