Databricks Connector > Mappings for Databricks > Sources for Databricks
  

Sources for Databricks

Add a Source transformation to extract data from a source.
When you add a Source transformation to a mapping, you define the source connection, source objects, and source properties related to the Databricks connection type.
The following table lists the source properties that are supported by SQL warehouse, all-purpose cluster, and job cluster:
Property
SQL warehouse1
All-purpose cluster2
Job cluster3
Source Type - Single Object
Yes
Yes
Yes
Source Type - Parameter
Yes
Yes
Yes
Source Type - Query
Yes
No
No
Source Type - Multiple Objects
Yes
No
No
Parameter
Yes
No
Yes
Filter
Yes
No
Yes
Sort
No
No
No
Select distinct rows only
No
No
No
Database Name
Yes
No
Yes
Table Name
Yes
No
Yes
Pre SQL
Yes
No
No
Post SQL
Yes
No
No
SQL Override
Yes
No
No
Staging Location
Yes
No
Yes
Job Timeout
No
No
Yes
Job Status Poll Interval
No
No
Yes
DB REST API Timeout
No
No
Yes
DB REST API Retry Interval
No
No
Yes
Tracing Level
Yes
No
Yes
Key Range Partitioning
Yes
No
No
1The Secure Agent connects to the SQL warehouse at design time and runtime.
2The Secure Agent connects to the all-purpose cluster to import the metadata at design time.
3The Secure Agent connects to the job cluster to run the mappings.

Source properties for Databricks

In a mapping, you can configure a Source transformation to represent a Databricks object.
The following table describes the Databricks source properties that you can configure in a Source transformation:
Property
Description
Connection
Name of the source connection. Select a source connection or click New Parameter to define a new parameter for the source connection.
Note: You can completely paramaterize a parameter file for a source connection only for a single object source type.
Parameterization doesn't apply to mappings in advanced mode.
Source Type
Type of the source object. Select any of the following source objects:
  • - Single Object
  • - Multiple Objects1. You can only use advanced relationships with multiple objects.
  • - Query.
  • - Parameter1. Select Parameter to define the source type when you configure the task.
Multiple objects and query source types don't apply to all-purpose cluster and job cluster.
Note: Multi-object database override will override the database for all imported objects, while the table override will only override the first table of the multi-object source.
Object
Name of the source object.
You cannot use the data preview option if the source fields contain hierarchical data types.
Query1
Click on Define Query and enter a valid custom query.
The Query property appears only if you select Query as the source type.
You can parameterize a custom query object at runtime in a mapping.
You can also enable unity catalog settings in a custom query to access a table within a particular catalog.
1Doesn't apply to mappings in advanced mode.
The following table describes the Databricks query options that you can configure in a Source transformation:
Property
Description
Query Options
Filters the source data based on the conditions you specify. Click Configure to configure a filter option.
The Filter option filters records and reduces the number of rows that the Secure Agent reads from the source. Add conditions in a read operation to filter records from the source. You can specify the following filter conditions:
  • - Not parameterized. Use a basic filter to specify the object, field, operator, and value to select specific records.
  • - Completely parameterized*. Use a parameter to specify the filter query.
  • - Advanced. Use an advanced filter to define a complex filter condition.
Note: You can use Contains, Ends With, and Starts With operators to filter records only on SQL endpoints.
Filter
Filters records based on the filter condition.
You can specify a simple filter or an advanced filter.
The following table describes the Databricks advanced source properties that you can configure in a Source transformation:
Property
Description
Database Name
Overrides the database name provided in connection and the database name provided during metadata import.
Note: To read from multiple objects ensure that you have specified the database name in the connection properties.
Table Name
Overrides the table name used in the metadata import with the table name that you specify.
Pre SQL
The pre-SQL command to run on the Databricks source table before the agent reads the data.
For example, if you want to update records in the database before you read the records from the table, specify a pre-SQL statement.
The query must include a fully qualified table name. You can specify multiple pre-SQL commands, each separated with a semicolon.
Post SQL
The post-SQL command to run on the Databricks table after the agent completes the read operation.
For example, if you want to delete some records after the latest records are loaded, specify a post-SQL statement.
The query must include a fully qualified table name. You can specify multiple post-SQL commands, each separated with a semicolon.
SQL Override
Overrides the default SQL query used to read data from Databricks custom query source.
The column names in the SQL override query should match with the column names in the custom query in a SQL transformation.
Note: The metadata of the source should be the same as SQL override to override the query.
Staging Location
Relative directory path to store the staging files.
  • - If the Databricks is hosted on the AWS platform, use the path relative to the Amazon S3 staging bucket.
  • - If the Databricks is hosted on the Azure platform, use the path relative to the Azure Data Lake Store Gen2 staging filesystem name.
Note: When you use the unity catalog, a pre-existing location on user's cloud storage must be provided in the Staging Location.
Job Timeout
Maximum time in seconds that is taken by the Spark job to complete processing.
If the job is not completed within the time specified, the Databricks cluster terminates the job and the mapping fails.
If the job timeout is not specified, the mapping shows success or failure based on the job completion.
Job Status Poll Interval
Poll interval in seconds at which the Secure Agent checks the status of the job completion.
Default is 30 seconds.
DB REST API Timeout
The Maximum time in seconds for which the Secure Agent retries the REST API calls to Databricks when there is an error due to network connection or if the REST endpoint returns 5xx HTTP error code.
Default is 10 minutes.
DB REST API Retry Interval
The time Interval in seconds at which the Secure Agent must retry the REST API call, when there is an error due to network connection or when the REST endpoint returns 5xx HTTP error code.
This value does not apply to the Job status REST API. Use job status poll interval value for the Job status REST API.
Default is 30 seconds.
Tracing Level
Sets the amount of detail that appears in the log file. You can choose terse, normal, verbose initialization, or verbose data.
Default is normal.
Note: Advanced source properties are not applicable to mappings in advanced mode. Only pre-SQL and post-SQL advanced properties are applicable for custom queries.

Custom query source type

You can use a custom query as a source object when you use a Databricks connection.
You might want to use a custom query as the source when a source object is large. You can use the custom query to reduce the number of fields that enter the data flow. You can also create a parameter for the source type when you design your mapping so that you can define the query in the Mapping Task wizard.
To use a custom query as a source, select Query as the source type when you configure the source transformation and then use valid and supported SQL to define the query.

Key range partitioning

You can configure key range partitioning when you use a mapping to read data from a single Databricks source. The partition type controls how the agent distributes data among partitions at partition points. You cannot use partitioning in mappings in advanced mode.
Partitioning optimizes the mapping performance at run time. When you run a mapping configured with key range partitioning, the agent distributes rows of source data based on the field that you define as partition keys. The agent compares the field value to the range values for each partition and sends rows to the appropriate partitions.
On the Partitions tab for the Source transformation, select key range partitioning and choose the field as the partition key.
You can add key ranges to create partitions, as shown in the following image:
Configure key range partitioning in the Partitions tab of the Source transformation
You can configure a partition key for fields of the following data types:

Rules and guidelines for key range partitioning

Consider the following rules and guidelines when you use key range partitioning: