Connections > Databricks Delta connection > Staging prerequisites
  

Staging prerequisites

Before you create a connection, you must perform certain prerequisite tasks to configure the staging environment to connect to SQL warehouse or Databricks cluster.

SQL warehouse

Configure either the AWS or Azure staging environment for the SQL warehouse based on the deployed environment. You also need to configure the Spark parameters for the SQL warehouse to use Azure and AWS staging.

Configure AWS staging

Configure IAM AssumeRole authentication to use AWS staging for the SQL warehouse.

IAM AssumeRole authentication

You can enable IAM AssumeRole authentication in Databricks Delta for secure and controlled access to the Amazon S3 staging bucket when you run tasks.
You can configure IAM authentication when the Secure Agent runs on an Amazon Elastic Compute Cloud (EC2) system.
Perform the following steps to configure IAM authentication on EC2:
  1. 1Create a minimal Amazon IAM policy.
  2. 2Create the Amazon EC2 role. The Amazon EC2 role is used when you create an EC2 system.
  3. For more information about creating the Amazon EC2 role, see the AWS documentation.
  4. 3Link the minimal Amazon IAM policy with the Amazon EC2 role.
  5. 4Create an EC2 instance. Assign the Amazon EC2 role that you created to the EC2 instance.
  6. 5Install the Secure Agent on the EC2 system.

Temporary security credentials using AssumeRole

You can use temporary security credentials using AssumeRole to access AWS resources from same or different AWS accounts.
Ensure that you have the sts:AssumeRole permission and a trust relationship established within the AWS accounts to use temporary security credentials. The trust relationship is defined in the trust policy of the IAM role when you create the role. The IAM role adds the IAM user as a trusted entity allowing the IAM users to use temporary security credentials and access AWS accounts.
For more information about how to establish the trust relationship, see the AWS documentation.
When the trusted IAM user requests for temporary security credentials, the AWS Security Token Service (AWS STS) dynamically generates the temporary security credentials that are valid for a specified period and provides the credentials to the trusted IAM users. The temporary security credentials consist of access key ID, secret access key, and secret token.
To use the dynamically generated temporary security credentials, provide a value for the IAM Role ARN connection property when you create a Databricks Delta connection. The IAM Role ARN uniquely identifies the AWS resources. Then, specify the time duration in seconds during which you can use the temporarily security credentials in the Temporary Credential Duration advanced source and target properties.

External ID

You can specify the external ID for a more secure cross-account access to the Amazon S3 bucket when the Amazon S3 bucket is in a different AWS account.
Optionally, you can specify the external ID in the AssumeRole request to the AWS Security Token Service (STS).
The external ID must be a string.
The following sample shows an external ID condition in the assumed IAM role trust policy:
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::AWS_Account_ID : user/user_name"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "dummy_external_id"
}
}
}
]
Note: Mass Ingestion does not support External ID.

Temporary security credentials policy

To use temporary security credentials to access AWS resources, both the IAM user and IAM role require policies.
Amazon S3 permission policy
Attach the following S3 permission policy to allow access to the Amazon S3 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectTagging",
"s3:GetBucketAcl"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::com.amk"
},
{
"Effect": "Allow",
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject",
"s3:PutObjectTagging",
"s3:GetBucketAcl"
],
"Resource": "arn:aws:s3:::com.amk/*"
}
]
}
The following section lists the policies required for IAM user and IAM role:
IAM user
An IAM user must have the sts:AssumeRole policy to use temporary security credentials in same or different AWS account.
The following sample policy allows an IAM user to use the temporary security credentials in an AWS account:
{
"Version":"2012-10-17",
"Statement":{
"Effect":"Allow",
"Action":"sts:AssumeRole",
"Resource":"arn:aws:iam::<ACCOUNT-HYPHENS>:role/<ROLE-NAME>" }
}
IAM role
An IAM role must have the sts:AssumeRole policy and a trust policy attached with the IAM role to allow the IAM user to access the AWS resource using temporary security credentials. The policy specifies the AWS resource that the IAM user can access and the actions that the IAM user can perform. The trust policy specifies the IAM user from the AWS account that can access the AWS resource.
The following policy is a sample trust policy:
{
"Version":"2012-10-17",
"Statement":[
{
"Effect":"Allow",
"Principal":{ "AWS":"arn:aws:iam::AWS-account-ID:root" },
"Action":"sts:AssumeRole"
}
]
}
}
Here, in the Principal attribute, you can also provide the ARN of IAM user who can use the dynamically generated temporary security credentials and to restrict further access. For example,
"Principal" : { "AWS" : "arn:aws:iam:: AWS-account-ID :user/ user-name " }

Temporary security credentials using AssumeRole for EC2

You can use temporary security credentials using AssumeRole for an Amazon EC2 role to access AWS resources from same or different AWS accounts.
The Amazon EC2 role would be able to assume another IAM Role from the same or a different AWS account without requiring the permanent access key and secret key.
Consider the following prerequisites when you use temporary security credentials using AssumeRole for EC2:
To configure an EC2 role to assume the IAM role provided in the IAM Role ARN connection property, select the Use EC2 Role to Assume Role check box in the connection properties.

Create a minimal Amazon IAM policy

To stage the data in Amazon S3, use the following minimum required permissions: :
You can use the following sample Amazon IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket",

],
"Resource": [
"arn:aws:s3:::<bucket_name>/*",
"arn:aws:s3:::<bucket_name>"
]
}
]
}
Note: The Test Connection does not validate the IAM policy assigned to users. You can specify the Amazon S3 bucket name in the source advanced properties.

Configure Spark parameters for AWS staging

Before you use the Databricks SQL warehouse, configure the Spark parameters for SQL warehouse on the Databricks SQL Admin console.
On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for AWS under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse:
For example, the S3 staging bucket warehouse value is s3.ap-south-1.amazonaws.com.
Ensure that the configured access key and secret key have access to the S3 buckets where you store the data for Databricks Delta tables.

Configure environment variables

Set the following environment variables in the Secure Agent before you connect to Databricks SQL warehouse:
After you set the environmental variables, restart the Secure Agent.

Configure Azure staging

Before you use Microsoft Azure Data Lake Storage Gen2 to stage files, perform the following tasks:

Configure Spark parameters for Azure staging

Before you use the Databricks SQL warehouse, configure the Spark parameters for SQL warehouse on the Databricks SQL Admin console.
On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for Azure under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse:
Ensure that the configured client ID and client secret have access to the file systems where you store the data for Databricks Delta tables.

Configure environment variables

Set the following environment variables in the Secure Agent before you connect to Databricks SQL warehouse:
After you set the environmental variables, restart the Secure Agent.

Databricks cluster

Configure the Spark parameters for the Databricks cluster to use Azure and AWS staging based on where the cluster is deployed.
You also need to enable the Secure Agent properties for runtime and design-time processing on the Databricks cluster.

Configure Spark parameters

Before you connect to the Databricks cluster, you must configure the Spark parameters on AWS and Azure.

Configuration on AWS

Add the following Spark configuration parameters for the Databricks cluster and restart the cluster:
Ensure that the access and secret key configured has access to the buckets where you store the data for Databricks Delta tables.

Configuration on Azure

Add the following Spark configuration parameters for the Databricks cluster and restart the cluster:
Ensure that the client ID and client secret configured has access to the file systems where you store the data for Databricks Delta tables.

Configure Secure Agent properties

When you configure mappings, the SQL warehouse processes the mapping by default.
To process the mappings on Databricks cluster, enable the Secure Agent properties.
To connect to all-purpose cluster and job cluster, enable the Secure Agent properties for design time and runtime respectively.

Setting the property for design time processing

  1. 1In Administrator, select the Secure Agent listed on the Runtime Environments tab.
  2. 2Click Edit.
  3. 3In the System Configuration Details section, select Data Integration Server as the Service and Tomcat JRE as the Type.
  4. 4Edit the JRE_OPTS field and set the value to -DUseDatabricksSql=false.
This image displays the JRE_OPTS property for the Tomcat JRE type.

Setting the property for runtime processing

  1. 1In Administrator, select the Secure Agent listed on the Runtime Environments tab.
  2. 2Click Edit.
  3. 3In the System Configuration Details section, select Data Integration Server as the Service and DTM as the Type.
  4. 4Edit the JVMOption field.
    1. aTo run mappings, set the value to -DUseDatabricksSql=false. This image shows the JVMOption property for mappings.
    2. bTo run mappings enabled with SQL ELT optimization, set the value to -DUseDatabricksSqlForPdo=false. This image shows the JVMOption property for mappings enabled with pushdown optimization.