Configure either the AWS or Azure staging environment for the SQL warehouse based on the deployed environment. You also need to configure the Spark parameters for the SQL warehouse to use Azure and AWS staging.
You can use a SQL warehouse on the Windows and Linux operating systems.
For more information on the types of SQL warehouses that you can connect to, see the Databricks SQL warehouses Knowledge Base article.
Configure AWS staging
Configure IAM AssumeRole authentication to use AWS staging for the SQL warehouse.
IAM AssumeRole authentication
You can enable IAM AssumeRole authentication in Databricks for secure and controlled access to the Amazon S3 staging bucket when you run mappings and mapping tasks.
You can configure IAM authentication when the Secure Agent runs on an Amazon Elastic Compute Cloud (EC2) system. When you use a serverless runtime environment, you cannot configure IAM authentication.
Note: Data Ingestion and Replication does not support IAM authentication for access to Amazon S3 staging.
Perform the following steps to configure IAM authentication on EC2:
1Create a minimal Amazon IAM policy.
2Create the Amazon EC2 role. The Amazon EC2 role is used when you create an EC2 system.
For more information about creating the Amazon EC2 role, see the AWS documentation.
3Link the minimal Amazon IAM policy with the Amazon EC2 role.
4Create an EC2 instance. Assign the Amazon EC2 role that you created to the EC2 instance.
5Install the Secure Agent on the EC2 system.
Temporary security credentials using AssumeRole
You can use temporary security credentials using AssumeRole to access AWS resources from same or different AWS accounts.
Note: Data Ingestion and Replication does not support using temporary security credentials for IAM users.
Ensure that you have the sts:AssumeRole permission and a trust relationship established within the AWS accounts to use temporary security credentials. The trust relationship is defined in the trust policy of the IAM role when you create the role. The IAM role adds the IAM user as a trusted entity allowing the IAM users to use temporary security credentials and access AWS accounts.
For more information about how to establish the trust relationship, see the AWS documentation.
When the trusted IAM user requests for temporary security credentials, the AWS Security Token Service (AWS STS) dynamically generates the temporary security credentials that are valid for a specified period and provides the credentials to the trusted IAM users. The temporary security credentials consist of access key ID, secret access key, and secret token.
To use the dynamically generated temporary security credentials, provide a value for the IAM Role ARN connection property when you create a Databricks connection. The IAM Role ARN uniquely identifies the AWS resources. Then, specify the time duration in seconds during which you can use the temporarily security credentials in the Temporary Credential Duration advanced source and target properties.
External ID
You can specify the external ID for a more secure cross-account access to the Amazon S3 bucket when the Amazon S3 bucket is in a different AWS account.
Optionally, you can specify the external ID in the AssumeRole request to the AWS Security Token Service (STS).
The external ID must be a string.
The following sample shows an external ID condition in the assumed IAM role trust policy:
An IAM role must have the sts:AssumeRole policy and a trust policy attached with the IAM role to allow the IAM user to access the AWS resource using temporary security credentials. The policy specifies the AWS resource that the IAM user can access and the actions that the IAM user can perform. The trust policy specifies the IAM user from the AWS account that can access the AWS resource.
Here, in the Principal attribute, you can also provide the ARN of IAM user who can use the dynamically generated temporary security credentials and to restrict further access. For example,
Temporary security credentials using AssumeRole for EC2
You can use temporary security credentials using AssumeRole for an Amazon EC2 role to access AWS resources from same or different AWS accounts.
The Amazon EC2 role would be able to assume another IAM Role from the same or a different AWS account without requiring the permanent access key and secret key.
Consider the following prerequisites when you use temporary security credentials using AssumeRole for EC2:
•Install the Secure Agent on an AWS service such as Amazon EC2.
•The EC2 role attached to the AWS EC2 service does not need access to Amazon S3 but needs permission to assume another IAM role.
•The IAM role that needs to be assumed by the EC2 role must have a permission policy and a trust policy attached to it.
To configure an EC2 role to assume the IAM role provided in the IAM Role ARN connection property, select the Use EC2 Role to Assume Role check box in the connection properties.
Create a minimal Amazon IAM policy
To stage the data in Amazon S3, use the following minimum required permissions: :
•PutObject
•GetObject
•DeleteObject
•ListBucket
•ListBucketMultipartUploads. Applicable only for mappings in advanced mode.
You can use the following sample Amazon IAM policy:
For mappings in advanced mode, you can use different AWS accounts within the same AWS region. Make sure that the Amazon IAM policy confirms access to the AWS accounts used in these mappings.
Note: The Test Connection does not validate the IAM policy assigned to users. You can specify the Amazon S3 bucket name in the source and target advanced properties.
This information does not apply to Data Ingestion and Replication.
Configure Spark parameters for AWS staging
Before you use the Databricks SQL warehouse to run mappings, configure the Spark parameters for SQL warehouse on the Databricks SQL Admin console.
On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for AWS under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse:
For example, the S3 staging bucket warehouse value is s3.ap-south-1.amazonaws.com.
Ensure that the configured access key and secret key have access to the S3 buckets where you store the data for Databricks tables.
Configure Azure staging
Before you use Microsoft Azure Data Lake Storage Gen2 to stage files, perform the following tasks:
• Create a storage account to use with Microsoft Azure Data Lake Storage Gen2 and enable Hierarchical namespace in the Azure portal.
You can use role-based access control to authorize the users to access the resources in the storage account. Assign the Contributor role or Reader role to the users. The contributor role grants you full access to manage all resources in the storage account, but does not allow you to assign roles. The reader role allows you to view all resources in the storage account, but does not allow you to make any changes.
Note: To add or remove role assignments, you must have write and delete permissions, such as an Owner role.
•Register an application in Azure Active Directory to authenticate users to access the Microsoft Azure Data Lake Storage Gen2 account.
You can use role-based access control to authorize the application. Assign the Storage Blob Data Contributor or Storage Blob Data Reader role to the application. The Storage Blob Data Contributor role lets you read, write, and delete Azure Storage containers and blobs in the storage account. The Storage Blob Data Reader role lets you only read and list Azure Storage containers and blobs in the storage account.
• Create an Azure Active Directory web application for service-to-service authentication with Microsoft Azure Data Lake Storage Gen2.
Note: Ensure that you have superuser privileges to access the folders or files created in the application using the connector.
•To read and write complex files, set the JVM options for type DTM to increase the -Xms and -Xmx values in the system configuration details of the Secure Agent to avoid java heap space error. The recommended -Xms and -Xmx values are 512 MB and 1024 MB respectively.
Configure Spark parameters for Azure staging
Before you use the Databricks SQL warehouse to run mappings, configure the Spark parameters for SQL warehouse on the Databricks SQL Admin console.
On the Databricks SQL Admin console, navigate to SQL Warehouse Settings > Data Security, and then configure the Spark parameters for Azure under Data access configuration.
Add the following Spark configuration parameters and restart the SQL warehouse: