Amazon Redshift Sources
You can use an Amazon Redshift object as a source in a Data Synchronization task. You can also use multiple related Amazon Redshift standard objects as sources in a Data Synchronization task.
When you use Amazon Redshift source objects, you can select a standard object as the primary source, and then add child objects.
When you configure the advanced source properties, you configure properties specific to Amazon Redshift. You can encrypt data, retain the staging files on Amazon S3, and securely unload the results of a query to files on Amazon Redshift.
Amazon Redshift Staging Directory for Amazon Redshift Sources
The agent creates a staging file in the directory that you specify in the source properties. The Data Synchronization, mapping and Mapping Configuration tasks stage data in a staging directory before reading data from Amazon Redshift. The agent deletes the staging files from the staging directory when the task completes.
You cannot configure a directory on Hosted Agent. The Hosted Agent creates a directory to stage data at a temporary location and deletes the staging files from the temporary location when the task completes.
To improve task performance, enable compression for staging files. Specify a staging directory with an appropriate amount of disk space for the volume of data that you want to process. Specify a directory path that is available on each Secure Agent machine in the runtime environment.
The applications create subdirectories in the staging directory based on the time that the task runs. Subdirectories use the following naming convention:
<staging directory>/infaRedShiftStaging<MMddHHmmssSSS+xyz>
Client-side Encryption for Amazon Redshift Sources
Client-side encryption is a technique to encrypt data before transmitting the data to the Amazon Redshift server.
When you enable client-side encryption for Amazon Redshift sources, Amazon Redshift unloads the data in encrypted format, and then pushes the data to the Secure Agent. The Secure Agent writes the data to the target based on the task or mapping logic.
To enable client-side encryption, you must provide a master symmetric key in the connection properties. The Secure Agent encrypts the data by using the master symmetric key. The master symmetric key is a 256-bit AES encryption key in the Base64 format. Amazon Redshift Connector uploads the data to the Amazon S3 server by using the master symmetric key and then loads the data by using the copy command with the Encrypted option and a private encryption key for additional security. To enable client-side encryption, perform the following tasks:
- 1. Provide the master symmetric key when you create an Amazon Redshift connection. Ensure that you provide a 256-bit AES encryption key in Base64 format.
- 2. Update the security policy JAR files on each Secure Agent machine in the runtime environment.
Update the local_policy.jar and the US_export_policy.jar files in the following directory:<Secure Agent installation directory>\jre\lib\security. You can download the JAR files supported by your JAVA environment from the Oracle website.
- 3. On the Schedule page, select Client Side Encryption as the encryption type in the advanced source properties.
Unload Command
You can use the Unload command to extract data from Amazon Redshift and create staging files on Amazon S3. The Unload command uses a secure connection to load data into one or more files on Amazon S3.
You can specify the Unload command options directly in the UnloadOptions Property File field. Enter the options in uppercase and delimit the options by using a new line . The Unload command has the following options and default values:
DELIMITER=\036
ESCAPE=OFF
PARALLEL=ON
AWS_IAM_ROLE=arn:aws:iam::<account ID>:role/<role-name>
When you run a task in the Secure Agent runtime environment, you can create a property file. The property file contains the Unload command options. Include the property file path in the UnloadOptions Property File field. For example:
C:\Temp\Redshift\unloadoptions.txt
If you run the Unload command as a pre-SQL or post-SQL command, specify the ALLOWOVERWRITE option to overwrite the existing objects.
Unload Command Options
The Unload command options extract data from Amazon Redshift and load data to staging files on Amazon S3 in a particular format. You can delimit the data with a particular character or load data to multiple files in parallel.
To add options to the Unload command, use the UnloadOptions Property File option. You can set the following options:
- DELIMITER
- A single ASCII character to separate fields in the input file. You can use characters such as pipe (|), tilde (~), or a tab (\t). The delimiter you specify should not be a part of the data. If the delimiter is a part of data, use ESCAPE to read the delimiter character as a regular character. Default is \036, the octal representation of the non-printable character, record separator.
- ESCAPE
- You can add an escape character for CHAR and VARCHAR columns in delimited unload files before occurrences of the following characters:
- - Linefeed \n
- - Carriage return \r
- - Delimiter character specified for the unloaded data
- - Escape character \
- - Single- or double-quote character
Default is OFF.
- PARALLEL
- The Unload command writes data in parallel to multiple files, according to the number of slices in the cluster. Default is ON. If you turn the Parallel option off, the Unload command writes data serially. The maximum size of a data file is 6.5 GB.
- AWS_IAM_ROLE
- Specify the Amazon Redshift Role Resource Name (ARN) to run the task on agent installed on an Amazon EC2 system in the following format: AWS_IAM_ROLE=arn:aws:iam::<account ID>:role/<role-name>
For example: arn:aws:iam::123123456789:role/redshift_read
- ADDQUOTES
- ADDQUOTES is implemented with the UNLOAD command by default. Do not specify the ADDQUOTES option in the advanced source properties. The Unload command adds quotation marks to each data field. With added quotation marks, the UNLOAD command can read data values that contain the delimiter. If double quote (") is a part of data, use ESCAPE to read the double quote as a regular character.
Partitioning
When you read data from Amazon Redshift, you can configure partitioning to optimize the mapping performance at run time. The partition type controls how the agent distributes data among partitions at partition points.
You can define the partition type as key range partitioning. Configure key range partitioning to partition Amazon Redshift data based on the value of a fields or set of fields. With key range partitioning, the Secure Agent distributes rows of source data based the fields that you define as partition keys. The Secure Agent compares the field value to the range values for each partition and sends rows to the appropriate partition.
Use key range partitioning for columns that have an even distribution of data values. Otherwise, the partitions might have unequal size. For example, a column might have 10 rows between key values 1 and 1000 and the column might have 999 rows between key values 1001 and 2000.
With key range partitioning, a query for one partition might return rows sooner than another partition. Or, one partition can return rows while the other partitions are not returning rows. This situation occurs when the rows in the table are in a similar order as the key range. One query might be reading and returning rows while the other queries are reading and filtering the same rows.