You can use an Amazon Redshift V2 object as a target in a mapping or mapping task. You can also create an Amazon Redshift V2 target based on the input source.
When you configure the advanced target properties, configure properties specific to Amazon Redshift V2. You can encrypt data, update statistical metadata of the database tables to improve the efficiency of queries, load data into Amazon Redshift from flat files in an Amazon S3 bucket, and use vacuum tables to recover disk space and sort rows in tables.
Note: If the distribution key column in a target table contains null values and you configure a task with an upsert operation for the same target table, the task might create duplicate rows. To avoid creating duplicate rows, you must perform one of the following tasks:
•Replace the null value with a non-null value when you load data.
•Do not configure the column as a distribution key if you expect null values in the distribution key column.
•Remove the distribution key column from the target table temporarily when you load data. You can use the Pre-SQL and Post-SQL properties to remove and then add the distribution key column in the target table.
Amazon Redshift staging directory for Amazon Redshift V2 targets
The Secure Agent creates a staging file in the directory that you specify in the target properties. The Secure Agent writes the data to the staging directory before writing the data to Amazon Redshift.
The Secure Agent deletes the staged files from the staging directory after writing the data to Amazon S3. Specify a staging directory in the mapping properties with an appropriate amount of disk space for the volume of data that you want to process. Specify a directory on the machine that hosts the Secure Agent.
The Secure Agent creates subdirectories in the staging directory. Subdirectories use the following naming convention: <staging directory>/infaRedShiftStaging<MMddHHmmssSSS+xyz>
Data encryption in Amazon Redshift V2 targets
To protect data, you can enable server-side encryption or client-side encryption to encrypt the data that you insert in Amazon Redshift.
If you enable both server-side and client-side encryption for an Amazon Redshift target, then the client-side encryption is used for data load.
Server-side encryption for Amazon Redshift V2 targets
If you want Amazon Redshift to encrypt data while uploading and staging the .csv files to Amazon S3, you must enable server-side encryption.
To enable server-side encryption, select S3 Server Side Encryption in the advanced target properties and specify the Customer Master key ID in the connection properties.
You can configure the customer master key ID generated by AWS Key Management Service (AWS KMS) in the connection properties for server-side encryption. You must add IAM EC2 role and IAM Redshift role to the customer master key when you use IAM authentication and server-side encryption using customer master key.
If you select the server-side encryption in the advanced target properties and do not specify the customer master key ID in the connection properties, Amazon S3-managed encryption keys are used to encrypt data.
Client-side encryption for Amazon Redshift V2 targets
Client-side encryption is a technique to encrypt data before transmitting the data to the Amazon Redshift server.
When you enable client-side encryption for Amazon Redshift V2 targets, the Secure Agent fetches the data from the source, writes the data to the staging directory, encrypts the data, and then writes the data to an Amazon S3 bucket. The Amazon S3 bucket then writes the data to Amazon Redshift.
Note: When you use a serverless runtime environment, you cannot configure client-side encryption for Amazon Redshift V2 targets.
If you enable both server-side and client-side encryption for an Amazon Redshift V2 target, then the client-side encryption is used for data load.
To enable client-side encryption, you must provide a master symmetric key in the connection properties and select S3 Client Side Encryption in the advanced target properties.
The Secure Agent encrypts the data by using the master symmetric key. The master symmetric key is a 256-bit AES encryption key in the Base64 format. Amazon Redshift V2 Connector uploads the data to the Amazon S3 server by using the master symmetric key and then loads the data to Amazon Redshift by using the copy command with the Encrypted option and a private encryption key for additional security.
Copy command
You can use the Copy command to append data in a table. The Copy command uses a secure connection to load data from flat files in an Amazon S3 bucket to Amazon Redshift.
You can specify the Copy command options directly in the Copy Options field or specify the path to a text file that contains all the command options.
Enter the options in uppercase and use a semicolon to separate the options.
For example, DELIMITER = \036;ACCEPTINVCHARS = #;QUOTE = \037;COMPUPDATE = ON;AWS_IAM_ROLE=arn;aws;iam;;<account ID>;role/<role-name>
It is recommended to use octal representation of non-printable characters as DELIMITER and QUOTE.
Note: You cannot specify the path to a text file that contains the Copy command options in a mapping in advanced mode.
Copy command options
The Copy command options read data from Amazon S3 and write data to Amazon Redshift in a particular format. You can apply compression to data in the tables or delimit the data with a particular character.
To add options to the Copy command, use the CopyOptions Property File option.
You can set the following options:
DELIMITER
A single ASCII character to separate fields in the input file. You can use characters such as pipe (|), tilde (~), or a tab (\t). The delimiter must not be a part of the data.
The default value is \036 which represents the octal representation of the non-printable known as the record separator.
ACCEPTINVCHARS
Loads data into VARCHAR columns even if the data contains UTF-8 characters that are not valid. When you specify ACCEPTINCHARS, the Secure Agent replaces UTF-8 character that is not valid with an equal length string consisting of the character specified in ACCEPTINVCHARS. If you have specified '|' in ACCEPTINVCHARS, the Secure Agent replaces the three-byte UTF-8 character with '|||'.
If you do not specify ACCEPTINVCHARS, the COPY command returns an error when it encounters an UTF-8 character that is not valid. You can use the ACCEPTINVCHARS option on VARCHAR columns. Default is question mark (?).
QUOTE
Specifies the quote character to use with comma separated values. Default is \037, the octal representation of the non-printable character, unit separator.
REGION
You can use the REGION attribute when the Amazon S3 staging bucket is not in the same region as the cluster region. If Amazon Redshift resides in the US East (N. Virginia) region, you can use an Amazon S3 bucket residing in the Asia Pacific (Mumbai) region to create staging files. For example, REGION = ap-south-1.
COMPUPDATE
Overrides current compression encoding and applies compression to an empty table. Use the COMPUPDATE option in an insert operation when the rows in a table are more than 100,000. The behavior of COMPUPDATE depends on how it is configured:
- If you do not specify COMPUPDATE, the COPY command applies compression if the target table is empty and all columns in the table have either RAW or no encoding.
- If you specify COMPUPDATE ON, the COPY command replaces the existing encodings if the target table is empty and the columns in the table have encodings other than RAW.
- If you specify COMPUPDATE OFF, the COPY command does not apply compression.
Default is OFF.
TRUNCATECOLUMN
Truncates the data of the VARCHAR and CHAR data types column before writing the data to the target. If the size of the data that you want to write to the target is larger than size of the target column, the Secure Agent truncates the data before writing data to the target column.
By default, the TRUNCATECOLUMNS option is OFF. To enable the TRUNCATECOLUMNS option, specify ON as the value of the TRUNCATECOLUMNS option. For example, TRUNCATECOLUMNS=ON.
AWS_IAM_ROLE
Specify the Amazon Redshift Role Resource Name (ARN) to run the task on Secure Agent installed on an Amazon EC2 system in the following format: AWS_IAM_ROLE=arn:aws:iam::<account ID>:role/<role-name>
For example: arn:aws:iam::123123456789:role/redshift_write
IGNOREBLANKLINES
If the input rows contain NULL values, set the IGNOREBLANKLINES property to ON in the Copy Options to ignores blank lines while inserting the data to the target. Specify the property in the following format: IGNOREBLANKLINES=ONThe property helps ignores blank lines that only contain a line feed in a data file and does not try to load them.
Analyze target table
To optimize query performance, you can configure a task to analyze the target table. Target table analysis updates statistical metadata of the database tables.
You can use the Analyze Target Table option to extract sample rows from the table, analyze the samples, and save the column statistics. Amazon Redshift then updates the query planner with the statistical metadata. The query planner uses the statistical metadata to build and choose optimal plans to improve the efficiency of queries.
You can run the Analyze Target Table option after you load data to an existing table by using the Copy command. If you load data to a new table, the Copy command performs an analysis by default.
Retain staging files
You can retain staging files on Amazon S3 after the Secure Agent writes data to the target. You can retain files to create a data lake of your organizational data on Amazon S3. The files you retain can also serve as a backup of your data.
When you create a target connection, you can configure a file prefix or directory prefix to save the staging files. After you provide the prefixes, the Secure Agent creates files within the directories at the Amazon S3 location specified in the target. Configure one of the following options for the Prefix for Retaining Staging Files on S3 property:
•Provide a directory prefix and a file prefix. For example, backup_dir/backup_file. The Secure Agent creates the following directories and files:
•Do not provide a prefix. The Secure Agent does not save the staging files.
Vacuum tables
You can use vacuum tables to recover disk space and sorts rows in a specified table or all tables in the database.
After you run bulk operations, such as delete or load, or after you run incremental updates, you must clean the database tables to recover disk space and to improve query performance on Amazon Redshift. Amazon Redshift does not reclaim and reuse free space when you delete and update rows.
Vacuum databases or tables often to maintain consistent query performance. You can recover disk space for the entire database or for individual tables in a database. You must run vacuum when you expect minimal activity on the database or during designated database administration schedules. Long durations of vacuum might impact database operations. Run vacuum often because large unsorted regions result in longer vacuum times.
You can enable the vacuum tables option when you configure the advanced target properties.
You can select the following recovery options:
None
Does not sort rows or recover disk space.
Full
Sorts the specified table or all tables in the database and recovers disk space occupied by rows marked for deletion by previous update and delete operations.
Sort Only
Sorts the specified table or all tables in the database without recovering space freed by deleted rows.
Delete Only
Recovers disk space occupied by rows marked for deletion by previous update and delete operations, and compresses the table to free up used space.
Reindex
Analyzes the distribution of the values in the interleaved sort key columns to configure the entire Vacuum table operations for a better performance.
Recovery and restart processing
When you run a mapping task to capture changed data from a CDC source and write the changed data to an Amazon Redshift target table, Amazon Redshift V2 Connector supports recovery and restart processing. You cannot use recovery and restart processing when you run a mapping in advanced mode.
If a mapping task fails or is stopped before completing the task, the Secure Agent uses the recovery information stored in the infa_recovery_table table on the target system to resume the extraction of changed data from the point of interruption. This functionality prevents changed data loss and inconsistencies between the source and target.
To enable recovery and restart processing, set the Recovery Strategy advanced session property to Resume from last checkpoint on the Runtime Options page when you create or edit a mapping task. With this setting, the mapping task can resume processing changed data from the point of interruption.
In special situations, you can specify a restart point for a mapping task. Typically, the first time you start a mapping task, you specify a restart point that corresponds to the target materialization time so that no change records are skipped. The default restart point is the end of log (EOL), which is the current point of CDC processing in the log. You can specify a restart point that corresponds to the extraction processing starting from the earliest available record in the log or from a specific date and time. When you use a time-based restart point, extraction processing starts in the log that contains the first unit-of-work (UOW) that has an end time later than the restart time.
When you specify a restart point, consider the following points:
•The restart point applies to all sources in the mapping that is associated with the mapping task.
•If you set a restart point that is too early, it might correspond to a expired log file. In this case, the value of the restart point is considered as the earliest available record in the available log files.
•If you set a restart point that is later than the latest record in the log files, an error message is issued.
Note: Restart information is associated with a mapping task, a specific source and target combination. If you change the source object in a mapping, you must either create a new mapping task for the mapping or increment the restart revision number for the existing mapping task. To increment the restart revision number, navigate to the CDC Runtime page for the mapping task, open the Select Restart Point dialog box, and click OK. If you do not take one of these actions, the mapping task will fail the next time you run it.
Preserve record order on Write
You can retain the order number of the changed record when you capture the changed record from a CDC source to a target table. This property enables you to avoid inconsistencies between the CDC source and target. You cannot use the preserve record order on write property when you run a mapping in advanced mode.
When you modify a single record in a row several times in a CDC source, enable the Preserve record order on write option in the advanced target property to retain the order number of the changed record when you write the changed record to the target table.
For example, you have a record in the following CDC source table in which you have performed multiple of operations:
Emp ID
Emp Name
Emp Description
RowType
RowID
1
John
L1
Insert
1
1
John
L2
Update
2
1
John
L3
Update
3
Here, assume that the RowID shows the order of the changed record in the CDC source table.
The Secure Agent writes the following changed record along with the order number in the target table:
Emp ID
Emp Name
Emp Description
RowType
RowID
1
John
L3
Update
3
Octal values as DELIMITER and QUOTE
In addition to printable ASCII characters, you can use octal values for printable and non-printable ASCII characters as DELIMITER and QUOTE.
To use a printable character as DELIMITER or QUOTE, you can either specify the ASCII character or the respective octal value. However, to use a non-printable character as DELIMITER or QUOTE, you must specify the respective octal value.
Example for a printable character:
DELIMITER=# or DELIMITER=\043
Example for a non-printable character, file separator:
QUOTE=\034
Octal values 000-037 and 177 represent non-printable characters and 040-176 represent printable characters. The following table lists the recommended octal values, for QUOTE and DELIMITER in the Copy command and as DELIMITER in the Unload command, supported by Amazon Redshift:
The Secure Agent generates success and error files after you run a mapping. Success and error files are .csv files that contain row-level details. You cannot use success and error files when you run a mapping in advanced mode.
The Secure Agent generates a success file after you run a mapping. The success file contains an entry for each record that successfully writes into Amazon Redshift. Each entry contains the values that are written for all the fields of the record. Use this file to understand the data that the Secure Agent writes to the Amazon S3 bucket and then to the Amazon Redshift target.
The error file contains an entry for each data error. Each entry in the file contains the values for all fields of the record and the error message. Use the error file to understand why the Secure Agent does not write data to the Amazon Redshift target.
The Secure Agent does not overwrite success or error files. Access the error rows files and success rows files directly from the directories where they are generated. You can manually delete the files that you no longer need.
Consider the following guidelines when you configure the mapping properties for success files:
•You must provide the file path where you want the Secure Agent to generate the success rows file.
•The success rows file uses the following naming convention: <timestamp>success
Consider the following guidelines when you configure the mapping properties for error files:
•You must provide the file path where you want the Secure Agent to generate the error rows file.
•The success rows file uses the following naming convention: <timestamp>error
Note: The insert and upsert tasks error rows file follows the same naming convention.
•When you define a error file directory, you can use the variable $PMBadFileDir. When you use the $PMBadFileDir variable, the application writes the file to the following Secure Agent directory: <Secure Agent installation directory>/apps/Data_Integration_Server/data/error.