Database Ingestion and Replication > Database Ingestion and Replication > Database Ingestion and Replication targets - preparation and usage
  

Database Ingestion and Replication targets - preparation and usage

Before you configure database ingestion and replication tasks for initial load, incremental load, or combined initial and incremental load operations, review the following guidelines for your target types to avoid unexpected results:

Amazon Redshift targets

The following list identifies considerations for preparing and using Amazon Redshift targets:

Amazon S3, Flat File, Google Cloud Storage, Microsoft Azure Data Lake Storage, Microsoft Fabric OneLake, and Oracle Cloud Object Storage targets

The following list identifies considerations for using Amazon S3, Flat File, Google Cloud Storage, Microsoft Azure Data Lake Storage, Microsoft Fabric OneLake, and Oracle Cloud Infrastructure (OCI) Object Storage targets:

Databricks targets

To use Databricks targets in database ingestion and replication tasks, first prepare the target and review the usage considerations.
Target preparation:
  1. 1In the Databricks connection properties, set the JDBC Driver Class Name property to com.databricks.client.jdbc.Driver.
  2. 2On Windows, install Visual C++ Redistributable Packages for Visual Studio 2013 on the computer where the Secure Agent runs.
Usage considerations:

Google BigQuery targets

The following list identifies considerations for preparing and using Google BigQuery targets:

Target preparation

Target usage

Kafka targets and Kafka-enabled Azure Event Hubs targets

The following list identifies considerations for using Kafka targets:

Generating custom message keys for Kafka targets

For all Kafka target types that use the Avro format, you can configure rules to generate a custom message key that consists of one or more columns for each source table. Database ingestion and replication incremental load jobs that have a Kafka target can then include the generated message key in the headers of the messages that it sends to the target messaging system. The target messaging system can use the message key to write messages with a specific key value to the same partition in a multi-partitioned topic.
To implement this feature, you must manually create a configuration file that contains rules that identify the key columns for each source table. Then specify the file in a custom configuration property in the task wizard.

Configuration file creation

Create the rule-configuration file in a text editor and save it to a location on the Secure Agent system. The file contains a rule for each source table. Each rule defines the custom key column or columns to use for writing data to topic partitions.
Note: If you change or add a rule or change one of the other parameters after the database ingestion and replication task has been deployed, you must redeploy the task for the rule change to take effect.
Rule syntax:
Use the following syntax to define rules in the configuration file:
rule=(schema.tablename,column1,column2,column3,… )
additional rules...
[tableNotFound=ABORT]
[trace={true|false}]
[delimiter=character]
To include comments in the file, begin each comment line with the number (#) sign. For example:
#This text is for informational purposes only.
Parameters:
Example rule:
rule=(testdb.ABC.DEPT,DEPTNO,DNAME)
tableNotFound=ABORT
trace=true
delimiter=;
Example generated key output based on this rule:
1234;HR;

Database ingestion and replication task configuration

When you create a database ingestion and replication incremental load task that has a Kafka target, you must set the following options to enable the generation of custom message keys:

Microsoft Azure Synapse Analytics targets

The following list identifies considerations for preparing and using Microsoft Azure Synapse Analytics targets:

Microsoft SQL Server and Azure SQL Database targets

You can use Microsoft SQL Server or Microsoft Azure SQL Database targets in initial load, incremental load, and combined initial and incremental load jobs. SQL Server targets include on-premises, RDS, and Azure SQL Managed Instance targets.
The following list identifies considerations for preparing and using Microsoft SQL Server targets:

Oracle targets

The following list identifies considerations for preparing and using Oracle targets.

Target preparation

Usage considerations

PostgreSQL targets

You can use Amazon Aurora PostgreSQL and RDS for PostgreSQL as a target in initial load, incremental load, and combined initial and incremental load jobs that have a Db2 for i, Oracle, PostgreSQL, or SQL Server source.
The following list identifies considerations for preparing and using PostgreSQL targets:

Snowflake targets

Target preparation

Target preparation varies depending on whether you use the Superpipe feature for high performance streaming of data to Snowflake target tables or write data to intermediate stage files.
With Superpipe
If you use the Superpipe feature, complete the following steps:
  1. 1Create a Data Ingestion and Replication user. Use the following SQL statement:
  2. create user INFACMI_User password 'Xxxx@xxx';
  3. 2Create a new user role and grant it to the Data Ingestion and Replication user. Use the following SQL statements:
  4. create role INFACMI_superpipe;
    grant role INFACMI_superpipe to user INFACMI_User;
  5. 3Grant usage on the Snowflake virtual warehouse to the new role. Use the following SQL statement:
  6. grant usage on warehouse warehouse_name to role INFACMI_superpipe;
  7. 4Grant usage on the Snowflake database to the new role. Use the following SQL statement:
  8. grant usage on database INFACMI_DB1 to role INFACMI_superpipe;
  9. 5Create a new schema. Use the following SQL statements:
  10. use database INFACMI_DB1;
    create schema sh_superpipe;
  11. 6Grant create stream, create view, and create table privileges on the new Snowflake schema to the new role. Use the following SQL statement:
  12. grant create stream, create view, create table, usage on schema INFACMI_DB1.sh_superpipe to role INFACMI_superpipe;
  13. 7Set the default role for the newly created user. Use the following SQL statement:
  14. alter user INFACMI_User set default_role=INFACMI_superpipe;
  15. 8Define a Snowflake Data Cloud connection to the target. You must use the KeyPair option as the authentication method. See Connectors and Connections > Snowflake Data Cloud connection properties.
  16. 9Generate a private key with OpenSSL version 3.x.x and the PBE-SHA1-2DES or PBE-SHA1-3DES cipher. Use the following openssl commands to generate and format the private key:
  17. openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -v1 PBE-SHA1-3DES -out rsa_key.p8
  18. 10Generate the public key. Use the following openssl command, in which the -in option references the file (rsa_key.p8) that contains the encrypted private key:
  19. openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
  20. 11In Snowflake, assign the public key to the Snowflake user. Use the following SQL command:
  21. alter user INFACMI_User set rsa_public_key='key_value’;
Next step: When you create an ingestion and replication task, select the Superpipe option on the Target page of the task wizard. You can also optionally specify a Merge Frequency value, which controls the frequency at which change data rows are merged and applied to the Snowflake target table.
Without Superpipe
If you do NOT use the Superpipe feature for Snowflake targets, complete the following steps as the ACCOUNTADMIN user:
  1. 1Create a Data Ingestion and Replication user. Use one of the following SQL statements:
  2. create user INFACMI_User password 'Xxxx@xxx';
    or
    replace user INFACMI_User password 'Xxxx@xxx';
  3. 2Create a new role and grant the role to the Data Ingestion and Replication user. Use the following SQL statements:
  4. create role INFA_CMI_Role;
    grant role INFA_CMI_Role to user INFACMI_User;
  5. 3Grant usage on the Snowflake virtual warehouse to the new role. Use the following SQL statement:
  6. grant usage on warehouse CMIWH to role INFA_CMI_Role;
  7. 4Grant usage on the Snowflake database to the new role. Use the following SQL statement:
  8. grant usage, CREATE SCHEMA on database CMIDB to role INFA_CMI_Role;
  9. 5Set the default role for the newly created user. Use the following SQL statement:
  10. alter user INFACMI_User set default_role=INFA_CMI_Role;
Also, as the INFACMI_User, create a new schema:
create schema CMISchema;
Note: If the user's default role is used for ingestion and replication tasks and does not have the required privileges, the following error will be issued at runtime:
SQL compilation error: Object does not exist, or operation cannot be performed.

Usage considerations

Configure private connectivity to Snowflake

You can access Snowflake using AWS or Azure Private Link endpoints.
The AWS or Azure Private Link setup ensures that the connection to Snowflake uses the AWS or Azure internal network and does not take place over the public Internet.
To connect to the Snowflake account over the private AWS network, see AWS Private Link and Snowflake.
To connect to the Snowflake account over the private Azure network, see Azure Private Link and Snowflake.

Default directory structure for CDC files on Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft Fabric OneLake targets

Database ingestion and replication jobs create directories on Amazon S3, Google Cloud Storage, Microsoft Azure Data Lake Storage Gen2, and Microsoft Fabric OneLake targets to store information about change data processing.
The following directory structure is created by default on the targets:
Bucket
└───connection_folder
└───job_folder
├───cycle
│ ├───completed
│ │ ├───completed_cycle_folder
│ │ │ └───Cycle-timestamp.csv
│ │ │ ...
│ │ └───completed_cycle_folder
│ │ └───Cycle-timestamp.csv
│ └───contents
│ ├───cycle_folder
│ │ └───Cycle-contents-timestamp.csv
│ │ ...
│ └───cycle_folder
│ └───Cycle-contents-timestamp.csv
└───data
└───table_name
├───data
│ ├───cycle_folder
│ │ └───table_name_timestamp.csv
│ │ ...
│ └───cycle_folder
│ └───table_name_timestamp.csv
└───schema
└───V1
└───table_name.schema
The following table describes the directories in the default structure:
Folder
Description
connection_folder
Contains the Database Ingestion and Replication objects. This folder is specified in the Folder Path field of the Amazon S3 connection properties, in the Directory Path field of the Microsoft Azure Data Lake Storage Gen2 connection properties, or in the Lakehouse Path field of the Microsoft Fabric OneLake connection properties.
Note: This folder is not created for Google Cloud Storage targets.
job_folder
Contains job output files. This folder is specified in the Directory field on the Target page of the database ingestion and replication task wizard.
cycle/completed
Contains a subfolder for each completed CDC cycle. Each completed cycle subfolder contains a completed cycle file.
cycle/contents
Contains a subfolder for each CDC cycle. Each cycle subfolder contains a cycle contents file.
data
Contains output data files and schema files for each table.
data/table_name/schema/V1
Contains a schema file.
Note: Database Ingestion and Replication does not save a schema file in this folder if the output files use the Parquet format.
data/table_name/data
Contains a subfolder for each CDC cycle that produces output data files.

Cycle directories

Database Ingestion and Replication uses the following pattern to name cycle directories:
[dt=]yyyy-mm-dd-hh-mm-ss
The "dt=" prefix is added to cycle folder names if you select the Add Directory Tags check box on the Target page of the database ingestion and replication task wizard.

Cycle contents files

Cycle contents files are located in cycle/contents/cycle_folder subdirectories. Cycle contents files contain a record for each table that has had a DML event during the cycle. If no DML operations occurred on a table in the cycle, the table does not appear in the cycle contents file.
Database Ingestion and Replication uses the following pattern to name cycle content files:
Cycle-contents-timestamp.csv
A cycle contents csv file contains the following information:

Completed cycle files

Completed cycle files are located in cycle/completed/completed_cycle_folder subdirectories. A database ingestion and replication job creates a cycle file in this subdirectory after a cycle completes. If this file is not present, the cycle has not completed yet.
Database Ingestion and Replication uses the following pattern to name completed cycle files:
Cycle-timestamp.csv
A completed cycle csv file contains the following information:

Output data files

The data files contain records that include the following information:

Custom directory structure for output files on Amazon S3, Google Cloud Storage, Flat File, Microsoft Fabric OneLake, and ADLS Gen2 targets

You can configure a custom directory structure for the output files that initial load, incremental load, and combined initial and incremental load jobs write to Amazon S3, Google Cloud Storage, Flat File, Microsoft Azure Data Lake Storage (ADLS) Gen2, or Microsoft Fabric OneLake targets if you do not want to use the default structure.

Initial loads

By default, initial load jobs write output files to tablename_timestamp subdirectories under the parent directory. For Amazon S3, Flat File, and ADLS Gen2 targets, the parent directory is specified in the target connection properties if the Connection Directory as Parent check box is selected on the Target page of the task wizard.
For Google Cloud Storage targets, the parent directory is the bucket container specified in the Bucket field on the Target page of the task wizard.
For Microsoft Fabric OneLake targets, the parent directory is the path specified in the Lakehouse Path field in the Microsoft Fabric OneLake connection properties.
You can customize the directory structure to suit your needs. For example, for initial loads, you can write the output files under a root directory or directory path that is different from the parent directory specified in the connection properties to better organize the files for your environment or to find them more easily. Or you can consolidate all output files for a table directly in a directory with the table name rather than write the files to separate timestamped subdirectories, for example, to facilitate automated processing of all of the files.
To configure a directory structure, you must use the Data Directory field on the Target page of the ingestion task wizard. The default value is {TableName}_{Timestamp}, which causes output files to be written to tablename_timestamp subdirectories under the parent directory. You can configure a custom directory path by creating a directory pattern that consists of any combination of case-insensitive placeholders and directory names. The placeholders are:
A pattern can also include the following functions:
By default, the target schema is also written to the data directory. If you want to use a different directory for the schema, you can define a directory pattern in the Schema Directory field.
Example 1
You are using an Amazon S3 target and want to write output files and the target schema to the same directory, which is under the parent directory specified in the Folder Path field of the connection properties. In this case, the parent directory is idr-test/DEMO/. You want to write all of the output files for a table to a directory that has a name matching the table name, without a timestamp. You must complete the Data Directory field and select the Connection Directory as Parent check box. The following image shows this configuration on the Target page of the task wizard:
Based on this configuration, the resulting directory structure is:
Example 2
You are using an Amazon S3 target and want to write output data files to a custom directory path and write the target schema to a separate directory path. To use the directory specified in the Folder Path field in the Amazon S3 connection properties as the parent directory for the data directory and schema directory, select Connection Directory as Parent. In this case, the parent directory is idr-test/DEMO/. In the Data Directory and Schema Directory fields, define directory patterns by using a specific directory name, such as data_dir and schema_dir, followed by the default {TableName}_{Timestamp} placeholder value. The placeholder creates tablename_timestamp destination directories. The following image shows this configuration on the Target page of the task wizard:
Based on this configuration, the resulting data directory structure is:
And the resulting schema directory structure is:

Incremental loads and combined initial and incremental loads

By default, incremental load and combined initial and incremental load jobs write cycle files and data files to subdirectories under the parent directory. However, you can create a custom directory structure to organize the files to best suit your organization's requirements.
This feature applies to database ingestion incremental load and combined initial and incremental load tasks that have Amazon S3, Google Cloud Storage, Microsoft Fabric OneLake, or Microsoft Azure Data Lake Storage (ADLS) Gen2 targets.
For Amazon S3 and ADLS Gen2 targets, the parent directory is set in the target connection properties if the Connection Directory as Parent check box is selected on the Target page of the task wizard.
For Google Cloud Storage targets, the parent directory is the bucket container specified in the Bucket field on the Target page of the task wizard.
For Microsoft Fabric OneLake targets, the parent directory is the path specified in the Lakehouse Path field in the Microsoft Fabric OneLake connection properties.
You can customize the directory structure to suit your needs. For example, you can write the data and cycle files under a target directory for the task instead of under the parent directory specified in the connection properties. Alternatively, you can 1) consolidate table-specific data and schema files under a subdirectory that includes the table name, 2) partition the data files and summary contents and completed files by CDC cycle, or 3) create a completely customized directory structure by defining a pattern that includes literal values and placeholders. For example, if you want to run SQL-type expressions to process the data based on time, you can write all data files directly to timestamp subdirectories without partitioning them by CDC cycle.
To configure a custom directory structure for an incremental load task, define a pattern for any of the following optional fields on the Target page of the ingestion task wizard:
Field
Description
Default
Task Target Directory
Name of a root directory to use for storing output files for an incremental load task.
If you select the Connection Directory as Parent option, you can still optionally specify a task target directory. It will be appended to the parent directory to form the root for the data, schema, cycle completion, and cycle contents directories.
This field is required if the {TaskTargetDirectory} placeholder is specified in patterns for any of the following directory fields.
None
Connection Directory as Parent
Select this check box to use the parent directory specified in the connection properties.
This field is not available for the Microsoft Fabric OneLake target.
Selected
Data Directory
Path to the subdirectory that contains the data files.
In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.
{TaskTargetDirectory}/data/{TableName}/data
Schema Directory
Path to the subdirectory in which to store the schema file if you do not want to store it in the data directory.
In the directory path, the {TableName} placeholder is required if data and schema files are not partitioned by CDC cycle.
{TaskTargetDirectory}/data/{TableName}/schema
Cycle Completion Directory
Path to the directory that contains the cycle completed file.
{TaskTargetDirectory}/cycle/completed
Cycle Contents Directory
Path to the directory that contains the cycle contents files.
{TaskTargetDirectory}/cycle/contents
Use Cycle Partitioning for Data Directory
Causes a timestamp subdirectory to be created for each CDC cycle, under each data directory.
If this option is not selected, individual data files are written to the same directory without a timestamp, unless you define an alternative directory structure.
Selected
Use Cycle Partitioning for Summary Directories
Causes a timestamp subdirectory to be created for each CDC cycle, under the summary contents and completed subdirectories.
Selected
List Individual Files in Contents
Lists individual data files under the contents subdirectory.
If Use Cycle Partitioning for Summary Directories is cleared, this option is selected by default. All of the individual files are listed in the contents subdirectory unless you can configure custom subdirectories by using the placeholders, such as for timestamp or date.
If Use Cycle Partitioning for Data Directory is selected, you can still optionally select this check box to list individual files and group them by CDC cycle.
Not selected if Use Cycle Partitioning for Summary Directories is selected.
Selected if you cleared Use Cycle Partitioning for Summary Directories.
A directory pattern consists of any combination of case-insensitive placeholders, shown in curly brackets { }, and specific directory names. The following placeholders are supported:
Note: The timestamp, year, month, and day placeholders indicate when the CDC cycle started when specified in patterns for data, contents, and completed directories, or indicate when the CDC job started when specified in the schema directory pattern.
Example 1
You want to accept the default directory settings for incremental load or combined initial and incremental load jobs as displayed in the task wizard. The target type is Amazon S3. Because the Connection Directory as Parent check box is selected by default, the parent directory path that is specified in the Folder Path field of the Amazon S3 connection properties is used. This parent directory is idr-test/dbmi/. You also must specify a task target directory name, in this case, s3_target, because the {TaskTargetDirectory} placeholder is used in the default patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name because the {TableName} placeholder is included in their default patterns. Also, because cycle partitioning is enabled, the files in the data directory, schema directory, and cycle summary directories will be subdivided by CDC cycle. The following image shows the default configuration settings on the Target page of the task wizard, except for the specified task target directory name:
Based on this configuration, the resulting data directory structure is:
If you drill down on the data folder and then on a table in that folder (pgs001_src_allint_init), you can access the data and schema subdirectories:
If you drill down on the data folder, you can access the timestamp directories for the data files:
If you drill down on cycle, you can access the summary contents and completed subdirectories:
Example 2
You want to create a custom directory structure for incremental load or combined initial and incremental load jobs that adds the subdirectories "demo" and "d1" in all of the directory paths except in the schema directory so that you can easily find the files for your demos. Because the Connection Directory as Parent check box is selected, the parent directory path (idr-test/dbmi/) that is specified in the Folder Path field of the Amazon S3 connection properties is used. You also must specify the task target directory because the {TaskTargetDirectory} placeholder is used in the patterns in the subsequent directory fields. The files in the data directory and schema directory will be grouped by table name. Also, because cycle partitioning is enabled, the files in the data, schema, and cycle summary directories will be subdivided by CDC cycle. The following image shows the custom configuration on the Target page of the task wizard:
Based on this configuration, the resulting data directory structure is:

Supported Avro data types

Database Ingestion and Replication supports some of the primitive and logical data types that Avro schemas provide. These data types pertain to target types that support Avro or Parquet output format.
A primitive data type represents a single data value. A logical data type is an Avro primitive or complex data type that has additional attributes to represent a derived type.
The following table lists the primitive Avro data types that Database Ingestion and Replication supports:
Primitive data type
Description
INT
32-bit signed integer
LONG
64-bit signed integer
FLOAT
Single precision (32-bit) IEEE 754 floating-point number
DOUBLE
Double precision (64-bit) IEEE 754 floating-point number
BYTES
Sequence of 8-bit unsigned bytes
STRING
Unicode character sequence
The following table lists the logical Avro data types that Database Ingestion and Replication supports:
Logical data type
Description
DECIMAL
An arbitrary-precision signed decimal number of the form unscaled × 10 -scale
DATE
A date, without reference to a time or time zone.
TIME
A time of day that has the precision of 1 millisecond or 1 microsecond, without reference to a time zone or date.
TIMESTAMP
A date and time value that has the precision of 1 millisecond or microsecond, without reference to a particular calendar or time zone.
For Databricks targets, Database Ingestion and Replication does not use the following data types in the intermediate Parquet files: